[WIP] Moving signals away from PyDispatcher #2030

Open
wants to merge 7 commits into
from

Projects

None yet

5 participants

@rootAvish
Contributor
rootAvish commented Jun 5, 2016 edited

Fixes #8

This is by no means complete, specially as of the changes committed to this point, most of the effort to this point dealt with porting relevant parts of django.dispatch to Scrapy and writing parts of it without dependency on django.utils and introducing methods for API consistency, but opening this PR so that the changes can be monitored by the community. I'll update this as the changes come along in detail about what the individual changes concern themselves with.

To this point we have:

  • Introduced the scrapy.dispatch module
  • Changed all standard signals to be objects of the Signal class instead of the generic object.

Most of the other stuff is outside the mainline and I'll be pushing it as soon as I'm done with it, I'll preferably have a working prototype by the end of this week(Saturday), latest by the start of the next one(the Monday after that).

/cc @jdemaeyer

@jdemaeyer
Contributor

Hey @rootAvish,

thanks for the WIP. You need to replace your imports to use either the full path or a relative one (from dispatch import Signal -> from scrapy.dispatch import Signal or from .dispatch import Signal, and a few others in the dispatch subpackage, preferably the absolute path version), you can easily see what I mean when you open a Python interpreter in the repositories base directory and simply try import scrapy.

Did you take a look at the tests?

@rootAvish
Contributor

Right, I thought I had fixed that when I pushed this but I'm really bad at tracking what I have and haven't pushed yet. I'll be pushing a complete working prototype next(by Monday morning) though, hope everything passes at that point.

@codecov-io
codecov-io commented Jun 20, 2016 edited

Codecov Report

Merging #2030 into master will increase coverage by 0.3%.
The diff coverage is 99.65%.

@@            Coverage Diff            @@
##           master    #2030     +/-   ##
=========================================
+ Coverage   83.59%   83.89%   +0.3%     
=========================================
  Files         161      166      +5     
  Lines        8812     9018    +206     
  Branches     1296     1344     +48     
=========================================
+ Hits         7366     7566    +200     
- Misses       1200     1202      +2     
- Partials      246      250      +4
Impacted Files Coverage Δ
scrapy/core/downloader/handlers/init.py 80% <ø> (ø)
scrapy/spiders/init.py 98.43% <ø> (ø)
scrapy/extensions/telnet.py 75.51% <100%> (+0.51%)
scrapy/crawler.py 89.87% <100%> (ø)
scrapy/dispatch/utils/init.py 100% <100%> (ø)
scrapy/downloadermiddlewares/downloadtimeout.py 92.3% <100%> (ø)
scrapy/extensions/memusage.py 25% <100%> (ø)
scrapy/extensions/corestats.py 88.46% <100%> (ø)
scrapy/extensions/throttle.py 44.89% <100%> (ø)
scrapy/downloadermiddlewares/httpauth.py 100% <100%> (ø)
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update afac3fd...12f670f. Read the comment docs.

@jdemaeyer jdemaeyer commented on an outdated diff Jun 21, 2016
scrapy/dispatch/dispatcher.py
+
+ Arguments:
+
+ receiver
+ The registered receiver to disconnect. May be none if
+ dispatch_uid is specified.
+
+ sender
+ The registered sender to disconnect
+
+ dispatch_uid
+ the unique identifier of the receiver to disconnect
+ """
+ if weak is not None:
+ warnings.warn("Passing `weak` to disconnect has no effect.",
+ ScrapyDeprecationWarning, stacklevel=2)
@jdemaeyer
jdemaeyer Jun 21, 2016 Contributor

We can probably drop weak from the keyword arguments, no use in introducing already deprecated code. ;)

@jdemaeyer jdemaeyer commented on an outdated diff Jun 21, 2016
scrapy/utils/signal.py
"""Disconnect all signal handlers. Useful for cleaning up after running
tests
"""
- for receiver in liveReceivers(getAllReceivers(sender, signal)):
- disconnect(receiver, signal=signal, sender=sender)
+ for receiver in signal._live_receivers(sender=sender):
+ signal.disconnect(receiver=receiver, sender=sender)
@jdemaeyer
jdemaeyer Jun 21, 2016 Contributor

This should be a method of Signal, with only a pass-through here (like the two functions above)

@jdemaeyer jdemaeyer commented on an outdated diff Jun 21, 2016
scrapy/signalmanager.py
@@ -1,11 +1,10 @@
from __future__ import absolute_import
-from pydispatch import dispatcher
from scrapy.utils import signal as _signal
@jdemaeyer
jdemaeyer Jun 21, 2016 Contributor

I wonder if we should drop this, directly call the Signal methods, and put deprecation methods into the util functions? They are pass-throughs now anyway.

@jdemaeyer
Contributor

Great work Avishkar!

I've added a few smaller notes here and there but I think this looks pretty good now. Codecov complains that not all of the newly added code is being executed during the test suite, but that will probably be remedied when we add tests for the new signalling backend.

There are two backwards compatibility issues that we should think about:

  1. All signal receivers must now accept **kwargs. This is a big one and will definitely break a lot of user code out there (just like it already broke a lot of internal code). In the old signalling backend, we were using pydispatcher's robustApply method to handle receivers that weren't able to accept all keyword arguments. Can we port that to work with our new backend? And then throw a warning when we connect signal receivers that don't accept **kwargs, but no longer completely disallow them.
  2. As you pointed out earlier, user-defined signals will be broken, because they now need to be instances of Signal instead of any arbitrary object. This one is much smaller than the first issue, since user-defined signals should be rare. I think we should try to implement an easy solution if there is one but be ready to let this stand as backwards incompatible if necessary. Given that all signalling tasks go through the SignalManager, maybe it's possible to do something like this:
class SignalManager(object):
    _signal_proxies = {}

    def _ensure_signal(self, obj):
        if isinstance(obj, Signal):
            return obj
        # Warn user that this is deprecated here
        self._signal_proxies.setdefault(obj, Signal())
        return self._signal_proxies[obj]

    def connect(self, receiver, signal, **kwargs):
        signal = self._ensure_signal(signal)
        # ...

    # and similar in the other methods

I haven't put much thought into it though, issue 1 is definitely of higher priority.

@jdemaeyer jdemaeyer commented on an outdated diff Jul 13, 2016
scrapy/dispatch/dispatcher.py
+ def send_robust(self, sender, **named):
+ """
+ Send signal from sender to all connected receivers catching errors.
+ Modified to return Failures.
+ Arguments:
+
+ sender
+ The sender of the signal. Can be any python object (normally one
+ registered with a connect if you actually want something to
+ occur).
+
+ named
+ Named arguments which will be passed to receivers. These
+ arguments must be a subset of the argument names defined in
+ providing_args.
+ """
@jdemaeyer
jdemaeyer Jul 13, 2016 Contributor

Tiny nitpick: This docstring should probably still say "Returns a list of tuple pairs [(receiver, response), ... ]."

@jdemaeyer jdemaeyer commented on an outdated diff Jul 13, 2016
scrapy/dispatch/dispatcher.py
+ def send_robust_deferred(self, sender, **named):
+ """
+ Send signal from sender to all connected receivers catching errors.
+ Modified to work with functionons returning twisted deferreds.
+ Arguments:
+
+ sender
+ The sender of the signal. Can be anything python object (normally one
+ registered with a connect if you actually want something to
+ occur).
+
+ named
+ Named arguments which will be passed to receivers. These
+ arguments must be a subset of the argument names defined in
+ providing_args.
+ """
@jdemaeyer
jdemaeyer Jul 13, 2016 Contributor

Tiny nitpick: This docstring should probably say "Returns a Deferred that fires with a list of tuple pairs [(receiver, response), ... ]." or similar

@jdemaeyer jdemaeyer commented on an outdated diff Jul 13, 2016
scrapy/dispatch/dispatcher.py
+ with self.lock:
+ self._clear_dead_receivers()
+ for index in range(len(self.receivers)):
+ (r_key, _) = self.receivers[index]
+ if r_key == lookup_key:
+ disconnected = True
+ del self.receivers[index]
+ break
+ self.sender_receivers_cache.clear()
+ return disconnected
+
+ def has_listeners(self, sender=None):
+ return bool(self._live_receivers(sender))
+
+ def send(self, sender, **named):
+ """
@jdemaeyer
jdemaeyer Jul 13, 2016 Contributor

It would be good if the docstrings were formatted in the same fashion as other ones in scrapy, but you'll probably encounter this anyways when you do docs

@jdemaeyer jdemaeyer commented on an outdated diff Jul 13, 2016
scrapy/signalmanager.py
self.sender = sender
+ self.signals = {}
@jdemaeyer
jdemaeyer Jul 13, 2016 Contributor

Why is this dictionary called signals when it actually contains (patched) receivers?

@jdemaeyer jdemaeyer and 1 other commented on an outdated diff Jul 13, 2016
scrapy/signalmanager.py
@@ -23,7 +28,16 @@ def connect(self, receiver, signal, **kwargs):
:type signal: object
"""
kwargs.setdefault('sender', self.sender)
- return dispatcher.connect(receiver, signal, **kwargs)
+ if not func_accepts_kwargs(receiver):
+ warnings.warn("The use of handlers that don't accept "
+ "**kwargs has been deprecated, plese refer "
+ "to the Signals API documentation.",
+ ScrapyDeprecationWarning, stacklevel=2)
+ self.signals[receiver.__repr__()] = \
@jdemaeyer
jdemaeyer Jul 13, 2016 Contributor

Why are we using receiver.__repr__() as key and not simply receiver?

@rootAvish
rootAvish Jul 23, 2016 Contributor

@jdemaeyer I read here that it's best to use string keys in a Python dictionary, so decided to use the string representation of the callable as key instead of a callable instance.

@jdemaeyer jdemaeyer commented on an outdated diff Jul 13, 2016
tests/test_pipeline_media.py
@@ -31,7 +31,7 @@ def setUp(self):
def tearDown(self):
for name, signal in vars(signals).items():
- if not name.startswith('_'):
+ if not name.startswith('_') and name is not 'Signal':
@jdemaeyer
jdemaeyer Jul 13, 2016 Contributor

Use name != 'Signal' here. The is operator does not check for equality, but for identity (i.e. "does this point to the very same position in memory?"). Using is when you mean == will sometimes work b/c CPython caches some string and number literals, but you can never be sure:

In [1]: teststr = "hello"
In [2]: teststr is "hello"
Out[2]: True
In [3]: teststr += "2"
In [4]: teststr is "hello2"
Out[4]: False
In [5]: teststr == "hello2"
Out[5]: True
@jdemaeyer jdemaeyer commented on the diff Jul 13, 2016
tests/test_engine.py
self.port.stopListening()
for name, signal in vars(signals).items():
- if not name.startswith('_'):
+ if not name.startswith('_') and name is not 'Signal':
@jdemaeyer
jdemaeyer Jul 13, 2016 edited Contributor

Use name != 'Signal' here (see below)

@jdemaeyer jdemaeyer commented on an outdated diff Jul 13, 2016
tests/test_utils_signal.py
with LogCapture() as l:
send_catch_log(test_signal)
self.assertEqual(len(l.records), 1)
self.assertIn("Cannot return deferreds from signal handler", str(l))
- dispatcher.disconnect(test_handler, test_signal)
+ test_signal.disconnect(test_handler)
@jdemaeyer
jdemaeyer Jul 13, 2016 Contributor

Missing newline :)

@jdemaeyer jdemaeyer commented on an outdated diff Jul 13, 2016
tests/test_utils_signal.py
+from scrapy.dispatch import Signal
from scrapy.utils.signal import send_catch_log, send_catch_log_deferred
class SendCatchLogTest(unittest.TestCase):
@jdemaeyer
jdemaeyer Jul 13, 2016 Contributor

This test should probably be moved to a new module test_dispatcher or something, since the real code no longer lives in utils.signal

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
scrapy/signalmanager.py
+ that might be in use in third party extensions which is still using the
+ standard python objects as signals. This method returns a Signal()
+ class instance and saves it to proxy the user defined signal.
+ """
+ if isinstance(signal, _Signal):
+ return signal
+ # Ensure we got at least an old style signal
+ warnings.warn("Signals in scrapy are no longer instances of the basic "
+ "python object but rather the Signal class defined in "
+ "the scrapy.dispatch module. Please refer to the Signal "
+ "API documentation.",
+ ScrapyDeprecationWarning, stacklevel=3)
+ signal_name = signal.__repr__()
+ if signal_name not in self._signal_proxies:
+ self._signal_proxies.setdefault(signal_name, _Signal())
+ return self._signal_proxies[signal_name]
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor
  1. There is a race condition here, where signal_name is not in self._signal_proxies at the time when the if condition is evaluated, but is already when the next line is evaluated (oh the perks of asynchronous programming). You're already doing the right thing by using setdefault, but then you can also skip the "check if already in dict" step.
  2. Dictionary keys can be anything in Python, and more importantly, an object's __repr__ is not necessarily unique (since it could potentially be overwritten and no longer contain the object memory address). You could avoid the __repr__ detour and just use signal itself as key.

Lastly, setdefault has the cool quality of returning the value that already was or is now stored in the dictionary, so the four lines above (everything after the warning) could be simplified to a single one:

        return self._signal_proxies.setdefault(signal, _Signal())
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

Sorry, I only read your earlier comment just now. I would still argue strongly for using the instance itself.

  • In broader terms, for the wisdom mentioned in the top answer in the SO thread you mention ("premature optimization is the root of all evil", a trap I fall for on a regular basis ;)).

  • In this case in particular, because calculating an objects representation is a very expensive operation. Its costs vastly exceed the savings from using a string, making the repr-based dictionary almost 20 times slower:

    In [17]: timeit.timeit('mydict[a]', 'a = object(); mydict = {a: True}')
    Out[17]: 0.06446212899936654
    
    In [18]: timeit.timeit('mydict[repr(a)]', 'a = object(); mydict = {repr(a): True}')
    Out[18]: 1.168589578999672
    
@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
scrapy/signalmanager.py
+ self._signal_proxies = {}
+
+ def _ensure_signal(self, signal):
+ """
+ This method is for backward compatability for any custom signals
+ that might be in use in third party extensions which is still using the
+ standard python objects as signals. This method returns a Signal()
+ class instance and saves it to proxy the user defined signal.
+ """
+ if isinstance(signal, _Signal):
+ return signal
+ # Ensure we got at least an old style signal
+ warnings.warn("Signals in scrapy are no longer instances of the basic "
+ "python object but rather the Signal class defined in "
+ "the scrapy.dispatch module. Please refer to the Signal "
+ "API documentation.",
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

It would be good if the warning were formulated as instruction instead of description, e.g. Custom signals should be instances of scrapy.dispatch.Signal.

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
scrapy/signalmanager.py
@@ -59,6 +82,7 @@ def send_catch_log(self, signal, **kwargs):
through the :meth:`connect` method).
"""
kwargs.setdefault('sender', self.sender)
+ signal = self._ensure_signal(signal)
return _signal.send_catch_log(signal, **kwargs)
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

Is there a reason for proxying through _signal.send_catch_log here instead of directly using signal.send_robust()? Since they are all just "pass-through functions" now, I think we can keep the utils.signal functions only for backwards compatibility but stop using them throughout scrapy's source. Or do you think this will "bind us" to the new Signal API too much? /cc @kmike

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
scrapy/dispatch/utils/robustapply.py
+ """Argument %r specified both positionally and
+ as a keyword for calling %r""" % (
+ name, receiver,
+ )
+ )
+ if not (codeObject.co_flags & 8):
+ # fc does not have a **kwds type parameter, therefore
+ # remove unacceptable arguments.
+ if six.PY2:
+ for arg in named.keys():
+ if arg not in acceptable:
+ del named[arg]
+ else:
+ for arg in list(named):
+ if arg not in acceptable:
+ del named[arg]
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

If I'm not mistaken this code should be suitable for Python 2 as well.

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
scrapy/dispatch/dispatcher.py
+
+ dispatch_uid
+ An identifier used to uniquely identify a particular instance
+ of a receiver. This will usually be a string, though it may be
+ anything hashable.
+ """
+
+ # If DEBUG is on, check that we got a good receiver
+ settings = get_project_settings()
+ if settings.get('LOG_ENABLED') \
+ and settings.get('LOG_LEVEL') == 'DEBUG':
+ assert callable(receiver), "Signal receivers must be callable."
+ # Check for **kwargs
+ if not func_accepts_kwargs(receiver):
+ raise ValueError(
+ "Signal receivers must accept keyword arguments(**kwargs).")
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

I think we should drop the "if debug" switch here and make this a warning instead of an exception. We have a working backwards-compatible mechanism in place and I see no reason not to use it just because we're using debug level logging.

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
tests/test_signalmanager_compatability.py
+import gc
+
+from scrapy.signalmanager import SignalManager
+from scrapy.dispatch.utils import func_accepts_kwargs
+from scrapy.signals import Signal
+
+
+def receiver(**kwargs):
+ pass
+
+
+def receiver_no_kwargs():
+ pass
+
+
+class BackwardCompatabilityTest(unittest.TestCase):
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

Compatibility :)

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
tests/test_signalmanager_compatability.py
+def receiver(**kwargs):
+ pass
+
+
+def receiver_no_kwargs():
+ pass
+
+
+class BackwardCompatabilityTest(unittest.TestCase):
+
+ def setUp(self):
+ self.signals = SignalManager()
+
+ def tearDown(self):
+ self.assertFalse(self.signals._patched_receivers)
+ gc.collect()
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

Could you add a comment why this is necessary?

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
scrapy/signalmanager.py
from scrapy.utils import signal as _signal
+from scrapy.exceptions import ScrapyDeprecationWarning
+from scrapy.dispatch.utils import robust_apply as _robust_apply
+from scrapy.dispatch.utils import func_accepts_kwargs
+from scrapy.dispatch import Signal as _Signal
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

(Reposting this here because I accidentally attached it to an old commit ;)) Is there a reason to add the leading underscore?

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
tests/test_dispatch_utils.py
def test_func_accepts_kwargs(self):
- self.assertTrue(func_accepts_kwargs(accepts_kwargs))
- self.assertFalse(func_accepts_kwargs(no_kwargs))
+ def accept_kwargs(**kwargs):
+ pass
+
+ def no_kwargs():
+ pass
+ # To test the attribute error case
+ someval = 0
+ self.assertTrue(func_accepts_kwargs(accept_kwargs))
+ # TODO: The fallback here is that it is assumed that the "receiver"
+ # is callable. Correct behaviour?
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

I think we should raise a TypeError when the given "function" is not a callable. From the docs:

Raised when an operation or function is applied to an object of inappropriate type.

@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

Oops, just saw that you already did that, nice!

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
scrapy/dispatch/utils/robustapply.py
+ If fromMethod is true, then the callable already
+ has its first argument bound
+ """
+ if hasattr(receiver, '__call__'):
+ # receiver is a class instance; assume it is callable.
+ # Reassign receiver to the actual method that will be called.
+ if hasattr(receiver.__call__, '__func__') \
+ or hasattr(receiver.__call__, '__code__'):
+ receiver = receiver.__call__
+ if hasattr(receiver, '__func__'):
+ # an instance-method...
+ return receiver, receiver.__code__, 1
+ elif not hasattr(receiver, '__code__'):
+ raise ValueError('unknown reciever type %s %s' %
+ (receiver, type(receiver)))
+ return receiver, receiver.__code__, 0
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

I know this is from Django and I know I'm being pedantic, but could you change 0 into False and 1 into True above? ;)

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
scrapy/signalmanager.py
@@ -23,7 +50,18 @@ def connect(self, receiver, signal, **kwargs):
:type signal: object
"""
kwargs.setdefault('sender', self.sender)
- return dispatcher.connect(receiver, signal, **kwargs)
+ signal = self._ensure_signal(signal)
+ if not func_accepts_kwargs(receiver):
+ warnings.warn("The use of handlers that don't accept "
+ "**kwargs has been deprecated, plese refer "
+ "to the Signals API documentation.",
+ ScrapyDeprecationWarning, stacklevel=2)
+ if receiver.__repr__() not in self._patched_receivers:
+ self._patched_receivers[receiver.__repr__()] = \
+ lambda sender, **kw: _robust_apply(receiver, sender, **kw)
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

Hmm, I think our solution here breaks the weak=True keyword argument functionality (since we're always keeping a strong reference in our _patched_receivers dictionary). Maybe it's enough to just mention this in the docs. (Since weak=True is a new feature and came together with requiring **kwargs.)

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
scrapy/signalmanager.py
@@ -23,7 +50,18 @@ def connect(self, receiver, signal, **kwargs):
:type signal: object
"""
kwargs.setdefault('sender', self.sender)
- return dispatcher.connect(receiver, signal, **kwargs)
+ signal = self._ensure_signal(signal)
+ if not func_accepts_kwargs(receiver):
+ warnings.warn("The use of handlers that don't accept "
+ "**kwargs has been deprecated, plese refer "
+ "to the Signals API documentation.",
+ ScrapyDeprecationWarning, stacklevel=2)
+ if receiver.__repr__() not in self._patched_receivers:
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

Nice catch!

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
tests/test_signalmanager_compatability.py
+ self.signals.connect(receiver, new_signal)
+ self.assertTrue(self.signals.disconnect(receiver, new_signal))
+ self.signals.connect(receiver_no_kwargs, new_signal)
+ recv = self.signals._patched_receivers[receiver_no_kwargs.__repr__()]
+ assert(self.signals.disconnect(receiver_no_kwargs, new_signal))
+ self.signals._patched_receivers[receiver_no_kwargs.__repr__()] = recv
+ self.assertFalse(self.signals.disconnect(receiver_no_kwargs,
+ new_signal))
+ del self.signals._patched_receivers[receiver_no_kwargs.__repr__()]
+ self.assertFalse(self.signals.disconnect(receiver_no_kwargs,
+ new_signal))
+
+ def test_disconnect_all(self):
+ new_signal = object()
+ self.signals.connect(receiver, new_signal)
+ self.signals.disconnect_all(new_signal)
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

Looks like this test makes no assertions

@jdemaeyer jdemaeyer commented on an outdated diff Aug 8, 2016
scrapy/dispatch/utils/robustapply.py
+ raise ValueError('unknown reciever type %s %s' %
+ (receiver, type(receiver)))
+ return receiver, receiver.__code__, 0
+
+
+def robust_apply(receiver, *arguments, **named):
+ """Call receiver with arguments and an appropriate subset of named
+ """
+ receiver, codeObject, startIndex = function(receiver)
+ acceptable = codeObject.co_varnames[startIndex+len(arguments):
+ codeObject.co_argcount]
+ for name in codeObject.co_varnames[startIndex:startIndex+len(arguments)]:
+ if name in named:
+ raise TypeError(
+ """Argument %r specified both positionally and
+ as a keyword for calling %r""" % (
@jdemaeyer
jdemaeyer Aug 8, 2016 Contributor

(Copying this here cause I accidentally attached it to and old commit earlier)

This string (as it will be printed to the user) contains the newline and all the blanks before "as a keyword":

Argument %r specified both positionally and\n                as a keyword for calling %r.

You can use Python's magic string concatenation to avoid this:

             raise TypeError(
                """Argument %r specified both positionally and """
                """as a keyword for calling %r""" % (
@jdemaeyer
Contributor
jdemaeyer commented Aug 8, 2016 edited

Hey @rootAvish :) I've added my usual nitpicks here and there but I am quite happy with how this PR looks. Great work!

It would be cool if you could squash the commits into a couple of "meaningful" ones (tip: make a tag or write down the current commit hash before calling rebase, it's an easy way out if you accidentally enter rebase hell ;)). Then when we have settled the remaining smaller issues (the line comments) and the benchmarks are up somewhere I think we're ready to have a second reviewer look over it and add some docs (or the other way around).

@rootAvish
Contributor

Hey @jdemaeyer , thanks for the code review. In accordance with the email I sent you, the patched receiver sending was abysmally slow when compared to the current API, and in the spirit of always coming up with overly complicated solutions, this too was due to over engineering on my part. I introduced a large amount of method call overhead as well, which was taking a lot of time too. I reworked that and completely got rid of _patched_receivers and instead worked using the Signal class itself, IMO the new approach is much simpler and cleaner, and the benchmark too is fixed(in case you didn't check the email yet, here's a link to the benchmark suite: https://github.com/rootAvish/scrapysignalbench).

Would appreciate it if you had a look over that part again, do you see any problems with the new implementation?

@jdemaeyer jdemaeyer commented on an outdated diff Aug 18, 2016
scrapy/dispatch/dispatcher.py
+ objects. If this parameter is false, then strong references
+ will be used.
+
+ dispatch_uid
+ An identifier used to uniquely identify a particular instance
+ of a receiver. This will usually be a string, though it may be
+ anything hashable.
+ """
+ assert callable(receiver), "Signal receivers must be callable."
+ # Check for **kwargs
+ if not func_accepts_kwargs(receiver):
+ warnings.warn("The use of handlers that don't accept "
+ "**kwargs has been deprecated, plese refer "
+ "to the Signals API documentation.",
+ ScrapyDeprecationWarning, stacklevel=3)
+ self.receiver_accepts_kwargs[_make_id(receiver)] = False
@jdemaeyer
jdemaeyer Aug 18, 2016 Contributor

The indentation is a space short in this block ;)

@jdemaeyer jdemaeyer commented on an outdated diff Aug 18, 2016
scrapy/dispatch/dispatcher.py
+ else:
+ lookup_key = (_make_id(receiver), _make_id(sender))
+
+ disconnected = False
+ with self.lock:
+ self._clear_dead_receivers()
+ for index in range(len(self.receivers)):
+ (r_key, _) = self.receivers[index]
+ if r_key == lookup_key:
+ disconnected = True
+ del self.receivers[index]
+ break
+ self.sender_receivers_cache.clear()
+ if disconnected:
+ if _make_id(receiver) in self.receiver_accepts_kwargs:
+ del self.receiver_accepts_kwargs[_make_id(receiver)]
@jdemaeyer
jdemaeyer Aug 18, 2016 Contributor

There's a race condition here, you can use self.receiver_accepts_kwargs.pop(_make_id(receiver), None) to get rid of it

@jdemaeyer jdemaeyer commented on an outdated diff Aug 18, 2016
scrapy/dispatch/dispatcher.py
+ The sender of the signal. Either a specific object or None.
+
+ named
+ Named arguments which will be passed to receivers.
+
+ Returns a list of tuple pairs [(receiver, response), ... ].
+ """
+ responses = []
+ if not self.receivers or self.sender_receivers_cache.get(sender) is NO_RECEIVERS:
+ return responses
+
+ for receiver in self._live_receivers(sender):
+ if self.receiver_accepts_kwargs[_make_id(receiver)]:
+ response = receiver(signal=self, sender=sender, **named)
+ else:
+ response = robust_apply(receiver, sender=sender, **named)
@jdemaeyer
jdemaeyer Aug 18, 2016 Contributor

I wonder why we don't pass signal=self here?

@jdemaeyer jdemaeyer commented on an outdated diff Aug 18, 2016
scrapy/dispatch/dispatcher.py
+ dfd = maybeDeferred(
+ receiver, signal=self, sender=sender, **named)
+ dfd.addErrback(logerror, receiver)
+ dfd.addBoth(lambda result: (receiver, result))
+ dfds.append(dfd)
+ d = DeferredList(dfds)
+ d.addCallback(lambda out: [x[1] for x in out])
+ return d
+
+ def _clear_dead_receivers(self):
+ # Note: caller is assumed to hold self.lock.
+ if self._dead_receivers:
+ self._dead_receivers = False
+ new_receivers = []
+ for r in self.receivers:
+ if isinstance(r[1], weakref.ReferenceType) and r[1]() is not None:
@jdemaeyer
jdemaeyer Aug 18, 2016 Contributor

This for loop used to contain:

                if isinstance(r[1], weakref.ReferenceType) and r[1]() is None:
                    continue
                new_receivers.append(r)

(i.e. everything that was not a dead weakref was added to new_receivers) With the changes you made it looks as if new_receivers will almost always be empty, except for live weakrefs. You probably meant to do this, right?

                if not (isinstance(r[1], weakref.ReferenceType) and r[1]() is None):
@jdemaeyer jdemaeyer commented on an outdated diff Aug 18, 2016
scrapy/signalmanager.py
from scrapy.utils import signal as _signal
+from scrapy.dispatch import Signal as Signal
@jdemaeyer
jdemaeyer Aug 18, 2016 Contributor

No need for the as Signal ;)

@jdemaeyer
Contributor

Heyhey Avishkar, nice work! I agree, this solution is quite a bit simpler, and I like how you restored the weak=True functionality. As usual I've added a few small comments (at one of these the code is actually broken I think), and there's still three that are open from earlier commits. (Btw these are never commands or anything, feel free to answer with why you disagree.)

So this leaves us with the following stuff still to do, right?

  • smaller issues (line comments)
  • squash commits
  • finish docs
  • get second review

If you find time for it, maybe you can put the output for the benchmark suite somewhere (just a raw paste of the command line output should be fine)?

@rootAvish
Contributor

@jdemaeyer thanks for responding on such short notice, always appreciate your insights.

there's still three that are open from earlier commits. (Btw these are never commands or anything, feel free to answer with why you disagree.)

Sorry for missing out on those, I don't disagree, I thought you wanted to deliberate further on deprecating utils.signal further so I didn't move on it at the time, I thought it would be too much of a change but then I considered that the utils pass-through methods are not directly usable after the change anyway, plus it also gets rid of the method call overhead.

So this leaves us with the following stuff still to do, right?
smaller issues (line comments)

I think I took care of all of them all (I think, I'll make sure to go over all of them again to make sure I didn't miss any like before).

squash commits

Right, I wanted to do that once I was done with the development.

finish docs

I'm almost done with those, I'll upload them sometime later today. I did the docstrings for dispatcher.py as a starting point, I tried to adhere to the formatting style used in Scrapy.

If you find time for it, maybe you can put the output for the benchmark suite somewhere (just a raw paste of the command line output should be fine)?

Hey sure, I thought you would like to run the same on your end for verification so I didn't include my run at first, here's a sample: https://github.com/rootAvish/scrapysignalbench/blob/master/scrapysignalbench/benchmarks/README.md

As you can see, in theory connection is slower than before but most of that time is used up in checking for kwargs and raising the deprecation warning(line_profiler output that I went by in the past suggest so), in time further down the road we can always look to migrate away from non-kwargs receivers completely. Do you think slower connection times an issue?

@lopuhin lopuhin referenced this pull request Sep 1, 2016
Open

PyPy support #2213

3 of 5 tasks complete
@rootAvish
Contributor

Hey @jdemaeyer , did you guys change the openSSL version or something? I don't think that break with Python 3.3 has anything to do with the code. Also, please review the docs when you get a chance. I hope the commits are as desired now.

@redapple redapple referenced this pull request Sep 13, 2016
Open

Refactor signals #8

@redapple
Contributor

Hi @rootAvish , we did not change anything explicitly, but newer packages were used:
If I compare python packages from the last PASSING Travis build
and this FAILING one, this is what you get:

diff freeze.passing.txt freeze.failing.txt 
1c1
< PASSING
---
> FAILING
3c3
< attrs==16.0.0
---
> attrs==16.1.0
5c5
< botocore==1.4.48
---
> botocore==1.4.53
7c7
< cffi==1.7.0
---
> cffi==1.8.2
9c9
< cryptography==1.4
---
> cryptography==1.5
11c11
< curtsies==0.2.6
---
> curtsies==0.2.9
35c35
< pyOpenSSL==16.0.0
---
> pyOpenSSL==16.1.0
46,48c46,48
< testfixtures==4.10.0
< traitlets==4.2.2
< Twisted==16.3.2
---
> testfixtures==4.10.1
> traitlets==4.3.0
> Twisted==16.4.0
51c51
< zope.interface==4.2.0
---
> zope.interface==4.3.2
@redapple
Contributor

This recent build for the master branch passes and has the same packages: https://travis-ci.org/scrapy/scrapy/jobs/159358737

@redapple redapple added this to the v1.3 milestone Sep 13, 2016
@redapple
Contributor

hey @rootAvish , I don't know why, but a new PR of mine with only your commits, rebased on current "master" branch, built fine, including on Python 3.3
#2235
https://travis-ci.org/scrapy/scrapy/builds/159620850

@kmike kmike commented on the diff Sep 13, 2016
docs/topics/signals.rst
@@ -13,9 +13,20 @@ Even though signals provide several arguments, the handlers that catch them
don't need to accept all of them - the signal dispatching mechanism will only
deliver the arguments that the handler receives.
+Signals underwent a major refactoring with an eye towards speed. Although still
+backwards compatible, receivers now need to have a ``**kwargs`` argument i.e. all
+receivers should now accept a variable keyword args param. We request that you
+make sure all your receivers follow this contract.
@kmike
kmike Sep 13, 2016 Member

It means this change is backwards incompatible, i.e. users need to change all their receivers.

@kmike
kmike Sep 13, 2016 Member

Or is this optional?

@rootAvish
rootAvish Sep 13, 2016 Contributor

@kmike It's optional as we did not want to break backward compatibility, but those receivers would not fully leverage the full potential of the speedup as robust_apply would still be invoked.

@kmike kmike commented on an outdated diff Sep 13, 2016
scrapy/dispatch/utils/inspect.py
@@ -0,0 +1,30 @@
+from __future__ import absolute_import
+
+import inspect
+
+import six
+
+
+def func_accepts_kwargs(func):
+ # Not all callables are inspectable with getargspec, so we'll
@kmike
kmike Sep 13, 2016 Member

indentation is off

@kmike kmike commented on an outdated diff Sep 13, 2016
scrapy/dispatch/dispatcher.py
+ "to the Signals API documentation.",
+ ScrapyDeprecationWarning, stacklevel=3)
+ accepts_kwargs = False
+ if dispatch_uid:
+ lookup_key = (dispatch_uid, _make_id(sender))
+ else:
+ lookup_key = (_make_id(receiver), _make_id(sender))
+ accepts_kwargs_lookup = _make_id(receiver)
+ if weak:
+ ref = weakref.ref
+ receiver_object = receiver
+ # Check for bound methods
+ if hasattr(receiver, '__self__') and hasattr(receiver, '__func__'):
+ ref = WeakMethod
+ receiver_object = receiver.__self__
+ if six.PY3 and sys.version_info >= (3, 4):
@kmike
kmike Sep 13, 2016 Member

why is six.PY3 check needed? It may return False in Python 4.x, and it looks redundant because of sys.version_info check.

@kmike kmike and 1 other commented on an outdated diff Sep 13, 2016
scrapy/dispatch/weakref_backports.py
@@ -0,0 +1,67 @@
+"""
+weakref_backports is a partial backport of the weakref module for python
+versions below 3.4.
@rootAvish
rootAvish Sep 13, 2016 edited Contributor

I glanced through the code in its repo here: https://github.com/pjdelport/backports.weakref/blob/master/src/backports/weakref.py and it looks like he has implemented the backport himself. Ours is a verbatim copy of the code in the Python 3.4 source. I can do a build with that module but it's your call in the end on which implementation of the backport you'd like to go with. That implementation would help us have the same code across all Python version though, since it backports finalize which ours does not.

@rootAvish
rootAvish Sep 13, 2016 edited Contributor

I'm sorry, as usual I spoke too soon. That implementation backports weakref.finalize from 3.5. We ported weakref.WeakMethod from 3.4. In the current state of that package we would require both this and that package.

@kmike
kmike Sep 13, 2016 Member

Aha, makes sense. Maybe we should raise an issue at https://github.com/pjdelport/backports.weakref, or even contribute a PR with WeakMethod changes. Less code we have in Scrapy the better :) What do you think?

@kmike kmike and 1 other commented on an outdated diff Sep 13, 2016
scrapy/utils/signal.py
-def disconnect_all(signal=Any, sender=Any):
+def disconnect_all(signal, sender=None):
@kmike
kmike Sep 13, 2016 Member

Previously signal argument was not required; also, it was posisble to pass object instead of Signal instance.

@rootAvish
rootAvish Sep 13, 2016 Contributor

This too would depend on my what the decision is regarding my other comment here: #2030 (diff).

@kmike kmike and 1 other commented on an outdated diff Sep 13, 2016
scrapy/utils/signal.py
- extra={'spider': spider})
- return failure
-
- dont_log = named.pop('dont_log', None)
- spider = named.get('spider', None)
- dfds = []
- for receiver in liveReceivers(getAllReceivers(sender, signal)):
- d = maybeDeferred(robustApply, receiver, signal=signal, sender=sender,
- *arguments, **named)
- d.addErrback(logerror, receiver)
- d.addBoth(lambda result: (receiver, result))
- dfds.append(d)
- d = DeferredList(dfds)
- d.addCallback(lambda out: [x[1] for x in out])
- return d
+ return signal.send_catch_log_deferred(sender=sender, **named)
@kmike
kmike Sep 13, 2016 Member

signal was optional previously, and it was not required to use Signal class

@rootAvish
rootAvish Sep 13, 2016 Contributor

@kmike The assumption here was that all signals would have been connected using the SignalManager class and all compatibility was implemented either there, or inside Signal itself, while deprecating the use of utils.signal, but I see how that can be a problem. I could always implement a layer of compatibility for utils.signal. I'll get on that.

@rootAvish
rootAvish Sep 13, 2016 edited Contributor

@kmike , sorry I just remembered what my line of thought here was. The module here contains just the send_catch_log, send_catch_log_deferred and disconnect_all methods. Assuming somebody was using scrapy.utils.signal directly instead of SignalManager, they most likely used dispatcher.connect from PyDispatcher directly to connect their signal to the receiver. If that be the case, then they are screwed(excuse my language please) anyway since the entire point here is to move signals away from PyDispatcher.

Assuming they were using SignalManager, that does require a signal as a compulsory argument everywhere already, and it would use the instance methods of Signalobjects from there itself, which does not use utils.signal anymore and does not break backward compatibility (accepts objects instead of Signal instances). I don't think implementing any kind of compatibility here would serve us a purpose. But on that note, I'm not sure what use case this module serves anymore. Your thoughts on the same?

@kmike
kmike Sep 13, 2016 Member

Hm, in this case it seems we may drop the whole module, or just keep the current implementation as-is and deprecate it.

@rootAvish
Contributor

@redapple I'm unable to make out what the problem is at first glance, but using your build I'm sure I can figure that part out. I'll report back as soon I have something. Thanks for the pointers.

rootAvish added some commits Sep 18, 2016
@rootAvish rootAvish Added: scrapy.dispatch module, containing the code for the new signal…
… dispatcher
95d9aa3
@rootAvish rootAvish Modified scrapy code everywhere to use the new dispatcher 8546c80
@rootAvish rootAvish Added unit tests for dispatcher and modified existing tests to use th…
…e new dispatcher
b7f919f
@rootAvish rootAvish Added docs for the the new dispatcher and signals API
9707556
@rootAvish
Contributor
rootAvish commented Sep 24, 2016 edited

Hey @redapple , sorry for not replying all this time but just wanted to give you an update. I haven't been able to solve this issue yet since the only place I was able to even reproduce the problem was on a Ubuntu 14.04 installation and on there it seems like this is an error with OpenSSL itself (scrapy-master breaks too).

[20:31:25] avishkar:Downloads $ python3.3                                                                                                         
Python 3.3.6 (default, Jan 28 2015, 17:27:09) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import OpenSSL
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.3/dist-packages/OpenSSL/__init__.py", line 8, in <module>
    from OpenSSL import rand, crypto, SSL
  File "/usr/local/lib/python3.3/dist-packages/OpenSSL/rand.py", line 12, in <module>
    from OpenSSL._util import (
  File "/usr/local/lib/python3.3/dist-packages/OpenSSL/_util.py", line 6, in <module>
    from cryptography.hazmat.bindings.openssl.binding import Binding
  File "/usr/local/lib/python3.3/dist-packages/cryptography/hazmat/bindings/openssl/binding.py", line 14, in <module>
    from cryptography.hazmat.bindings._openssl import ffi, lib
ImportError: No module named 'cryptography.hazmat.bindings._openssl'
>>> 
@redapple
Contributor
redapple commented Oct 26, 2016 edited

@rootAvish , sorry for the late feedback.
I just tried again submitting a master-branch-rebased version of your commits, and it built ok: #2361

Not sure what issue Travis is having to be honest.

But perhaps you can try doing the same thing: open a new PR, on top of master-branch, and continue/finish-up the work there (if there's anything missing)

@redapple redapple modified the milestone: v1.4, v1.3 Nov 23, 2016
@rootAvish
Contributor

@redapple Right, I'm anyway out of ideas on what else I can try here.

rootAvish added some commits Dec 29, 2016
@rootAvish rootAvish Merge branch 'master' of https://github.com/scrapy/scrapy into signal…
…-rewrite

Conflicts:
	scrapy/extensions/logstats.py
799d873
@rootAvish rootAvish dispatcher - PyPy package for WeakMethod backport
Use the PyPy package at: https://pypi.python.org/pypi/weakrefmethod/1.0.0
for WeakMethod instead of packaging the backport code with scrapy itself.
dc06067
@rootAvish
Contributor

@redapple, whatever issue Travis was having seems to be fixed now, waiting for @kmike 's comments on the usage of the weakrefmethod package, all the other stuff is taken care of. Sorry for the delay, was in a crunch at work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment