Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bytes_received signal #4205

Merged
merged 20 commits into from May 11, 2020
Merged

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Dec 2, 2019

As discussed in #3793 (comment)

signal=signals.bytes_received,
data=bodyBytes,
request=self._request,
)
Copy link
Member

@Gallaecio Gallaecio Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the signal is not assigned to any handler, does this have any noticeable impact on performance?

Copy link
Member Author

@elacuesta elacuesta Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using https://github.com/scrapy/scrapy-bench:

scrapy-bench --n-runs=10 --book_url=http://localhost:8000 bookworm

scrapy @ 454de31 (bytes_received_signal branch)

The results of the benchmark are (all speeds in items/sec) : 
Test = 'Book Spider' Iterations = '10'
Mean : 62.16695929434989 Median : 62.43078124454085 Std Dev : 1.1148966043160284

scrapy @ 07b8cd2 (master branch)

The results of the benchmark are (all speeds in items/sec) : 
Test = 'Book Spider' Iterations = '10'
Mean : 60.77521979854436 Median : 60.963007266556204 Std Dev : 1.6349215293441601

@Gallaecio Gallaecio requested a review from kmike Dec 3, 2019
@codecov
Copy link

codecov bot commented Dec 3, 2019

Codecov Report

Merging #4205 into master will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4205      +/-   ##
==========================================
- Coverage   84.50%   84.49%   -0.02%     
==========================================
  Files         164      164              
  Lines        9903     9909       +6     
  Branches     1481     1481              
==========================================
+ Hits         8369     8373       +4     
- Misses       1266     1268       +2     
  Partials      268      268              
Impacted Files Coverage Δ
scrapy/core/downloader/handlers/http11.py 92.72% <100.00%> (+0.13%) ⬆️
scrapy/signals.py 100.00% <100.00%> (ø)
scrapy/crawler.py 87.15% <0.00%> (-1.12%) ⬇️

docs/topics/signals.rst Outdated Show resolved Hide resolved
docs/topics/signals.rst Outdated Show resolved Hide resolved
assert signals.spider_opened in self.run.signals_catched
assert signals.spider_idle in self.run.signals_catched
assert signals.spider_closed in self.run.signals_catched
def _assert_bytes_received(self):
Copy link
Member

@kmike kmike Jan 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test which shows the signal behavior when the data is larger than a buffer, i.e. when a signal is sent several times by Scrapy?

Copy link
Member Author

@elacuesta elacuesta Jan 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 2c9643d.
Not sure about the 2**14 thing though, it doesn't seem to slow the test down but there might be a better way to force a smaller buffer size.

Running only tests/test_engine.py with the large payload:

real	0m4.765s
user	0m2.445s
sys	0m0.330s

without it:

real	0m4.751s
user	0m2.367s
sys	0m0.338s

.. signal:: bytes_received
.. function:: bytes_received(data, request)

Sent by the HTTP 1.1 download handler when a group of bytes is
Copy link
Member

@kmike kmike Jan 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a signal should be renamed, if it is just for HTTP 1.1 download handler.

Is it sent e.g. for S3 responses, or for FTP responses? It could be, because S3DownloadHandler uses http download handler,though I haven't checked it. If it is sent, we should document that. If it is not sent for all handlers, imagine in future we decide to add support for them (e.g. S3 handler starts to send this signal). It'd be a backwards incompatible change, as user code may rely on bytes_received to be only issued by HTTP 1.1 download handler, not by S3 or FTP.

Having different signal names can avoid this backwards incompatibility. Alternatively, we can decide to support everything from the beginning. If we support everything, it may be good to add an additional argument to the signal, which would allow to distinguish between e.g. s3 and http1.1 data.

Copy link
Member Author

@elacuesta elacuesta Jan 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed 6f02a8d, which adds a source string parameter to the signal (current values could be "http11" or "s3"). I'm not sure how to test the occurrence of the "s3" value without actually requesting stuff from S3.
Let me know what you think of this approach @kmike.

Copy link
Member

@kmike kmike Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check pro/cons of these aproaches, from user point of view.

Use case 1: user wants to check traffic received from websites. Two signals:

def on_bytes_received_http(data):
    # ...

crawler.signals.connect(on_bytes_received_http, signals.bytes_received_http)

One signal, "source" field:

def on_bytes_received(data, source):
    if source != 'http11':
        return
    # ...

crawler.signals.connect(on_bytes_received, signals.bytes_received)

Use case 2: user wants to check traffic received from websites and from downloading s3 files (e.g. with seed urls used in start_requests?), but wants to handle them differently:

def on_bytes_received_http(data):
    # ...

def on_bytes_received_s3(data):
    # ...

crawler.signals.connect(on_bytes_received_http, signals.bytes_received_http)
crawler.signals.connect(on_bytes_received_s3, signals.bytes_received_s3)

One signal, "source" field:

def on_bytes_received(data, source):
    if source == 'http11':
        # ...
    elif source == 's3':
        # ...

crawler.signals.connect(on_bytes_received, signals.bytes_received)

Use case 3: user wants overall downloaded data stats, source doesn't matter.

Two signals:

def on_bytes_received(data):
    # ...

crawler.signals.connect(on_bytes_received, signals.bytes_received_http)
crawler.signals.connect(on_bytes_received, signals.bytes_received_s3)

One signal, "source" field:

def on_bytes_received(data, source):
    # ...

crawler.signals.connect(on_bytes_received, signals.bytes_received)

To my taste, "two signals" approach

  • wins in Use case 1 (shorter, less error prone)
  • wins in Use case 2 (less error prone, but more code)
  • loses in Use case 3 (more code, need to find all signals related to downloading)

In first two use cases "less error prone" means that if you make a typo in a signal name, an error will be raised (there is no such a signal), while if you check against non-existing source, it will silently work wrong. Because of this "less error prone" I think having 2 separate signals is better than relying on "source" field.

For the stats use case a single signal could make sense, which is fired if any other bytes_received signal are fired. It will also allow to add more signal sources without users having to update their code. It can actually be the current bytes_received signal. But I wonder what's the performance overhead of that, having to send 2x signals on every chunk, as compared to benefit (it may still worth it).


One more thing, I'm not sure http11 is a good name here, because it may be set for http1.0 traffic as well - though I may be wrong :) I'm assuming that HTTP11DownloadHandler handles communication with all kind of of http servers, but maybe I'm wrong.

Copy link
Member

@Gallaecio Gallaecio Apr 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was in favor of multiple signals, but re-reading I’m starting to think that a single signal is much cleaner and should work, as long as we clearly document that additional download handlers may support this signal in the future, and recommend to use the request object to filter out non-HTTP traffic if desired.

This does not allow to know whether the traffic comes from HTTP 1.0, 1.1 or 2.0. Do we want to allow that? I don’t think we need that at the moment, and in the future we could implement a separate, new signal, e.g. protocol_negotiated or similar, that provides this information along with the affected request.

Copy link
Member Author

@elacuesta elacuesta Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's absolutely true, thanks for pointing it out. The request object should be enough to determine whether or not the source is S3.
Regarding the version, we do not support HTTP/2 as far as I remember, and HTTP/1.0 needs to be explicitly enabled by changing the download handler. +1 on the protocol_negotiated signal once we add support for HTTP/2.
I'll go ahead and simplify the implementation, thanks both ❤️

Copy link
Member

@Gallaecio Gallaecio Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike These would be the 3 use cases with a single signal and relying on request.url:

Use case 1: user wants to check traffic received from websites:

def on_bytes_received(data, request, spider):
    if urlparse(request.url).protocol not in {'http', 'https'}:
        return
    # …

crawler.signals.connect(on_bytes_received, signals.bytes_received)

Use case 2: user wants to check traffic received from websites and from downloading s3 files (e.g. with seed urls used in start_requests?), but wants to handle them differently:

def on_bytes_received(data, request, spider):
    protocol = urlparse(request.url).protocol
    if protocol in {'http', 'https'}:
        # …
    elif protocol == 's3':
        # …

crawler.signals.connect(on_bytes_received, signals.bytes_received)

Use case 3: user wants overall downloaded data stats, source doesn't matter:

def on_bytes_received(data, request, spider):
    # …

crawler.signals.connect(on_bytes_received, signals.bytes_received)

Copy link
Member

@kmike kmike May 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fair enough. Let's proceed with a single signal, without source :)

Copy link
Member

@kmike kmike May 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re examples, nitpick (in case we decide to put them somewhere, or if someone gets to this page via search) - they should use scheme = urlparse_cached(request).scheme, like scrapy's DownloadHandlers.

tests/test_engine.py Outdated Show resolved Hide resolved
tests/test_engine.py Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Member

Gallaecio commented Feb 19, 2020

I wonder if we should consider #3880 (comment) and allow the signal to somehow stop the response download and send the response to the corresponding callback as is.

Although that could wait for a future pull request, and I’m not sure if a signal is the right place to expose such a functionality.

@elacuesta
Copy link
Member Author

elacuesta commented Feb 19, 2020

That's an interesting use case though I agree signal handlers might not be the best place to do it.


Sent by the HTTP 1.1 and S3 download handlers when a group of bytes is
received for a specific request. This signal might be fired multiple
times for the same request, with partial data each time.
Copy link
Member

@kmike kmike May 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 ways to return partial data:

  1. return data which is received so far (in this case "partial" means that the response is not fully received yet)
  2. return a next received chunk

Would it be possible to clarify in the docs, say that (2) is correct?

Copy link
Member Author

@elacuesta elacuesta May 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in e1948b4

@kmike kmike merged commit b183579 into scrapy:master May 11, 2020
2 checks passed
@kmike
Copy link
Member

kmike commented May 11, 2020

Thanks @elacuesta!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants