Port downloader to use twisted.web.client.Agent (HTTP/1.1 client) #109

wants to merge 15 commits into

4 participants


Lately, I've been fiddling with Downloaders, wanting to use HTTP/1.1 and hopefully later benefit from persistent connections once they're officially in Twisted.

So here's my attempt at refactoring HttpDownloadHandler using twisted.web.client.Agent (which uses HTTP/1.1 starting from Twisted 10?) with the Scrapy specificities I could understand in ScrapyHTTPClientFactory.

For now, I added the code to scrapy/core/downloader/handlers/http11.py
but this could be moved to scrapy/contrib/downloaders/ or something.

The test cases I added in test_downloader_handlers.py are OK on my dev machine.

The code still has debug traces.

What you think?

Scrapy project member

overall it is looking great!

my notes:

  • WebClientContextFactory needs to be configurable as it is after #82
  • HEAD is handled as a special case, is this really needed?
  • twisted.web._newclient.HTTPClientParser (used by HTTP11ClientFactory) handles chunked transfers clashing with scrapy.contrib.downloadermiddleware.chunked.
  • HTTPClientParser is a bit strict about content-length headers, it could cause problems crawling badly implemented servers.
  • HTTP11ClientFactory doesn't handle redirections as it did in HTTPClientFactory. good!

in summary:

  • lot of code reused from twisted
  • clear separation of agent, protocol and parser
  • still extensible
  • no persistent connections yet but the path looks clean
  • ProxyAgent doesn't support SSL proxies (CONNECT) but looks great as future improvement to twisted

Thanks for your comments.

As for the HEAD method special case, a test case was blocked since ScrapyAgentResponseBodyReader was not receiving any bytes (as could be expected...). So I added this very ugly test to pass the test.
But it's for sure better to test the expected length of the body instead

Indeed, redirections are not handled by HTTP11ClientFactory, hence RedirectAgent which you can tweak to handle badly formatted redirections (which I do in a custom/simpler non-Scrapy-based crawler)

Regarding chunked-encoding transfer, does scrapy.contrib.downloadermiddleware.chunked do something that twisted.web._newclient.HTTPClientParser doesn't?
Do you think scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware could simply be removed from DOWNLOADER_MIDDLEWARES_BASE?

In my non-Scrapy-based crawler I've also been successfully using ContentDecoderAgent with GzipDecoder

from twisted.internet import reactor
from twisted.web.client import Agent, ContentDecoderAgent, GzipDecoder
agent = ContentDecoderAgent(
    [("gzip", GzipDecoder)])

Wonder how this could fit in Scrapy.

Scrapy project member

thanks for submitting this pull request and keep working on it!

As for HEAD method, I prefer to test the body length too instead of handling it specially in downloader handler.

About RedirectAgent, Scrapy already have a downloader middleware to handle redirections scrapy.contrib.downloadermiddleware.redirect and it is the canonical way to implement this feature because it can be easily disabled, extended and tested. I only foresee using two twisted agents in Scrapy, they are basic Agent and ProxyAgent because they offer the minimal set of features that doesn't clash with scrapy builtins.

Regarding chunked-encoding transfer, scrapy.contrib.downloadermiddleware.chunked can't handle all the cases that HTTPClientParser does trough _ChunkedTransferDecoder as reported in https://groups.google.com/d/topic/scrapy-developers/Ss2XDTRdSb4/discussion. So using twisted chunked decoder instead of scrapy middleware is a big plus in this case.

For gzip and others content-encodings there is scrapy.contrib.downloadermiddleware.httpcompression that works very well so far, do you know of case not handled by this middleware?


Refactored a bit. Explicitly removed Trasnfer-Encoding header.
I admit I haven't really tested this patch with chunked encoded bodies from real servers.

As ProxyAgent doesn't handle CONNECT, I guess it's better to restrict the use of Twisted's Agent for http scheme only (why I removed the 443 port part)

As for your question about gzip and scrapy.contrib.downloadermiddleware.httpcompression, I simply haven't played with Scrapy enough to have an opinion.

Scrapy project member
Scrapy project member

hi @redapple, I have been working on your patch, see https://github.com/dangra/scrapy/tree/http11

as you are the original developer for this feature, can I ask you for a quick review?

It requires twisted 12.1 because it is using HTTPConnectionPool for persistent connections

compare link dangra@master...http11


hi @dangra,
Sure, I'll review it.
I personally had problems with persistent connections and Twisted (outside of Scrapy), but I'll try with 12.1 and your code.


code looks nice and works nicely :)
I've tested it with Twisted trunk.
Checked with Wireshark. Got (much) less TCP connections than before.

My only comment is on the body producer which could be set to None when request body is empty, in order not to send an extra "Content-length: 0" header

Scrapy project member

thanks for reviewing it!

I agree on your comment to body producer, wrote a simpler alternative line of code that follows your idea.

Scrapy project member
Scrapy project member

Can you merge my branch into yours and push to github, that's the only way my commits will show in this pull request. otherwise I will have to open another PR.


if you tell me how to, sure :)
I had to separately clone your fork of scrapy to get your code.
Surely there's a more gitonic way of doing it... I'm pretty ignorant in git/github


Hope it's ok.

Here's what I did:
git remote add dangra git://github.com/dangra/scrapy.git
git fetch dangra
git merge remotes/dangra/http11 http11
git push origin http11

Scrapy project member

looks good to me, thanks.

there is still a broken test case that I want to fix before merging this into master branch.


I did some work refreshing these patches against current origin/master.


Some notes:

See 3077ac8 for a fix where Http11DownloadHandler needs to accept second argument, since that's how the tests use it.

See 888dfad for a fix where the call to log.err() creates an exception in the log that test runner notices, so tests that call this method begin to fail. For now I've disabled the call to log.err(). If we want to keep logging the error, we'll need to call flushLoggedErrors() -- see http://twistedmatrix.com/documents/current/core/howto/trial.html#auto11 . (The test this makes fail is scrapy.tests.test_downloader_handlers.Http11TestCase.test_timeout_download_from_spider )

Commit bdd2a1b (rebased from the original, but otherwise the same) makes many tests fail because there is work left in the reactor. One test you can see this with is scrapy.tests.test_engine.EngineTest.test_crawler

For now I can't promise I'll have time to figure out what's going on with leaving the reactor unclean, but I thought I'd leave a comment with what I have found in a few hours of looking into this pull request!

-- Asheesh.

Scrapy project member

hi @paulproteus, thanks for following up on this

I have a fork myself on this pull request on https://github.com/dangra/scrapy/tree/http11

It is rebased on top of master too and include a fix for unclean reactor messages, basically trial complains about HTTPConnectionPool not cleaned before closing reactor dangra/scrapy@698105b

I am OK with removing the log.err line in _agentrequest_failed callback, it is there just to help debugging new issues with http11 downloader, isn't something permanent.

As said before in previous PR comments, the only test case failing is scrapy.tests.test_downloader_handlers.Http11TestCase.test_timeout_download_from_spider

you can see a full test run with all tests passing except this here http://travis-ci.org/#!/dangra/scrapy/jobs/2369075

$ trial scrapy.tests.test_downloader_handlers.Http11TestCase
    test_download ...                                                      [OK]
    test_download_head ...                                                 [OK]
    test_host_header_not_in_request_headers ...                            [OK]
    test_host_header_seted_in_request_headers ...                          [OK]
    test_payload ...                                                       [OK]
    test_redirect_status ...                                               [OK]
    test_redirect_status_head ...                                          [OK]
    test_timeout_download_from_spider ...                               [ERROR]

Traceback (most recent call last):
Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure.

Ran 8 tests in 0.087s

FAILED (errors=1, successes=7

by this time I was hoping this twisted ticket to be closed but it isn't http://twistedmatrix.com/trac/ticket/4330
I have been trying workarounds by extending Agents and even HTTPConnectionPool but I always ends discarding it because I don't like the results

Scrapy project member

I changed the pull request HEAD to my fork using:

curl --user "dangra" --request POST --data '{"issue": "109", "head": "dangra:http11", "base": "master"}' https://api.github.com/repos/scrapy/scrapy/pulls
Scrapy project member

FYI: there is a branch http11 created on the main repo with the latest code for this.

Scrapy project member

Yes! we have it ready, just need to backport some twisted bits to support old releases.


Awesome @dangra. You rock!

Scrapy project member

@redapple: you did the initial work, you are the rock star here :)


Thanks ;-)

Scrapy project member

moved to PR #318

@dangra dangra closed this Jun 4, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment