Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Port downloader to use twisted.web.client.Agent (HTTP/1.1 client) #109

Closed
wants to merge 15 commits into from

4 participants

@redapple

Lately, I've been fiddling with Downloaders, wanting to use HTTP/1.1 and hopefully later benefit from persistent connections once they're officially in Twisted.

So here's my attempt at refactoring HttpDownloadHandler using twisted.web.client.Agent (which uses HTTP/1.1 starting from Twisted 10?) with the Scrapy specificities I could understand in ScrapyHTTPClientFactory.

For now, I added the code to scrapy/core/downloader/handlers/http11.py
but this could be moved to scrapy/contrib/downloaders/ or something.

The test cases I added in test_downloader_handlers.py are OK on my dev machine.

The code still has debug traces.

What you think?

@dangra
Owner

overall it is looking great!

my notes:

  • WebClientContextFactory needs to be configurable as it is after #82
  • HEAD is handled as a special case, is this really needed?
  • twisted.web._newclient.HTTPClientParser (used by HTTP11ClientFactory) handles chunked transfers clashing with scrapy.contrib.downloadermiddleware.chunked.
  • HTTPClientParser is a bit strict about content-length headers, it could cause problems crawling badly implemented servers.
  • HTTP11ClientFactory doesn't handle redirections as it did in HTTPClientFactory. good!

in summary:

  • lot of code reused from twisted
  • clear separation of agent, protocol and parser
  • still extensible
  • no persistent connections yet but the path looks clean
  • ProxyAgent doesn't support SSL proxies (CONNECT) but looks great as future improvement to twisted
@redapple

Thanks for your comments.

As for the HEAD method special case, a test case was blocked since ScrapyAgentResponseBodyReader was not receiving any bytes (as could be expected...). So I added this very ugly test to pass the test.
But it's for sure better to test the expected length of the body instead
http://twistedmatrix.com/documents/current/web/howto/client.html#auto4
http://twistedmatrix.com/documents/current/api/twisted.web.iweb.IResponse.html

Indeed, redirections are not handled by HTTP11ClientFactory, hence RedirectAgent which you can tweak to handle badly formatted redirections (which I do in a custom/simpler non-Scrapy-based crawler)
http://twistedmatrix.com/documents/current/web/howto/client.html#auto6

Regarding chunked-encoding transfer, does scrapy.contrib.downloadermiddleware.chunked do something that twisted.web._newclient.HTTPClientParser doesn't?
Do you think scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware could simply be removed from DOWNLOADER_MIDDLEWARES_BASE?

In my non-Scrapy-based crawler I've also been successfully using ContentDecoderAgent with GzipDecoder

from twisted.internet import reactor
from twisted.web.client import Agent, ContentDecoderAgent, GzipDecoder
...
agent = ContentDecoderAgent(
    Agent(reactor),
    [("gzip", GzipDecoder)])

Wonder how this could fit in Scrapy.

@dangra
Owner

thanks for submitting this pull request and keep working on it!

As for HEAD method, I prefer to test the body length too instead of handling it specially in downloader handler.

About RedirectAgent, Scrapy already have a downloader middleware to handle redirections scrapy.contrib.downloadermiddleware.redirect and it is the canonical way to implement this feature because it can be easily disabled, extended and tested. I only foresee using two twisted agents in Scrapy, they are basic Agent and ProxyAgent because they offer the minimal set of features that doesn't clash with scrapy builtins.

Regarding chunked-encoding transfer, scrapy.contrib.downloadermiddleware.chunked can't handle all the cases that HTTPClientParser does trough _ChunkedTransferDecoder as reported in https://groups.google.com/d/topic/scrapy-developers/Ss2XDTRdSb4/discussion. So using twisted chunked decoder instead of scrapy middleware is a big plus in this case.

For gzip and others content-encodings there is scrapy.contrib.downloadermiddleware.httpcompression that works very well so far, do you know of case not handled by this middleware?

@redapple

Refactored a bit. Explicitly removed Trasnfer-Encoding header.
I admit I haven't really tested this patch with chunked encoded bodies from real servers.

As ProxyAgent doesn't handle CONNECT, I guess it's better to restrict the use of Twisted's Agent for http scheme only (why I removed the 443 port part)

As for your question about gzip and scrapy.contrib.downloadermiddleware.httpcompression, I simply haven't played with Scrapy enough to have an opinion.

@dangra
Owner
@dangra
Owner

hi @redapple, I have been working on your patch, see https://github.com/dangra/scrapy/tree/http11

as you are the original developer for this feature, can I ask you for a quick review?

It requires twisted 12.1 because it is using HTTPConnectionPool for persistent connections

compare link dangra@master...http11

@redapple

hi @dangra,
Sure, I'll review it.
I personally had problems with persistent connections and Twisted (outside of Scrapy), but I'll try with 12.1 and your code.

@redapple

Hi,
code looks nice and works nicely :)
I've tested it with Twisted trunk.
Checked with Wireshark. Got (much) less TCP connections than before.

My only comment is on the body producer which could be set to None when request body is empty, in order not to send an extra "Content-length: 0" header

@dangra
Owner

thanks for reviewing it!

I agree on your comment to body producer, wrote a simpler alternative line of code that follows your idea.

@dangra
Owner
@dangra
Owner

Can you merge my branch into yours and push to github, that's the only way my commits will show in this pull request. otherwise I will have to open another PR.

@redapple

if you tell me how to, sure :)
I had to separately clone your fork of scrapy to get your code.
Surely there's a more gitonic way of doing it... I'm pretty ignorant in git/github

@redapple

Hope it's ok.

Here's what I did:
git remote add dangra git://github.com/dangra/scrapy.git
git fetch dangra
git merge remotes/dangra/http11 http11
git push origin http11

@dangra
Owner

looks good to me, thanks.

there is still a broken test case that I want to fix before merging this into master branch.

@paulproteus

I did some work refreshing these patches against current origin/master.

https://github.com/paulproteus/scrapy/tree/revise-pullrequest-109

Some notes:

See 3077ac8 for a fix where Http11DownloadHandler needs to accept second argument, since that's how the tests use it.

See 888dfad for a fix where the call to log.err() creates an exception in the log that test runner notices, so tests that call this method begin to fail. For now I've disabled the call to log.err(). If we want to keep logging the error, we'll need to call flushLoggedErrors() -- see http://twistedmatrix.com/documents/current/core/howto/trial.html#auto11 . (The test this makes fail is scrapy.tests.test_downloader_handlers.Http11TestCase.test_timeout_download_from_spider )

Commit bdd2a1b (rebased from the original, but otherwise the same) makes many tests fail because there is work left in the reactor. One test you can see this with is scrapy.tests.test_engine.EngineTest.test_crawler

For now I can't promise I'll have time to figure out what's going on with leaving the reactor unclean, but I thought I'd leave a comment with what I have found in a few hours of looking into this pull request!

-- Asheesh.

@dangra
Owner

hi @paulproteus, thanks for following up on this

I have a fork myself on this pull request on https://github.com/dangra/scrapy/tree/http11

It is rebased on top of master too and include a fix for unclean reactor messages, basically trial complains about HTTPConnectionPool not cleaned before closing reactor dangra/scrapy@698105b

I am OK with removing the log.err line in _agentrequest_failed callback, it is there just to help debugging new issues with http11 downloader, isn't something permanent.

As said before in previous PR comments, the only test case failing is scrapy.tests.test_downloader_handlers.Http11TestCase.test_timeout_download_from_spider

you can see a full test run with all tests passing except this here http://travis-ci.org/#!/dangra/scrapy/jobs/2369075

$ trial scrapy.tests.test_downloader_handlers.Http11TestCase
scrapy.tests.test_downloader_handlers
  Http11TestCase
    test_download ...                                                      [OK]
    test_download_head ...                                                 [OK]
    test_host_header_not_in_request_headers ...                            [OK]
    test_host_header_seted_in_request_headers ...                          [OK]
    test_payload ...                                                       [OK]
    test_redirect_status ...                                               [OK]
    test_redirect_status_head ...                                          [OK]
    test_timeout_download_from_spider ...                               [ERROR]

===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure.

scrapy.tests.test_downloader_handlers.Http11TestCase.test_timeout_download_from_spider
-------------------------------------------------------------------------------
Ran 8 tests in 0.087s

FAILED (errors=1, successes=7

by this time I was hoping this twisted ticket to be closed but it isn't http://twistedmatrix.com/trac/ticket/4330
I have been trying workarounds by extending Agents and even HTTPConnectionPool but I always ends discarding it because I don't like the results

@dangra
Owner

I changed the pull request HEAD to my fork using:

curl --user "dangra" --request POST --data '{"issue": "109", "head": "dangra:http11", "base": "master"}' https://api.github.com/repos/scrapy/scrapy/pulls
@pablohoffman

FYI: there is a branch http11 created on the main repo with the latest code for this.

@dangra
Owner

Yes! we have it ready, just need to backport some twisted bits to support old releases.

@redapple

Awesome @dangra. You rock!

@dangra
Owner

@redapple: you did the initial work, you are the rock star here :)

@redapple

Thanks ;-)

@dangra
Owner

moved to PR #318

@dangra dangra closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 6, 2013
  1. @redapple @dangra

    Use twisted.web.client.Agent for download requests (use of HTTP/1.1)

    redapple authored dangra committed
    Adds http11.HttpDownloadHandler in scrapy.core.downloader.handlers
  2. @redapple @dangra

    Renamed downloader to Http11DownloadHandler and some refactoring

    redapple authored dangra committed
    Only for HTTP, not HTTPS
    Test on expected body length instead of request method (HEAD case)
  3. @redapple @dangra

    Restore handling of HTTPS

    redapple authored dangra committed
  4. @dangra

    http11 cleanup

    dangra authored
  5. @dangra
  6. @dangra

    enable persistent connections

    dangra authored
  7. @dangra

    move ssl context factory to its own module and implement a non-ssl ve…

    dangra authored
    …rsion that warns about pyopenssl support
  8. @dangra
  9. @dangra
  10. @dangra
  11. @dangra

    adapt for singletons removals

    dangra authored
  12. @dangra
  13. @dangra
Commits on Jun 3, 2013
  1. @dangra
Commits on Jun 4, 2013
  1. @dangra
Something went wrong with that request. Please try again.