Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds the functionality to do HTTPS downloads behind proxies using an #397

Merged
merged 14 commits into from Dec 3, 2013

Conversation

@duendex
Copy link
Contributor

@duendex duendex commented Sep 25, 2013

HTTP CONNECT.

Implementation of #392

self._tunnelReadyDeferred.callback(self._protocol)
else:
# Not sure if this is the best way to handle this error.
raise SSLError

This comment has been minimized.

@dangra

dangra Sep 25, 2013
Member

It's probably better to return the proxy response including the http status: 403s, 500s, ... whatever proxy returns that is not a 200.

This comment has been minimized.

@dangra

dangra Sep 25, 2013
Member

In case the response status line isn't a 200, I think we should restore dataReceived and feed it with bytes, that should returns a http response with correct status to the client.

Do you think we need a check on what bytes are discarded, just in case we switch transport before all proxy related bytes are read?

This comment has been minimized.

@duendex

duendex Sep 25, 2013
Author Contributor

We can't just feed the restored dataReceived with bytes as the client is not expecting a response from the server yet (the connection is still 'pending' until I trigger self._tunnelReadyDeferred). I will commit a change where I just trigger the deferred and not switch to TLS when the response from the proxy is not a 200. In this case, the request from the client will be sent and the proxy will return an http status 500 to the client.

@@ -5,10 +5,10 @@
from urlparse import urldefrag

from zope.interface import implements
from twisted.internet import defer, reactor, protocol
from twisted.internet import defer, reactor, protocol, ssl

This comment has been minimized.

@pablohoffman

pablohoffman Sep 26, 2013
Member

unused import

This comment has been minimized.

@duendex

duendex Sep 26, 2013
Author Contributor

removed.

from twisted.web.http_headers import Headers as TxHeaders
from twisted.web.iweb import IBodyProducer
from twisted.internet.error import TimeoutError
from twisted.internet.error import TimeoutError, SSLError

This comment has been minimized.

@pablohoffman

pablohoffman Sep 26, 2013
Member

unused import

This comment has been minimized.

@duendex

duendex Sep 26, 2013
Author Contributor

removed.

"""
# Restore the protocol dataReceived method.
self._protocol.dataReceived = self._protocolDataReceived
if bytes.find('200 Connection established') > 0:

This comment has been minimized.

@pablohoffman

pablohoffman Sep 26, 2013
Member

We should only check for the response code, not the reason (since that is bound to vary among different servers). Something like if bytes.startswith('200 ') should do.

This comment has been minimized.

@duendex

duendex Sep 26, 2013
Author Contributor

Ok, but the response is of the form HTTP/1.x 200 Connection established. I will use a re to match 'HTTP/1.x 200' at the beginning.

This comment has been minimized.

@pablohoffman

pablohoffman Sep 27, 2013
Member

Right, here is where reusing the twisted HTTP client protocol would help, but a regex should be fine for now.

# allow the client to send the request and get a response from the
# proxy, we will intercept the connectionLost message and restore
# the connection.
self._protocolConnectionLost = self._protocol.connectionLost

This comment has been minimized.

@pablohoffman

pablohoffman Sep 26, 2013
Member

We should return some exception here (using self._tunnelReadyDeferred.errback() possibly with a custom exception) instead of just closing the connection.

This comment has been minimized.

@duendex

duendex Sep 26, 2013
Author Contributor

Note that the connection is closed by the proxy and not by us (at least that's Squid's behaviour) when the tunnel can't be opened. My idea here is that we restore the connection and allow the client to send his request to the proxy. The proxy will finally respond with an error that will be returned to the client. If we do an errback here, who will handle it?

This comment has been minimized.

@pablohoffman

pablohoffman Sep 27, 2013
Member

It would be propagated through the downloader middleware and ultimately handled by the spider/request errback, if any.

it would be better to return the actual response we got from the proxy but in order to do that we'd need to do more parsing (like parsing HTTP headers) so this is probably something that could be improved when we port the patch to a cleaner approach using a twisted protocol.


def __init__(self, contextFactory=None, connectTimeout=10, bindAddress=None, pool=None):
def __init__(self, contextFactory=None, connectTimeout=10, bindAddress=None,

This comment has been minimized.

@pablohoffman

pablohoffman Sep 26, 2013
Member

Unnecessary change, let's remove to keep the commit cleaner.

This comment has been minimized.

@duendex

duendex Sep 26, 2013
Author Contributor

My bad, I let pylint convince me...

This comment has been minimized.

@pablohoffman

pablohoffman Sep 27, 2013
Member

Totally understandable, see http://doc.scrapy.org/en/latest/contributing.html

It's also a good idea, in general, to separate aesthetic from functional changes, although you probably know that :)

# Restore the connection to the proxy but don't open the tunnel.
self.connect(self._protocolFactory, False)

def connect(self, protocolFactory, openTunnel=True):

This comment has been minimized.

@pablohoffman

pablohoffman Sep 26, 2013
Member

Is there really a need for the openTunnel argument?. Since this is a tunneling class (TunnelingTCP4ClientEndpoint) it may as well always tunnel, right?

This comment has been minimized.

@duendex

duendex Sep 26, 2013
Author Contributor

Actually it is needed when we restore the connection after the request to open the tunnel fails. Look at the connectionLost method.

This comment has been minimized.

@pablohoffman

pablohoffman Sep 27, 2013
Member

I'm in favor of removing that hacky retrial too.

This comment has been minimized.

@duendex

duendex Sep 27, 2013
Author Contributor

Ok. I did the hacky retrial for the client to get some kind of HTTP response (an error) instead of getting a closed connection. I will remove the retrial and just raise an error using an errback for now.

return self._ProxyAgent(endpoint)
_, _, proxyHost, proxyPort, _ = _parse(proxy)
scheme = _parse(request.url)[0]
if scheme == 'https':

This comment has been minimized.

@pablohoffman

pablohoffman Sep 26, 2013
Member

We should also support the old (insecure) mechanism that proxies HTTPS over plain HTTP. I discussed it with Dan and we propose to use an argument in the proxy url to indicate that the old mechanism should be used.

For example, to use the new (recommended) mechanism the proxy url would be:

http://localhost:8080

While for using the new mechanism it would be:

http://localhost:8080?noconnect

This if would then check both the scheme and the presence of noconnect argument.

This comment has been minimized.

@duendex

duendex Sep 26, 2013
Author Contributor

I will implement this. It may make things a little bit uglier if we decide we have to modify the URL to remove the noconnect parameter form the request... I guess we don't want that parameter to reach the proxy. WDYT?

This comment has been minimized.

@pablohoffman

pablohoffman Sep 27, 2013
Member

It shouldn't reach the proxy, I think, since there's no need to pass the proxyArgs to _TunnelingAgent method.

Remember this is to specify the proxy in the configuration, not the request url to visit.

This comment has been minimized.

@duendex

duendex Sep 27, 2013
Author Contributor

I totally misread your comment. I thought the noconnect parameter was going to be added to the list of URLs to crawl... ?!?
It makes total sense now!

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Sep 26, 2013

I left some comments. Two points in addition to add:

  • it would be nice to add some tests with mitmproxy or similar library (given that twisted doesn't have a proxy server with CONNECT support builtin)
  • we should add support for proxy authentication (Proxy-Authorization header)
@duendex
Copy link
Contributor Author

@duendex duendex commented Sep 26, 2013

I answered to your comments and will make a commit tomorrow addressing most of them. I will them proceed to implement some tests and the proxy auth.

@duendex
Copy link
Contributor Author

@duendex duendex commented Sep 30, 2013

I have attended to your comments and implemented proxy authentication and tunnel switching by using the noconnect paramter. I will proceed to write some tests now.

@duendex duendex closed this Oct 2, 2013
@duendex duendex reopened this Oct 2, 2013
@pablohoffman
pablohoffman reviewed Oct 2, 2013
View changes
scrapy/tests/test_proxy_connect.py Outdated

def start(self):
self._proxy_process_handle = subprocess.Popen(
('mitmdump', '-p', '%d' % self._port, '--singleuser',

This comment has been minimized.

@pablohoffman

pablohoffman Oct 2, 2013
Member

maybe we should run this as a python module?. like python -m mitmproxy.tool or something like that. It's typically more portable than running the raw command, because it only depends on the python path and not the system path. Remember this should also run on windows (although it's not tested so frequently there)

This comment has been minimized.

@duendex

duendex Oct 2, 2013
Author Contributor

Excellent point, I will correct that.

This comment has been minimized.

@pablohoffman

pablohoffman Oct 2, 2013
Member

Great. Also, did you consider using libmtproxy on a thread instead of spawning a separate process?
http://mitmproxy.org/doc/scripting/libmproxy.html

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Oct 14, 2013

@duendex can you add the issue to the PR description? #392

@dangra
Copy link
Member

@dangra dangra commented Nov 15, 2013

what is missing on this PR aside of rebasing to master?

it seems to fulfill all requirements to be merged, specially because it includes unit tests.

@r-marques
Copy link

@r-marques r-marques commented Nov 15, 2013

Today I was testing and discussing this pr with @dangra in irc.
I was trying to access an https url through an http proxy.

This is what I get from the scrapy shell:

scrapy shell "https://www.whatismyip.com/"

[default] DEBUG: Retrying <GET https://www.whatismyip.com/> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
[default] DEBUG: Retrying <GET https://www.whatismyip.com/> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
[default] DEBUG: Gave up retrying <GET https://www.whatismyip.com/> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
Traceback (most recent call last):
  File "/usr/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.19.0', 'scrapy')
  File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 492, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 1350, in run_script
    execfile(script_filename, namespace, namespace)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.19.0-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/usr/lib/python2.7/site-packages/Scrapy-0.19.0-py2.7.egg/scrapy/cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.19.0-py2.7.egg/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.19.0-py2.7.egg/scrapy/cmdline.py", line 149, in _run_command
    cmd.run(args, opts)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.19.0-py2.7.egg/scrapy/commands/shell.py", line 50, in run
    shell.start(url=url, spider=spider)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.19.0-py2.7.egg/scrapy/shell.py", line 43, in start
    self.fetch(url, spider)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.19.0-py2.7.egg/scrapy/shell.py", line 85, in fetch
    reactor, self._schedule, request, spider)
  File "/usr/lib64/python2.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

http urls work fine.

Any idea on how to solve this error?

@r-marques
Copy link

@r-marques r-marques commented Nov 15, 2013

I tried it now with another http proxy and is working fine

@dustinthughes
Copy link

@dustinthughes dustinthughes commented Nov 15, 2013

Our staff will give the tires a kick with this - Later today. We will let you know if works. By the close of Monday morning we should have an answer it it works well with HTTPS connection proxies. This has been a real long term issue with our use of the software.

@dangra
Copy link
Member

@dangra dangra commented Nov 15, 2013

@dustinthughes that would be great. This PR is the closest we are to merge CONNECT into Scrapy.

@r-marques
Copy link

@r-marques r-marques commented Nov 15, 2013

@dustinthughes thanks. I would be great to have better support for proxies in scrapy

@duendex
Copy link
Contributor Author

@duendex duendex commented Nov 18, 2013

Hi, I am responsible for the PR. It would be really great to receive some
feedback. If you find something not working properly I will do my best to
fix it.

Cheers

On Fri, Nov 15, 2013 at 8:06 PM, Rodolphe Marques
notifications@github.comwrote:

@dustinthughes https://github.com/dustinthughes thanks. I would be
great to have better support for proxies in scrapy


Reply to this email directly or view it on GitHubhttps://github.com//pull/397#issuecomment-28607393
.

@hfoffani
Copy link

@hfoffani hfoffani commented Nov 25, 2013

Hi @duendex. I'm doing some scraping with a proxy (Privoxy->Tor) to a some sites that I need to login first (https).
If you like I can do some tests over them.

@duendex
Copy link
Contributor Author

@duendex duendex commented Nov 25, 2013

Definitively yes. Thx

On Mon, Nov 25, 2013 at 4:39 PM, hfoffani notifications@github.com wrote:

Hi @duendex https://github.com/duendex. I'm doing some scraping with a
proxy (Privoxy->Tor) to a some sites that I need to login first (https).
If you like I can do some tests over them.


Reply to this email directly or view it on GitHubhttps://github.com//pull/397#issuecomment-29228903
.

@hfoffani
Copy link

@hfoffani hfoffani commented Nov 26, 2013

So far so good.
Tested https://ipdb.at (a what's-my-ip web) and https://www.linkedin.com/uas/login. I'm using Privoxy 3.0.21 over Tor v0.2.3.25 on OSX Mavericks. Scrapy 0.21.0 with this PR http11.py.
Now that it seems to work as expected I'm going to test it more heavily this week. I'll let you know how things go.
Thanks!

FYI. With the new http11.py:

2013-11-26 10:24:14+0100 [test] DEBUG: Crawled (200) <GET https://www.linkedin.com/uas/login> (referer: None)
2013-11-26 10:24:14+0100 [test] INFO: Closing spider (finished)
2013-11-26 10:24:14+0100 [test] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 875,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 10590,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/302': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 11, 26, 9, 24, 14, 225776),
     'log_count/DEBUG': 12,
     'log_count/INFO': 3,
     'response_received_count': 1,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'start_time': datetime.datetime(2013, 11, 26, 9, 24, 8, 462974)}
2013-11-26 10:24:14+0100 [test] INFO: Spider closed (finished)

With the old http11.py either:

2013-11-26 10:34:20+0100 [test] DEBUG: Gave up retrying <GET https://www.linkedin.com/uas/login> (failed 3 times): User timeout caused connection failure: Getting https://www.linkedin.com/uas/login took longer than 180 seconds..
2013-11-26 10:34:20+0100 [test] ERROR: Error downloading <GET https://www.linkedin.com/uas/login>
    Traceback (most recent call last):
      File "xxx/twisted/internet/defer.py", line 470, in cancel
... etc ...
      File "xxx/scrapy/core/downloader/handlers/http11.py", line 89, in _cb_timeout
        raise TimeoutError("Getting %s took longer than %s seconds." % (url, timeout))
    twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.linkedin.com/uas/login took longer than 180 seconds..

2013-11-26 10:34:20+0100 [test] INFO: Closing spider (finished)
2013-11-26 10:34:20+0100 [test] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 3,
     'downloader/exception_type_count/twisted.internet.error.TimeoutError': 3,
     'downloader/request_bytes': 2057,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 4,
     'downloader/response_bytes': 1671,
     'downloader/response_count': 1,
     'downloader/response_status_count/302': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 11, 26, 9, 34, 20, 288418),
     'log_count/DEBUG': 10,
     'log_count/ERROR': 1,
     'log_count/INFO': 12,
     'scheduler/dequeued': 4,
     'scheduler/dequeued/memory': 4,
     'scheduler/enqueued': 4,
     'scheduler/enqueued/memory': 4,
     'start_time': datetime.datetime(2013, 11, 26, 9, 25, 18, 205478)}
2013-11-26 10:34:20+0100 [test] INFO: Spider closed (finished)

or:

2013-11-25 19:53:05+0100 [test] DEBUG: Crawled (200) <GET https://ipdb.at> (referer: None) ['partial']
2013-11-25 19:53:05+0100 [test] INFO: Closing spider (finished)
2013-11-25 19:53:05+0100 [test] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 280,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 48,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 11, 25, 18, 53, 5, 297137),
     'log_count/DEBUG': 11,
     'log_count/INFO': 3,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2013, 11, 25, 18, 52, 49, 876982)}
@amrali
Copy link

@amrali amrali commented Dec 2, 2013

I'm lucky. Just right when I got around to actually need that feature. Any idea when this PR will be merged? @dangra ?

@dangra
Copy link
Member

@dangra dangra commented Dec 2, 2013

@duendex : please, rebase the PR so travis-ci can run tests and let's get this merged :)

@duendex
Copy link
Contributor Author

@duendex duendex commented Dec 2, 2013

PR rebased.

@dangra
Copy link
Member

@dangra dangra commented Dec 3, 2013

Tests still failing according to travis-ci https://travis-ci.org/scrapy/scrapy/builds/14836547

@duendex
Copy link
Contributor Author

@duendex duendex commented Dec 3, 2013

I found a couple of issues that were affecting the tests outcome and fixed them. Currently there is still one test case failing on travis but not failing locally by either running the tests manually or using tox. Any help regarding what differences there may be between travis and my local environment is more than welcome.

duendex added 2 commits Dec 3, 2013
…ids trigerring the creation of a connect tunnel when downloading from a site with https scheme.
@dangra
Copy link
Member

@dangra dangra commented Dec 3, 2013

LGTM. /cc @pablohoffman

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Dec 3, 2013

LGTM - merge, merge, merge!

dangra added a commit that referenced this pull request Dec 3, 2013
Adds the functionality to do HTTPS downloads behind proxies using an
@dangra dangra merged commit 72543c9 into scrapy:master Dec 3, 2013
1 check passed
1 check passed
default The Travis CI build passed
Details
@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Dec 4, 2013

This should go to the highlights of Scrapy 0.22 release notes! :)

@JakeAustwick
Copy link

@JakeAustwick JakeAustwick commented Sep 12, 2014

I'm still getting errors in regards to this.

ERROR: Error downloading <GET https://www.xxx.com>: Could not open CONNECT tunnel.

I'm using a proxy, with the proxy-authorization set. This works fine on none https urls. Any suggestions?

scrapy version
Scrapy 0.24.2
@JakeAustwick
Copy link

@JakeAustwick JakeAustwick commented Sep 12, 2014

Okay, so just upgraded to Scrapy 0.24.4 and this works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

9 participants