Adds the functionality to do HTTPS downloads behind proxies using an #397
Conversation
self._tunnelReadyDeferred.callback(self._protocol) | ||
else: | ||
# Not sure if this is the best way to handle this error. | ||
raise SSLError |
dangra
Sep 25, 2013
Member
It's probably better to return the proxy response including the http status: 403s, 500s, ... whatever proxy returns that is not a 200.
It's probably better to return the proxy response including the http status: 403s, 500s, ... whatever proxy returns that is not a 200.
dangra
Sep 25, 2013
Member
In case the response status line isn't a 200, I think we should restore dataReceived
and feed it with bytes
, that should returns a http response with correct status to the client.
Do you think we need a check on what bytes
are discarded, just in case we switch transport before all proxy related bytes are read?
In case the response status line isn't a 200, I think we should restore dataReceived
and feed it with bytes
, that should returns a http response with correct status to the client.
Do you think we need a check on what bytes
are discarded, just in case we switch transport before all proxy related bytes are read?
duendex
Sep 25, 2013
Author
Contributor
We can't just feed the restored dataReceived with bytes as the client is not expecting a response from the server yet (the connection is still 'pending' until I trigger self._tunnelReadyDeferred). I will commit a change where I just trigger the deferred and not switch to TLS when the response from the proxy is not a 200. In this case, the request from the client will be sent and the proxy will return an http status 500 to the client.
We can't just feed the restored dataReceived with bytes as the client is not expecting a response from the server yet (the connection is still 'pending' until I trigger self._tunnelReadyDeferred). I will commit a change where I just trigger the deferred and not switch to TLS when the response from the proxy is not a 200. In this case, the request from the client will be sent and the proxy will return an http status 500 to the client.
@@ -5,10 +5,10 @@ | |||
from urlparse import urldefrag | |||
|
|||
from zope.interface import implements | |||
from twisted.internet import defer, reactor, protocol | |||
from twisted.internet import defer, reactor, protocol, ssl |
pablohoffman
Sep 26, 2013
Member
unused import
unused import
duendex
Sep 26, 2013
Author
Contributor
removed.
removed.
from twisted.web.http_headers import Headers as TxHeaders | ||
from twisted.web.iweb import IBodyProducer | ||
from twisted.internet.error import TimeoutError | ||
from twisted.internet.error import TimeoutError, SSLError |
pablohoffman
Sep 26, 2013
Member
unused import
unused import
duendex
Sep 26, 2013
Author
Contributor
removed.
removed.
""" | ||
# Restore the protocol dataReceived method. | ||
self._protocol.dataReceived = self._protocolDataReceived | ||
if bytes.find('200 Connection established') > 0: |
pablohoffman
Sep 26, 2013
Member
We should only check for the response code, not the reason (since that is bound to vary among different servers). Something like if bytes.startswith('200 ')
should do.
We should only check for the response code, not the reason (since that is bound to vary among different servers). Something like if bytes.startswith('200 ')
should do.
duendex
Sep 26, 2013
Author
Contributor
Ok, but the response is of the form HTTP/1.x 200 Connection established. I will use a re to match 'HTTP/1.x 200' at the beginning.
Ok, but the response is of the form HTTP/1.x 200 Connection established. I will use a re to match 'HTTP/1.x 200' at the beginning.
pablohoffman
Sep 27, 2013
Member
Right, here is where reusing the twisted HTTP client protocol would help, but a regex should be fine for now.
Right, here is where reusing the twisted HTTP client protocol would help, but a regex should be fine for now.
# allow the client to send the request and get a response from the | ||
# proxy, we will intercept the connectionLost message and restore | ||
# the connection. | ||
self._protocolConnectionLost = self._protocol.connectionLost |
pablohoffman
Sep 26, 2013
Member
We should return some exception here (using self._tunnelReadyDeferred.errback()
possibly with a custom exception) instead of just closing the connection.
We should return some exception here (using self._tunnelReadyDeferred.errback()
possibly with a custom exception) instead of just closing the connection.
duendex
Sep 26, 2013
Author
Contributor
Note that the connection is closed by the proxy and not by us (at least that's Squid's behaviour) when the tunnel can't be opened. My idea here is that we restore the connection and allow the client to send his request to the proxy. The proxy will finally respond with an error that will be returned to the client. If we do an errback here, who will handle it?
Note that the connection is closed by the proxy and not by us (at least that's Squid's behaviour) when the tunnel can't be opened. My idea here is that we restore the connection and allow the client to send his request to the proxy. The proxy will finally respond with an error that will be returned to the client. If we do an errback here, who will handle it?
pablohoffman
Sep 27, 2013
Member
It would be propagated through the downloader middleware and ultimately handled by the spider/request errback, if any.
it would be better to return the actual response we got from the proxy but in order to do that we'd need to do more parsing (like parsing HTTP headers) so this is probably something that could be improved when we port the patch to a cleaner approach using a twisted protocol.
It would be propagated through the downloader middleware and ultimately handled by the spider/request errback, if any.
it would be better to return the actual response we got from the proxy but in order to do that we'd need to do more parsing (like parsing HTTP headers) so this is probably something that could be improved when we port the patch to a cleaner approach using a twisted protocol.
|
||
def __init__(self, contextFactory=None, connectTimeout=10, bindAddress=None, pool=None): | ||
def __init__(self, contextFactory=None, connectTimeout=10, bindAddress=None, |
pablohoffman
Sep 26, 2013
Member
Unnecessary change, let's remove to keep the commit cleaner.
Unnecessary change, let's remove to keep the commit cleaner.
duendex
Sep 26, 2013
Author
Contributor
My bad, I let pylint convince me...
My bad, I let pylint convince me...
pablohoffman
Sep 27, 2013
Member
Totally understandable, see http://doc.scrapy.org/en/latest/contributing.html
It's also a good idea, in general, to separate aesthetic from functional changes, although you probably know that :)
Totally understandable, see http://doc.scrapy.org/en/latest/contributing.html
It's also a good idea, in general, to separate aesthetic from functional changes, although you probably know that :)
# Restore the connection to the proxy but don't open the tunnel. | ||
self.connect(self._protocolFactory, False) | ||
|
||
def connect(self, protocolFactory, openTunnel=True): |
pablohoffman
Sep 26, 2013
Member
Is there really a need for the openTunnel
argument?. Since this is a tunneling class (TunnelingTCP4ClientEndpoint
) it may as well always tunnel, right?
Is there really a need for the openTunnel
argument?. Since this is a tunneling class (TunnelingTCP4ClientEndpoint
) it may as well always tunnel, right?
duendex
Sep 26, 2013
Author
Contributor
Actually it is needed when we restore the connection after the request to open the tunnel fails. Look at the connectionLost method.
Actually it is needed when we restore the connection after the request to open the tunnel fails. Look at the connectionLost method.
pablohoffman
Sep 27, 2013
Member
I'm in favor of removing that hacky retrial too.
I'm in favor of removing that hacky retrial too.
duendex
Sep 27, 2013
Author
Contributor
Ok. I did the hacky retrial for the client to get some kind of HTTP response (an error) instead of getting a closed connection. I will remove the retrial and just raise an error using an errback for now.
Ok. I did the hacky retrial for the client to get some kind of HTTP response (an error) instead of getting a closed connection. I will remove the retrial and just raise an error using an errback for now.
return self._ProxyAgent(endpoint) | ||
_, _, proxyHost, proxyPort, _ = _parse(proxy) | ||
scheme = _parse(request.url)[0] | ||
if scheme == 'https': |
pablohoffman
Sep 26, 2013
Member
We should also support the old (insecure) mechanism that proxies HTTPS over plain HTTP. I discussed it with Dan and we propose to use an argument in the proxy url to indicate that the old mechanism should be used.
For example, to use the new (recommended) mechanism the proxy url would be:
http://localhost:8080
While for using the new mechanism it would be:
http://localhost:8080?noconnect
This if
would then check both the scheme and the presence of noconnect
argument.
We should also support the old (insecure) mechanism that proxies HTTPS over plain HTTP. I discussed it with Dan and we propose to use an argument in the proxy url to indicate that the old mechanism should be used.
For example, to use the new (recommended) mechanism the proxy url would be:
http://localhost:8080
While for using the new mechanism it would be:
http://localhost:8080?noconnect
This if
would then check both the scheme and the presence of noconnect
argument.
duendex
Sep 26, 2013
Author
Contributor
I will implement this. It may make things a little bit uglier if we decide we have to modify the URL to remove the noconnect
parameter form the request... I guess we don't want that parameter to reach the proxy. WDYT?
I will implement this. It may make things a little bit uglier if we decide we have to modify the URL to remove the noconnect
parameter form the request... I guess we don't want that parameter to reach the proxy. WDYT?
pablohoffman
Sep 27, 2013
Member
It shouldn't reach the proxy, I think, since there's no need to pass the proxyArgs
to _TunnelingAgent
method.
Remember this is to specify the proxy in the configuration, not the request url to visit.
It shouldn't reach the proxy, I think, since there's no need to pass the proxyArgs
to _TunnelingAgent
method.
Remember this is to specify the proxy in the configuration, not the request url to visit.
duendex
Sep 27, 2013
Author
Contributor
I totally misread your comment. I thought the noconnect
parameter was going to be added to the list of URLs to crawl... ?!?
It makes total sense now!
I totally misread your comment. I thought the noconnect
parameter was going to be added to the list of URLs to crawl... ?!?
It makes total sense now!
I left some comments. Two points in addition to add:
|
I answered to your comments and will make a commit tomorrow addressing most of them. I will them proceed to implement some tests and the proxy auth. |
I have attended to your comments and implemented proxy authentication and tunnel switching by using the |
|
||
def start(self): | ||
self._proxy_process_handle = subprocess.Popen( | ||
('mitmdump', '-p', '%d' % self._port, '--singleuser', |
pablohoffman
Oct 2, 2013
Member
maybe we should run this as a python module?. like python -m mitmproxy.tool
or something like that. It's typically more portable than running the raw command, because it only depends on the python path and not the system path. Remember this should also run on windows (although it's not tested so frequently there)
maybe we should run this as a python module?. like python -m mitmproxy.tool
or something like that. It's typically more portable than running the raw command, because it only depends on the python path and not the system path. Remember this should also run on windows (although it's not tested so frequently there)
duendex
Oct 2, 2013
Author
Contributor
Excellent point, I will correct that.
Excellent point, I will correct that.
pablohoffman
Oct 2, 2013
Member
Great. Also, did you consider using libmtproxy
on a thread instead of spawning a separate process?
http://mitmproxy.org/doc/scripting/libmproxy.html
Great. Also, did you consider using libmtproxy
on a thread instead of spawning a separate process?
http://mitmproxy.org/doc/scripting/libmproxy.html
what is missing on this PR aside of rebasing to master? it seems to fulfill all requirements to be merged, specially because it includes unit tests. |
Today I was testing and discussing this pr with @dangra in irc. This is what I get from the scrapy shell:
http urls work fine. Any idea on how to solve this error? |
I tried it now with another http proxy and is working fine |
Our staff will give the tires a kick with this - Later today. We will let you know if works. By the close of Monday morning we should have an answer it it works well with HTTPS connection proxies. This has been a real long term issue with our use of the software. |
@dustinthughes that would be great. This PR is the closest we are to merge CONNECT into Scrapy. |
@dustinthughes thanks. I would be great to have better support for proxies in scrapy |
Hi, I am responsible for the PR. It would be really great to receive some Cheers On Fri, Nov 15, 2013 at 8:06 PM, Rodolphe Marques
|
Hi @duendex. I'm doing some scraping with a proxy (Privoxy->Tor) to a some sites that I need to login first (https). |
Definitively yes. Thx On Mon, Nov 25, 2013 at 4:39 PM, hfoffani notifications@github.com wrote:
|
So far so good. FYI. With the new http11.py:
With the old http11.py either:
or:
|
I'm lucky. Just right when I got around to actually need that feature. Any idea when this PR will be merged? @dangra ? |
@duendex : please, rebase the PR so travis-ci can run tests and let's get this merged :) |
… unnecesary comments.
…oconnect parameter to the URL of the proxy.
… as a separate process.
PR rebased. |
Tests still failing according to travis-ci https://travis-ci.org/scrapy/scrapy/builds/14836547 |
… scheme and failed with PR 397.
I found a couple of issues that were affecting the tests outcome and fixed them. Currently there is still one test case failing on travis but not failing locally by either running the tests manually or using tox. Any help regarding what differences there may be between travis and my local environment is more than welcome. |
…ids trigerring the creation of a connect tunnel when downloading from a site with https scheme.
LGTM. /cc @pablohoffman |
LGTM - merge, merge, merge! |
Adds the functionality to do HTTPS downloads behind proxies using an
This should go to the highlights of Scrapy 0.22 release notes! :) |
I'm still getting errors in regards to this. ERROR: Error downloading <GET https://www.xxx.com>: Could not open CONNECT tunnel. I'm using a proxy, with the proxy-authorization set. This works fine on none https urls. Any suggestions?
|
Okay, so just upgraded to Scrapy 0.24.4 and this works. |
HTTP CONNECT.
Implementation of #392