Skip to content

Loading…

HTTPS over a proxy, now with Proxy-Connection and CONNECT...HTTP/1.1 #45

Closed
wants to merge 4 commits into from

5 participants

@sabren

My previous pull request worked with several free proxies (Apache+mod_proxy, fiddler, Proxoid),
but failed on the commercial proxy we tested. [400 Bad Request]

This version uses HTTP/1.1 for the CONNECT request and adds a "Proxy-Connection: keep-alive" header, which resolved the issue.

@sabren

The code above actually does work correctly with HTTPS redirects through a proxy.

I said redirects weren't working because of the special case of https://paypal.com described here:

As far as I can tell, https://paypal.com/ is actually sending a bad location header when scrapy visits.

Location: https://www.paypal.comhttps://paypal.com/

I think it may be a bug in their handling of HTTP/1.0 because I can't duplicate the problem with any other client.
(Unfortunately, simply patching scrapy to send HTTP/1.1 results in 400 Bad Request errors.)

@scottyallen scottyallen commented on the diff
scrapy/core/downloader/webclient.py
((5 lines not shown))
proxy = request.meta.get('proxy')
if proxy:
+ old_scheme, old_host, old_port = self.scheme, self.host, self.port
self.scheme, _, self.host, self.port, _ = _parse(proxy)
self.path = self.url

Thanks for the patch - I was in the process of trying to fix this, and it saved me a ton of time:) However, I don't think line 191 is quite right for the tunnel case. It results in sending a GET request with the full url to the destination webserver, which is technically wrong and some sites refuse to handle. Instead, self.path should remain unchanged for the tunnel case. I can send a patch to your patch, if you like...

@sabren
sabren added a note

Thanks, Scotty!

I'm sure my client would appreciate it.

Can you make a combined pull request, or should I pull from you and open another pull request here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@nramirezuy
Scrapy project member

@dangra Can we close this ticket? on #392 you said this is deprecated.

@dangra
Scrapy project member

@nramirezuy: so true.

@dangra dangra closed this
@dangra dangra referenced this pull request
Closed

https request with proxies. #453

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Showing with 75 additions and 7 deletions.
  1. +75 −7 scrapy/core/downloader/webclient.py
View
82 scrapy/core/downloader/webclient.py
@@ -1,5 +1,6 @@
from time import time
from urlparse import urlparse, urlunparse, urldefrag
+from twisted.internet.ssl import ClientContextFactory
from twisted.python import failure
from twisted.web.client import PartialDownloadError, HTTPClientFactory
@@ -9,7 +10,7 @@
from scrapy.http import Headers
from scrapy.utils.httpobj import urlparse_cached
from scrapy.responsetypes import responsetypes
-
+from scrapy import log
def _parsed_url_args(parsed):
path = urlunparse(('', '', parsed.path or '/', parsed.params, parsed.query, ''))
@@ -35,25 +36,85 @@ class ScrapyHTTPPageGetter(HTTPClient):
def connectionMade(self):
self.headers = Headers() # bucket for response headers
+ if self.factory.use_tunnel:
+ log.msg("Sending CONNECT", log.DEBUG)
+ self.tunnel_started = False
+ self.sendCommand("CONNECT", "%s:%s"
+ % (self.factory.tunnel_to_host, self.factory.tunnel_to_port))
+ self.sendHeaders(only=['Host','Proxy-Connection', 'User-Agent'])
+ del self.factory.headers['Proxy-Connection']
+ else:
+ self.sendEverything()
+
+
+ def sendCommand(self, command, path):
+ if self.factory.use_tunnel and not self.tunnel_started:
+ http_version = "1.1"
+ else:
+ http_version = "1.0"
+ self.transport.write('%s %s HTTP/%s\r\n' % (command, path, http_version))
+
+
+ def sendEverything(self):
+ self.sendMethod()
+ self.sendHeaders()
+ self.sendBody()
+
+ def sendMethod(self):
# Method command
self.sendCommand(self.factory.method, self.factory.path)
- # Headers
- for key, values in self.factory.headers.items():
- for value in values:
+
+ def sendHeaders(self, only=None):
+ # Note: it's a Headers object, not a dict
+ keys = only if only is not None else self.factory.headers.keys()
+ for key in keys:
+ for value in self.factory.headers.getlist(key):
self.sendHeader(key, value)
self.endHeaders()
+
+ def sendBody(self):
# Body
if self.factory.body is not None:
self.transport.write(self.factory.body)
def lineReceived(self, line):
- return HTTPClient.lineReceived(self, line.rstrip())
+ if self.factory.use_tunnel and not self.tunnel_started: log.msg("LINE: %s" % line)
+ if self.factory.use_tunnel and not self.tunnel_started and not line.rstrip():
+ # End of headers from the proxy in response to our CONNECT request
+ # Skip the call to HTTPClient.lienReceived for now, since otherwise
+ # it would switch to row mode.
+ self.startTunnel()
+ else:
+ return HTTPClient.lineReceived(self, line.rstrip())
+
+ def startTunnel(self):
+
+ log.msg("starting Tunnel")
+
+ # We'll get a new batch of headers through the tunnel. This sets us
+ # up to capture them.
+ self.firstLine = True
+ self.tunnel_started = True
+
+ # Switch to SSL
+ ctx = ClientContextFactory()
+ self.transport.startTLS(ctx, self.factory)
+
+ # And send the normal request:
+ self.sendEverything()
+
def handleHeader(self, key, value):
- self.headers.appendlist(key, value)
+ if self.factory.use_tunnel and not self.tunnel_started:
+ pass # maybe log headers for CONNECT request?
+ else:
+ self.headers.appendlist(key, value)
def handleStatus(self, version, status, message):
- self.factory.gotStatus(version, status, message)
+ if self.factory.use_tunnel and not self.tunnel_started:
+ self.tunnel_status = status
+ else:
+ self.factory.gotStatus(version, status, message)
def handleEndHeaders(self):
self.factory.gotHeaders(self.headers)
@@ -122,10 +183,17 @@ def _build_response(self, body, request):
def _set_connection_attributes(self, request):
parsed = urlparse_cached(request)
self.scheme, self.netloc, self.host, self.port, self.path = _parsed_url_args(parsed)
+ self.use_tunnel = False
proxy = request.meta.get('proxy')
if proxy:
+ old_scheme, old_host, old_port = self.scheme, self.host, self.port
self.scheme, _, self.host, self.port, _ = _parse(proxy)
self.path = self.url

Thanks for the patch - I was in the process of trying to fix this, and it saved me a ton of time:) However, I don't think line 191 is quite right for the tunnel case. It results in sending a GET request with the full url to the destination webserver, which is technically wrong and some sites refuse to handle. Instead, self.path should remain unchanged for the tunnel case. I can send a patch to your patch, if you like...

@sabren
sabren added a note

Thanks, Scotty!

I'm sure my client would appreciate it.

Can you make a combined pull request, or should I pull from you and open another pull request here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ if old_scheme=="https":
+ self.headers['Proxy-Connection'] = 'keep-alive'
+ self.use_tunnel = True
+ self.tunnel_to_host = old_host
+ self.tunnel_to_port = old_port
def gotHeaders(self, headers):
self.headers_time = time()
Something went wrong with that request. Please try again.