Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS connection fails through HTTPS proxy after CONNECT tunnel is established #2491

Closed
aiportal opened this issue Jan 11, 2017 · 10 comments · Fixed by #2495
Closed

TLS connection fails through HTTPS proxy after CONNECT tunnel is established #2491

aiportal opened this issue Jan 11, 2017 · 10 comments · Fixed by #2495

Comments

@aiportal
Copy link

aiportal commented Jan 11, 2017

I set proxy by this code:

class HttpProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'https://127.0.0.1:8787'

It's error is:

scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy 127.0.0.1:8787

Then I test the proxy by requests:
resp = requests.get('https://......', proxies={'https': 'https://127.0.0.1:8787'})
It' work!

So, what is it happen?

@redapple
Copy link
Contributor

What version of scrapy are you using? (output of scrapy version -v)
What local proxy are you using? would it be possible for me to run the same setup?
Also, are you able to inspect network traffix (on the loopback interface) to compare the CONNECT requests (and subsequent TLS setup) in both scrapy and python requests?

Does scrapy work for you with non-localhost HTTPS proxies?

@redapple
Copy link
Contributor

redapple commented Jan 11, 2017

I was able to connect with Polipo running locally on port 8123, using CONNECT:

$ https_proxy='https://127.0.0.1:8123' scrapy shell https://www.example.com
CONNECT www.example.com:443 HTTP/1.1
Host: www.example.com:443

HTTP/1.1 200 Tunnel established
(... redacted...)

You can find the Wireshark capture file and scrapy logs at https://github.com/redapple/scrapy-issues/tree/master/2491

@aiportal
Copy link
Author

aiportal commented Jan 12, 2017

What version of scrapy are you using? (output of scrapy version -v)

scrapy -V
Scrapy 1.1.0 - no active project

current platform is windows 10, I have not test it on Ubuntu.

What local proxy are you using? would it be possible for me to run the same setup?

Lantern 3.6.1 (20170110.001954)

Also, are you able to inspect network traffix (on the loopback interface) to compare the CONNECT requests (and subsequent TLS setup) in both scrapy and python requests?

No, I have not do it.

Does scrapy work for you with non-localhost HTTPS proxies?

Yes, it is work ok on free http/https proxy that I found from web.

You can find the Wireshark capture file and scrapy logs at https://github.com/redapple/scrapy-issues/tree/master/2491

Thanks very much, I will do it later.

@redapple
Copy link
Contributor

redapple commented Jan 12, 2017

Thanks @bfbd888 ,
I'm able to reproduce the issue with Lantern, and even their Go HTTP(S) proxy. It's not systematic but I get it very often: sometimes after 1 or 2 retries I'm able to connect, but yeah, seems like a serious issue, probably around how the TLS connection is established after the CONNECT tunnel get's opened.

See below a failed attempt, followed by a successful one:

(scrapy13) paul@host:~$ scrapy version -v
Scrapy    : 1.3.0
lxml      : 3.7.0.0
libxml2   : 2.9.4
cssselect : 1.0.0
parsel    : 1.1.0
w3lib     : 1.16.0
Twisted   : 16.6.0
Python    : 2.7.12 (default, Nov 19 2016, 06:48:10) - [GCC 5.4.0 20160609]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Linux-4.8.0-34-generic-x86_64-with-Ubuntu-16.10-yakkety


(scrapy13) paul@host:~$ https_proxy=https://localhost:45793 scrapy shell https://www.example.com
2017-01-12 12:25:54 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
2017-01-12 12:25:54 [scrapy.utils.log] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
(..)
2017-01-12 12:25:54 [scrapy.core.engine] INFO: Spider opened
2017-01-12 12:25:54 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example.com> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
2017-01-12 12:25:54 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example.com> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
2017-01-12 12:25:54 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.example.com> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy13/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 149, in _run_command
    cmd.run(args, opts)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/commands/shell.py", line 73, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/shell.py", line 48, in start
    self.fetch(url, spider, redirect=redirect)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/shell.py", line 115, in fetch
    reactor, self._schedule, request, spider)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]


(scrapy13) paul@host:~$ https_proxy=https://localhost:45793 scrapy shell https://www.example.com
2017-01-12 12:25:58 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
(..)
2017-01-12 12:25:58 [scrapy.core.engine] INFO: Spider opened
2017-01-12 12:25:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7ff8e0a4fb50>
[s]   item       {}
[s]   request    <GET https://www.example.com>
[s]   response   <200 https://www.example.com>
[s]   settings   <scrapy.settings.Settings object at 0x7ff8e0a4fad0>
[s]   spider     <DefaultSpider 'default' at 0x7ff8d9f7a4d0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> 

Wireshark tells me that when the ClientHello is sent by Scrapy (after CONNECT gets HTTP 200 back), Scrapy/Twisted closes the TCP connection, without even waiting for the ServerHello

@redapple redapple changed the title Can't use localhost proxy. TLS connection fails through HTTPS proxy after CONNECT tunnel is established Jan 12, 2017
@redapple
Copy link
Contributor

@bfbd888 ,
you may be experiencing something different because I don't see scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel but I'm still investigating the TLS error.
(in other words, in your case the CONNECT call is unsuccessful ; in my case, it fails just after)

TunnelError shows a bit more info in recent Scrapy version (if you can upgrade your Scrapy 1.1.0 with at least 1.1.1 or 1.1.3 that would be great)
Could you paste the whole error from the console that you see?

@redapple
Copy link
Contributor

More info:
so I am able to reproduce the Could not open CONNECT tunnel with scrapy 1.1.0
This is the same as issue #2069, where a "Host" HTTP header is missing.

$ pip freeze
attrs==16.3.0
cffi==1.9.1
constantly==15.1.0
cryptography==1.7.1
cssselect==1.0.1
enum34==1.1.6
idna==2.2
incremental==16.10.1
ipaddress==1.0.18
lxml==3.7.2
parsel==1.1.0
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycparser==2.17
PyDispatcher==2.0.5
pyOpenSSL==16.2.0
queuelib==1.4.2
Scrapy==1.1.0
service-identity==16.0.0
six==1.10.0
Twisted==16.6.0
w3lib==1.16.0
zope.interface==4.3.3

$ https_proxy=https://localhost:45793 scrapy shell https://www.example.com
2017-01-12 17:08:24 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
(...)
2017-01-12 17:08:24 [scrapy] INFO: Spider opened
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy1.1.0/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 149, in _run_command
    cmd.run(args, opts)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/commands/shell.py", line 71, in run
    shell.start(url=url)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/shell.py", line 47, in start
    self.fetch(url, spider)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/shell.py", line 112, in fetch
    reactor, self._schedule, request, spider)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy localhost:45793

and it can be fixed by upgrading Scrapy to at least Scrapy 1.1.1

But once you upgrade, you run into the TLS issue I was mentioning earlier (#2491 (comment))
So, issue not closed yet...

@redapple
Copy link
Contributor

redapple commented Jan 12, 2017

Alright, I think I figured this one out.
What happens in that Scrapy's TunnelingTCP4ClientEndpoint does not consume all of the proxy's response to the CONNECT request (it only checks that the status code is 200),
and because TunnelingTCP4ClientEndpoint.processProxyResponse() is called with chunks of the response, the remaining bytes on the transport are fed into the TLS layer, as if sent by the server as a response to the ClientHello, but these are plain ASCII bytes and therefore OpenSSL says "No way!"

Step 0: send CONNECT

CONNECT www.example.com:443 HTTP/1.1
Host: www.example.com:443

Step 1: receive first chunk:

'HTTP/1.1 200 OK\r\nKeep-Alive'

Step 2: Scrapy says: "Cool! the proxy is ready, let's initiate the TLS connection."
a ClientHello is sent over the TCP connection...

Step 3: there are more bytes from the proxy where the HTTP 200 came from...

': timeout=38\r\nContent-Length: 0\r\n\r\n'

Step 4: OpenSSL is not happy with these bytes (there are not a ServerHello) and aborts the connection

Simple fix: add a small buffer when reading the initial response from the proxy and detect \r\n\r\n before starting the TLS negotiation.

Advanced fix: use some HTTP parsing state machine (Twisted's?) to do this properly.

@aiportal
Copy link
Author

Thanks very much, waiting for new version to fixed.
I have use privoxy to forword proxy tunnel and it's work.

@redapple
Copy link
Contributor

@bfbd888 , you can try #2495

@aiportal
Copy link
Author

Thanks very much. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants