TLS connection fails through HTTPS proxy after CONNECT tunnel is established #2491

aiportal · 2017-01-11T03:09:17Z

I set proxy by this code:

class HttpProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'https://127.0.0.1:8787'

It's error is:

scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy 127.0.0.1:8787

Then I test the proxy by requests:
resp = requests.get('https://......', proxies={'https': 'https://127.0.0.1:8787'})
It' work!

So, what is it happen?

The text was updated successfully, but these errors were encountered:

redapple · 2017-01-11T12:47:08Z

What version of scrapy are you using? (output of scrapy version -v)
What local proxy are you using? would it be possible for me to run the same setup?
Also, are you able to inspect network traffix (on the loopback interface) to compare the CONNECT requests (and subsequent TLS setup) in both scrapy and python requests?

Does scrapy work for you with non-localhost HTTPS proxies?

redapple · 2017-01-11T13:36:19Z

I was able to connect with Polipo running locally on port 8123, using CONNECT:

$ https_proxy='https://127.0.0.1:8123' scrapy shell https://www.example.com

CONNECT www.example.com:443 HTTP/1.1
Host: www.example.com:443

HTTP/1.1 200 Tunnel established
(... redacted...)

You can find the Wireshark capture file and scrapy logs at https://github.com/redapple/scrapy-issues/tree/master/2491

aiportal · 2017-01-12T01:55:12Z

What version of scrapy are you using? (output of scrapy version -v)

scrapy -V
Scrapy 1.1.0 - no active project

current platform is windows 10, I have not test it on Ubuntu.

What local proxy are you using? would it be possible for me to run the same setup?

Lantern 3.6.1 (20170110.001954)

Also, are you able to inspect network traffix (on the loopback interface) to compare the CONNECT requests (and subsequent TLS setup) in both scrapy and python requests?

No, I have not do it.

Does scrapy work for you with non-localhost HTTPS proxies?

Yes, it is work ok on free http/https proxy that I found from web.

You can find the Wireshark capture file and scrapy logs at https://github.com/redapple/scrapy-issues/tree/master/2491

Thanks very much, I will do it later.

redapple · 2017-01-12T11:34:55Z

Thanks @bfbd888 ,
I'm able to reproduce the issue with Lantern, and even their Go HTTP(S) proxy. It's not systematic but I get it very often: sometimes after 1 or 2 retries I'm able to connect, but yeah, seems like a serious issue, probably around how the TLS connection is established after the CONNECT tunnel get's opened.

See below a failed attempt, followed by a successful one:

(scrapy13) paul@host:~$ scrapy version -v
Scrapy    : 1.3.0
lxml      : 3.7.0.0
libxml2   : 2.9.4
cssselect : 1.0.0
parsel    : 1.1.0
w3lib     : 1.16.0
Twisted   : 16.6.0
Python    : 2.7.12 (default, Nov 19 2016, 06:48:10) - [GCC 5.4.0 20160609]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Linux-4.8.0-34-generic-x86_64-with-Ubuntu-16.10-yakkety


(scrapy13) paul@host:~$ https_proxy=https://localhost:45793 scrapy shell https://www.example.com
2017-01-12 12:25:54 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
2017-01-12 12:25:54 [scrapy.utils.log] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
(..)
2017-01-12 12:25:54 [scrapy.core.engine] INFO: Spider opened
2017-01-12 12:25:54 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example.com> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
2017-01-12 12:25:54 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example.com> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
2017-01-12 12:25:54 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.example.com> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy13/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 149, in _run_command
    cmd.run(args, opts)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/commands/shell.py", line 73, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/shell.py", line 48, in start
    self.fetch(url, spider, redirect=redirect)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/scrapy/shell.py", line 115, in fetch
    reactor, self._schedule, request, spider)
  File "/home/paul/.virtualenvs/scrapy13/local/lib/python2.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]


(scrapy13) paul@host:~$ https_proxy=https://localhost:45793 scrapy shell https://www.example.com
2017-01-12 12:25:58 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
(..)
2017-01-12 12:25:58 [scrapy.core.engine] INFO: Spider opened
2017-01-12 12:25:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7ff8e0a4fb50>
[s]   item       {}
[s]   request    <GET https://www.example.com>
[s]   response   <200 https://www.example.com>
[s]   settings   <scrapy.settings.Settings object at 0x7ff8e0a4fad0>
[s]   spider     <DefaultSpider 'default' at 0x7ff8d9f7a4d0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

Wireshark tells me that when the ClientHello is sent by Scrapy (after CONNECT gets HTTP 200 back), Scrapy/Twisted closes the TCP connection, without even waiting for the ServerHello

redapple · 2017-01-12T16:05:48Z

@bfbd888 ,
you may be experiencing something different because I don't see scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel but I'm still investigating the TLS error.
(in other words, in your case the CONNECT call is unsuccessful ; in my case, it fails just after)

TunnelError shows a bit more info in recent Scrapy version (if you can upgrade your Scrapy 1.1.0 with at least 1.1.1 or 1.1.3 that would be great)
Could you paste the whole error from the console that you see?

redapple · 2017-01-12T16:13:33Z

More info:
so I am able to reproduce the Could not open CONNECT tunnel with scrapy 1.1.0
This is the same as issue #2069, where a "Host" HTTP header is missing.

$ pip freeze
attrs==16.3.0
cffi==1.9.1
constantly==15.1.0
cryptography==1.7.1
cssselect==1.0.1
enum34==1.1.6
idna==2.2
incremental==16.10.1
ipaddress==1.0.18
lxml==3.7.2
parsel==1.1.0
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycparser==2.17
PyDispatcher==2.0.5
pyOpenSSL==16.2.0
queuelib==1.4.2
Scrapy==1.1.0
service-identity==16.0.0
six==1.10.0
Twisted==16.6.0
w3lib==1.16.0
zope.interface==4.3.3

$ https_proxy=https://localhost:45793 scrapy shell https://www.example.com
2017-01-12 17:08:24 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
(...)
2017-01-12 17:08:24 [scrapy] INFO: Spider opened
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy1.1.0/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 149, in _run_command
    cmd.run(args, opts)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/commands/shell.py", line 71, in run
    shell.start(url=url)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/shell.py", line 47, in start
    self.fetch(url, spider)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/scrapy/shell.py", line 112, in fetch
    reactor, self._schedule, request, spider)
  File "/home/paul/.virtualenvs/scrapy1.1.0/local/lib/python2.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy localhost:45793

and it can be fixed by upgrading Scrapy to at least Scrapy 1.1.1

But once you upgrade, you run into the TLS issue I was mentioning earlier (#2491 (comment))
So, issue not closed yet...

redapple · 2017-01-12T17:38:12Z

Alright, I think I figured this one out.
What happens in that Scrapy's TunnelingTCP4ClientEndpoint does not consume all of the proxy's response to the CONNECT request (it only checks that the status code is 200),
and because TunnelingTCP4ClientEndpoint.processProxyResponse() is called with chunks of the response, the remaining bytes on the transport are fed into the TLS layer, as if sent by the server as a response to the ClientHello, but these are plain ASCII bytes and therefore OpenSSL says "No way!"

Step 0: send CONNECT

CONNECT www.example.com:443 HTTP/1.1
Host: www.example.com:443

Step 1: receive first chunk:

'HTTP/1.1 200 OK\r\nKeep-Alive'

Step 2: Scrapy says: "Cool! the proxy is ready, let's initiate the TLS connection."
a ClientHello is sent over the TCP connection...

Step 3: there are more bytes from the proxy where the HTTP 200 came from...

': timeout=38\r\nContent-Length: 0\r\n\r\n'

Step 4: OpenSSL is not happy with these bytes (there are not a ServerHello) and aborts the connection

Simple fix: add a small buffer when reading the initial response from the proxy and detect \r\n\r\n before starting the TLS negotiation.

Advanced fix: use some HTTP parsing state machine (Twisted's?) to do this properly.

aiportal · 2017-01-13T08:28:35Z

Thanks very much, waiting for new version to fixed.
I have use privoxy to forword proxy tunnel and it's work.

redapple · 2017-01-13T14:56:36Z

@bfbd888 , you can try #2495

aiportal · 2017-01-14T09:21:11Z

Thanks very much. 👍

redapple added bug https security labels Jan 12, 2017

redapple changed the title ~~Can't use localhost proxy.~~ TLS connection fails through HTTPS proxy after CONNECT tunnel is established Jan 12, 2017

redapple mentioned this issue Jan 13, 2017

[MRG] Buffer CONNECT response bytes from proxy until all HTTP headers are received #2495

Merged

redapple self-assigned this Jan 13, 2017

kmike closed this as completed in #2495 Feb 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TLS connection fails through HTTPS proxy after CONNECT tunnel is established #2491

TLS connection fails through HTTPS proxy after CONNECT tunnel is established #2491

aiportal commented Jan 11, 2017 •

edited by redapple

Loading

redapple commented Jan 11, 2017

redapple commented Jan 11, 2017 •

edited

Loading

aiportal commented Jan 12, 2017 •

edited

Loading

redapple commented Jan 12, 2017 •

edited

Loading

redapple commented Jan 12, 2017

redapple commented Jan 12, 2017

redapple commented Jan 12, 2017 •

edited

Loading

aiportal commented Jan 13, 2017

redapple commented Jan 13, 2017

aiportal commented Jan 14, 2017

TLS connection fails through HTTPS proxy after CONNECT tunnel is established #2491

TLS connection fails through HTTPS proxy after CONNECT tunnel is established #2491

Comments

aiportal commented Jan 11, 2017 • edited by redapple Loading

redapple commented Jan 11, 2017

redapple commented Jan 11, 2017 • edited Loading

aiportal commented Jan 12, 2017 • edited Loading

redapple commented Jan 12, 2017 • edited Loading

redapple commented Jan 12, 2017

redapple commented Jan 12, 2017

redapple commented Jan 12, 2017 • edited Loading

aiportal commented Jan 13, 2017

redapple commented Jan 13, 2017

aiportal commented Jan 14, 2017

aiportal commented Jan 11, 2017 •

edited by redapple

Loading

redapple commented Jan 11, 2017 •

edited

Loading

aiportal commented Jan 12, 2017 •

edited

Loading

redapple commented Jan 12, 2017 •

edited

Loading

redapple commented Jan 12, 2017 •

edited

Loading