SSL errors crawling https sites using proxies #1855

Cesped · 2016-03-09T12:15:35Z

I'm unable to scrape https sites through https supported proxies. I've tried with proxymesh as well as other proxy services. I can scrape most of this sites without proxies or using Tor.

Curl seems to work fine too:
curl -x https://xx.xx.xx.xx:xx --proxy-user user:pass -L https://www.base.net:443
Retrieves the site's html.

Setup:

OS:
OS X El Capitan v10.11.3

Scrapy:

scrapy version -v
Scrapy    : 1.0.5
lxml      : 3.5.0.0
libxml2   : 2.9.2
Twisted   : 15.5.0
Python    : 2.7.11 (default, Dec  7 2015, 23:36:10) - [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)]
pyOpenSSL : 0.15.1 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Darwin-15.3.0-x86_64-i386-64bit

Solutions tried:
1 - Installing Scrapy-1.1.0rc3
2016-03-09 12:44:59 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
Other website:
2016-03-09 12:56:45 [scrapy] DEBUG: Retrying <GET https://es.alojadogatopreto.com/es-es/> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]

2 - #1764 (comment)
Using SSLv23_METHOD
2016-03-09 12:22:40 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
Using other SSL methods
2016-03-09 12:24:11 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_GET_RECORD', 'wrong version number')]>]

3 - #1227 (comment) | Get same errors as in 1 & 2.
4 - #1429 (comment) | Get same errors as in 1 & 2.

The text was updated successfully, but these errors were encountered:

redapple · 2016-03-09T13:43:56Z

@Cesped ,
I don't use OS X, but under Ubuntu,
with both scrapy 1.0.5 and scrapy 1.1 rc3

$ scrapy version -v
Scrapy    : 1.0.5
lxml      : 3.5.0.0
libxml2   : 2.9.2
Twisted   : 15.5.0
Python    : 2.7.9 (default, Apr  2 2015, 15:33:21) - [GCC 4.9.2]
pyOpenSSL : 0.15.1 (OpenSSL 1.0.2d 9 Jul 2015)
Platform  : Linux-4.2.0-30-generic-x86_64-with-Ubuntu-15.10-wily

I was able to use 2 https proxies from HMA with https://www.python.org and https://www.base.net

I pushed the Wireshark-capture pcap file and console logs to https://github.com/redapple/scrapy-issues/tree/master/1855/redapple for you to compare if you can.

Could it be related to the HTTPS proxies you use?
If you don't want to share here publicly, you can send info to opensource@scrapinghub.com

Cesped · 2016-03-09T16:49:27Z

Thanks for answering @redapple.

The solution was changing base64.encodestring to base64.b64encode in my ProxyMiddleware.
Did scrapy shell 'https://www.base.net' a few times and printed request.meta. The value for meta['proxy']changes each time and corresponds to those in my proxy list.

YasirNazir81 · 2016-08-21T08:44:17Z

I am trying to crawl walmart using proxymesh proxy provider. Same error is coming. Can i solve this using http proxies?

vionemc · 2016-09-14T01:16:54Z

I am the same with @yasirnazir, I get this error when using ProxyMesh.

@Cesped I don't understand what do you mean with base64. May I see your middleware code?
Thanks

vionemc · 2016-09-14T01:27:47Z

Doesn't matter. Found the solution just as @Cesped said. Here is the middleware:

import base64

class MeshProxy(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://fr.proxymesh.com:31280"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "user:pass"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.b64encode(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

arturfcesped · 2020-10-13T10:02:51Z

Same problem, also found the solution thanks to user cesped

Gallaecio · 2020-10-30T21:11:57Z

You can alternatively use w3lib.http.basic_auth_header

redapple added a commit to redapple/scrapy-issues that referenced this issue Mar 9, 2016

Add .pcap and logs from tests for scrapy/scrapy#1855

4498bd4

redapple added a commit to redapple/scrapy-issues that referenced this issue Mar 9, 2016

Add .pcap and logs from tests for scrapy/scrapy#1855

44fa258

Cesped closed this as completed Mar 9, 2016

Gallaecio mentioned this issue Aug 1, 2019

Problem using proxy with scrapy #3875

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSL errors crawling https sites using proxies #1855

SSL errors crawling https sites using proxies #1855

Cesped commented Mar 9, 2016

redapple commented Mar 9, 2016

Cesped commented Mar 9, 2016

YasirNazir81 commented Aug 21, 2016

vionemc commented Sep 14, 2016

vionemc commented Sep 14, 2016 •

edited

arturfcesped commented Oct 13, 2020

Gallaecio commented Oct 30, 2020

SSL errors crawling https sites using proxies #1855

SSL errors crawling https sites using proxies #1855

Comments

Cesped commented Mar 9, 2016

redapple commented Mar 9, 2016

Cesped commented Mar 9, 2016

YasirNazir81 commented Aug 21, 2016

vionemc commented Sep 14, 2016

vionemc commented Sep 14, 2016 • edited

arturfcesped commented Oct 13, 2020

Gallaecio commented Oct 30, 2020

vionemc commented Sep 14, 2016 •

edited