Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSL errors crawling https sites using proxies #1855

Closed
Cesped opened this issue Mar 9, 2016 · 7 comments
Closed

SSL errors crawling https sites using proxies #1855

Cesped opened this issue Mar 9, 2016 · 7 comments

Comments

@Cesped
Copy link

Cesped commented Mar 9, 2016

I'm unable to scrape https sites through https supported proxies. I've tried with proxymesh as well as other proxy services. I can scrape most of this sites without proxies or using Tor.

Curl seems to work fine too:
curl -x https://xx.xx.xx.xx:xx --proxy-user user:pass -L https://www.base.net:443
Retrieves the site's html.

Setup:

  • OS:
    OS X El Capitan v10.11.3

Scrapy:

scrapy version -v
Scrapy    : 1.0.5
lxml      : 3.5.0.0
libxml2   : 2.9.2
Twisted   : 15.5.0
Python    : 2.7.11 (default, Dec  7 2015, 23:36:10) - [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)]
pyOpenSSL : 0.15.1 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Darwin-15.3.0-x86_64-i386-64bit

Solutions tried:
1 - Installing Scrapy-1.1.0rc3
2016-03-09 12:44:59 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
Other website:
2016-03-09 12:56:45 [scrapy] DEBUG: Retrying <GET https://es.alojadogatopreto.com/es-es/> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]

2 - #1764 (comment)
Using SSLv23_METHOD
2016-03-09 12:22:40 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
Using other SSL methods
2016-03-09 12:24:11 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_GET_RECORD', 'wrong version number')]>]

3 - #1227 (comment) | Get same errors as in 1 & 2.
4 - #1429 (comment) | Get same errors as in 1 & 2.

redapple added a commit to redapple/scrapy-issues that referenced this issue Mar 9, 2016
@redapple
Copy link
Contributor

redapple commented Mar 9, 2016

@Cesped ,
I don't use OS X, but under Ubuntu,
with both scrapy 1.0.5 and scrapy 1.1 rc3

$ scrapy version -v
Scrapy    : 1.0.5
lxml      : 3.5.0.0
libxml2   : 2.9.2
Twisted   : 15.5.0
Python    : 2.7.9 (default, Apr  2 2015, 15:33:21) - [GCC 4.9.2]
pyOpenSSL : 0.15.1 (OpenSSL 1.0.2d 9 Jul 2015)
Platform  : Linux-4.2.0-30-generic-x86_64-with-Ubuntu-15.10-wily

I was able to use 2 https proxies from HMA with https://www.python.org and https://www.base.net

I pushed the Wireshark-capture pcap file and console logs to https://github.com/redapple/scrapy-issues/tree/master/1855/redapple for you to compare if you can.

Could it be related to the HTTPS proxies you use?
If you don't want to share here publicly, you can send info to opensource@scrapinghub.com

redapple added a commit to redapple/scrapy-issues that referenced this issue Mar 9, 2016
@Cesped
Copy link
Author

Cesped commented Mar 9, 2016

Thanks for answering @redapple.

The solution was changing base64.encodestring to base64.b64encode in my ProxyMiddleware.
Did scrapy shell 'https://www.base.net' a few times and printed request.meta. The value for meta['proxy']changes each time and corresponds to those in my proxy list.

@Cesped Cesped closed this as completed Mar 9, 2016
@YasirNazir81
Copy link

I am trying to crawl walmart using proxymesh proxy provider. Same error is coming. Can i solve this using http proxies?

@vionemc
Copy link

vionemc commented Sep 14, 2016

I am the same with @yasirnazir, I get this error when using ProxyMesh.

@Cesped I don't understand what do you mean with base64. May I see your middleware code?
Thanks

@vionemc
Copy link

vionemc commented Sep 14, 2016

Doesn't matter. Found the solution just as @Cesped said. Here is the middleware:

import base64

class MeshProxy(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://fr.proxymesh.com:31280"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "user:pass"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.b64encode(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

@arturfcesped
Copy link

Same problem, also found the solution thanks to user cesped

@Gallaecio
Copy link
Member

You can alternatively use w3lib.http.basic_auth_header

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants