Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for socks5 proxy #747

Open
cydu opened this issue Jun 14, 2014 · 28 comments

Comments

@cydu
Copy link

commented Jun 14, 2014

Support for socks5 proxy

http://www.ietf.org/rfc/rfc1928.txt

maybe we can use https://github.com/habnabit/txsocksx 's SOCKS5Agent

@pablohoffman

This comment has been minimized.

Copy link
Member

commented Jun 14, 2014

Here's an article about using tsocks with scrapy:
http://blog.scrapinghub.com/2010/11/12/scrapy-tsocks/

Not sure SOCKS5 is something we'd want to support directly on Scrapy, since HTTP proxies are often enough. Could you elaborate on your need @cydu?

@cydu

This comment has been minimized.

Copy link
Author

commented Jun 15, 2014

@pablohoffman Thank you for your reply.
But in my case tsocks can't work, I have to crawl several site use different proxy, because of performance and security reason.

Something like this:

DOWNLOAD_HANDLERS = {
    'aaa.com': 'myspider.http_proxy.HttpProxyDownloadHandler',
    'bbb.com': 'myspider.socks5_proxy.Socks5DownloadHandler',
    'ccc.com': 'myspider.no_proxy.HTTP11DownloadHandler',
} 

I have implement Socks5DownloadHandler, but because it depends on

#txsocksx is installed from pip install txsocksx
from txsocksx.http import SOCKS5Agent

so I don't know is it ok to pull a request?

Here is my Socks5DownloadHandler code: https://gist.github.com/cydu/8a4b9855c5e21423c9c5

@disfasia

This comment has been minimized.

Copy link

commented Jun 24, 2014

Hi
@pablohoffman thanks for your awesome scrapy!

I was looking for something similar, i think is a big lack that such a complete software is missing socks support.

My case is a bit different. I currently use a middleware to remotely request a proxy from a dispatcher server, then i assign the obtained proxy to the current request. Everything is working fine if they are http proxy, but I'm struggling to get the same behavior with socks.

Imagine i have a list of socks, and i want to use a random proxy per crawl. I didn't find a way to make this work with http-to-socks converters (polipo, privoxy, etc) and i thought i could write a customized server to handle that, maybe with special header to define to which sock to connect to..but I think is way too much complexity!

Here is what i have for http proxies, and I would like to do something very similar for socks

import base64, urllib2, json

class ProxyMiddleware(object):
    # overwrite process request
    proxyDispatcher = "http://my.proxy.dispatcher/"

    def process_request(self, request, spider):
        if spider.name == 'some-particular-spider':
            if not spider.proxy:
                response = urllib2.urlopen(self.proxyDispatcher)
                try: 
                    spider.proxy = json.loads(response.read())
                except: 
                    spider.proxy = False
            if spider.proxy:
                request.meta['proxy'] = "http://%s" % spider.proxy['host']
                request.headers['Proxy-Authorization'] = 'Basic ' + base64.encodestring(spider.proxy['auth'])

            return None
@traverseda

This comment has been minimized.

Copy link

commented Oct 30, 2014

👍

Can we at least get an "enhancement" tag on this?

@dangra dangra added the enhancement label Oct 30, 2014

@boltgolt

This comment has been minimized.

Copy link

commented Apr 23, 2015

Would be really useful, indeed

@crasker

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2015

Would be really really really really really really really really useful, indeed

@robsonpeixoto

This comment has been minimized.

Copy link

commented Mar 8, 2016

👍

@robsonpeixoto

This comment has been minimized.

Copy link

commented Mar 9, 2016

@pablohoffman There are a lot of cheap and good proxy that only support SOCKS4/5. This feature will be ridiculous useful.
I want to use scrapy but I can't because of it. 😢 And I really would like to use scrapy, because it's f**** amazing.

@redapple

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2016

@robsonpeixoto , I'll raise the priority of this. Thanks for the feedback!

@AdolphYu

This comment has been minimized.

Copy link

commented Apr 6, 2016

Chinese need socks proxy, because they have GFW.

@kmike

This comment has been minimized.

Copy link
Member

commented Apr 6, 2016

A PR with socks proxy support is welcome! AFAIK nobody is working on it now.

@pawelmhm

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2016

This is somewhat complicated because Scrapy uses Twisted Agents in downloader, and Twisted doesn't have socks client. There is only socks4 server. I did some research about this and there is unfinished ticket for socks client: https://twistedmatrix.com/trac/ticket/3508

To implement socks support for Scrapy we would have to either:

  1. use existing socks-twisted libraries that are not official part of Twisted this one here looks like best one around https://github.com/habnabit/txsocksx, but this looks like it's Python 2 only :/
  2. Contribute to Twisted and add support for socks client to Twisted.
  3. Write socks client for Scrapy ourselves.

Which option is best?

@daidoji

This comment has been minimized.

Copy link

commented Jun 24, 2016

+1

@jbagot

This comment has been minimized.

Copy link

commented Jul 29, 2016

I have a big scrapy project with a large amount of crawlers and I use socks with socks-http-converter called polipo. I have a huge amount of socks ports and I only need create one polipo for every sock port and you can connect each polipo to one socks port. Then I have a queue of polipo ports (HTTP) and get these.

polipo socksParentProxy=localhost:$socks_port proxyPort=$polipo_port & > /dev/null

This code insite a loop and it's all

But, I would prefer use a direct socks to twister because polipo instances waste memory of server and the performance is lesser than it could be.

@Margular

This comment has been minimized.

Copy link

commented Oct 11, 2016

proxychains scrapy crawl spider_name

@dchrostowski

This comment has been minimized.

Copy link

commented Mar 16, 2017

Also requesting this. +9001 internets to whoever can implement.

@traverseda

This comment has been minimized.

Copy link

commented Mar 16, 2017

@dchrostowski

This comment has been minimized.

Copy link

commented Mar 20, 2017

I would, but some extra research on google led me to a few options I'm going to try first. I'm working on an automatic public proxy farm (not using Tor) which I hope will be useful to all scrapy enthusiasts, so I will likely release it to open source in lieu of bounty money regardless of whether or not socks5 gets officially implemented/supported. I already have a nice prototype that's been running for over a year and collected well over 350K public proxies. A significant portion of these are socks proxies and I've been using with my crawling infrastructure that was initially built on top of Perl. I'm in the middle of converting everything to Python though because I just can't deal with Perl anymore and I have picked scrapy as my crawling framework. I'm intending on releasing it primarily as a scrapy middleware but I also designed it with modularity in mind so that it should be able to be hooked into just about anything with few headaches.

@dchrostowski

This comment has been minimized.

Copy link

commented Mar 20, 2017

Basically, you just get free proxy servers and you don't even have to think about it. It maintains itself.

@bakwc

This comment has been minimized.

Copy link

commented Jun 9, 2017

Surprised that the best python scraping library does not have SOCKS5 proxy support at 2017. It's very sad :(

@redapple

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2017

That's why we need help! Any volunteers?

@traverseda

This comment has been minimized.

Copy link

commented Jun 10, 2017

@dchrostowski

This comment has been minimized.

Copy link

commented Jun 13, 2017

I thought I'd just share how I'm getting socks support with scrapy. Basically there are two pretty good options, DeleGate and Privoxy. I'm going to give an example of a middleware that I implemented using DeleGate which has worked great for me thus far.

DeleGate is amazingly simple and straightforward; it's basically serving as an http-to-socks bridge. In other words, you make a request to it with scrapy as if it were an http proxy and it will take care of bridging that over to the socks server. Privoxy can do this too, but it seems like DeleGate has much better documentation and possibly more functionality than Privoxy (maybe...) You can either build from source or download a pre-built binary (supports Windows, MacOS X, Linux, BSD, and Solaris). Set it up however you like so that it's on your PATH. In my Ubuntu setup I simply created a symbolic link to the binary in my /usr/bin directory. Copying it over there works too. So after it's installed, try running this in your shell:

delegated ADMIN=<whatever-you-want> RESOLV="" -Plocalhost:<localport> SERVER=http SOCKS=<socks-server-address>:<socks-server-port>

This should setup a proxy server on the local machine. A brief explanation of some of the options:

ADMIN - this can be whatever. Ideally it should be an email address to display should the DeleGate server run into a problem.

RESOLV - I forget exactly what this was doing, something to do with DNS resolution. Basically, if I didn't include this argument and set it to an empty string, I noticed I was inadvertently exposing my IP while testing against my dev server. (You may or may not need this, I suspect I needed it because I have a public DNS A record pointing to the particular machine I was testing DeleGate on)

-P[localhost]:localport - the address and port of the local DeleGate proxy server which will run. You can just set an arbitrary port.

SERVER - the protocol of the local DeleGate proxy server. In this case, we want HTTP because that's what scrapy is compatible with

SOCKS - the address and port of the socks proxy server that DeleGate will "bridge" the request to.

To shut down gracefully, you can run this command in a separate window:

delegated -P[localhost]:<localport> -Fkill

Keep in mind that this is setting up a live proxy server running on localhost. While testing I was able to access the delegate web interface through my browser. Make sure that either your firewall is setup accordingly or read the docs on setting up auth/security lest you want people like me finding it and using it.

So to make this integrate nicely with scrapy, I wrote a middleware. Here's a watered down version of it:

from my_scrapy_project.util.proxy_manager import Proxy, ProxyManager
import subprocess

class CustomProxyMiddleware(object):
    
    @staticmethod
    def start_delegate(proxy,localport)
        cmd = 'delegated ADMIN=nobdoy RESOLV="" -P:%s SERVER=http TIMEOUT=con:15 SOCKS=%s:%s' % (localport, proxy.address, proxy.port)
        subprocess.Popen(cmd, shell=True)
        proxy.address = 'localhost'
        proxy.scheme = 'http'
        proxy.port = localport

        return proxy

    @staticmethod
    def stop_delegate(localport):
        cmd = 'delegated -P:%s -Fkill' % localport
        subprocess.Popen(cmd, shell=True)
        ProxyManager.release_delegate_port(localport)
        
    def process_request(self, request, spider):
        # For simplicity I'm not including code for Proxy or ProxyManager.  Should be self explanatory.
        proxy = Proxy(ProxyManager.get_socks_proxy_params())
        localport = ProxyManager.reserve_delegate_port()
        socks_bridge_proxy = CustomProxyMiddleware.start_delegate(proxy,localport)
        request.meta['proxy'] = socks_bridge_proxy.to_string()
        request.meta['delegate_port'] = localport

    def process_response(self, request, response, spider):
        # handle response logic here

        # check if there is a delegate instance running for this request
        if 'delegate_port' in request.meta:
            CustomProxyMiddleware.stop_delegate(request.meta['delegate_port'])

    def process_exception(self, request, exception, spider):
        # handle exceptions here
@dchrostowski

This comment has been minimized.

Copy link

commented Jun 13, 2017

By the way, I should mention, this was written to accommodate thousands of socks proxies that my bots have found. If you have a smaller number, then it might make more sense to keep the delegate instances open and running all the time rather than allocating a port number, starting, and then stopping the instance for each request. In my real application, I'm cycling through a very large pool of proxies cached in memory consisting of both socks and http proxies. I estimate the ratio to be around 1:10 socks:http so this makes sense for my project and I'm not rapid fire opening and closing delegate ports.

@pablohoffman

This comment has been minimized.

Copy link
Member

commented Jun 13, 2017

thanks for sharing @dchrostowski , that's worth a blog post! :)

@tigercbc

This comment has been minimized.

Copy link

commented Sep 19, 2017

works for me, thanks! @cydu

@percy507

This comment has been minimized.

Copy link

commented Apr 15, 2018

@barravi

This comment has been minimized.

Copy link

commented Nov 27, 2018

Hello. I'd like to post my 2 cents here.

Any of the proposed workarounds fall short due to the fact that softwares like Privoxy do not support authenticated Socks proxy. This might seem like an edge case, but VPN providers such ask IPVanish offer a proxy service with their VPN subscription that is authenticated and socks only. It is very convenient as it allows a crawler to spoof its IP from different countries in the world, but it's not supported out of the box by scrapy.

So far I tried both Privoxy, Polipo, Ncat and whatever else I could stumble upon to try to setup an HTTP-to-SOCKS proxy that would authenticate the proxy connection, without any luck.

The one solution I found so far that works is to use Scrapinghub's Splash with a proxy profile setup. Splash actually supports authenticated SOCKS. However, it would be nice to have out-of-the-box support for socks proxy the same way one has with HTTP proxy.

I found some code lying around for a download handler that supports SOCKS, I'll try to integrate it in my Scrapy project and I'll post a pull request one it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.