Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Splash with Tor #105

Closed
ceylanMeric opened this issue Feb 17, 2017 · 7 comments
Closed

Use Splash with Tor #105

ceylanMeric opened this issue Feb 17, 2017 · 7 comments

Comments

@ceylanMeric
Copy link

ceylanMeric commented Feb 17, 2017

I try to build proxy settings of splash. I assign Tor or Polipo port address to set_proxy, but it doesn't work. I get 504 error:

 function main(splash)
            local host = "localhost"
            local port = 8123
            --local type = "SOCKS5"

            splash:on_request(function (request)
               request:set_proxy{host, port}
        end)

        splash:go(splash.args.url)
        splash:wait(0.5)
        local image = assert(splash:png{render_all=true})
        return {png=image}
    end
    """

    url = 'https://www.torproject.org/'

In polipo.config(9150 is tor port):

socksParentProxy = localhost:9150
diskCacheRoot=""
#socksProxyType = socks5

In settings.py:

HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
    'thefork.middlewares.RandomUserAgentMiddleware': 400,
    'thefork.middlewares.ProxyMiddleware': 410,

How can I fix this, or is there an easier way to use splash with Tor? Please help :(

@kmike
Copy link
Member

kmike commented Feb 17, 2017

Are you running tor in the same Docker container as Splash? If not, it is likely not available at localhost address because localhost is a localhost of a container, not a localhost of your machine.

If you're running tor in a separate Docker container then you can make a link between them.

https://github.com/TeamHG-Memex/aquarium has tor set up, you can check configs there.

See also: scrapinghub/splash#234 and scrapinghub/splash#268.

@ceylanMeric
Copy link
Author

ceylanMeric commented Feb 18, 2017

I have installed aquarium and start it with "sudo docker-compose up" and Tor is 1. It looks fine but should I use these numbers( 172.18.0.6 , 7562):
1

And my port status:
2

Haproxy.cfg settings:

userlist users
    user ceylan_meric insecure-password XXX

defaults
    log global
    mode http

# visit 0.0.0.0:8036 to see HAProxy stats page
listen stats
    bind *:8036
    mode http
    stats enable
    stats hide-version
    stats show-legends
    stats show-desc Splash Cluster
    stats uri /
    stats refresh 10s
    stats realm Haproxy\ Statistics
    stats auth    XXX

# Splash Cluster configuration
frontend http-in
    bind *:8050

    # http basic auth
    acl auth_ok http_auth(users)
    http-request auth realm Splash if !auth_ok
    http-request allow if auth_ok
    http-request deny

backend splash-misc
    balance roundrobin
    server splash-0 splash0:8050 check fall 15
    server splash-1 splash1:8050 check fall 15
    server splash-2 splash2:8050 check fall 15

Tor.ini settings:

; enable tor for all requests
[proxy]
host = tor
;host = 192.168.99.100
port = 9150
type = socks5

Settings.py:

SPLASH_URL = 'http://localhost:8050/'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,

My Spider:

script = """
        function main(splash)
            splash:on_request(function(request)
            request:set_proxy{
                host = "localhost",
                port = 9150,
            }
        end)
            assert(splash:go(splash.args.url))
        """

And when I run spider i get 401 error:

DEBUG: Crawled (401) <GET https://www.torproject.org/ via http://localhost:8050/execute> (referer: None) ['partial']
2017-02-18 16:10:49 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <401 https://www.torproject.org/>: HTTP status code is not handled or not allowed

Sorry to put everything on here. But, I look at your link, couldn't solve.I am confused what am I doing wrong? Or what should I make sure to run this?

@ceylanMeric
Copy link
Author

kmike , there is also this error happen when I run "docker-compose up":

tor_1 | WARNING: no logs are available with the 'none' log driver

I think problem tor is not running here. Does Aquarium has its own Tor ? If it is, I don't have to change host and port settings in tor.ini, because Aquarium's Tor has no relationship with my local Tor browser( port= 9150). How can I sure Aquarium's Tor is running ?

@kmike
Copy link
Member

kmike commented Feb 19, 2017

@ceylanMeric

request:set_proxy{host = "localhost", port = 9150}

^^ this is still incorrect - Tor is not available at Splash localhost, it is executed in a separate Docker container. It should be something like

request:set_proxy{host = "tor", port = 9050, type='socks'}

You can also just pass 'proxy=tor' argument to the request - Aquarium provides tor proxy profile by default: https://github.com/TeamHG-Memex/aquarium/blob/master/%7B%7Bcookiecutter.folder_name%7D%7D/proxy-profiles/tor.ini

@ceylanMeric
Copy link
Author

@kmike

Still get 401 error:
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <401 https://www.torproject.org/>: HTTP status code is not handled or not allowed

also in haproxy.cfg I have username and password
user ceylan_meric insecure-password XXXXXX
So, do I need to send userame and password like this:

splash:on_request(function(request)
            request:set_proxy{
                host = "tor",
                port = 9050,
                type = 'socks'
                --username = "ceylan_meric",
                --password = "XXXXXXXX",
            }
        end)

In settings.py:
SPLASH_URL = 'http://0.0.0.0:8050/'
So, my spider listen to splash. But always 401 error happen.

and this error might cause problem:
tor_1 | WARNING: no logs are available with the 'none' log driver
I'm not sure docker-compose up run Tor.

@Gallaecio
Copy link
Contributor

@ceylanMeric Did you ever manage to solve this issue?

@Gallaecio
Copy link
Contributor

Closing due to a lack of feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants