Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with TOR + Splash #268

Closed
andverb opened this issue Aug 6, 2015 · 9 comments
Closed

Problem with TOR + Splash #268

andverb opened this issue Aug 6, 2015 · 9 comments

Comments

@andverb
Copy link

andverb commented Aug 6, 2015

Hello! Recently i found out about Splash and Scrapyjs and started to use them in my scrapers. In my current project, i have encountered a problem and i just cant make it work.
So i kindly ask you for help. Usually i use Tor for my Scrapy crawlers if i have problems with ip blocking. In my current project, i need to scrape website that both don't show content without javascript, and blocks my ip after very moderate amount of requests. So i decided to combine Tor and Splash for this.
my default.ini file in /etc/splash/proxy-profiles/
[proxy]
host=localhost
port=9050
type=SOCKS5
i run Splash like this:
sudo docker run -p 8050:8050 -v /etc/splash/proxy-profiles/:/etc/splash/proxy-profiles/ scrapinghub/splash
i added proxy to request url as written in docs:
yield scrapy.Request(url+"proxy=default", self.parse_wine, meta={
'splash':{'endpoint': 'render.html','args': {'wait': 0.5}}})
but i get errors in Splash:
2015-08-06 15:05:53.714796 [-] "172.17.42.1" - - [06/Aug/2015:15:05:52 +0000] "POST /render.html HTTP/1.1" 502 21 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10"
2015-08-06 15:05:57.150016 [render] [37545816] loadFinished: RenderErrorInfo(type='Network', code=1, text=u'Connection refused', url=u"http://www.vinopedia.com/wine/The+Winner's+Tank+Shiraz+2005proxy=default")
my os is Elementary OS 0.3, basically its ubuntu 14.04.
I would be very grateful if someone can point me in right direction

@kmike
Copy link
Member

kmike commented Aug 6, 2015

How did you install Splash? There is a few caveats if you use Docker for this, e.g. 'localhost' inside Docker container is not the same as your localhost.

This is incorrect:

yield scrapy.Request(url+"proxy=default", self.parse_wine, meta={
    'splash': {
        'endpoint': 'render.html',
        'args': {'wait': 0.5}
    }
})

Correct usage would be

yield scrapy.Request(url, self.parse_wine, meta={
    'splash': {
        'endpoint': 'render.html',
        'args': {'wait': 0.5, 'proxy': 'default'}
    }
})

but it shouldn't matter because if you use a "default" proxy profile there is no need to pass "proxy=default" argument.

@kmike
Copy link
Member

kmike commented Aug 6, 2015

I just tried using one of the SOCKS5 proxies from this list; it worked both with proxy profiles and request:set_proxy method in Lua script.

By the way, SOCKS5 proxying is a new feature in upcoming Splash release; are you using latest Splash master? In Splash 1.6 it won't work.

@andverb
Copy link
Author

andverb commented Aug 6, 2015

  • How did you install Splash?
    Usual way, as was written in docs, haven't encountered any problems during installation.
  • 'localhost' inside Docker container is not the same as your localhost.
    I believe i should use 127.0.0.1 instead, right?
  • are you using latest Splash master? In Splash 1.6 it won't work.
    Ah, this is the case probably, i have Splash 1.6. I will get fresh master and try again.

Thanks for the answer!

@kmike
Copy link
Member

kmike commented Aug 6, 2015

'localhost' inside Docker container is not the same as your localhost. I believe i should use 127.0.0.1 instead, right?

No, it is more complicated. It will be something like http://10.0.2.2. There is a discussion at #234.

I'm releasing Splash 1.7 now; SOCKS5 support will be there.

@andverb
Copy link
Author

andverb commented Aug 7, 2015

Thanks! Still no luck with 10.0.2.2:9050 i guess most time-saving option for me would be installing splash without docker. There are section in docs about install on 12.04, i guess 14.04 must work the same.

@andverb
Copy link
Author

andverb commented Aug 8, 2015

Okay so installing all dependencies and running Splash without docker solved the issue.
Everything works, despite i am getting new error now:
2015-08-08 15:53:40.983302 [-] Qt says: 1 QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.
Sometimes few times in a row.
Any idea what this error means?
Thank for answers!

@kmike
Copy link
Member

kmike commented Aug 8, 2015

@andverb cool! I've also seen these messages in logs, but never found any problem they can cause.

@kmike
Copy link
Member

kmike commented Aug 13, 2015

I'm closing this issue because it seems tor works fine.

@kmike kmike closed this as completed Aug 13, 2015
@nordborn
Copy link

Hello!

default.ini file in /etc/splash/proxy-profiles/
[proxy]
host=localhost
port=9050
type=SOCKS5

The solution is to run docker with this args:
sudo docker run -p 8050:8050 -v /etc/splash/proxy-profiles/:/etc/splash/proxy-profiles/ --net="host" scrapinghub/splash

--net="host" - it'll bind the docker with localhost

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants