Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

504 Gateway Time-out #28

Closed
odmaaa opened this issue Nov 25, 2015 · 5 comments
Closed

504 Gateway Time-out #28

odmaaa opened this issue Nov 25, 2015 · 5 comments

Comments

@odmaaa
Copy link

odmaaa commented Nov 25, 2015

Hello,
I am crawling a website with 10K contents, when I crawl first it's all response 200, everything is ok, but after few minutes 504 Gateway Time-out appears and after 3 times retrying scrapy give up retrying. I set :

    'CONCURRENT_REQUESTS':10,
    'HTTPCACHE_ENABLED':True,
    'DOWNLOAD_DELAY':5,
    'CONCURRENT_REQUESTS_PER_IP':10,

and endpoint is render.html

'splash' : {
    'endpoint' : 'render.html',
    'args' : {'wait':1},
}

I am using :
*scrapy version: 1.0.3
*python:2.7
*docker server

How can I optimize my crawler ? and avoid 504 error?

@kmike
Copy link
Member

kmike commented Nov 25, 2015

Hey @omkaaa,

Please check http://splash.readthedocs.org/en/stable/faq.html - does it help?

@odmaaa
Copy link
Author

odmaaa commented Nov 27, 2015

Hi @kmike ,

Yes it helped thank you,I disabled the image and set time-out to 720, all worked great. Thank you

@kmike
Copy link
Member

kmike commented Nov 27, 2015

@omkaaa glad to hear that!

@kmike kmike closed this as completed Nov 27, 2015
@yeszao
Copy link

yeszao commented Jul 4, 2019

Follow @omkaaa, I change the args to:

yield SplashRequest(url, self.parse, args={'wait': 0.5, 'viewport': '1024x2480', 'timeout':90, 'images': 0}

It works!

Besides, some website would very quick when you using curl or Browser, but very slow in splash, because splash cannot download some resources currectly.

These can also come across with 504 Gateway Time-out. The right way is stop the slow resource download. in Splash, you can set resource_timeout in args:

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url,
                                self.parse,
                                args={'wait': 0.5, 'viewport': '1024x2480', 'timeout': 90, 'images': 0, 'resource_timeout': 10},
                                )

@kamrankausar
Copy link

Thanks @yeszao it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants