Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spider finshed normally despite an error in start_requests #4182

Open
lopuhin opened this issue Nov 20, 2019 · 9 comments
Open

Spider finshed normally despite an error in start_requests #4182

lopuhin opened this issue Nov 20, 2019 · 9 comments

Comments

@lopuhin
Copy link
Member

lopuhin commented Nov 20, 2019

This is a usability issue, although I'm not sure it's a good one. This is based on a real case we discovered with @whalebot-helmsman .

Consider a spider which crawls a large list of URLs, and in it's start_requests method it will loop through urls and yield scrapy.Request objects. If one URL is invalid, it will fail to create request object, and we'll get ERROR: Error while obtaining start requests caused by ValueError: Missing scheme in request url:, and the rest of the URLs won't be scheduled, because start_requests crashed. So far this is ok, but the problem is that this issue is not trivial to diagnose, because spider finish reason would be 'finish_reason': 'finished',, and this particular error in the log won't be the last one very often, because requests already scheduled would be still executed. So we will not schedule all URLs after the bad one, but it would be hard to find out.

I'm not sure it's a good solution, but maybe if errors in start_requests would cause a different finish reason or some message at the end of the log or in stats, it would be easier to debug such an issue.

On the other hand, it's possible that a similar error could happen in the parse method - and we won't be able to do anything with it.

Another potential solution is to delay URL validation, so that one faulty URL does not cause other URLs to not be scheduled (while still giving an error in the log) - but this probably has it's own drawbacks.

@lopuhin
Copy link
Member Author

lopuhin commented Nov 20, 2019

Another potential solution is to delay URL validation

Although that won't help if an error in start_requests is a different one, not related to invalid URLs.

it's possible that a similar error could happen in the parse method

This case is a bit different, as we'd loose only requests from one page and this is something we half-expect. Although this could be used as a funny anti-bot protection: place an invalid invisible URL first on the page :)

@Nimit-Khurana
Copy link

@lopuhin Can you share the code that produces this error?

@lopuhin
Copy link
Member Author

lopuhin commented Nov 21, 2019

@Nimit-Khurana sure, here is an example spider:

import scrapy

class Spider(scrapy.Spider):
    name = 'spider'

    def start_requests(self):
        urls = ['bad-url', 'http://good-url.org']
        for url in urls:
            yield scrapy.Request(url)

    def parse(self, response):
        return {'ok': 'True'}

@lopuhin
Copy link
Member Author

lopuhin commented Nov 21, 2019

Re original issue: another potential solution is to distinguish better between request errors and crashes in the code: right now if we fail to download a URL, and if there is a crash in the spider, both give one more ERROR in the log (and download error also has an entry under 'downloader/exception_type_count/X. And this particular crash in the spider is not reflected in the stats in any way, except for 'log_count/ERROR' being higher by 1.

@Nimit-Khurana
Copy link

Nimit-Khurana commented Nov 23, 2019

@lopuhin test_log.txt

A plain spider with ['https://nsdkjvn.com','http://google.com'] as start_urls.

Both the urls were crawled. Is this what you are talking about?

@Gallaecio
Copy link
Member

Your start URLs do not match the log. In the log I see “http://fffgoogle.com/robots.txt”.

@Nimit-Khurana
Copy link

@Gallaecio Sorry. The log is of url http://fffgoogle.com same as in the spider.

@Prime-5
Copy link
Contributor

Prime-5 commented Jan 14, 2020

Re original issue: another potential solution is to distinguish better between request errors and crashes in the code: right now if we fail to download a URL, and if there is a crash in the spider, both give one more ERROR in the log (and download error also has an entry under 'downloader/exception_type_count/X. And this particular crash in the spider is not reflected in the stats in any way, except for 'log_count/ERROR' being higher by 1.

Displaying it as another kind of error would really help in easy identification. @lopuhin @Gallaecio I'm new to the scrapy codebase but would like to help in this. If possible, could you point to the part where scrapy errors are managed?

@noviluni
Copy link
Member

noviluni commented Nov 25, 2020

Related to: #4058 and #3463

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants