Spider finshed normally despite an error in start_requests #4182

lopuhin · 2019-11-20T08:49:35Z

This is a usability issue, although I'm not sure it's a good one. This is based on a real case we discovered with @whalebot-helmsman .

Consider a spider which crawls a large list of URLs, and in it's start_requests method it will loop through urls and yield scrapy.Request objects. If one URL is invalid, it will fail to create request object, and we'll get ERROR: Error while obtaining start requests caused by ValueError: Missing scheme in request url:, and the rest of the URLs won't be scheduled, because start_requests crashed. So far this is ok, but the problem is that this issue is not trivial to diagnose, because spider finish reason would be 'finish_reason': 'finished',, and this particular error in the log won't be the last one very often, because requests already scheduled would be still executed. So we will not schedule all URLs after the bad one, but it would be hard to find out.

I'm not sure it's a good solution, but maybe if errors in start_requests would cause a different finish reason or some message at the end of the log or in stats, it would be easier to debug such an issue.

On the other hand, it's possible that a similar error could happen in the parse method - and we won't be able to do anything with it.

Another potential solution is to delay URL validation, so that one faulty URL does not cause other URLs to not be scheduled (while still giving an error in the log) - but this probably has it's own drawbacks.

The text was updated successfully, but these errors were encountered:

lopuhin · 2019-11-20T08:57:03Z

Another potential solution is to delay URL validation

Although that won't help if an error in start_requests is a different one, not related to invalid URLs.

it's possible that a similar error could happen in the parse method

This case is a bit different, as we'd loose only requests from one page and this is something we half-expect. Although this could be used as a funny anti-bot protection: place an invalid invisible URL first on the page :)

Nimit-Khurana · 2019-11-21T07:29:29Z

@lopuhin Can you share the code that produces this error?

lopuhin · 2019-11-21T11:56:18Z

@Nimit-Khurana sure, here is an example spider:

import scrapy

class Spider(scrapy.Spider):
    name = 'spider'

    def start_requests(self):
        urls = ['bad-url', 'http://good-url.org']
        for url in urls:
            yield scrapy.Request(url)

    def parse(self, response):
        return {'ok': 'True'}

lopuhin · 2019-11-21T11:59:56Z

Re original issue: another potential solution is to distinguish better between request errors and crashes in the code: right now if we fail to download a URL, and if there is a crash in the spider, both give one more ERROR in the log (and download error also has an entry under 'downloader/exception_type_count/X. And this particular crash in the spider is not reflected in the stats in any way, except for 'log_count/ERROR' being higher by 1.

Nimit-Khurana · 2019-11-23T06:30:06Z

@lopuhin test_log.txt

A plain spider with ['https://nsdkjvn.com','http://google.com'] as start_urls.

Both the urls were crawled. Is this what you are talking about?

Gallaecio · 2019-11-23T06:36:41Z

Your start URLs do not match the log. In the log I see “http://fffgoogle.com/robots.txt”.

Nimit-Khurana · 2019-11-23T15:17:03Z

@Gallaecio Sorry. The log is of url http://fffgoogle.com same as in the spider.

Prime-5 · 2020-01-14T19:31:02Z

Re original issue: another potential solution is to distinguish better between request errors and crashes in the code: right now if we fail to download a URL, and if there is a crash in the spider, both give one more ERROR in the log (and download error also has an entry under 'downloader/exception_type_count/X. And this particular crash in the spider is not reflected in the stats in any way, except for 'log_count/ERROR' being higher by 1.

Displaying it as another kind of error would really help in easy identification. @lopuhin @Gallaecio I'm new to the scrapy codebase but would like to help in this. If possible, could you point to the part where scrapy errors are managed?

noviluni · 2020-11-25T08:19:38Z

Related to: #4058 and #3463

Gallaecio added discuss enhancement labels Nov 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spider finshed normally despite an error in start_requests #4182

Spider finshed normally despite an error in start_requests #4182

lopuhin commented Nov 20, 2019

lopuhin commented Nov 20, 2019

Nimit-Khurana commented Nov 21, 2019

lopuhin commented Nov 21, 2019

lopuhin commented Nov 21, 2019

Nimit-Khurana commented Nov 23, 2019 •

edited by Gallaecio

Gallaecio commented Nov 23, 2019

Nimit-Khurana commented Nov 23, 2019

Prime-5 commented Jan 14, 2020 •

edited

noviluni commented Nov 25, 2020 •

edited

Spider finshed normally despite an error in start_requests #4182

Spider finshed normally despite an error in start_requests #4182

Comments

lopuhin commented Nov 20, 2019

lopuhin commented Nov 20, 2019

Nimit-Khurana commented Nov 21, 2019

lopuhin commented Nov 21, 2019

lopuhin commented Nov 21, 2019

Nimit-Khurana commented Nov 23, 2019 • edited by Gallaecio

Gallaecio commented Nov 23, 2019

Nimit-Khurana commented Nov 23, 2019

Prime-5 commented Jan 14, 2020 • edited

noviluni commented Nov 25, 2020 • edited

Nimit-Khurana commented Nov 23, 2019 •

edited by Gallaecio

Prime-5 commented Jan 14, 2020 •

edited

noviluni commented Nov 25, 2020 •

edited