New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spider finshed normally despite an error in start_requests #4182
Comments
Although that won't help if an error in
This case is a bit different, as we'd loose only requests from one page and this is something we half-expect. Although this could be used as a funny anti-bot protection: place an invalid invisible URL first on the page :) |
@lopuhin Can you share the code that produces this error? |
@Nimit-Khurana sure, here is an example spider:
|
Re original issue: another potential solution is to distinguish better between request errors and crashes in the code: right now if we fail to download a URL, and if there is a crash in the spider, both give one more ERROR in the log (and download error also has an entry under |
A plain spider with Both the urls were crawled. Is this what you are talking about? |
Your start URLs do not match the log. In the log I see “http://fffgoogle.com/robots.txt”. |
@Gallaecio Sorry. The log is of url http://fffgoogle.com same as in the spider. |
Displaying it as another kind of error would really help in easy identification. @lopuhin @Gallaecio I'm new to the scrapy codebase but would like to help in this. If possible, could you point to the part where scrapy errors are managed? |
This is a usability issue, although I'm not sure it's a good one. This is based on a real case we discovered with @whalebot-helmsman .
Consider a spider which crawls a large list of URLs, and in it's
start_requests
method it will loop through urls and yieldscrapy.Request
objects. If one URL is invalid, it will fail to create request object, and we'll getERROR: Error while obtaining start requests
caused byValueError: Missing scheme in request url:
, and the rest of the URLs won't be scheduled, becausestart_requests
crashed. So far this is ok, but the problem is that this issue is not trivial to diagnose, because spider finish reason would be'finish_reason': 'finished',
, and this particular error in the log won't be the last one very often, because requests already scheduled would be still executed. So we will not schedule all URLs after the bad one, but it would be hard to find out.I'm not sure it's a good solution, but maybe if errors in
start_requests
would cause a different finish reason or some message at the end of the log or in stats, it would be easier to debug such an issue.On the other hand, it's possible that a similar error could happen in the
parse
method - and we won't be able to do anything with it.Another potential solution is to delay URL validation, so that one faulty URL does not cause other URLs to not be scheduled (while still giving an error in the log) - but this probably has it's own drawbacks.
The text was updated successfully, but these errors were encountered: