Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fix for KeyError in robots.txt middleware #1735

Merged
merged 3 commits into from Feb 3, 2016
Merged

[MRG+1] Fix for KeyError in robots.txt middleware #1735

merged 3 commits into from Feb 3, 2016

Conversation

@ArturGaspar
Copy link
Contributor

@ArturGaspar ArturGaspar commented Jan 27, 2016

Fixes #1733

@codecov-io
Copy link

@codecov-io codecov-io commented Jan 27, 2016

Current coverage is 83.30%

Merging #1735 into master will increase coverage by +0.01% as of 380e480

Powered by Codecov. Updated on successful CI builds.

@redapple redapple modified the milestone: v1.1 Jan 27, 2016
@kmike kmike changed the title Fix for KeyError in robots.txt middleware [MRG+1] Fix for KeyError in robots.txt middleware Jan 27, 2016
@kmike
Copy link
Member

@kmike kmike commented Jan 27, 2016

LGTM.

# deferLater() ensures that the error callback isn't called
# immediately upon being added, so that it doesn't remove the key
# before we check for it.
dfd = deferLater(reactor, 0, self.crawler.engine.download,

This comment has been minimized.

@nramirezuy

nramirezuy Jan 27, 2016
Contributor

Don't you need to use .1 or greater to be really safe?

This comment has been minimized.

@kmike

kmike Jan 27, 2016
Member

@nramirezuy what additional safety does 0.1 provide? won't it just make error message appear further in log?

This comment has been minimized.

@nramirezuy

nramirezuy Jan 27, 2016
Contributor

If .1 makes the error further down the log, how 0 solves the issue?

This comment has been minimized.

@kmike

kmike Jan 27, 2016
Member

It solves the KeyError issue because it ensures dfd errbacks are not called while robot_parser function is executed. Error message can still be in a wrong place, but I don't see why would we want to make it further by using 0.1 instead of 0.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 27, 2016

I feel this component needs a slight rework.

  • Doing the Robots Request and appending other requests to the result should be 2 different methods/callbacks.
  • Also why is the netloc removed? Shouldn't it just store some value, the failure maybe? Removing it provokes the request to be retried in the future even tho it is probably already retried by the RetryMiddleware.
  • Why aren't you using Request.callback and Request.errback? It would make it easier to understand.

@kmike
Copy link
Member

@kmike kmike commented Jan 27, 2016

@nramirezuy can refactoring this middleware be a separate issue? It has a few other problems: it doesn't handle Crawl-Delay directives (#892), and it breaks on robots.txt files found in the wild (#754).

@kmike
Copy link
Member

@kmike kmike commented Jan 27, 2016

or do you feel refactoring will help for this particular issue?

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 27, 2016

@kmike Yes, the second point should fix for the issue. Just avoid every request to a domain which robots.txt failed to download. For example what if the domain just blocked the UA you using or the Crawler IP.

Storing the failure might be overboard, just storing None should be enough. Also you can collect stats and see the different failure you are getting on robots requests.

@ArturGaspar
Copy link
Contributor Author

@ArturGaspar ArturGaspar commented Jan 27, 2016

@nramirezuy I'm not sure why I was removing the netloc in the first place. All tests still pass when leaving it as none, and it is also what the middleware already did, so I'm making it this way again.

I don't understand your first suggestion, and I'm not using callback and errback because the previous code didn't use it.
I think blocking every request if robots.txt fails downloading is inconsistent with the behaviour on other errors (and incompatible with the previous behaviour on download errors), which is to consider it as allowing all.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 28, 2016

@ArturGaspar I just looked at the previous version before the defer.
And looks like the robots.txt requests was done only once and if it failed it let you download the from the site, it also let you download while the robots request wasn't handled.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 28, 2016

Can we document the behavior? With defer implementation the behavior changed, now it waits for the robots request to be downloaded.

@kmike
Copy link
Member

@kmike kmike commented Jan 28, 2016

@nramirezuy I think that not downloading when robots.txt request fails is a bug, but waiting for robots.txt to be downloaded before proceeding with the main website is a bug fix.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 28, 2016

@kmike I agree, is how I suppose to be implemented. But we should also documented, because it will inform users of the middleware and will help us to review on the future.

@ArturGaspar
Copy link
Contributor Author

@ArturGaspar ArturGaspar commented Jan 28, 2016

@nramirezuy The documentation mentioned it as a limitation, and #1473 removed that warning from the documentation.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 28, 2016

@ArturGaspar But we should add a warning about when robots request fails (when it reach the errback), also tell that it respect the Downloader Middlewares but doesn't use the Spider Middlewares.
Sorry, I'm just dumping everything while I'm on it. Otherwise it get lost on the void.

@ArturGaspar
Copy link
Contributor Author

@ArturGaspar ArturGaspar commented Jan 28, 2016

@nramirezuy Isn't it better to make a separate issue for improving documentation on this middleware?

@kmike
Copy link
Member

@kmike kmike commented Jan 29, 2016

Sorry, I'm falling behind :) What does currently happen if robots.txt request falis?

  • Is it retried?
  • If robots.txt request fails, is the main request sent? If so, does it happen before or after retries?
  • Is an error with robots.txt logged?
  • Is the behaviour different in Scrapy 1.0.x?

@ArturGaspar
Copy link
Contributor Author

@ArturGaspar ArturGaspar commented Jan 29, 2016

@kmike

Is it retried?

By the default retry middleware. (Without this PR, it would be retried on every new request to a domain, and only remembered for that domain if succeeded.)

If robots.txt request fails, is the main request sent? If so, does it happen before or after retries?

Yes. After the retries.

Is an error with robots.txt logged?

Yes.

Is the behaviour different in Scrapy 1.0.x?

With this PR, no, except for the previously documented limitation (that in 1.0 the first request is made before robots.txt, and some requests can still go through before it was downloaded).

@kmike
Copy link
Member

@kmike kmike commented Jan 29, 2016

👍 that's perfect

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 29, 2016

Is it retried?

By the default retry middleware. (Without this PR, it would be retried on every new request to a domain, and only remembered for that domain if succeeded.)

Depends on the position of RobotsTxtMiddleware on the chain, by default answer is valid.

If robots.txt request fails, is the main request sent? If so, does it happen before or after retries?

Yes. After the retries.

Depends on the position of RobotsTxtMiddleware on the chain, by default answer is valid.

Is the behaviour different in Scrapy 1.0.x?

With this PR, no, except for the previously documented limitation (that in 1.0 the first request is made before robots.txt, and some requests can still go through before it was downloaded).

1.0.x behavior was bogus, so it have to change. Maybe in another PR.

@ArturGaspar can you create PR changing behavior #1735 (comment) with it documentation?

@ArturGaspar
Copy link
Contributor Author

@ArturGaspar ArturGaspar commented Jan 29, 2016

@nramirezuy I think position of the middleware doesn't matter, because it calls engine.download() for a new request, which goes through the entire middleware chain. Isn't that so?

@ArturGaspar can you create PR changing behavior #1735 (comment) with it documentation?

Sorry, I'm confused. You mean documenting only what was improved in #1473 and this PR, or the behaviour that was already present (what happens when robots.txt fails etc.)?

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 29, 2016

@ArturGaspar Yea you are right it will work regardless. What I'm not sure is if it gets propagated to the SpiderMiddlewares and if is the same for error and normal.

We should first fix this #1735 (comment), then document the change. And also will be useful to document that it indeed passes through the DownloaderMiddlewares, and if it gets propagated to SpiderMiddlewares too. So we don't have to science it every time we want to use it.

@kmike
Copy link
Member

@kmike kmike commented Jan 29, 2016

@nramirezuy sorry, what exactly should we fix?

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 29, 2016

I think that not downloading when robots.txt request fails is a bug

This one should be settable, #1735 (comment) the site can even forbid you from access robots.txt.

  • #1735 (comment) This behavior has to be documented, removing the older warning isn't enough. Adding a note instead might be the way to go.
  • Refactoring the middleware should be on the queue, till now is our only example of defer from downloader middlewares and it is a mess. This method can't be doing 2 things.

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Feb 2, 2016

Hey fellows, any chance of we agree and wrap this up to include in the next release?

I see this has a merge vote already and it improve things already, I would've had merged it if there wasn't this additional discussion (bikeshedding? 🚲 🎪 ) :)

Should the additional work (refactoring + documenting) be a separate issue?

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Feb 3, 2016

Hey @kmike @nramirezuy @ArturGaspar, I'll merge this one now in order to move forward with the 1.1 RC release.
@nramirezuy can you please open a ticket for the follow-up?

eliasdorneles added a commit that referenced this pull request Feb 3, 2016
[MRG+1] Fix for KeyError in robots.txt middleware
@eliasdorneles eliasdorneles merged commit a8a6f05 into scrapy:master Feb 3, 2016
2 checks passed
2 checks passed
codecov/patch 100.00% of diff hit (target 100.00%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants