-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[MRG+1] [GSoC 2019] Interface for robots.txt parsers #3796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3796 +/- ##
==========================================
- Coverage 85.42% 82.64% -2.79%
==========================================
Files 169 170 +1
Lines 9635 9664 +29
Branches 1433 1433
==========================================
- Hits 8231 7987 -244
- Misses 1156 1419 +263
- Partials 248 258 +10
|
Codecov Report
@@ Coverage Diff @@
## master #3796 +/- ##
=========================================
+ Coverage 85.42% 85.62% +0.2%
=========================================
Files 169 165 -4
Lines 9635 9624 -11
Branches 1433 1434 +1
=========================================
+ Hits 8231 8241 +10
+ Misses 1156 1135 -21
Partials 248 248
|
https://stackoverflow.com/a/19328146/939364 looks like a good read on the subject. I think we should abstain from using them unless we find a good reason to do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve dropped a few questions regarding the proposed API. Bear in mind that I’m barely familiar with robots.txt
at this point, so there may be stupid questions :)
I am thinking that we won't like to have |
I think, it is common in scrapy to raise import exception in case some class is missing(in middlewares or queues). Nothing special required here. |
To ellaborate a bit more, you could use a The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great!
Is there any particular reason why we are using |
I’m guessing that is has something to do with Twisted’s asynchronous nature, that some tests may not work otherwise. But I don’t know for sure. |
@@ -28,6 +29,7 @@ def __init__(self, crawler): | |||
self.crawler = crawler | |||
self._useragent = crawler.settings.get('USER_AGENT') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is how it worked before, and this is a separate issue, but using crawler.settings.get('USER_AGENT')
doesn't feel right - user agent middleware can be disabled, and users may also override user-agent per request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anubhavp28 has made it so that per-request user agent strings are respected.
As for the user agent middleware, I think it makes sense for the robotstxt middleware to rely on the same setting and the request header regardless of whether or not the middleware is enabled. I think this offers the most intuitive behavior when the robotstxt middleware is enabled, regardless of whether or not the user agent middleware is enabled as well.
@anubhavp28: we do need to update the documentation of the USER_AGENT
setting to make it clear that both the user-agent middleware and this robotstxt middleware use it, and how (i.e. how it can be overridden in each case).
There is one use case I can think of that we are not covering with the current approach: a user wants to send one user-agent header value but have a different user agent string evaluated to decide which pages to crawl.
To support this, I think we could implement a ROBOTSTXT_USER_AGENT
setting with a higher priority than both the USER_AGENT
setting and the header of a specific request, and a robotstxt_user_agent
meta key that allows overriding the ROBOTSTXT_USER_AGENT
setting for specific requests.
@kmike @whalebot-helmsman Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have crawler.settings.get('USER_AGENT')
and request.headers['User-Agent']
we should support them and add corresponding documentation.
To support this, I think we could implement a ROBOTSTXT_USER_AGENT setting with a higher priority than both the USER_AGENT setting and the header of a specific request, and a robotstxt_user_agent meta key that allows overriding the ROBOTSTXT_USER_AGENT setting for specific requests.
I think request.meta['robotstxt_user_agent']
should be enough here, no need in ROBOTSTXT_USER_AGENT
In next order:
- request.meta['robotstxt_user_agent']
- request.headers['User-Agent']
- crawler.settings.get('USER_AGENT')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike What's your thoughts on this? Would request.meta['robotstxt_user_agent']
be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m now wondering whether or not it makes sense to allow per-request user agents to affect the robots.txt parser.
The only situation I can think of is that where two sites ask crawlers to specify their user agent string in a different format, and a user wants to use the same spider for both sites. But that sounds a bit bizarre.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike I don’t have strong feelings against @whalebot-helmsman’s proposal. It it looks OK to you, let’s go that route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey! I shouldn't have asked to fix this issue in this PR in a first place, as it is unrelated to introducing robots.txt parser interface.
Previously Scrapy was only taking USER_AGENT option in account. Now support for User-Agent header is added as well, which in most cases will be the same as USER_AGENT anyways. This can be backwards incompatible in some cases, e.g. it may affect crawlers when people use rotating user-agent middlewares.
At the same time, I think the current change is good. It is polite to follow robots.txt with user agent you're sending, and I think it should be a default; using a separate user agent for robots.txt is a more aggressive option, probably similar to disabling robots.txt middleware.
I'm fine with adding both request.meta['robotstxt_user_agent']
and ROBOTSTXT_USER_AGENT
options, or adding only one of them, but see it as a separate discussion, which shouldn't block merging of the current changes. User now can easily override RobotParser.allowed method and implement any logic there, we have a public documented interface for this, so this use case looks covered as well - not in a super-convenient way, but it can be fine, as it is an advanced use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gallaecio Based on suggestion from @kmike, I am thinking of creating new issue and pull requests for discussing and resolving this issue. Merging this pull request will allow other work such as integrating SitemapCrawler
with this interface to commence.
I think the only remaining point from the feedback so far is how to handle user agent strings. Lets wait for additional input regarding that (cc @kmike, @whalebot-helmsman), as well as any additional feedback there might be, but the patch is looking really good! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @anubhavp28! Also, thanks @Gallaecio and @whalebot-helmsman for the reviews. @Gallaecio feel free to merge; I'm not merging it myself in case you wanted to take another look, as PR changed quite a lot since the approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike Actually, please see the link in my comment above. We still need to discuss how we want to allow users to define the user string that the robots.txt parser use.
@Gallaecio for me it looks like an issue which can be solved separately (#3796 (comment)); I'm still fine with merging this PR as-is. |
[For Google Summer of Code 2019] This pull request is not ready for merging. Looking for reviews and suggestions. Excited for the awesome summer ahead. :)
@Gallaecio @whalebot-helmsman
Will there be any benefit from using python's abc library for creating
BaseRobotsTxtParser
class here?