Skip to content

[MRG+1] [GSoC 2019] Interface for robots.txt parsers #3796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Aug 2, 2019

Conversation

anubhavp28
Copy link
Contributor

@anubhavp28 anubhavp28 commented May 27, 2019

[For Google Summer of Code 2019] This pull request is not ready for merging. Looking for reviews and suggestions. Excited for the awesome summer ahead. :)

@Gallaecio @whalebot-helmsman

Will there be any benefit from using python's abc library for creating BaseRobotsTxtParser class here?

@codecov
Copy link

codecov bot commented May 27, 2019

Codecov Report

Merging #3796 into master will decrease coverage by 2.78%.
The diff coverage is 79.31%.

@@            Coverage Diff             @@
##           master    #3796      +/-   ##
==========================================
- Coverage   85.42%   82.64%   -2.79%     
==========================================
  Files         169      170       +1     
  Lines        9635     9664      +29     
  Branches     1433     1433              
==========================================
- Hits         8231     7987     -244     
- Misses       1156     1419     +263     
- Partials      248      258      +10
Impacted Files Coverage Δ
scrapy/extensions/robotstxtparser.py 79.31% <79.31%> (ø)
scrapy/linkextractors/sgml.py 0% <0%> (-96.81%) ⬇️
scrapy/linkextractors/regex.py 0% <0%> (-95.66%) ⬇️
scrapy/linkextractors/htmlparser.py 0% <0%> (-92.07%) ⬇️
scrapy/core/downloader/handlers/s3.py 62.9% <0%> (-32.26%) ⬇️
scrapy/extensions/statsmailer.py 0% <0%> (-30.44%) ⬇️
scrapy/utils/boto.py 46.66% <0%> (-26.67%) ⬇️
scrapy/_monkeypatches.py 42.85% <0%> (-14.29%) ⬇️
scrapy/link.py 86.36% <0%> (-13.64%) ⬇️
scrapy/core/downloader/tls.py 77.5% <0%> (-12.5%) ⬇️
... and 14 more

@codecov
Copy link

codecov bot commented May 27, 2019

Codecov Report

Merging #3796 into master will increase coverage by 0.2%.
The diff coverage is 97.26%.

@@            Coverage Diff            @@
##           master    #3796     +/-   ##
=========================================
+ Coverage   85.42%   85.62%   +0.2%     
=========================================
  Files         169      165      -4     
  Lines        9635     9624     -11     
  Branches     1433     1434      +1     
=========================================
+ Hits         8231     8241     +10     
+ Misses       1156     1135     -21     
  Partials      248      248
Impacted Files Coverage Δ
scrapy/settings/default_settings.py 98.67% <ø> (+0.02%) ⬆️
scrapy/downloadermiddlewares/robotstxt.py 100% <100%> (ø) ⬆️
scrapy/robotstxt.py 96.72% <96.72%> (ø)
scrapy/utils/decorators.py 80% <0%> (-8%) ⬇️
scrapy/crawler.py 89.94% <0%> (-2.46%) ⬇️
scrapy/pipelines/files.py 66.53% <0%> (-1.71%) ⬇️
scrapy/extensions/feedexport.py 83.48% <0%> (-1.42%) ⬇️
scrapy/commands/parse.py 20.32% <0%> (-0.73%) ⬇️
scrapy/spiders/crawl.py 81.7% <0%> (-0.65%) ⬇️
scrapy/http/request/form.py 95.41% <0%> (-0.62%) ⬇️
... and 35 more

@Gallaecio
Copy link
Member

Will there be any benefit from using python's abc library for creating BaseRobotsTxtParser class here?

https://stackoverflow.com/a/19328146/939364 looks like a good read on the subject. I think we should abstain from using them unless we find a good reason to do it.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve dropped a few questions regarding the proposed API. Bear in mind that I’m barely familiar with robots.txt at this point, so there may be stupid questions :)

@anubhavp28
Copy link
Contributor Author

I am thinking that we won't like to have reppy and Robotexclusionrulesparser as Scrapy's dependencies. We won't have any guarantee that these modules are available. Therefore, inside the interface class we need to perform some kind of test to check the availability of these modules. How should we handle the situation such as when the user sets ROBOTTXT_PARSER setting to the reppy interface class but does not have reppy installed?

@whalebot-helmsman
Copy link
Contributor

How should we handle the situation such as when the user sets ROBOTTXT_PARSER setting to the reppy interface class but does not have reppy installed?

I think, it is common in scrapy to raise import exception in case some class is missing(in middlewares or queues). Nothing special required here.

@Gallaecio
Copy link
Member

Gallaecio commented May 31, 2019

To ellaborate a bit more, you could use a try … except to import the modules that require additional dependencies, and as @whalebot-helmsman suggests, raise a specific exception from the constructor if those imports did not work, similar to how it’s done with middlewares when they are not enabled in the configuration.

The robots.txt middleware, which takes care of creating the parser instance, should catch this exception, log an error and possibly fallback to the default parser class.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great!

@anubhavp28
Copy link
Contributor Author

Is there any particular reason why we are using unittest module from twisted.trial rather using the Python's inbuilt unittest module? twisted.trial.unitttest seems to be lacking functionality such as expectedFailure (https://twistedmatrix.com/trac/ticket/6336) which we can use to reduce duplication of code.

@Gallaecio
Copy link
Member

Is there any particular reason why we are using unittest module from twisted.trial rather using the Python's inbuilt unittest module? twisted.trial.unitttest seems to be lacking functionality such as expectedFailure (https://twistedmatrix.com/trac/ticket/6336) which we can use to reduce duplication of code.

I’m guessing that is has something to do with Twisted’s asynchronous nature, that some tests may not work otherwise. But I don’t know for sure.

@@ -28,6 +29,7 @@ def __init__(self, crawler):
self.crawler = crawler
self._useragent = crawler.settings.get('USER_AGENT')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is how it worked before, and this is a separate issue, but using crawler.settings.get('USER_AGENT') doesn't feel right - user agent middleware can be disabled, and users may also override user-agent per request.

Copy link
Member

@Gallaecio Gallaecio Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anubhavp28 has made it so that per-request user agent strings are respected.

As for the user agent middleware, I think it makes sense for the robotstxt middleware to rely on the same setting and the request header regardless of whether or not the middleware is enabled. I think this offers the most intuitive behavior when the robotstxt middleware is enabled, regardless of whether or not the user agent middleware is enabled as well.

@anubhavp28: we do need to update the documentation of the USER_AGENT setting to make it clear that both the user-agent middleware and this robotstxt middleware use it, and how (i.e. how it can be overridden in each case).

There is one use case I can think of that we are not covering with the current approach: a user wants to send one user-agent header value but have a different user agent string evaluated to decide which pages to crawl.

To support this, I think we could implement a ROBOTSTXT_USER_AGENT setting with a higher priority than both the USER_AGENT setting and the header of a specific request, and a robotstxt_user_agent meta key that allows overriding the ROBOTSTXT_USER_AGENT setting for specific requests.

@kmike @whalebot-helmsman Thoughts?

Copy link
Contributor

@whalebot-helmsman whalebot-helmsman Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have crawler.settings.get('USER_AGENT') and request.headers['User-Agent'] we should support them and add corresponding documentation.

To support this, I think we could implement a ROBOTSTXT_USER_AGENT setting with a higher priority than both the USER_AGENT setting and the header of a specific request, and a robotstxt_user_agent meta key that allows overriding the ROBOTSTXT_USER_AGENT setting for specific requests.

I think request.meta['robotstxt_user_agent'] should be enough here, no need in ROBOTSTXT_USER_AGENT

In next order:

  • request.meta['robotstxt_user_agent']
  • request.headers['User-Agent']
  • crawler.settings.get('USER_AGENT')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike What's your thoughts on this? Would request.meta['robotstxt_user_agent'] be enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m now wondering whether or not it makes sense to allow per-request user agents to affect the robots.txt parser.

The only situation I can think of is that where two sites ask crawlers to specify their user agent string in a different format, and a user wants to use the same spider for both sites. But that sounds a bit bizarre.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike I don’t have strong feelings against @whalebot-helmsman’s proposal. It it looks OK to you, let’s go that route.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey! I shouldn't have asked to fix this issue in this PR in a first place, as it is unrelated to introducing robots.txt parser interface.

Previously Scrapy was only taking USER_AGENT option in account. Now support for User-Agent header is added as well, which in most cases will be the same as USER_AGENT anyways. This can be backwards incompatible in some cases, e.g. it may affect crawlers when people use rotating user-agent middlewares.

At the same time, I think the current change is good. It is polite to follow robots.txt with user agent you're sending, and I think it should be a default; using a separate user agent for robots.txt is a more aggressive option, probably similar to disabling robots.txt middleware.

I'm fine with adding both request.meta['robotstxt_user_agent'] and ROBOTSTXT_USER_AGENT options, or adding only one of them, but see it as a separate discussion, which shouldn't block merging of the current changes. User now can easily override RobotParser.allowed method and implement any logic there, we have a public documented interface for this, so this use case looks covered as well - not in a super-convenient way, but it can be fine, as it is an advanced use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gallaecio Based on suggestion from @kmike, I am thinking of creating new issue and pull requests for discussing and resolving this issue. Merging this pull request will allow other work such as integrating SitemapCrawler with this interface to commence.

@Gallaecio
Copy link
Member

Gallaecio commented Jul 10, 2019

I think the only remaining point from the feedback so far is how to handle user agent strings. Lets wait for additional input regarding that (cc @kmike, @whalebot-helmsman), as well as any additional feedback there might be, but the patch is looking really good!

Copy link
Member

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @anubhavp28! Also, thanks @Gallaecio and @whalebot-helmsman for the reviews. @Gallaecio feel free to merge; I'm not merging it myself in case you wanted to take another look, as PR changed quite a lot since the approval.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike Actually, please see the link in my comment above. We still need to discuss how we want to allow users to define the user string that the robots.txt parser use.

@kmike
Copy link
Member

kmike commented Jul 31, 2019

@Gallaecio for me it looks like an issue which can be solved separately (#3796 (comment)); I'm still fine with merging this PR as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants