Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds ROBOTSTXT_USER_AGENT setting #3966

Merged
merged 2 commits into from
Aug 28, 2019
Merged

Conversation

anubhavp28
Copy link
Contributor

Fixes #3931

@anubhavp28 anubhavp28 force-pushed the robotstxt_useragent branch from cffb38a to 00fe05e Compare August 19, 2019 03:54
@codecov
Copy link

codecov bot commented Aug 19, 2019

Codecov Report

Merging #3966 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3966      +/-   ##
==========================================
+ Coverage   85.35%   85.39%   +0.03%     
==========================================
  Files         167      167              
  Lines        9699     9724      +25     
  Branches     1453     1456       +3     
==========================================
+ Hits         8279     8304      +25     
  Misses       1162     1162              
  Partials      258      258
Impacted Files Coverage Δ
scrapy/downloadermiddlewares/robotstxt.py 100% <100%> (ø) ⬆️
scrapy/settings/default_settings.py 98.7% <100%> (+0.01%) ⬆️
scrapy/core/downloader/contextfactory.py 96.66% <0%> (+0.51%) ⬆️
scrapy/robotstxt.py 97.36% <0%> (+0.64%) ⬆️
scrapy/utils/ssl.py 53.65% <0%> (+1.15%) ⬆️

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

I’ve left just a couple of comments regarding documentation.

@@ -1074,6 +1074,21 @@ implementing the methods described below.
.. autoclass:: RobotParser
:members:

RobotsTxtMiddleware Settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setting should only be described in one page, probably the settings page. In this page you can instead mention that the user agent that the middleware uses may be overridden with this setting, providing a link to the settings page entry. See how ROBOTSTXT_OBEY is referenced in this page but its documentation is in the settings page only.

@@ -1409,7 +1421,9 @@ USER_AGENT

Default: ``"Scrapy/VERSION (+https://scrapy.org)"``

The default User-Agent to use when crawling, unless overridden.
The default User-Agent to use when crawling, unless overridden. This user agent is
also used in robots.txt if :setting:`ROBOTSTXT_USER_AGENT` setting is ``None`` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of in robots.txt, I would say by RobotsTxtMiddleware, with a link to the middleware documentation.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature!

@kmike kmike merged commit ede9147 into scrapy:master Aug 28, 2019
@kmike
Copy link
Member

kmike commented Aug 28, 2019

Thanks @anubhavp28!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using separate user agent for robots.txt
3 participants