Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Enable robots.txt handling by default for new projects. #1724

Merged
merged 1 commit into from Jan 26, 2016

Conversation

@kmike
Copy link
Member

@kmike kmike commented Jan 26, 2016

A proposed fix for #1668.
For backwards compatibility reasons the default value is not changed: settings for existing projects or settings used for CrawlerProcess won't change.

For backwards compatibility reasons the default value is not changed.
@codecov-io
Copy link

@codecov-io codecov-io commented Jan 26, 2016

Current coverage is 83.34%

Merging #1724 into master will increase coverage by +0.08% as of fad8e42

Powered by Codecov. Updated on successful CI builds.

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Jan 26, 2016

Neat idea.
It won't help when running with scrapy runspider, though.
What about adding a deprecation warning?

@kmike
Copy link
Member Author

@kmike kmike commented Jan 26, 2016

@eliasdorneles do you mean a deprecation warning with ROBOTSTXT_OBEY is not set? It could be inconvenient for CrawlerProcess users because CrawlerProcess() will be raising warnings by default.

Changing scrapy runspider is backwards incompatible; to do a deprecation cycle we will have to require users to pass -s ROBOTSTXT_OBEY=1 at each runspider call, it makes API much worse.

What about changing default values in future Scrapy 2.0?

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Jan 26, 2016

Hmm, right.
Agreed, this is good enough for most purposes and for Scrapy 2.0 we can break compatibility then. 👍

@eliasdorneles eliasdorneles changed the title Enable robots.txt handling by default for new projects. [MRG+1] Enable robots.txt handling by default for new projects. Jan 26, 2016
@redapple
Copy link
Contributor

@redapple redapple commented Jan 26, 2016

👍

redapple added a commit that referenced this pull request Jan 26, 2016
[MRG+1] Enable robots.txt handling by default for new projects.
@redapple redapple merged commit 0d368c5 into master Jan 26, 2016
2 checks passed
2 checks passed
codecov/patch coverage not affected
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@redapple
Copy link
Contributor

@redapple redapple commented Jan 26, 2016

I also agree that we can make it default setting in v2.0

@Digenis
Copy link
Member

@Digenis Digenis commented Jan 26, 2016

I have the impression that I referenced this already but can't find it.
How about checking if the user agent was overridden?
Anyone overriding it is probably familiar with the implication of disregarding robots.txt.
Those who do can get some extra release cycles with deprecation warnings
before the default behaviour.

@kmike
Copy link
Member Author

@kmike kmike commented Jan 26, 2016

@Digenis do you suggest to enable or disable robots.txt handling when user agent is overridden?

We encourage users to override user agent in generated setting.py (near the top), and provide an example: '$project_name (+http://www.yourdomain.com)'. Users who follow this suggestion will likely want to have robots.txt enabled by default as well. On the other hand, some user override user-agent to browser-like values, they are more likely to want robots.txt to be disabled. I'm not sure it is a good idea to tie robots.txt handling with user agent overrides - it may be surprising and unwelcome regardless of what default we choose.

@Digenis
Copy link
Member

@Digenis Digenis commented Jan 26, 2016

Not permanently tied to robots. Just for an extra release in deprecation warnings.
I'd also find such unpredictable configuration unwelcome (though I look at warnings)
but it passed through my mind because scrapy is about to embrace dynamic configuration
(if I correctly understood the add-ons PR).

@kmike kmike deleted the robotstxt-default branch Jan 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants