Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use protego as a default robots.txt parser #4006

Merged
merged 17 commits into from Oct 1, 2019

Conversation

whalebot-helmsman
Copy link
Contributor

@whalebot-helmsman whalebot-helmsman commented Sep 9, 2019

Implementation of proposal #3969

@codecov
Copy link

@codecov codecov bot commented Sep 9, 2019

Codecov Report

Merging #4006 into master will increase coverage by 0.08%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4006      +/-   ##
==========================================
+ Coverage   85.43%   85.52%   +0.08%     
==========================================
  Files         167      167              
  Lines        9732     9741       +9     
  Branches     1456     1461       +5     
==========================================
+ Hits         8315     8331      +16     
+ Misses       1159     1153       -6     
+ Partials      258      257       -1
Impacted Files Coverage Δ
scrapy/settings/default_settings.py 98.7% <100%> (ø) ⬆️
scrapy/logformatter.py 90.47% <0%> (ø) ⬆️
scrapy/core/engine.py 92.79% <0%> (+0.03%) ⬆️
scrapy/pipelines/files.py 66.16% <0%> (+0.78%) ⬆️
scrapy/core/scraper.py 89.33% <0%> (+2.84%) ⬆️
scrapy/extensions/corestats.py 100% <0%> (+10%) ⬆️

Copy link
Member

@elacuesta elacuesta left a comment

Thanks! Looks good to me, I just left a couple of minor comments.

docs/topics/downloader-middleware.rst Outdated Show resolved Hide resolved
requirements-py2.txt Outdated Show resolved Hide resolved
whalebot-helmsman and others added 3 commits Sep 9, 2019
Co-Authored-By: elacuesta <elacuesta@users.noreply.github.com>
Co-Authored-By: elacuesta <elacuesta@users.noreply.github.com>
docs/topics/downloader-middleware.rst Outdated Show resolved Hide resolved
docs/topics/downloader-middleware.rst Outdated Show resolved Hide resolved
docs/topics/downloader-middleware.rst Outdated Show resolved Hide resolved
Co-Authored-By: Mikhail Korobov <kmike84@gmail.com>
Copy link
Member

@Gallaecio Gallaecio left a comment

I see a few minor style issues with the documentation changes. Mainly, inconsistent use of capitalization (paragraphs that start with lowercase, lists where some entries are capitalized while others are not), inconsistent ending of list entries (lists where some entries end in a comma), etc.

I’m thinking it may be better to keep documentation changes here to the minimum here (e.g. just change which parser is the default), and propose a separate pull request which works on improving the documentation of the different parsers.

docs/topics/downloader-middleware.rst Outdated Show resolved Hide resolved
docs/topics/downloader-middleware.rst Outdated Show resolved Hide resolved
docs/topics/downloader-middleware.rst Outdated Show resolved Hide resolved
docs/topics/downloader-middleware.rst Outdated Show resolved Hide resolved
docs/topics/downloader-middleware.rst Outdated Show resolved Hide resolved
whalebot-helmsman and others added 2 commits Sep 17, 2019
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
@kmike
Copy link
Member

@kmike kmike commented Sep 17, 2019

Hey! The PR looks fine to me.

I'm on fence on it, but maybe it can be nice to review protego API a bit before making it default, to avoid churn in future - e.g. in Python's default parser (in upcoming 3.8) there is .site_maps() method, while in protego there is .sitemaps property. On the other hand, it doesn't affect Scrapy directly, so maybe it is not needed.

@whalebot-helmsman
Copy link
Contributor Author

@whalebot-helmsman whalebot-helmsman commented Sep 18, 2019

review protego API a bit before making it default, to avoid churn in future - e.g. in Python's default parser (in upcoming 3.8) there is .site_maps() method, while in protego there is .sitemaps property.

I see no problem here. We don't directly use parsers' API in scrapy. Every parser should provide proper adapter conforming common interface. We tied to interface's API, not parser's API.

@kmike kmike added this to the v1.8 milestone Oct 1, 2019
@kmike
Copy link
Member

@kmike kmike commented Oct 1, 2019

Ok, let's do it!

@kmike kmike merged commit 74b4a5c into scrapy:master Oct 1, 2019
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants