Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modular sitemap spider #3543

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

stranac
Copy link

@stranac stranac commented Dec 22, 2018

Currently, the structure of SitemapSpider makes extraction of additional data from a sitemap unnecessarily complicated.
It would require copy-pasting the 22-line _parse_sitemap method, and then processing the created sitemap in a for loop.

This PR splits _parse_sitemap functionality into 3 additional methods, making this kind of processing simpler.
It also moves the link extraction inside the spider.

I also added tests for some of the previously untested SitemapSpider functionality.

@codecov
Copy link

codecov bot commented Dec 22, 2018

Codecov Report

Merging #3543 into master will increase coverage by 0.04%.
The diff coverage is 84.61%.

@@            Coverage Diff             @@
##           master    #3543      +/-   ##
==========================================
+ Coverage   84.38%   84.43%   +0.04%     
==========================================
  Files         167      167              
  Lines        9376     9385       +9     
  Branches     1392     1393       +1     
==========================================
+ Hits         7912     7924      +12     
+ Misses       1206     1204       -2     
+ Partials      258      257       -1
Impacted Files Coverage Δ
scrapy/spiders/sitemap.py 82.6% <84.61%> (+7.6%) ⬆️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant