Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Add sitemap_filter function to SitemapSpider class #3512

Merged
merged 11 commits into from Dec 28, 2018

Conversation

@victor-torres
Copy link
Contributor

@victor-torres victor-torres commented Nov 29, 2018

it makes it possible to filter sitemap urls by any available attribute

for example, you can filter urls with lastmod greater than a given datetime

it can be helpful when the url loc itself does not aggregate that information

@codecov
Copy link

@codecov codecov bot commented Nov 29, 2018

Codecov Report

Merging #3512 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3512      +/-   ##
==========================================
+ Coverage   84.44%   84.48%   +0.03%     
==========================================
  Files         167      167              
  Lines        9401     9405       +4     
  Branches     1396     1397       +1     
==========================================
+ Hits         7939     7946       +7     
+ Misses       1204     1201       -3     
  Partials      258      258
Impacted Files Coverage Δ
scrapy/spiders/sitemap.py 81.25% <100%> (+6.25%) ⬆️
@raphapassini
Copy link
Contributor

@raphapassini raphapassini commented Nov 30, 2018

@victor-torres seems like a nice feature. But we'll need some tests

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Nov 30, 2018

scrapy/spiders/sitemap.py Outdated Show resolved Hide resolved
@kmike
Copy link
Member

@kmike kmike commented Nov 30, 2018

Hey @victor-torres! I think this feature makes sense, and implementation is fine, but the current API won't be clear for users; it is a matter of fixing names and adding a bit more docs.

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Nov 30, 2018

Should we discuss it with someone else, @kmike?

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Dec 20, 2018

I'll rename the url param to entries as suggested by @dangra.

Still need to fix the coverage problem stated by dodecov exten.

@dangra
Copy link
Member

@dangra dangra commented Dec 20, 2018

please rebase to get travis-ci fix #3538

@victor-torres victor-torres force-pushed the victor-torres:sitemap_filter branch from 7ca5908 to 64a126f Dec 21, 2018
@dangra
Copy link
Member

@dangra dangra commented Dec 21, 2018

@victor-torres do you plan to add a test case to cover the lines that are missing coverage?

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Dec 21, 2018

@victor-torres do you plan to add a test case to cover the lines that are missing coverage?

I'm not sure about what additional tests I should write to increase the coverage on this case.

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Dec 22, 2018

All checks have passed :)

@kmike
Copy link
Member

@kmike kmike commented Dec 26, 2018

Hey @victor-torres! The implementation and argument name looks fine, but I still think we should document what's inside a single entry - what keys/values are available, what they mean.

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Dec 26, 2018

@kmike, thanks for your contribution here.

I think this is already documented in the docs/topics/spiders.rst file when we show this example:


.. method:: sitemap_filter(entries)

    Specifies a function to filter sitemap entries and their attributes.

    For example::

        <url>
            <loc>http://example.com/</loc>
            <lastmod>2005-01-01</lastmod>
        </url>

    We can define a ``sitemap_filter`` function to filter ``entries`` by date::

        def sitemap_filter(entries):
            from datetime import datetime
            for entry in entries:
                date_time = datetime.strptime(entry['lastmod'], '%Y-%m-%d')
                if date_time.year >= 2005:
                    yield entry

    This would retrieve only ``entries`` modified on 2005 and the following
    years.

    If you omit this method, all entries found in sitemaps will be
    processed, observing other attributes and their settings.

Given the context, it's clear to me that entries come from the sitemap document and the keys/values will depend upon the document structure itself. Explaining the sitemap XML format here doesn't seem to be in the scope.

Of course my opinion may be biased and I'd love to hear some suggestions from you guys.

@kmike
Copy link
Member

@kmike kmike commented Dec 26, 2018

@victor-torres I agree that documenting sitemap XML doesn't make sense, though entries we return don't match attributes exactly - see

def __iter__(self):

For example, namespaces seem to be removed, there is an extra "alternate" key, and rows without "loc" are dropped - this is what I think is nice to have documented.

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Dec 26, 2018

@kmike I have just pushed a commit adding more information as suggested by you. Do you believe it's enough or would you like me to give more details? 68690b9

docs/topics/spiders.rst Outdated Show resolved Hide resolved
@victor-torres victor-torres force-pushed the victor-torres:sitemap_filter branch 2 times, most recently from 4ec4540 to 45325f8 Dec 27, 2018
@victor-torres victor-torres force-pushed the victor-torres:sitemap_filter branch from 45325f8 to e1597f7 Dec 27, 2018
@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Dec 27, 2018

Force pushed twice because I rebased from my fork instead of this repo. Hope the rebase fixes the tests.

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Dec 27, 2018

Tests are passing, comments were addressed. Anything missing, @dangra @kmike @raphapassini?

@dangra
dangra approved these changes Dec 27, 2018
@dangra dangra changed the title Add sitemap_filter attribute to SitemapSpider class [MRG+1] Add sitemap_filter attribute to SitemapSpider class Dec 27, 2018
docs/topics/spiders.rst Outdated Show resolved Hide resolved
docs/topics/spiders.rst Outdated Show resolved Hide resolved
@victor-torres victor-torres changed the title [MRG+1] Add sitemap_filter attribute to SitemapSpider class [MRG+1] Add sitemap_filter function to SitemapSpider class Dec 27, 2018
@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Dec 27, 2018

Done @kmike

docs/topics/spiders.rst Outdated Show resolved Hide resolved
docs/topics/spiders.rst Outdated Show resolved Hide resolved
@kmike kmike merged commit 094dde6 into scrapy:master Dec 28, 2018
3 checks passed
3 checks passed
@codecov
codecov/patch 100% of diff hit (target 84.44%)
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kmike
Copy link
Member

@kmike kmike commented Dec 28, 2018

Thanks @victor-torres!

@kmike kmike added this to the v1.6 milestone Jan 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants