[MRG+1] Add sitemap_filter function to SitemapSpider class #3512
@@ Coverage Diff @@ ## master #3512 +/- ## ========================================== + Coverage 84.44% 84.48% +0.03% ========================================== Files 167 167 Lines 9401 9405 +4 Branches 1396 1397 +1 ========================================== + Hits 7939 7946 +7 + Misses 1204 1201 -3 Partials 258 258
@kmike, thanks for your contribution here.
I think this is already documented in the
Given the context, it's clear to me that entries come from the sitemap document and the keys/values will depend upon the document structure itself. Explaining the sitemap XML format here doesn't seem to be in the scope.
Of course my opinion may be biased and I'd love to hear some suggestions from you guys.
@victor-torres I agree that documenting sitemap XML doesn't make sense, though entries we return don't match attributes exactly - see
For example, namespaces seem to be removed, there is an extra "alternate" key, and rows without "loc" are dropped - this is what I think is nice to have documented.
it makes it possible to filter sitemap urls by any available attribute for example, you can filter urls with lastmod greater than a given datetime it can be helpful when the url loc itself does not aggregate that information