Skip to content

Indexing: Crawl from web feed (RSS/Atom) #54

@m-i-l

Description

@m-i-l

Split from #34.

In its current form, the indexer will get other links to index from a sitemap and/or RSS feed if it finds the sitemap and/or RSS, but the chances are that it won't find the sitemap and/or RSS feed because there often won't be links to them anywhere.

This would need a new field in the database, and appropriate updates to the admin interface to allow site owners to specify their sitemap and/or RSS. There is also a separate SitemapSpider and XMLFeedSpider in Scrapy that could potentially be used.

Not sure how much more efficient this would make the indexing, but it could make it more predictable/targetted, i.e. make it more likely that the important pages will be indexed before hitting the indexing limits or timeout.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions