-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing: Crawl from web feed (RSS/Atom) #54
Comments
Now I'm indexing content type as per #63 , you can now search for the RSS feeds and sitemap.xml files via https://searchmysite.net/search/?q=content_type:*/xml , or for a specific site via https://searchmysite.net/search/?q=content_type%3A*%2Fxml+%2Bdomain%3Amichael-lewis.com . It looks like it finds the RSS feeds for many/most sites (presumably because most people include links to their RSS feeds), but not so many sitemaps (again presumably because there aren't so many links to them). That means it could be possible to auto-discover and auto-populate these values for a site. It might not be bullet-proof, but the RSS feed for a site could be found via e.g. https://searchmysite.net/search/?q=content_type%3A*%2Fxml+AND+domain%3Akudadam.com+NOT+url%3A*sitemap.xml and the sitemap for a site could be found via e.g. https://searchmysite.net/search/?q=content_type%3A*%2Fxml+AND+domain%3Akudadam.com+AND+url%3A*sitemap.xml . Having these values for a site could also be useful for #34 . |
…stem_generated, web_feed_user_entered, sitemap_system_generated and sitemap_user_entered, in preparation for #54
…new field to indicate if a page is included in a web feed, and added support for #34 Add an incremental reindex
…t which made some things more difficult), and added the ability to edit the web feed for #54 Crawl from sitemap and web feed (RSS/Atom)
Implemented. The process is:
Note that only 7 pages have a sitemap detected. This is largely because few sites link directly to the sitemap (and the spidering works by following links). Hence I'm not going to index from sitemap for the time being (and have changed this issue title accordingly). See also #71 which was implemented alongside this. |
…ated the code to detect if it is a web feed to use data identified while parsing the item, for #54.
…t if it is a web feed while parsing the item for #54
…m 0.5Mb to 1Mb, to accommodate some of the large RSS feeds, for #54
Note that a number of RSS and Atom feeds were not being detected, so I've made the following changes to improve detection:
There are still web feeds being missed though:
Now more feeds are being detected on a site there may also be a new requirement to have some logic to try to pick the most appropriate one - some sites e.g. have feed for each tag and category (which the site owner may not even be aware of). |
Split from #34.
In its current form, the indexer will get other links to index from a sitemap and/or RSS feed if it finds the sitemap and/or RSS, but the chances are that it won't find the sitemap and/or RSS feed because there often won't be links to them anywhere.
This would need a new field in the database, and appropriate updates to the admin interface to allow site owners to specify their sitemap and/or RSS. There is also a separate SitemapSpider and XMLFeedSpider in Scrapy that could potentially be used.
Not sure how much more efficient this would make the indexing, but it could make it more predictable/targetted, i.e. make it more likely that the important pages will be indexed before hitting the indexing limits or timeout.
The text was updated successfully, but these errors were encountered: