Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: Crawl from web feed (RSS/Atom) #54

Closed
m-i-l opened this issue May 5, 2022 · 3 comments
Closed

Indexing: Crawl from web feed (RSS/Atom) #54

m-i-l opened this issue May 5, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@m-i-l
Copy link
Contributor

m-i-l commented May 5, 2022

Split from #34.

In its current form, the indexer will get other links to index from a sitemap and/or RSS feed if it finds the sitemap and/or RSS, but the chances are that it won't find the sitemap and/or RSS feed because there often won't be links to them anywhere.

This would need a new field in the database, and appropriate updates to the admin interface to allow site owners to specify their sitemap and/or RSS. There is also a separate SitemapSpider and XMLFeedSpider in Scrapy that could potentially be used.

Not sure how much more efficient this would make the indexing, but it could make it more predictable/targetted, i.e. make it more likely that the important pages will be indexed before hitting the indexing limits or timeout.

@m-i-l
Copy link
Contributor Author

m-i-l commented Jun 13, 2022

Now I'm indexing content type as per #63 , you can now search for the RSS feeds and sitemap.xml files via https://searchmysite.net/search/?q=content_type:*/xml , or for a specific site via https://searchmysite.net/search/?q=content_type%3A*%2Fxml+%2Bdomain%3Amichael-lewis.com .

It looks like it finds the RSS feeds for many/most sites (presumably because most people include links to their RSS feeds), but not so many sitemaps (again presumably because there aren't so many links to them).

That means it could be possible to auto-discover and auto-populate these values for a site. It might not be bullet-proof, but the RSS feed for a site could be found via e.g. https://searchmysite.net/search/?q=content_type%3A*%2Fxml+AND+domain%3Akudadam.com+NOT+url%3A*sitemap.xml and the sitemap for a site could be found via e.g. https://searchmysite.net/search/?q=content_type%3A*%2Fxml+AND+domain%3Akudadam.com+AND+url%3A*sitemap.xml .

Having these values for a site could also be useful for #34 .

@m-i-l m-i-l changed the title Indexing: Crawl from sitemap and/or RSS Indexing: Crawl from sitemap and web feed (RSS/Atom) Sep 25, 2022
m-i-l added a commit that referenced this issue Oct 2, 2022
…stem_generated, web_feed_user_entered, sitemap_system_generated and sitemap_user_entered, in preparation for #54
m-i-l added a commit that referenced this issue Oct 2, 2022
…new field to indicate if a page is included in a web feed, and added support for #34 Add an incremental reindex
m-i-l added a commit that referenced this issue Oct 2, 2022
…t which made some things more difficult), and added the ability to edit the web feed for #54 Crawl from sitemap and web feed (RSS/Atom)
@m-i-l m-i-l changed the title Indexing: Crawl from sitemap and web feed (RSS/Atom) Indexing: Crawl from web feed (RSS/Atom) Oct 2, 2022
@m-i-l
Copy link
Contributor Author

m-i-l commented Oct 2, 2022

Implemented. The process is:

  • The system tries to identify a site's RSS/Atom feed while indexing, using the logic described in comments above. Note that it only has content-type and filename available at this point (it doesn't have the content to inspect), and it only saves one value (while some sites have more than one feed). Nevertheless, of the 1401 sites currently in the search index, 524 have feeds detected, and of these 497 have feeds which can be parsed and which have links. Most of the 27 which don't have links are for the usual reasons (sites going offline, domains expiring, https certs expiring, etc., which should be cleaned up with the "deindex site if indexing fails twice in a row" functionality), and some are simply because there are no links in the feed. So that should provide good coverage. As an aside, picking some of the sites without feeds at random, some do say things like "Subscribe to my RSS feed" without providing a link, which suggests they do have one - in one case I was able to guess by entering /rss.xml, but for others I wasn't.
  • If the web weed is missing or incorrect, site owners can use Manage Site / Site details / Web feed to specify their own. This is stored in the database in web_feed_user_entered, with the auto-discovered web feed stored in the database in web_feed_auto_discovered.
  • The web feed is stored in Solr in the web_feed attribute. This is populated with web_feed_user_entered (if present) or failing that web_feed_auto_discovered (also only if present).
  • Indexing begins with both web_feed (if present) and home_page in the start_urls, with special code in parse_start_url to parse the RSS/Atom using feedparser (scrapy doesn't normally parse XML), as described in Indexing: Add an incremental reindex (only indexing new items) #34 .

Note that only 7 pages have a sitemap detected. This is largely because few sites link directly to the sitemap (and the spidering works by following links). Hence I'm not going to index from sitemap for the time being (and have changed this issue title accordingly).

See also #71 which was implemented alongside this.

@m-i-l m-i-l closed this as completed Oct 2, 2022
m-i-l added a commit that referenced this issue Oct 8, 2022
…ated the code to detect if it is a web feed to use data identified while parsing the item, for #54.
m-i-l added a commit that referenced this issue Oct 8, 2022
…t if it is a web feed while parsing the item for #54
m-i-l added a commit that referenced this issue Oct 8, 2022
…m 0.5Mb to 1Mb, to accommodate some of the large RSS feeds, for #54
@m-i-l
Copy link
Contributor Author

m-i-l commented Oct 8, 2022

Note that a number of RSS and Atom feeds were not being detected, so I've made the following changes to improve detection:

  • Changed the logic to detect if it is a web feed while actually parsing the item, rather than relying on the Content-Type HTTP header, given the number of sites which didn't set the header or set it incorrectly (e.g. to text/html).
  • Now look for links in the link rel="alternate" type="application/rss+xml", given a number of sites had no clickable links to the feed.
  • Increased the maximum download size, given the number of RSS feeds which weren't downloaded because they exceeded the limit.

There are still web feeds being missed though:

  • Links are just not detected within the first 50 pages (but would presumably be detected if the indexing page limit was increased). Examples: cathydutton.co.uk and rosieleizrowice.com.
  • Download of RSS/Atom cancelled because file exceeds 2Mb file size limit configured. Example: laurakalbag.com (smallest feed 2.2Mb, largest feed 3.1Mb).
  • User has explicitly set up an exclude path rule for the feed. Example: sierdy.one (exclude path /posts/atom.xml).

Now more feeds are being detected on a site there may also be a new requirement to have some logic to try to pick the most appropriate one - some sites e.g. have feed for each tag and category (which the site owner may not even be aware of).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant