Skip to content

Indexing: Improve heuristics for guessing the primary web feed #77

@m-i-l

Description

@m-i-l

Now I've started building a feed reader into the search engine, as per #54 and #71 , it is often detecting multiple feeds on a site, some of which the site owner might not even be aware exist. For example, it turns out that on my personal site (based on Hugo) there are feeds for each of the /tags/ and /categories/, which I wasn't aware existed, e.g. https://www.michael-lewis.com/tags/artificial-intelligence/index.xml. Now in my case I've configured the search to exclude the /tags/ and /categories/ paths so it doesn't actually find them, but few people are configuring their search in this way.

As it is right now, if it finds multiple feeds, it simply picks the most recently found one to use as a primary feed. If that turns out to be (for example) for a rarely used tag, then that isn't going to be too useful e.g. for #34 or some of the functionality that will be based off #71 .

Easiest choice would be to use the path, and have a list of paths in order of preference. With a bit more work there could be other data that could factor into this too, e.g. location of the link (a link on the home page is more likely to be the main feed).

The list of multiple feeds can be obtained from the logs via:

docker logs src_indexing_1 2>&1 | grep "More than one potential web feed"

At the moment (with only around 1 day of data), a prioritised list could look like this:

  1. /posts/index.xml
  2. /feed/
  3. /feed
  4. /feed.xml
  5. /feed/rss
  6. /feeds/all.xml
  7. /feeds/all.atom.xml
  8. /blog/atom
  9. /blog/atom.xml
  10. /blog/index.xml
  11. /blog/feed.xml
  12. /blog?format=rss
  13. /blog/feeds/all.xml
  14. /notes/index.xml
  15. /index.xml
  16. /atom.xml
  17. /rss.xml
  18. /rss

Note that some sites just have feeds for tags so probably just have to pick one of those, some look pretty site-specific so again just pick one, and some are not for content (e.g. one only has a feed of public keys) which is a bit more tricky but not sure I want to start maintaining an allow list so they might have to slip through. As an aside, in many cases the preferred one is the first one in the list, i.e. the first one found, although there are plenty of exceptions so not sure I'd want to make that a rule.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions