Now I've started building a feed reader into the search engine, as per #54 and #71 , it is often detecting multiple feeds on a site, some of which the site owner might not even be aware exist. For example, it turns out that on my personal site (based on Hugo) there are feeds for each of the /tags/ and /categories/, which I wasn't aware existed, e.g. https://www.michael-lewis.com/tags/artificial-intelligence/index.xml. Now in my case I've configured the search to exclude the /tags/ and /categories/ paths so it doesn't actually find them, but few people are configuring their search in this way.
As it is right now, if it finds multiple feeds, it simply picks the most recently found one to use as a primary feed. If that turns out to be (for example) for a rarely used tag, then that isn't going to be too useful e.g. for #34 or some of the functionality that will be based off #71 .
Easiest choice would be to use the path, and have a list of paths in order of preference. With a bit more work there could be other data that could factor into this too, e.g. location of the link (a link on the home page is more likely to be the main feed).
The list of multiple feeds can be obtained from the logs via:
docker logs src_indexing_1 2>&1 | grep "More than one potential web feed"
At the moment (with only around 1 day of data), a prioritised list could look like this:
- /posts/index.xml
- /feed/
- /feed
- /feed.xml
- /feed/rss
- /feeds/all.xml
- /feeds/all.atom.xml
- /blog/atom
- /blog/atom.xml
- /blog/index.xml
- /blog/feed.xml
- /blog?format=rss
- /blog/feeds/all.xml
- /notes/index.xml
- /index.xml
- /atom.xml
- /rss.xml
- /rss
Note that some sites just have feeds for tags so probably just have to pick one of those, some look pretty site-specific so again just pick one, and some are not for content (e.g. one only has a feed of public keys) which is a bit more tricky but not sure I want to start maintaining an allow list so they might have to slip through. As an aside, in many cases the preferred one is the first one in the list, i.e. the first one found, although there are plenty of exceptions so not sure I'd want to make that a rule.
Now I've started building a feed reader into the search engine, as per #54 and #71 , it is often detecting multiple feeds on a site, some of which the site owner might not even be aware exist. For example, it turns out that on my personal site (based on Hugo) there are feeds for each of the /tags/ and /categories/, which I wasn't aware existed, e.g. https://www.michael-lewis.com/tags/artificial-intelligence/index.xml. Now in my case I've configured the search to exclude the /tags/ and /categories/ paths so it doesn't actually find them, but few people are configuring their search in this way.
As it is right now, if it finds multiple feeds, it simply picks the most recently found one to use as a primary feed. If that turns out to be (for example) for a rarely used tag, then that isn't going to be too useful e.g. for #34 or some of the functionality that will be based off #71 .
Easiest choice would be to use the path, and have a list of paths in order of preference. With a bit more work there could be other data that could factor into this too, e.g. location of the link (a link on the home page is more likely to be the main feed).
The list of multiple feeds can be obtained from the logs via:
At the moment (with only around 1 day of data), a prioritised list could look like this:
Note that some sites just have feeds for tags so probably just have to pick one of those, some look pretty site-specific so again just pick one, and some are not for content (e.g. one only has a feed of public keys) which is a bit more tricky but not sure I want to start maintaining an allow list so they might have to slip through. As an aside, in many cases the preferred one is the first one in the list, i.e. the first one found, although there are plenty of exceptions so not sure I'd want to make that a rule.