-
Notifications
You must be signed in to change notification settings - Fork 9
Description
The way the crawler works, i.e. just following links, means there is still quite a bit of noise in the index, e.g. some people self-host git repositories on the same domain, or various sites for testing, and all the these are all crawled and added to the index despite not containing any content that is likely to be useful in a search result. (Ideally site owners would edit their robots.txt, or use Manage Site to configure what is crawled, to help keep the index clean.)
There are other issues too, e.g. landing pages like /posts/ being indexed and being returned in the results for a search for a term in the title, as described in #66 .
A part solution is to boost results that are in a web feed, because this is likely to be a useful signal. As per #71 there's now a flag on all content to indicate whether is is part of a web feed or not.