Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: Change Newest Pages to Newest Posts, comprising only pages which appear in a web feed #74

Closed
m-i-l opened this issue Oct 2, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@m-i-l
Copy link
Contributor

m-i-l commented Oct 2, 2022

Rename Newest Pages to Newest Posts, and change the filter which generates them from the current fq=published_date:* (which lets some non article content through, e.g. the PostgreSQL home page whenever it is updated) to fq=in_web_feed:true.

This aims to make Newest Posts more of an article feed, and therefore hopefully more useful, especially when combined with #34 which should mean full listings have new posts added daily.

Note that #71 has only just been implemented, and as per latest comment on #64 the web feed field has been renamed, so it could take up to 8 weeks for all the pages which are in a web feed to be identified, i.e. 4 weeks for the web feeds to be populated in the new web_feed field, and a further 4 weeks for all the in_web_feed fields to be populated.

@m-i-l
Copy link
Contributor Author

m-i-l commented Oct 15, 2022

Note that the logic for picking the "best" web feed for a site has just been changed - see #77 . This means it'll be 12 Nov 2022 before all sites are updated with the new "best" feed, and 10 Dec 2022 before all sites are indexed with the in_web_feed flag for the new "best" feed. Should start getting enough data to start testing early/mid Nov and deploy around late Nov though.

Note also that, as per #77 , a lot of sites only have not very useful feeds, e.g. auto-generated feeds for specific tags which the site owner might not even be aware exist, so the in_web_feed flag might not be as useful as initially hoped. Will therefore need to do some side-by-side comparisons of the Newest Pages/Posts with and without the in_web_feed flag before confirming whether to make this change or not.

@m-i-l
Copy link
Contributor Author

m-i-l commented Jan 21, 2023

Some stats:

  • of 1518 sites, 948 have web_feed
  • of 63084 pages, 9412 have published_date:* (i.e. current shortlist for newest), 7062 have in_web_feed:true

Replacing "published_date:*" with "in_web_feed:true" in mandatory_filter_queries_newest does shorten the list, perhaps too much for now. I've made another change to improve the web_feed detection (and therefore the number of pages identified as in_web_feed) as per #54, so will revisit once that has had time to take effect.

@m-i-l
Copy link
Contributor Author

m-i-l commented Aug 12, 2023

Latest stats:

  • of 1775 sites, 1214 have a web_feed
  • of 72494 pages, 11474 have published_date:*, and 14320 have in_web_feed:true

So in theory more articles should appear if switching from published_date to in_web_feed. However, there are a number of issues with this approach:

  • in_web_feed:true didn't turn out to be as useful as a signal as I'd hoped, given feeds aren't just used for blog posts, but all sorts of things which could be classed as noise, e.g. git commits.
  • How do you sort? You could make the filter published_date:*&in_web_feed:true and still sort on published_date, but that would make it a subset and so not as good. You could make page_last_modified mandatory and sort on that, but then you'll get lots of old article which have been recently modifed at the top, plus all sorts of non-blog entries like git commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant