Skip to content

Indexing: Add an incremental reindex (only indexing new items) #34

@m-i-l

Description

@m-i-l

As per the post searchmysite.net: The delicate matter of the bill one of the less desirable "features" of the searchmysite.net model is that it burns up a lot of money indexing sites on a regular basis even if no-one is actually using the system. It would therefore be good to try to reduce indexing costs.

One idea is to only reindex sites and/or pages which have been updated. It doesn't look like there is a reliable way of doing this though, e.g. given only around 45% of pages in the system currently return a Last-Modified header, so there may need to be some "good enough" only-if-probably-modified approach.

For the only-if-probably-modified approach, one idea may be to store the entire home page in Solr, and at the start of reindexing that site compare the last home page with the new home page - if they are different, then proceed with reindexing that site, and if they are the same, do not reindex that site. There are some issues with this, e.g. if the page has some auto-generated text which changes on each page load, e.g. a timestamp, it will always register as different even if it isn't, and conversely there may be pages within the site which have been updated even if the home page hasn't changed at all. It might therefore be safest to have, e.g. a weekly only-if-probably-modified reindex and monthly reindex-everything-regardless (i.e. the current) approach as a fail-safe.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions