Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: Better handling of multiple failed reindexes #11

Closed
m-i-l opened this issue Dec 5, 2020 · 2 comments
Closed

Indexing: Better handling of multiple failed reindexes #11

m-i-l opened this issue Dec 5, 2020 · 2 comments
Labels
enhancement New feature or request

Comments

@m-i-l
Copy link
Contributor

m-i-l commented Dec 5, 2020

At the moment, tblIndexingLog contains the following messages for robots.txt forbidden and site timeout respectively:

"WARNING: No documents found. Possibly robots.txt forbidden or site timeout: robotstxt/forbidden 1, retry/max_reached None"

"WARNING: No documents found. Possibly robots.txt forbidden or site timeout: robotstxt/forbidden None, retry/max_reached 2"

However, it will keep on trying every 3.5 days or 7 days indefinitely. Leaving it this way isn't ideal because (i) the indexing gets filled up with warnings and resources are potentially wasted attempting reindexing, and perhaps more importantly (ii) if a site was previously indexed before consistently timing out or subsequently blocking indexing via robots.txt then stale content will be left in the search index adversely impacting the quality of results.

At the moment, the tblIndexingLog is checked manually for such issues, and one of 2 actions taken manually:

  • If the site looks like it is permanently offline, or robots.txt blocks indexing and it isn't a verified site, it is moved from tblIndexedDomains to tblExcludeDomains.
  • If robots.txt blocks indexing and it is a verified site, it is left in tblIndexedDomains but the indexing_frequency increased from '3.5 days' to '30 days'

It would be good to automate this process. Might want to have indexing such that it keeps a count of unsuccessful indexing, and moves to tblExcludeDomains after certain number of unsuccessful indexes, plus conversely a maintenance job that (much less often) checks that certain reasons on tblExcludeDomains are still true (e.g. robots.txt forbidden or site timeout) and moves back to tblIndexedDomains.

See also #14 to try and prevent sites which block indexing via robots.txt from being submitted (although there have been cases, including with validates sites, where robots.txt allowed indexing on submission but was subsequently changed to block indexing, leaving the data in the index to become stale).

@m-i-l m-i-l added the enhancement New feature or request label Dec 5, 2020
m-i-l added a commit that referenced this issue Oct 10, 2021
…idden and likely site timeout, in preparation for #11 Better handling of multiple failed reindexes
m-i-l added a commit that referenced this issue Oct 16, 2021
…_disabled_reason, for #11 Better handling of multiple failed reindexes
m-i-l added a commit that referenced this issue Oct 16, 2021
…blPendingDomains and tblExcludeDomains into tblDomains (again), to simplify changes such as #3 Automate site expiry. Also updated schema for #20 and #11
m-i-l added a commit that referenced this issue Oct 16, 2021
…multiple failed reindexes, and updated SQL from tblIndexedDomains to tblDomains
@m-i-l
Copy link
Contributor Author

m-i-l commented Oct 16, 2021

Added 3 new fields to tblDomains:
indexing_enabled Boolean
indexing_disabled_date TIMESTAMPTZ
indexing_disabled_reason TEXT

A site is only indexed if indexing_enabled is TRUE (and moderator_approved is TRUE).

If indexing fails, i.e. no documents are indexed, there is now a check (in tblIndexingLog) to see whether it also failed on the last index, and if it does, any documents in the index are deleted and indexing_enabled is set to FALSE to prevent further reindexing.

That should help keep the index cleaner.

However, I'll still need to periodically manually check these:

  • If a site indexing failed because it is blocked by robots.txt, it is possible the site owner may change that, and if they do change it to allow SearchMySiteBot, I'll manually need to re-enable indexing for the site.
  • If an owner verified site is disabled, I should contact them to work out why.

@m-i-l
Copy link
Contributor Author

m-i-l commented Sep 18, 2022

As per commit comment, I'm now also emailing the site admin when tier 3 sites have indexing disabled, so I can investigate sooner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant