-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing: Better handling of multiple failed reindexes #11
Comments
…idden and likely site timeout, in preparation for #11 Better handling of multiple failed reindexes
…_disabled_reason, for #11 Better handling of multiple failed reindexes
…multiple failed reindexes, and updated SQL from tblIndexedDomains to tblDomains
Added 3 new fields to tblDomains: A site is only indexed if indexing_enabled is TRUE (and moderator_approved is TRUE). If indexing fails, i.e. no documents are indexed, there is now a check (in tblIndexingLog) to see whether it also failed on the last index, and if it does, any documents in the index are deleted and indexing_enabled is set to FALSE to prevent further reindexing. That should help keep the index cleaner. However, I'll still need to periodically manually check these:
|
As per commit comment, I'm now also emailing the site admin when tier 3 sites have indexing disabled, so I can investigate sooner. |
At the moment, tblIndexingLog contains the following messages for robots.txt forbidden and site timeout respectively:
"WARNING: No documents found. Possibly robots.txt forbidden or site timeout: robotstxt/forbidden 1, retry/max_reached None"
"WARNING: No documents found. Possibly robots.txt forbidden or site timeout: robotstxt/forbidden None, retry/max_reached 2"
However, it will keep on trying every 3.5 days or 7 days indefinitely. Leaving it this way isn't ideal because (i) the indexing gets filled up with warnings and resources are potentially wasted attempting reindexing, and perhaps more importantly (ii) if a site was previously indexed before consistently timing out or subsequently blocking indexing via robots.txt then stale content will be left in the search index adversely impacting the quality of results.
At the moment, the tblIndexingLog is checked manually for such issues, and one of 2 actions taken manually:
It would be good to automate this process. Might want to have indexing such that it keeps a count of unsuccessful indexing, and moves to tblExcludeDomains after certain number of unsuccessful indexes, plus conversely a maintenance job that (much less often) checks that certain reasons on tblExcludeDomains are still true (e.g. robots.txt forbidden or site timeout) and moves back to tblIndexedDomains.
See also #14 to try and prevent sites which block indexing via robots.txt from being submitted (although there have been cases, including with validates sites, where robots.txt allowed indexing on submission but was subsequently changed to block indexing, leaving the data in the index to become stale).
The text was updated successfully, but these errors were encountered: