Indexing: Better handling of multiple failed reindexes #11

m-i-l · 2020-12-05T21:33:30Z

At the moment, tblIndexingLog contains the following messages for robots.txt forbidden and site timeout respectively:

"WARNING: No documents found. Possibly robots.txt forbidden or site timeout: robotstxt/forbidden 1, retry/max_reached None"

"WARNING: No documents found. Possibly robots.txt forbidden or site timeout: robotstxt/forbidden None, retry/max_reached 2"

However, it will keep on trying every 3.5 days or 7 days indefinitely. Leaving it this way isn't ideal because (i) the indexing gets filled up with warnings and resources are potentially wasted attempting reindexing, and perhaps more importantly (ii) if a site was previously indexed before consistently timing out or subsequently blocking indexing via robots.txt then stale content will be left in the search index adversely impacting the quality of results.

At the moment, the tblIndexingLog is checked manually for such issues, and one of 2 actions taken manually:

If the site looks like it is permanently offline, or robots.txt blocks indexing and it isn't a verified site, it is moved from tblIndexedDomains to tblExcludeDomains.
If robots.txt blocks indexing and it is a verified site, it is left in tblIndexedDomains but the indexing_frequency increased from '3.5 days' to '30 days'

It would be good to automate this process. Might want to have indexing such that it keeps a count of unsuccessful indexing, and moves to tblExcludeDomains after certain number of unsuccessful indexes, plus conversely a maintenance job that (much less often) checks that certain reasons on tblExcludeDomains are still true (e.g. robots.txt forbidden or site timeout) and moves back to tblIndexedDomains.

See also #14 to try and prevent sites which block indexing via robots.txt from being submitted (although there have been cases, including with validates sites, where robots.txt allowed indexing on submission but was subsequently changed to block indexing, leaving the data in the index to become stale).

…idden and likely site timeout, in preparation for #11 Better handling of multiple failed reindexes

…_disabled_reason, for #11 Better handling of multiple failed reindexes

…blPendingDomains and tblExcludeDomains into tblDomains (again), to simplify changes such as #3 Automate site expiry. Also updated schema for #20 and #11

…multiple failed reindexes, and updated SQL from tblIndexedDomains to tblDomains

m-i-l · 2021-10-16T09:03:23Z

Added 3 new fields to tblDomains:
indexing_enabled Boolean
indexing_disabled_date TIMESTAMPTZ
indexing_disabled_reason TEXT

A site is only indexed if indexing_enabled is TRUE (and moderator_approved is TRUE).

If indexing fails, i.e. no documents are indexed, there is now a check (in tblIndexingLog) to see whether it also failed on the last index, and if it does, any documents in the index are deleted and indexing_enabled is set to FALSE to prevent further reindexing.

That should help keep the index cleaner.

However, I'll still need to periodically manually check these:

If a site indexing failed because it is blocked by robots.txt, it is possible the site owner may change that, and if they do change it to allow SearchMySiteBot, I'll manually need to re-enable indexing for the site.
If an owner verified site is disabled, I should contact them to work out why.

…led, to assist with #11

m-i-l · 2022-09-18T10:02:07Z

As per commit comment, I'm now also emailing the site admin when tier 3 sites have indexing disabled, so I can investigate sooner.

m-i-l added the enhancement New feature or request label Dec 5, 2020

m-i-l mentioned this issue Jan 16, 2021

Web: Check a site can be indexed before completing submission process #14

Closed

m-i-l added a commit that referenced this issue Oct 10, 2021

Added specific messages in the indexing log for likely robots.tx forb…

7bb2895

…idden and likely site timeout, in preparation for #11 Better handling of multiple failed reindexes

m-i-l added a commit that referenced this issue Oct 16, 2021

Added indexing_enabled, and (if indexing_enabled=False) show indexing…

6bf7d38

…_disabled_reason, for #11 Better handling of multiple failed reindexes

m-i-l added a commit that referenced this issue Oct 16, 2021

Added SQL to find last indexing log entry for #11 Better handling of …

c52bcf2

…multiple failed reindexes, and updated SQL from tblIndexedDomains to tblDomains

m-i-l added a commit that referenced this issue Oct 16, 2021

Implemented #11 Better handling of multiple failed reindexes

65d9dd7

m-i-l closed this as completed Oct 16, 2021

This was referenced Oct 23, 2021

Error when resubmitting a site which has had indexing disabled #45

Closed

Indexing of some sites is blocked by Cloudflare #46

Open

m-i-l added a commit that referenced this issue Sep 17, 2022

Email the admin if a paid for (i.e tier 3) listing has indexing disab…

0a45ab9

…led, to assist with #11

m-i-l mentioned this issue May 20, 2024

Support: Script to check whether sites which have had indexing disabled are now indexable again #157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing: Better handling of multiple failed reindexes #11

Indexing: Better handling of multiple failed reindexes #11

m-i-l commented Dec 5, 2020 •

edited

Loading

m-i-l commented Oct 16, 2021 •

edited

Loading

m-i-l commented Sep 18, 2022

Indexing: Better handling of multiple failed reindexes #11

Indexing: Better handling of multiple failed reindexes #11

Comments

m-i-l commented Dec 5, 2020 • edited Loading

m-i-l commented Oct 16, 2021 • edited Loading

m-i-l commented Sep 18, 2022

m-i-l commented Dec 5, 2020 •

edited

Loading

m-i-l commented Oct 16, 2021 •

edited

Loading