As per #11 Better handling of multiple failed reindexes, sites which fail to index content two times in a row have their indexing disabled.
There are two sites which index fine on dev, but fail on prod, and so have had indexing disabled. I think this is because indexing is blocked by Cloudflare. A dig <domain.com> NS +short (replacing domain.com with the actual domain), show they use cloudflare.com name servers, and one of the sites also has /cdn-cgi/challenge-platform/h/b/scripts/invisible.js in the source which is related to Cloudflare's Bot Fight Mode.
To recreate, run the scrapy shell inside the docker container (replacing home_page with the actual site's home page), i.e.
docker exec -it src_indexing_1 scrapy shell 'home_page'
This returns
DEBUG: Crawled (200) (referer: None)
on dev, but
DEBUG: Crawled (503) (referer: None)
or
DEBUG: Crawled (403) (referer: None)
on prod.
Need to contact Cloudflare to see if they can address. According to "I run a good bot and want for it to be added to the allowlist (cf.bot_management.verified_bot). What should I do?" at https://support.cloudflare.com/hc/en-us/articles/360035387431#h_5itGQRBabQ51RwT5cNJX8u there is a form to fill in.
As per #11 Better handling of multiple failed reindexes, sites which fail to index content two times in a row have their indexing disabled.
There are two sites which index fine on dev, but fail on prod, and so have had indexing disabled. I think this is because indexing is blocked by Cloudflare. A
dig <domain.com> NS +short(replacing domain.com with the actual domain), show they use cloudflare.com name servers, and one of the sites also has /cdn-cgi/challenge-platform/h/b/scripts/invisible.js in the source which is related to Cloudflare's Bot Fight Mode.To recreate, run the scrapy shell inside the docker container (replacing home_page with the actual site's home page), i.e.
docker exec -it src_indexing_1 scrapy shell 'home_page'
This returns
DEBUG: Crawled (200) (referer: None)
on dev, but
DEBUG: Crawled (503) (referer: None)
or
DEBUG: Crawled (403) (referer: None)
on prod.
Need to contact Cloudflare to see if they can address. According to "I run a good bot and want for it to be added to the allowlist (cf.bot_management.verified_bot). What should I do?" at https://support.cloudflare.com/hc/en-us/articles/360035387431#h_5itGQRBabQ51RwT5cNJX8u there is a form to fill in.