You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As per #11 Better handling of multiple failed reindexes, sites which fail to index content two times in a row have their indexing disabled.
There are two sites which index fine on dev, but fail on prod, and so have had indexing disabled. I think this is because indexing is blocked by Cloudflare. A dig <domain.com> NS +short (replacing domain.com with the actual domain), show they use cloudflare.com name servers, and one of the sites also has /cdn-cgi/challenge-platform/h/b/scripts/invisible.js in the source which is related to Cloudflare's Bot Fight Mode.
To recreate, run the scrapy shell inside the docker container (replacing home_page with the actual site's home page), i.e.
docker exec -it src_indexing_1 scrapy shell 'home_page'
This returns
DEBUG: Crawled (200) (referer: None)
on dev, but
DEBUG: Crawled (503) (referer: None)
or
DEBUG: Crawled (403) (referer: None)
on prod.
As per #11 Better handling of multiple failed reindexes, sites which fail to index content two times in a row have their indexing disabled.
There are two sites which index fine on dev, but fail on prod, and so have had indexing disabled. I think this is because indexing is blocked by Cloudflare. A
dig <domain.com> NS +short
(replacing domain.com with the actual domain), show they use cloudflare.com name servers, and one of the sites also has /cdn-cgi/challenge-platform/h/b/scripts/invisible.js in the source which is related to Cloudflare's Bot Fight Mode.To recreate, run the scrapy shell inside the docker container (replacing home_page with the actual site's home page), i.e.
docker exec -it src_indexing_1 scrapy shell 'home_page'
This returns
DEBUG: Crawled (200) (referer: None)
on dev, but
DEBUG: Crawled (503) (referer: None)
or
DEBUG: Crawled (403) (referer: None)
on prod.
Need to contact Cloudflare to see if they can address. According to "I run a good bot and want for it to be added to the allowlist (cf.bot_management.verified_bot). What should I do?" at https://support.cloudflare.com/hc/en-us/articles/360035387431#h_5itGQRBabQ51RwT5cNJX8u there is a form to fill in.
The text was updated successfully, but these errors were encountered: