Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing of some sites is blocked by Cloudflare #46

Open
m-i-l opened this issue Oct 23, 2021 · 2 comments
Open

Indexing of some sites is blocked by Cloudflare #46

m-i-l opened this issue Oct 23, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@m-i-l
Copy link
Contributor

m-i-l commented Oct 23, 2021

As per #11 Better handling of multiple failed reindexes, sites which fail to index content two times in a row have their indexing disabled.

There are two sites which index fine on dev, but fail on prod, and so have had indexing disabled. I think this is because indexing is blocked by Cloudflare. A dig <domain.com> NS +short (replacing domain.com with the actual domain), show they use cloudflare.com name servers, and one of the sites also has /cdn-cgi/challenge-platform/h/b/scripts/invisible.js in the source which is related to Cloudflare's Bot Fight Mode.

To recreate, run the scrapy shell inside the docker container (replacing home_page with the actual site's home page), i.e.
docker exec -it src_indexing_1 scrapy shell 'home_page'
This returns
DEBUG: Crawled (200) (referer: None)
on dev, but
DEBUG: Crawled (503) (referer: None)
or
DEBUG: Crawled (403) (referer: None)
on prod.

Need to contact Cloudflare to see if they can address. According to "I run a good bot and want for it to be added to the allowlist (cf.bot_management.verified_bot). What should I do?" at https://support.cloudflare.com/hc/en-us/articles/360035387431#h_5itGQRBabQ51RwT5cNJX8u there is a form to fill in.

@m-i-l
Copy link
Contributor Author

m-i-l commented May 10, 2022

Form submitted.

@m-i-l
Copy link
Contributor Author

m-i-l commented May 18, 2022

8 days after completing the form Cloudflare is still blocking some sites:

2022-05-18 13:30:10 [scrapy.core.engine] DEBUG: Crawled (403) <GET ...> (referer: None)
2022-05-18 13:30:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 ...>: HTTP status code is not handled or not allowed

Don't know if searchmysite.net will appear on https://radar.cloudflare.com/verified-bots when/if it is approved by Cloudflare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant