Skip to content

Indexing of some sites is blocked by Cloudflare #46

@m-i-l

Description

@m-i-l

As per #11 Better handling of multiple failed reindexes, sites which fail to index content two times in a row have their indexing disabled.

There are two sites which index fine on dev, but fail on prod, and so have had indexing disabled. I think this is because indexing is blocked by Cloudflare. A dig <domain.com> NS +short (replacing domain.com with the actual domain), show they use cloudflare.com name servers, and one of the sites also has /cdn-cgi/challenge-platform/h/b/scripts/invisible.js in the source which is related to Cloudflare's Bot Fight Mode.

To recreate, run the scrapy shell inside the docker container (replacing home_page with the actual site's home page), i.e.
docker exec -it src_indexing_1 scrapy shell 'home_page'
This returns
DEBUG: Crawled (200) (referer: None)
on dev, but
DEBUG: Crawled (503) (referer: None)
or
DEBUG: Crawled (403) (referer: None)
on prod.

Need to contact Cloudflare to see if they can address. According to "I run a good bot and want for it to be added to the allowlist (cf.bot_management.verified_bot). What should I do?" at https://support.cloudflare.com/hc/en-us/articles/360035387431#h_5itGQRBabQ51RwT5cNJX8u there is a form to fill in.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions