-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Add an identifiable user agent so that I can prevent it from crawling my site #1169
Comments
Hi! We implement |
Watching my request logs, there was no request to |
On
The check is made when a |
This is not sufficiently obvious. It is on the 13th page of the home page (according to the "print" dialogue in firefox), and is not present anywhere in the documentation
Then please make it happen for all scrape calls, and cache the response for 24 hours. I do not consent to having my site scraped for being ingested into any AI system, and wish to block it. Additionally, |
Any legit bot crawler should set its own UA instead of pretending to be a normal browser. Don't be evil. |
This. There is very little reason to not use an identifiable UA. If you feel you really need to use a browser-like UA, then your UA can be something like
note: this link here does not currently exist, it is a theoretical link to the documentation indicating how to block the crawler, as well as explaining what the crawler does. |
I just checked my weblog, your IPs are from UA looks like real human browser If you persist in this behavior, you will be served with a court summons. |
I don't think this is true. I am receiving a hit to Regardless, hitting robots.txt with a different user-agent than what the bot would respect is bad behavior. Both the robots.txt check and the crawl should be clearly identifiable as FirecrawlAgent. |
yeah, when I tested it, I never saw any requests to |
Describe the Bug
There is no mechanism to prevent firecrawl from crawling my site. I do not want my site to be crawled. Do not crawl my site.
To Reproduce
Steps to reproduce the issue:
Expected Behavior
It should not crawl the site that does not want to be crawled.
The easiest way to do this would be to add something like
Firecrawl/1.4.3
(or whatever the current version is) to the end of the user agent, and allow people to write an nginx rule to block it:Additionally, you could also implement following the
robots.txt
and look for a rule that either matches*
orFirecrawl
.Environment (please complete the following information):
Additional Context
Add any other context about the problem here, such as configuration specifics, network conditions, data volumes, etc.
The text was updated successfully, but these errors were encountered: