Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Add an identifiable user agent so that I can prevent it from crawling my site #1169

Open
solonovamax opened this issue Feb 11, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@solonovamax
Copy link

Describe the Bug
There is no mechanism to prevent firecrawl from crawling my site. I do not want my site to be crawled. Do not crawl my site.

To Reproduce
Steps to reproduce the issue:

  1. Crawl a site that does not want to be crawled.

Expected Behavior
It should not crawl the site that does not want to be crawled.
The easiest way to do this would be to add something like Firecrawl/1.4.3 (or whatever the current version is) to the end of the user agent, and allow people to write an nginx rule to block it:

map $http_user_agent $llm_scraper_user_agent {
    default         0;
    ~*Firecrawl     1;
}

server {

    # ...

    if ($llm_scraper_user_agent) {
        return 403;
    }

    # ...
}

Additionally, you could also implement following the robots.txt and look for a rule that either matches * or Firecrawl.

Environment (please complete the following information):

  • OS: N/A
  • Firecrawl Version: N/A
  • Node.js Version: N/A

Additional Context
Add any other context about the problem here, such as configuration specifics, network conditions, data volumes, etc.

@solonovamax solonovamax added the bug Something isn't working label Feb 11, 2025
@mogery
Copy link
Member

mogery commented Feb 20, 2025

Hi! We implement robots.txt, the User-Agent we check for is FirecrawlAgent (We also look for *). This may still cause an initial scrape to hit your site, but no crawling is done afterwards.

@solonovamax
Copy link
Author

Hi! We implement robots.txt, the User-Agent we check for is FirecrawlAgent (We also look for *). This may still cause an initial scrape to hit your site, but no crawling is done afterwards.

Watching my request logs, there was no request to robots.txt. Further, there is no documentation about this what-so-ever.

@mogery
Copy link
Member

mogery commented Feb 20, 2025

Further, there is no documentation about this what-so-ever.

On firecrawl.dev:

Image

Watching my request logs, there was no request to robots.txt.

The check is made when a /crawl operation is started on a site, to avoid potentially overloading sites that do not agree to being crawled. The check is not made for one-off /scrape calls.

@solonovamax
Copy link
Author

On firecrawl.dev:

Image

This is not sufficiently obvious. It is on the 13th page of the home page (according to the "print" dialogue in firefox), and is not present anywhere in the documentation

The check is not made for one-off /scrape calls.

Then please make it happen for all scrape calls, and cache the response for 24 hours. I do not consent to having my site scraped for being ingested into any AI system, and wish to block it.

Additionally, Firecrawl should be included in your UA, so that it can be identified when reading through logs. You should not be pretending to be a normal browser via your user agent, as you are not a normal browser.

@pandayummy
Copy link

Any legit bot crawler should set its own UA instead of pretending to be a normal browser. Don't be evil.

@solonovamax
Copy link
Author

Any legit bot crawler should set its own UA instead of pretending to be a normal browser. Don't be evil.

This.

There is very little reason to not use an identifiable UA.
Your UA should include the name of the product as well as a link to the product documentation (including how to block it).

If you feel you really need to use a browser-like UA, then your UA can be something like

Mozilla/5.0 (X11; Linux x86_64; rv:136.0) Gecko/20100101 Firefox/136.0 Firecrawl/1.4.3 (https://docs.firecrawl.dev/crawler)

note: this link here does not currently exist, it is a theoretical link to the documentation indicating how to block the crawler, as well as explaining what the crawler does.

@pandayummy
Copy link

pandayummy commented Mar 5, 2025

Image

Image

Image

I just checked my weblog, your IPs are from Silver Spring, Maryland with rotating IPs. No ASN.

UA looks like real human browser
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36

If you persist in this behavior, you will be served with a court summons.

@rossabaker
Copy link

Hi! We implement robots.txt, the User-Agent we check for is FirecrawlAgent (We also look for *). This may still cause an initial scrape to hit your site, but no crawling is done afterwards.

I don't think this is true. I am receiving a hit to /robots.txt with User-Agent axios/1.7.2. I am explicitly disallowing FirecrawlAgent, but my site is still immediately crawled by a user-agent identical to what PandaYummy reports.

Regardless, hitting robots.txt with a different user-agent than what the bot would respect is bad behavior. Both the robots.txt check and the crawl should be clearly identifiable as FirecrawlAgent.

@solonovamax
Copy link
Author

Hi! We implement robots.txt, the User-Agent we check for is FirecrawlAgent (We also look for *). This may still cause an initial scrape to hit your site, but no crawling is done afterwards.

I don't think this is true. I am receiving a hit to /robots.txt with User-Agent axios/1.7.2. I am explicitly disallowing FirecrawlAgent, but my site is still immediately crawled by a user-agent identical to what PandaYummy reports.

Regardless, hitting robots.txt with a different user-agent than what the bot would respect is bad behavior. Both the robots.txt check and the crawl should be clearly identifiable as FirecrawlAgent.

yeah, when I tested it, I never saw any requests to /robots.txt either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants