Releases: mendableai/firecrawl
LLMs.txt API + Deep Research API - v1.6.0
Introducing LLMs.txt API
The /llmstxt endpoint allows you to transform any website into clean, LLM-ready text files. Simply provide a URL, and Firecrawl will crawl the site and generate both llms.txt and llms-full.txt files that can be used for training or analysis with any LLM.
Docs here: https://docs.firecrawl.dev/features/alpha/llmstxt
Introducing Deep Research API (Alpha)
The /deep-research endpoint enables AI-powered deep research and analysis on any topic. Simply provide a research query, and Firecrawl will autonomously explore the web, gather relevant information, and synthesize findings into comprehensive insights.
Join the waitlist here: https://www.firecrawl.dev/deep-research
Official Firecrawl MCP Server
Introducing the Firecrawl MCP Server. Give Cursor, Windsurf, Claude enhanced web extraction capabilities. Big thanks to @vrknetha, @cawstudios for the initial implementation!
See here: https://github.com/mendableai/firecrawl-mcp-server
Fixes & Enhancements
- Improved charset detection and re-decoding.
- Fixed extract token limit issues.
- Addressed issues with includes/excludes handling.
- Fixed AI SDK handling of JSON responses.
New Features & Improvements
- AI-SDK Migration β transitioned to AI-SDK.
- Auto-Recharge Emails β notify users about upgrades.
- Fire-Index Added β introduced a new indexing system.
- Self-Hosting Enhancements β OpenAI-compatible API & Ollama env support.
- Batch Billing β streamlined billing processes.
- Supabase Read Replica Routing β improved database performance.
Crawler & AI Improvements
- Implemented Claude 3.7 and GPT-4.5 web crawlers.
- Added Groq Web Crawler example.
- Updated crawl-status behavior for better error handling.
- Improved cross-origin redirect handling.
Documentation & Maintenance
- Updated Dockerfile.
- Fixed missing "required" field in docs.
New Contributors
Detailed breakdown
Deep Research API & LLMS TXT API
- (feat/deep-research-alpha) Added Max URLs, Sources, and Fixes by @nickscamara in #1271
- (feat/deep-research) Alpha prep + Improvements by @nickscamara in #1284
- Truncate llmstxt cache based on max URLs limit & improve max URLs handling by @ericciarla in #1285
Fixes & Enhancements
- fix(scrapeURL/engines/fetch): Discover charset and re-decode by @mogery in #1221
- fix(crawl-redis): Ignore empty includes/excludes by @mogery in #1223
- fix(token-slicer): Fix extract token limit issues by @nickscamara in #1236
- fix(scraper): Improve charset detection regex to accurately parse meta tags by @GrassH in #1265
- fix(crawl): Includes/excludes fixes (FIR-1300) by @mogery in #1303
- Fix AI SDK being unable to handle the AI returning a JSON code block (FIR-1277) by @mogery in #1280
- Fix/p token by @nickscamara in #1305
Features & Improvements
- (feat/ai-sdk) Migrate to AI-SDK by @nickscamara in #1220
- (feat/auto-recharge) Send email suggesting an upgrade when hitting auto recharges by @nickscamara in #1237
- feat(self-host/ai): Use any OpenAI-compatible API by @mogery in #1245
- feat(self-host/ai): Pass in the Ollama envs into Docker Compose by @brrock in #1269
- feat(v1/crawl-status-ws): Update behavior to ignore errors like regular crawl-status (FIR-1106) by @mogery in #1234
- feat(fire-index): Added new fire-index by @nickscamara in #1263
- feat(supabase): Add read replica routing by @mogery in #1274
- feat(crawler): Handle cross-origin redirects differently than same-origin redirects by @mogery in #1279
- (feat/batch-billing): Batch billing by @nickscamara in #1264
- feat(tests/snips): Add billing tests + misc billing fixes (FIR-1280) by @mogery in #1283
New Implementations
- Implemented GitHub analyzer by @aparupganguly in #1229
- Implemented Claude 3.7 web crawler by @aparupganguly in #1257
- examples/Add GPT-4.5 web crawler by @aparupganguly in #1276
- examples/Add Claude 3.7 web extractor by @aparupganguly in #1291
- Add groq_web_crawler example and dependencies by @ceewaigit in #1267
Documentation & Maintenance
- docs: Remove undefined "required" field by @jmporchet in #1282
- Update Dockerfile by @mogery in #1232
New Contributors
- @GrassH made their first contribution in #1265
- @brrock made their first contribution in #1269
- @jmporchet made their first contribution in #1282
- @ceewaigit made their first contribution in #1267
Full Changelog: v1.5.0...v1.6.0
What's Changed
- fix(scrapeURL/engines/fetch): discover charset and re-decode by @mogery in #1221
- fix(crawl-redis): ignore empty includes/excludes by @mogery in #1223
- Feat/added eval run after deploy workflow by @rafaelsideguide in #1224
- (feat/ai-sdk) Migrate to AI-SDK by @nickscamara in #1220
- Implemented github analyzer by @aparupganguly in #1229
- Update Dockerfile (#1231) by @mogery in #1232
- (fix/token-slicer) Fixes extract token limit issues by @nickscamara in #1236
- (feat/auto-recharge) Send email suggesting an upgrade when hitting auto recharges by @nickscamara in #1237
- feat(self-host/ai): use any OpenAI-compatible API by @mogery in #1245
- feat(v1/crawl-status-ws): update behavior to ignore errors like regular crawl-status (FIR-1106) by @mogery in #1234
- Implemented claude 3.7 web crawler by @aparupganguly in #1257
- (feat/fire-index) Added new fire-index by @nickscamara in #1263
- fix(scraper): improve charset detection regex to accurately parse met⦠by @GrassH in #1265
- feat(self-host/ai): pass in the ollama envs into docker compose by @brrock in #1269
- (feat/deep-research-alpha) Added Max Urls, Sources and Fixes by @nickscamara in #1271
- (feat/batch-billing) Batch billing by @nickscamara in #1264
- feat(supabase): add read replica routing by @mogery in #1274
- examples/Add GPT-4.5 web crawler by @aparupganguly in #1276
- feat(crawler): handle cross-origin redirects differently than same-origin redirects by @mogery in #1279
- docs: remove undefined "required" field by @j...
Self-Host Overhaul - v1.5.0
Self-Host Fixes
- Reworked Guide: The
SELF_HOST.md
anddocker-compose.yaml
have been updated for clarity and compatibility - Kubernetes Improvements: Updated self-hosted Kubernetes deployment examples for compatibility and consistency (#1177)
- Self-Host Fixes: Numerous fixes aimed at improving self-host performance and stability (#1207)
- Proxy Support: Added proxy support tailored for self-hosted environments (#1212)
- Playwright Integration: Added fixes and continuous integration for the Playwright microservice (#1210)
- Search Endpoint Upgrade: Added SearXNG support for the
/search
endpoint (#1193)
Core Fixes & Enhancements
- Crawl Status Fixes: Fixed various race conditions in the crawl status endpoint (#1184)
- Timeout Enforcement: Added timeout for scrapeURL engines to prevent hanging requests (#1183)
- Query Parameter Retention: Map function now preserves query parameters in results (#1191)
- Screenshot Action Order: Ensured screenshots execute after specified actions (#1192)
- PDF Scraping: Improved handling for PDFs behind anti-bot measures (#1198)
- Map/scrapeURL Abort Control: Integrated AbortController to stop scraping when the request times out (#1205)
- SDK Timeout Enforcement: Enforced request timeouts in the SDK (#1204)
New Features & Additions
- Proxy & Stealth Options: Introduced a proxy option and stealthProxy flag (#1196)
- Deep Research (Alpha): Launched an alpha implementation of deep research (#1202)
- LLM Text Generator: Added a new endpoint for llms.txt generation (#1201)
Docker & Containerization
- Production Ready Docker Image: A streamlined, production ready Docker image is now available to simplify self-hosted deployments.
For the complete details, check out the full changelog.
What's Changed
- fix(crawl-status): consider concurrency limited jobs as prioritized (FIR-851) by @mogery in #1184
- fix(scrapeURL/sb): enforce timeout (FIR-980) by @mogery in #1183
- fix(map): do not remove query parameters from results (FIR-1015) by @mogery in #1191
- fix(scrapeURL/fire-engine): perform format screenshot after specified actions (FIR-985) by @mogery in #1192
- Update self-hosted Kubernetes deployments examples for compatibility and consistency by @tetuyoko in #1177
- fix(v1/types): fix extract -> json rename (FIR-1072) by @mogery in #1195
- feat(v1): proxy option / stealthProxy flag (FIR-1050) by @mogery in #1196
- fix(v1/types): fix extract -> json rename, ROUND II (FIR-1072) by @mogery in #1199
- (feat/deep-research) Alpha implementation of deep research by @nickscamara in #1202
- Add llmstxt generator endpoint by @ericciarla in #1201
- fix(concurrency-limit): move to renewing a lock on each active job instead of estimating time to complete (FIR-1075) by @mogery in #1197
- SELFHOST FIXES (FIR-1105) by @mogery in #1207
- feat(v1/map): stop mapping if timed out via AbortController (FIR-747) by @mogery in #1205
- Playwright page error schema by @makeiteasierapps in #1172
- feat(ci/self-host): add playwright microservice tests by @mogery in #1210
- feat(scrapeURL): handle PDFs behind anti-bot (FIR-722) by @mogery in #1198
- Use correct list typing for py 3.8 support by @niazarak in #931
- feat(map): mock support (FIR-1109) by @mogery in #1213
- Add searxng for search endpoint by @loorisr in #1193
- feat(sdk): enforce timeout on client-side if set (FIR-864) by @mogery in #1204
- feat(self-host): proxy support (FIR-1111) by @mogery in #1212
- temp by @mogery in #1218
- gemini extractor Implementation by @aparupganguly in #1206
New Contributors
- @tetuyoko made their first contribution in #1177
- @makeiteasierapps made their first contribution in #1172
- @niazarak made their first contribution in #931
- @loorisr made their first contribution in #1193
Full Changelog: v1.4.4...v1.5.0
v1.4.4
π Features & Enhancements
- Scrape API: Added action & wait time validation (#1146)
- Extraction Improvements:
- Environment Setup: Added Serper & Search API env vars to docker-compose (#1147)
- Credit System Update: Now displays "tokens" instead of "credits" when out of tokens (#1178)
βοΈ Examples
- Gemini 2.0 Crawler: Implemented new crawling example (#1161)
- Gemini TrendFinder: https://github.com/mendableai/gemini-trendfinder
- Normal Search to Open Deep Research: https://github.com/nickscamara/open-deep-research
π Fixes
- HTML Transformer: Updated free_string function parameter type (#1163)
- Gemini Crawler: Updated library & improved PDF link extraction (#1175)
- Crawl Queue Worker: Only reports successful page count in num_docs (#1179)
- Scraping & URLs:
What's Changed
- [FIR-796] feat(api/types): Add action and wait time validation for scrape requests by @ftonato in #1146
- Implemented Gemini 2.0 crawler by @aparupganguly in #1161
- Add Serper and Search API env vars to docker-compose by @RealLukeMartin in #1147
- fix(html-transformer): Update free_string function parameter type by @carterlasalle in #1163
- Add detection of PDF/image sub-links and extract text via Gemini by @mayooear in #1173
- fix: update gemini library. extract pdf links from scraped content by @mayooear in #1175
- feat(v1/checkCredits): say "tokens" instead of "credits" if out of tokens by @mogery in #1178
- feat(v1/extract) Show sources out of __experimental by @nickscamara in #1180
- (feat/extract) Multi-entity prompt improvements by @nickscamara in #1181
- fix(queue-worker/crawl): only report successful page count in num_docs (FIR-960) by @mogery in #1179
- fix: relative url 2 full url use error base url by @dolonfly in #584
- fix(v1/batch/scrape): use scrape rate limit by @mogery in #1182
New Contributors
- @RealLukeMartin made their first contribution in #1147
- @carterlasalle made their first contribution in #1163
- @mayooear made their first contribution in #1173
- @dolonfly made their first contribution in #584
Full Changelog: v1.4.3...v1.4.4
Examples Week - v1.4.3
Summary of changes
- Open Deep Research: An open source version of OpenAI Deep Research. See here
- R1 Web Extractor Feature: New extraction capability added.
- O3-Mini Web Crawler: Introduces a lightweight crawler for specific use cases.
- Updated Model Parameters: Enhancements to o3-mini_company_researcher.
- URL Deduplication: Fixes handling of URLs ending with /, index.html, index.php, etc.
- Improved URL Blocking: Uses tldts parsing for better blocklist management.
- Valid JSON via rawHtml in Scrape: Ensures valid JSON extraction.
- Product Reviews Summarizer: Implements summarization using o3-mini.
- Scrape Options for Extract: Adds more configuration options for extracting data.
- O3-Mini Job Resource Extractor: Extracts job-related resources using o3-mini.
- Cached Scrapes for Extract evals: Improves performance by using cached data for extractions evals.
What's Changed
- You forgot an 'e' by @sami0596 in #1118
- added cached scrapes to extract by @rafaelsideguide in #1107
- Added R1 web extractor feature by @aparupganguly in #1115
- Feature o3-mini web crawler by @aparupganguly in #1120
- Updated Model Parameters (o3-mini_company_researcher) by @aparupganguly in #1130
- Fix corepack and self hosting setup by @rothnic in #1131
- fix(crawl-redis/generateURLPermutations): dedupe index.html/index.php/slash/bare URL ends (FIR-827) by @mogery in #1134
- feat(blocklist): Improve URL blocking with tldts parsing by @ftonato in #1117
- fix(scrape): allow getting valid JSON via rawHtml (FIR-852) by @mogery in #1138
- Implemented prodcut reviews summarizer using o3 mini by @aparupganguly in #1139
- [Feat] Added scrapeOptions to extract by @rafaelsideguide in #1133
- Feature/o3 mini job resource extractor by @aparupganguly in #1144
New Contributors
- @sami0596 made their first contribution in #1118
- @aparupganguly made their first contribution in #1115
- @rothnic made their first contribution in #1131
Full Changelog: v1.4.2...v1.4.3
Extract and API Improvements - v1.4.2
We're excited to announce several new features and improvements:
New Features
- Added web search capabilities to the extract endpoint via the
enableWebSearch
parameter - Introduced source tracking with
__experimental_showSources
parameter - Added configurable webhook events for crawl and batch operations
- New
timeout
parameter for map endpoint - Optional ad blocking with
blockAds
parameter (enabled by default)
Infrastructure & UI
- Enhanced proxy selection and infrastructure reliability
- Added domain checker tool to cloud platform
- Redesigned LLMs.txt generator interface for better usability
What's Changed
- (feat/extract) Refactor and Reranker improvements by @nickscamara in #1100
- Fix bad WebSocket URL in CrawlWatcher by @ProfHercules in #1053
- (feat/extract) Add sources to the extraction by @nickscamara in #1101
- feat(v1/map): Timeout parameter (FIR-393) by @mogery in #1105
- fix(scrapeURL/fire-engine): default to separate US-generic proxy list if no location is specified (FIR-728) by @mogery in #1104
- feat(scrapeUrl/fire-engine): add blockAds flag (FIR-692) by @mogery in #1106
- (feat/extract) Logs analyzeSchemaAndPrompt output did not match the schema by @nickscamara in #1108
- (feat/extract) Improved completions to use model's limits by @nickscamara in #1109
- feat(v0): store v0 users (team ID) in Redis for collection (FIR-698) by @mogery in #1111
- feat(github/ci): connect to tailscale (FIR-748) by @mogery in #1112
- (feat/conc) Move fully to a concurrency limit system by @nickscamara in #1045
- Added instructions for empty string to extract prompts by @rafaelsideguide in #1114
New Contributors
- @ProfHercules made their first contribution in #1053
Full Changelog: 1.4.1...v1.4.2
Firecrawl website changelog: https://firecrawl.dev/changelog
Extract Improvements - v1.4.1
We've significantly enhanced our data extraction capabilities with several key updates:
- Extract now returns a lot more data due to a new re-ranker system
- Improved infrastructure reliability
- Migrated from Cheerio to a high-performance Rust-based parser for faster and more memory-efficient parsing
- Enhanced crawl cancellation functionality for better control over running jobs
What's Changed
- Added "today" to extract prompts by @rafaelsideguide in #1084
- docs: update cancel crawl response by @ftonato in #1087
- port most of cheerio stuff to rust by @mogery in #1089
- Re-ranker changes by @nickscamara in #1090
- Rerank with lower threshold + back to map if length = 0 by @rafaelsideguide in #1086
Full Changelog: v1.4.0...1.4.1
Introducing /extract - v.1.4.0
Get structured web data with /extract
Weβre excited to announce the release of /extract - get data from any website with just a prompt. With /extract, you can retrieve any information from anywhere on a website without being limited by scraping roadblocks or the typical context constraints of LLMs.
No more manual copy-pasting, broken scraping scripts, or debugging LLM calls. - itβs never been easier to enrich your data, create datasets, or power AI applications with clean, structured data from any website.
Companies are already using extract to:
- Enrich CRM data
- Streamline KYB processes
- Monitor competitors
- Supercharge onboarding experiences
- Build targeted prospecting lists
Instead of spending hours manually researching, fixing broken scrapers, or piecing together data from multiple sources, simply specify what information you need and the target website, and let the Firecrawl handle the entire retrieval process.
Specifically, you can:
- Extract structured data from entire websites using URL wildcards (https://example.com/*)
- Define custom schemas to capture exactly what you needβfrom simple product details to complex organizational structures
- Guide the extraction with custom prompts to ensure the LLM focuses on your target information
- Deploy anywhere with comprehensive support for Python, Node, cURL, and other popular tools. For no-code workflows, just connect via Zapier or use our API to set up integrations with other tools.
This versatility translates into a wide range of real-world applicationsβenabling you to enrich web data for just about any use case.
Limitations - (and the road ahead)
- Let's be honest - while /extract is pretty awesome at grabbing web data, it's not perfect yet. Here's what we're still working on:
- Big sites are tricky - It can't (yet!) grab every single product on Amazon in one go
- Complex searches need work - Things like "find all posts from 2025" aren't quite there
- Sometimes, it's a bit quirky - Results can vary between runs, though it usually gets what you need
- But here's the exciting part: we're seeing the future of web scraping take shape
Try it out
Curious to try /extract out for yourself?
Visit our playground to try out /extract - you get 500,000 tokens for free
Dive into our Extract Beta documentation for detailed technical guidance and API reference
Want a no-code solution? Connect /extract to thousands of applications through our enhanced Zapier integration
That's all for now! Happy Extracting from the whole Firecrawl team π₯
Full Changelog: v.1.3.0...v1.4.0
v1.3 - /extract improvements
What's Changed
- feat: new snips test framework (FIR-414) by @mogery in #1033
- (feat/extract) New re-ranker + multi entity extraction by @nickscamara in #1061
- __experimental_streamSteps by @nickscamara in #1063
Full Changelog: v1.2.1...v.1.3.0
v1.2.1 - /extract Beta Improvements
What's Changed
- Indexes, Caching for /extract, Improvements by @nickscamara in #1037
- [SDK] fixed none and undefined on response by @rafaelsideguide in #1034
- feat: use new random user agent instead of the old one by @1101-1 in #1038
- (feat/extract) Move extract to a queue system by @nickscamara in #1044
/extract (beta) changes
-
We have updated the /extract endpoint to now be asynchronous. When you make a request to /extract, it will return an ID that you can use to check the status of your extract job. If you are using our SDKs, there are no changes required to your code, but please make sure to update the SDKs to the latest versions as soon as possible.
-
For those using the API directly, we have made it backwards compatible. However, you have 10 days to update your implementation to the new asynchronous model.
-
For more details about the parameters, refer to the docs sent to you.
New Contributors
Full Changelog: v1.2.0...v1.2.1
Changelog: https://www.firecrawl.dev/changelog#/extract-changes
v1.2.0 - v1/search is now available!
/v1/search
The search endpoint combines web search with Firecrawlβs scraping capabilities to return full page content for any query.
Include scrapeOptions
with formats: ["markdown"]
to get complete markdown content for each search result otherwise it defaults to getting SERP results (url, title, description).
More info here /v1/search docs
What's Changed
- /extract URL trace by @nickscamara in #1014
- (feat/v1) Search by @nickscamara in #1032
Full Changelog: v1.1.1...v1.2.0