A high-performance web service that converts JavaScript-heavy documentation sites into clean, LLM-optimized Markdown. Built specifically to solve the problem of AI agents being unable to parse modern documentation sites that rely heavily on client-side rendering.
Why llm.codes exists: Modern AI agents like Claude Code struggle with JavaScript-heavy documentation sites, particularly Apple's developer docs. This tool bridges that gap by converting dynamic content into clean, parseable Markdown that AI agents can actually use.
📖 Read the full story: How llm.codes Transforms Developer Documentation for AI Agents
Modern documentation sites (especially Apple's) use heavy JavaScript rendering that makes content invisible to AI agents. llm.codes solves this by:
- Using Firecrawl's headless browser to execute JavaScript and capture fully-rendered content
- Converting dynamic HTML to clean, semantic Markdown
- Removing noise (navigation, footers, duplicate content) that wastes AI context tokens
- Providing parallel URL processing for efficient multi-page documentation crawling
- Parallel Processing: Fetches up to 20 URLs concurrently using batched promises
- Smart Caching: Redis-backed 30-day cache reduces API calls and improves response times
- Content Filtering: Multiple filtering strategies to remove:
- Navigation elements and boilerplate
- Platform availability strings (iOS 14.0+, etc.)
- Duplicate content across pages
- Empty sections and formatting artifacts
- Recursive Crawling: Configurable depth-first crawling with intelligent link extraction
- Browser Notifications: Web Notifications API integration for background processing alerts
- URL State Management: Query parameter-based URL sharing for easy documentation links
🚀 Try it now at llm.codes
Experience the tool instantly without any setup required.
- Node.js 20+
- npm or yarn
- Firecrawl API key
- Clone the repository:
git clone https://github.com/amantus-ai/llm-codes.git
cd llm-codes
- Install dependencies:
npm install
- Create a
.env.local
file:
cp .env.local.example .env.local
- Add your Firecrawl API key to
.env.local
:
# Required
FIRECRAWL_API_KEY=your_api_key_here
# Optional - Redis Cache (Recommended for production)
UPSTASH_REDIS_REST_URL=https://your-redis-instance.upstash.io
UPSTASH_REDIS_REST_TOKEN=your_redis_token_here
# Optional - Cache Admin
CACHE_ADMIN_KEY=your_secure_admin_key_here
- Run the development server:
npm run dev
The easiest way to deploy is using Vercel:
- Click the button above
- Create a new repository
- Add your
FIRECRAWL_API_KEY
environment variable - Deploy!
- Push to your GitHub repository
- Import project on Vercel
- Add environment variables:
FIRECRAWL_API_KEY
: Your Firecrawl API key (required)UPSTASH_REDIS_REST_URL
: Your Upstash Redis URL (optional)UPSTASH_REDIS_REST_TOKEN
: Your Upstash Redis token (optional)CACHE_ADMIN_KEY
: Admin key for cache endpoints (optional)
- Deploy
-
Enter URL: Paste any documentation URL
- Most documentation sites are automatically supported through pattern matching
- Click "Learn more" to see the supported URL patterns
-
Configure Options (click "Show Options"):
- Crawl Depth: How deep to follow links (0 = main page only, max 5)
- Max URLs: Maximum number of pages to process (1-1000, default 200)
- Filter URLs: Remove hyperlinks from content (recommended for LLMs)
- Deduplicate Content: Remove duplicate paragraphs to save tokens
- Filter Availability: Remove platform availability strings (iOS 14.0+, etc.)
-
Process: Click "Process Documentation" and grant notification permissions if prompted
-
Monitor Progress:
- Real-time progress bar shows completion percentage
- Activity log displays detailed processing information
- Browser notifications alert you when complete
-
Download: View statistics and download your clean Markdown file
llm.codes uses intelligent pattern matching to support most documentation sites automatically. Rather than maintaining a list of thousands of individual sites, we use regex patterns to match common documentation URL structures.
We support documentation sites that match these patterns:
-
Documentation Subdomains (
docs.*, developer.*, learn.*, etc.
)- Examples:
docs.python.org
,developer.apple.com
,learn.microsoft.com
- Pattern: Any subdomain like docs, developer, dev, learn, help, api, guide, wiki, or devcenter
- Examples:
-
Documentation Paths (
/docs, /guide, /learn, etc.
)- Examples:
angular.io/docs
,redis.io/docs
,react.dev/learn
- Pattern: URLs ending with paths like /docs, /documentation, /api-docs, /guides, /learn, /help, /stable, or /latest
- Examples:
-
Programming Language Sites (
*js.org, *lang.org, etc.
)- Examples:
vuejs.org
,kotlinlang.org
,ruby-doc.org
- Pattern: Domains ending with js, lang, py, or -doc followed by .org or .com
- Examples:
-
GitHub Pages (
*.github.io
)- Examples: Any GitHub Pages documentation site
- Pattern: All subdomains of github.io
A small number of popular documentation sites don't follow standard patterns and are explicitly supported:
- Swift Package Index (
swiftpackageindex.com
) - Flask (
flask.palletsprojects.com
) - Material-UI (
mui.com/material-ui
) - pip (
pip.pypa.io/en/stable
) - PHP (
www.php.net/docs.php
)
Most documentation sites are automatically supported! If your site follows standard documentation URL patterns (like having /docs
in the path or docs.
as a subdomain), it should work without any changes.
If you find a documentation site that isn't supported, please open an issue and we'll either adjust our patterns or add it as an exception.
Option | Description | Default | Range |
---|---|---|---|
Crawl Depth | How many levels deep to follow links | 2 | 0-5 |
Max URLs | Maximum number of URLs to process | 200 | 1-1000 |
Batch Size | URLs processed concurrently | 20 | N/A |
Cache Duration | How long results are cached | 30 days | N/A |
The core API endpoint that handles documentation conversion.
Request Flow:
- URL validation against allowed domains whitelist
- Cache check (Redis/in-memory with 30-day TTL)
- Firecrawl API call with optimized scraping parameters
- Content post-processing and filtering
- Response with markdown and cache status
Request Body:
{
"url": "https://developer.apple.com/documentation/swiftui",
"action": "scrape"
}
Response:
{
"success": true,
"data": {
"markdown": "# SwiftUI Documentation\n\n..."
},
"cached": false
}
Error Handling:
- Domain validation errors (400)
- Firecrawl API errors (500)
- Network timeouts (504)
- Rate limiting (429)
- Framework: Next.js 15 with App Router
- Language: TypeScript
- Styling: Tailwind CSS v4
- API: Firecrawl for web scraping
- Cache: Upstash Redis for distributed caching
- Deployment: Vercel
- Development: Turbopack for fast refreshes
llm-codes/
├── src/
│ ├── app/
│ │ ├── api/
│ │ │ └── scrape/
│ │ │ ├── route.ts # API endpoint
│ │ │ └── __tests__/ # API tests
│ │ ├── globals.css # Global styles & Tailwind
│ │ ├── layout.tsx # Root layout
│ │ ├── page.tsx # Main page component
│ │ └── icon.tsx # Dynamic favicon
│ ├── constants.ts # Configuration constants
│ ├── utils/ # Utility functions
│ │ ├── content-processing.ts # Content cleaning logic
│ │ ├── file-utils.ts # File handling
│ │ ├── notifications.ts # Browser notifications
│ │ ├── scraping.ts # Scraping utilities
│ │ ├── url-utils.ts # URL validation & handling
│ │ └── __tests__/ # Utility tests
│ └── test/
│ └── setup.ts # Test configuration
├── public/
│ └── favicon.svg # Static favicon
├── next.config.js # Next.js configuration
├── postcss.config.js # PostCSS with Tailwind v4
├── tsconfig.json # TypeScript configuration
├── vitest.config.ts # Vitest test configuration
├── spec.md # Detailed specification
└── package.json # Dependencies
- URL Extraction: Custom regex patterns extract links from markdown and HTML
- Domain-Specific Filtering: Each documentation site has custom rules for link following
- Parallel Batch Processing: URLs processed in batches of 10 for optimal performance
- Content Deduplication: Hash-based paragraph and section deduplication
- Multi-Stage Filtering: Sequential filters for URLs, navigation, boilerplate, and platform strings
- Batched API Calls: Reduces Firecrawl API latency by processing multiple URLs per request
- Progressive Loading: UI updates with real-time progress during long crawls
- Smart Link Extraction: Only follows relevant documentation links based on URL patterns
- Client-Side Caching: Browser-based result caching for repeat operations
# Run all tests
npm test
# Run tests with UI
npm run test:ui
# Run tests with coverage
npm run test:coverage
# Type checking
npm run type-check
Tests cover:
- URL validation and domain filtering
- Content processing and deduplication
- API error handling
- Cache behavior
- UI component interactions
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
- Check browser permissions for notifications
- Ensure you're using a supported browser (Chrome, Firefox, Safari 10.14+, Edge)
- Try resetting notification permissions in browser settings
The app includes a 30-day cache to minimize API calls. If you're hitting rate limits:
- Reduce crawl depth
- Lower maximum URLs
- Wait for cached results
- Consider setting up Redis cache for better performance
For production use, we recommend setting up Redis cache:
- Sign up for Upstash (free tier available)
- Create a Redis database
- Add the credentials to your environment variables
- The app will automatically use Redis for caching
Benefits:
- Cache persists across deployments
- Shared cache across all instances
- Automatic compression for large documents
- ~70% reduction in Firecrawl API calls
- Ensure
FIRECRAWL_API_KEY
is set in environment variables - Check Vercel function logs for errors
- Verify your API key is valid
This project is licensed under the MIT License - see the LICENSE file for details.
LLM Codes supports 69 documentation sites across multiple categories:
- Python, MDN Web Docs, TypeScript, Rust, Go, Java, Ruby, PHP, Swift, Kotlin
- React, Vue.js, Angular, Next.js, Nuxt, Svelte, Django, Flask, Express.js, Laravel
- AWS, Google Cloud, Azure, DigitalOcean, Heroku, Vercel, Netlify, Salesforce
- PostgreSQL, MongoDB, MySQL, Redis, Elasticsearch, Couchbase, Cassandra
- Docker, Kubernetes, Terraform, Ansible, GitHub, GitLab
- PyTorch, TensorFlow, Hugging Face, scikit-learn, LangChain, pandas, NumPy
- Tailwind CSS, Bootstrap, Material-UI, Chakra UI
- npm, webpack, Vite, pip, Cargo, Maven
- Jest, Cypress, Playwright, pytest, Mocha
- React Native, Flutter, Android, Apple Developer
If you need support for a documentation site that's not listed, please open an issue on GitHub!
- Handles JavaScript-heavy sites that traditional scrapers can't parse
- Built-in markdown conversion with semantic structure preservation
- Reliable headless browser automation at scale
- Server-side API key security
- Built-in caching with fetch()
- Streaming responses for large documentation sets
- Edge-ready deployment on Vercel
- Reduces server load for filtering operations
- Enables real-time UI updates during processing
- Allows users to customize output without re-fetching
- WebSocket support for real-time crawl progress
- Custom domain rule configuration
- Batch URL upload via CSV/JSON
- Export to multiple formats (PDF, EPUB, Docusaurus)
- LLM-specific formatting profiles
- Powered by Firecrawl for JavaScript rendering
- Inspired by the challenges of making documentation accessible to AI agents
- Built with Next.js 15, Tailwind CSS v4, and TypeScript
Built by Peter Steinberger | Blog Post | Twitter