Scrape entire websites and convert to clean markdown — perfect for LLM training data, RAG systems, and AI applications. Handles iframes, JavaScript navigation, and complex site structures.
Markdown is the ideal format for working with LLMs:
- ✅ Clean, structured text for training language models
- ✅ Perfect for RAG (Retrieval-Augmented Generation) pipelines
- ✅ Easy to process, chunk, and embed for vector databases
- ✅ Human-readable and Git-friendly for documentation
- 🕷️ Full site crawling with automatic link discovery
- 🖼️ Iframe support for embedded content
- 🧹 Smart cleanup removes navigation, boilerplate, and duplicates
- 📝 Clean markdown output with readable filenames
- 🚀 Headless browser powered by Playwright (handles JavaScript)
pip install scrape2md
playwright install chromium # One-time browser setupCLI:
scrape2md https://example.com
scrape2md https://site.com -o docs -m 50 -d 2.0Python:
from scrape2md import WebScraper
scraper = WebScraper("https://example.com", "output", max_pages=50)
scraper.scrape_site()scrape2md <url> [options]
-o, --output DIR Output directory (default: scraped_sites)
-m, --max-pages N Max pages to scrape (default: 100)
-d, --delay SECONDS Delay between requests (default: 1.0)
--download-images Download images (off by default)- Discovers site structure from navigation menus
- Crawls pages breadth-first with Playwright (handles JavaScript)
- Extracts content from iframes and dynamic elements
- Strips boilerplate (nav, footer, ads, login forms)
- Converts to clean markdown with smart filenames
- Detects and skips duplicate content
scraped_sites/
└── example_com/
├── Home.md
├── About Us.md
├── Documentation.md
└── ...
- Requires Chromium browser (installed via Playwright)
- Doesn't handle login-protected content
- Google Docs embeds are linked but not downloaded
- Default limit: 100 pages per site (configurable)
git clone https://github.com/taralika/scrape2md.git
cd scrape2md
pip install -e .[dev] # Install with dev dependencies
playwright install chromium # One-time browser setup
pytest # Run tests
black . # Format code
ruff check . # Lint code
mypy src/ # Type checkingPull requests welcome! Please open an issue first to discuss major changes.
MIT License - see LICENSE file for details
Anand Taralika - GitHub
0.1.0 (2025-11-19) — Initial release