A comprehensive web scraping tool for library documentation with multiple backend support. Designed to efficiently crawl, process, and organize technical documentation from various sources with intelligent content extraction and quality assurance.
- Crawl4AI: Advanced AI-powered crawling with JavaScript rendering
- Playwright: High-performance browser automation for modern web apps
- Lightpanda: Lightweight browser engine for efficient scraping
- Scrapy: High-throughput crawling for large-scale operations
- HTTP Backend: Simple HTTP requests for basic content
- File Backend: Local file system processing
- Smart Structure Detection: Automatic identification of documentation sections
- Format Support: HTML, Markdown, reStructuredText, and more
- Code Extraction: Syntax highlighting and code block preservation
- Metadata Extraction: Automatic title, description, and tag detection
- Asset Handling: Images, PDFs, and other media files
- Content Validation: Automated quality checks and scoring
- Link Verification: Broken link detection and reporting
- Duplicate Detection: Content deduplication and similarity analysis
- Metadata Verification: Required field validation
- Intelligent Classification: Automatic URL type detection
- Security Validation: Path traversal and malicious URL detection
- Normalization: RFC-compliant URL standardization
- Domain Analysis: TLD extraction and subdomain classification
- Python 3.9 or higher
- uv (recommended) or pip for package management
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/VespianRex/lib2docscrape.git
cd lib2docscrape
# Install with development dependencies
uv sync --extra dev
# Or install specific backend extras
uv sync --extra crawl4ai --extra playwright
pip install -e .[dev] # Development installation
pip install -e .[all] # All optional dependencies
# Run tests to verify installation
uv run pytest
# Start the web interface
uv run python src/main.py
# Run with specific backend
uv run python -m src.main --backend crawl4ai
from src.crawler import DocumentationCrawler
from src.crawler.models import CrawlTarget, CrawlerConfig
# Create crawler configuration
config = CrawlerConfig(
concurrent_requests=5,
requests_per_second=2.0,
max_retries=3
)
# Define crawl target
target = CrawlTarget(
url="https://docs.example.com",
depth=2,
follow_external=False
)
# Initialize and run crawler
crawler = DocumentationCrawler(config)
result = await crawler.crawl(target)
print(f"Crawled {len(result.pages)} pages")
- Start the server:
uv run python src/main.py
- Access the web interface at
http://localhost:8000
- Use the dashboard to:
- Configure scraping targets
- Monitor scraping progress
- View and organize documentation
- Export documentation
-
API Documentation
- OpenAPI/Swagger
- API Blueprint
- GraphQL schemas
-
Technical Documentation
- Product documentation
- User guides
- Developer guides
- Reference documentation
-
Knowledge Bases
- Wiki pages
- Help centers
- FAQs
-
Source Code Documentation
- JSDoc
- Python docstrings
- JavaDoc
- Doxygen
crawler:
concurrent_requests: 5
requests_per_second: 10
max_retries: 3
request_timeout: 30
processing:
allowed_tags:
- p
- h1
- h2
- code
code_languages:
- python
- javascript
max_content_length: 5000000
quality:
min_content_length: 100
required_metadata_fields:
- title
- description
- url: "https://docs.example.com/"
depth: 2
follow_external: false
content_types:
- "text/html"
- "text/markdown"
exclude_patterns:
- "/downloads/"
- "/community/"
required_patterns:
- "/docs/"
- "/api/"
For detailed configuration options and examples, see the User Guide.
The project has comprehensive test coverage with 598+ tests:
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src --cov-report=html
# Run specific test categories
uv run pytest tests/url/ # URL handling tests
uv run pytest tests/backends/ # Backend tests
uv run pytest tests/integration/ # Integration tests
- Python 3.9+ (tested on 3.9, 3.10, 3.11, 3.12)
- Core Libraries:
aiohttp>=3.9.1 # Async HTTP client beautifulsoup4>=4.12.2 # HTML parsing pydantic>=2.5.2 # Data validation scrapy>=2.11.0 # Web crawling framework tldextract>=3.1.0 # Domain parsing
- Crawl4AI:
crawl4ai>=0.2.0
- Playwright:
playwright>=1.40.0
- Development:
pytest
,ruff
,coverage
Full dependency specifications in pyproject.toml
.
src/
βββ backends/ # Pluggable scraping backends
βββ crawler/ # Core crawling logic
βββ processors/ # Content processing pipeline
βββ utils/url/ # Advanced URL handling
βββ organizers/ # Documentation organization
βββ ui/ # Web interface
βββ main.py # Application entry point
-
Backend System (
src/backends/
)- Pluggable architecture with automatic backend selection
- Support for multiple scraping engines (Crawl4AI, Playwright, Scrapy)
- Intelligent fallback and load balancing
-
URL Processing (
src/utils/url/
)- RFC-compliant URL normalization and validation
- Security checks (path traversal, malicious URLs)
- Domain classification and TLD extraction
-
Content Pipeline (
src/processors/
)- Structure-aware content extraction
- Metadata enrichment and validation
- Quality scoring and filtering
-
Quality Assurance (
src/processors/quality_checker.py
)- Automated content validation
- Link verification and health checks
- Duplicate detection and deduplication
# Clone and setup
git clone https://github.com/yourusername/lib2docscrape.git
cd lib2docscrape
uv sync --extra dev
# Run tests
uv run pytest
# Code quality checks
uv run ruff check src/
uv run ruff format src/
- β 598 tests passing (100% success rate)
- β Comprehensive backend support (6 different backends)
- β Advanced URL handling with security validation
- β Quality assurance pipeline with automated checks
- π Active development with regular improvements
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Follow TDD: Write tests first, then implement features
- Run quality checks:
uv run ruff check && uv run pytest
- Commit changes: Use conventional commit messages
- Push to branch:
git push origin feature/amazing-feature
- Create Pull Request
- Follow Test-Driven Development (TDD)
- Maintain 100% test pass rate
- Use type hints and docstrings
- Follow the existing code style (ruff formatting)
This project is licensed under the MIT License - see the LICENSE file for details.
- π User Guide - Comprehensive usage documentation
- π§ API Documentation - Technical reference
- π Issue Tracker - Bug reports and feature requests
- π¬ Discussions - Community support
lib2docScrape - Comprehensive documentation scraping made simple