lib2docScrape

A comprehensive web scraping tool for library documentation with multiple backend support. Designed to efficiently crawl, process, and organize technical documentation from various sources with intelligent content extraction and quality assurance.

Features

🚀 Multi-Backend Architecture

Crawl4AI: Advanced AI-powered crawling with JavaScript rendering
Playwright: High-performance browser automation for modern web apps
Lightpanda: Lightweight browser engine for efficient scraping
Scrapy: High-throughput crawling for large-scale operations
HTTP Backend: Simple HTTP requests for basic content
File Backend: Local file system processing

📄 Content Processing

Smart Structure Detection: Automatic identification of documentation sections
Format Support: HTML, Markdown, reStructuredText, and more
Code Extraction: Syntax highlighting and code block preservation
Metadata Extraction: Automatic title, description, and tag detection
Asset Handling: Images, PDFs, and other media files

🔍 Quality Assurance

Content Validation: Automated quality checks and scoring
Link Verification: Broken link detection and reporting
Duplicate Detection: Content deduplication and similarity analysis
Metadata Verification: Required field validation

🎯 Advanced URL Handling

Intelligent Classification: Automatic URL type detection
Security Validation: Path traversal and malicious URL detection
Normalization: RFC-compliant URL standardization
Domain Analysis: TLD extraction and subdomain classification

Installation

Prerequisites

Python 3.9 or higher
uv (recommended) or pip for package management

Using uv (Recommended)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/VespianRex/lib2docscrape.git
cd lib2docscrape

# Install with development dependencies
uv sync --extra dev

# Or install specific backend extras
uv sync --extra crawl4ai --extra playwright

Using pip

pip install -e .[dev]  # Development installation
pip install -e .[all]  # All optional dependencies

Quick Start

Development Setup

# Run tests to verify installation
uv run pytest

# Start the web interface
uv run python src/main.py

# Run with specific backend
uv run python -m src.main --backend crawl4ai

Basic Usage

from src.crawler import DocumentationCrawler
from src.crawler.models import CrawlTarget, CrawlerConfig

# Create crawler configuration
config = CrawlerConfig(
    concurrent_requests=5,
    requests_per_second=2.0,
    max_retries=3
)

# Define crawl target
target = CrawlTarget(
    url="https://docs.example.com",
    depth=2,
    follow_external=False
)

# Initialize and run crawler
crawler = DocumentationCrawler(config)
result = await crawler.crawl(target)

print(f"Crawled {len(result.pages)} pages")

Web Interface

Start the server: uv run python src/main.py
Access the web interface at http://localhost:8000
Use the dashboard to:
- Configure scraping targets
- Monitor scraping progress
- View and organize documentation
- Export documentation

Supported Documentation Types

API Documentation
- OpenAPI/Swagger
- API Blueprint
- GraphQL schemas
Technical Documentation
- Product documentation
- User guides
- Developer guides
- Reference documentation
Knowledge Bases
- Wiki pages
- Help centers
- FAQs
Source Code Documentation
- JSDoc
- Python docstrings
- JavaDoc
- Doxygen

Configuration

Main Configuration (config.yaml)

crawler:
  concurrent_requests: 5
  requests_per_second: 10
  max_retries: 3
  request_timeout: 30

processing:
  allowed_tags:
    - p
    - h1
    - h2
    - code
  code_languages:
    - python
    - javascript
  max_content_length: 5000000

quality:
  min_content_length: 100
  required_metadata_fields:
    - title
    - description

Target Configuration (targets.yaml)

- url: "https://docs.example.com/"
  depth: 2
  follow_external: false
  content_types:
    - "text/html"
    - "text/markdown"
  exclude_patterns:
    - "/downloads/"
    - "/community/"
  required_patterns:
    - "/docs/"
    - "/api/"

For detailed configuration options and examples, see the User Guide.

Testing

The project has comprehensive test coverage with 598+ tests:

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src --cov-report=html

# Run specific test categories
uv run pytest tests/url/          # URL handling tests
uv run pytest tests/backends/     # Backend tests
uv run pytest tests/integration/  # Integration tests

Requirements

Core Dependencies

Python 3.9+ (tested on 3.9, 3.10, 3.11, 3.12)

Core Libraries:

aiohttp>=3.9.1          # Async HTTP client
beautifulsoup4>=4.12.2  # HTML parsing
pydantic>=2.5.2         # Data validation
scrapy>=2.11.0          # Web crawling framework
tldextract>=3.1.0       # Domain parsing

Optional Backend Dependencies

Crawl4AI: crawl4ai>=0.2.0
Playwright: playwright>=1.40.0
Development: pytest, ruff, coverage

Full dependency specifications in pyproject.toml.

Architecture

🏗️ Modular Design

src/
├── backends/           # Pluggable scraping backends
├── crawler/           # Core crawling logic
├── processors/        # Content processing pipeline
├── utils/url/         # Advanced URL handling
├── organizers/        # Documentation organization
├── ui/               # Web interface
└── main.py           # Application entry point

🔧 Core Components

Backend System (src/backends/)
- Pluggable architecture with automatic backend selection
- Support for multiple scraping engines (Crawl4AI, Playwright, Scrapy)
- Intelligent fallback and load balancing
URL Processing (src/utils/url/)
- RFC-compliant URL normalization and validation
- Security checks (path traversal, malicious URLs)
- Domain classification and TLD extraction
Content Pipeline (src/processors/)
- Structure-aware content extraction
- Metadata enrichment and validation
- Quality scoring and filtering
Quality Assurance (src/processors/quality_checker.py)
- Automated content validation
- Link verification and health checks
- Duplicate detection and deduplication

Development

Setting Up Development Environment

# Clone and setup
git clone https://github.com/yourusername/lib2docscrape.git
cd lib2docscrape
uv sync --extra dev

# Run tests
uv run pytest

# Code quality checks
uv run ruff check src/
uv run ruff format src/

Project Status

✅ 598 tests passing (100% success rate)
✅ Comprehensive backend support (6 different backends)
✅ Advanced URL handling with security validation
✅ Quality assurance pipeline with automated checks
🔄 Active development with regular improvements

Contributing

We welcome contributions! Please follow these steps:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Follow TDD: Write tests first, then implement features
Run quality checks: uv run ruff check && uv run pytest
Commit changes: Use conventional commit messages
Push to branch: git push origin feature/amazing-feature
Create Pull Request

Development Guidelines

Follow Test-Driven Development (TDD)
Maintain 100% test pass rate
Use type hints and docstrings
Follow the existing code style (ruff formatting)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support & Documentation

📖 User Guide - Comprehensive usage documentation
🔧 API Documentation - Technical reference
🐛 Issue Tracker - Bug reports and feature requests
💬 Discussions - Community support

lib2docScrape - Comprehensive documentation scraping made simple

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github		.github
.hypothesis		.hypothesis
.roo		.roo
.venv		.venv
archive		archive
cline_docs		cline_docs
comprehensive_scraped_libraries		comprehensive_scraped_libraries
crawl4ai		crawl4ai
docs		docs
examples		examples
home/alex/.trae-server/data/Machine		home/alex/.trae-server/data/Machine
improved_scraped_libraries		improved_scraped_libraries
lib2docScrape		lib2docScrape
lib2docscrape.egg-info		lib2docscrape.egg-info
node_modules		node_modules
playwright-report		playwright-report
reports		reports
scraped_libraries		scraped_libraries
scripts		scripts
src		src
static		static
templates		templates
test-results		test-results
test_templates		test_templates
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Dockerfile.test		Dockerfile.test
ENHANCED_FEATURES.md		ENHANCED_FEATURES.md
Jenkinsfile		Jenkinsfile
MCP_SERVERS.md		MCP_SERVERS.md
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
README_FEATURES.md		README_FEATURES.md
assert		assert
benchmark_targets.yaml		benchmark_targets.yaml
bun.lock		bun.lock
bunfig.toml		bunfig.toml
check_bleach.py		check_bleach.py
code_analysis_plan.md		code_analysis_plan.md
config.yaml		config.yaml
config_with_lightpanda.yaml		config_with_lightpanda.yaml
crawler.log		crawler.log
debug_crawl_target.py		debug_crawl_target.py
debug_duckduckgo.py		debug_duckduckgo.py
debug_error_test.py		debug_error_test.py
deno.lock		deno.lock
doc_organizer.py		doc_organizer.py
docker-compose.ci.yml		docker-compose.ci.yml
jsconfig.json		jsconfig.json
mcp_settings.json		mcp_settings.json
output.json		output.json
package.json		package.json
playwright-minimal.config.js		playwright-minimal.config.js
playwright.config.js		playwright.config.js
pyproject.toml		pyproject.toml
pytest_html_report.html		pytest_html_report.html
report.html		report.html
requirements_OLD.txt		requirements_OLD.txt
rules.md		rules.md
run.py		run.py
run_gui.py		run_gui.py
run_test_batch.py		run_test_batch.py
run_tests.py		run_tests.py
setup.py		setup.py
srs.md		srs.md
start-mcp-servers.sh		start-mcp-servers.sh
targets.yaml		targets.yaml
tdd.md		tdd.md
test-basic.js		test-basic.js
test-gui-server.js		test-gui-server.js
test-results.json		test-results.json
test-results.xml		test-results.xml
test_config.yaml		test_config.yaml
test_output.txt		test_output.txt
test_search.py		test_search.py
test_targets.yaml		test_targets.yaml
uv.lock		uv.lock

VespianRex/lib2docScrape

Folders and files

Latest commit

History

Repository files navigation