A high-performance, configurable web crawler for mirroring websites with parallel processing, intelligent link conversion, SQLite database tracking, and automatic Confluence API mode with full metadata extraction.
The crawler now automatically detects Confluence sites and switches to API-based crawling when credentials are available:
- π 33+ Metadata Fields: Versions, history, authors, contributors, labels, restrictions
- π Attachment Downloads: Automatic download and URL rewriting
- π― Accurate Content: Direct from Confluence REST API
- π Automatic Fallback: Uses HTML mode if no credentials
- π YAML Metadata: Compatible with bash script format
Quick Setup:
# 1. Create credentials file
cp config/.env.template config/.env
# 2. Add your Confluence API token
# Edit config/.env with your token
# 3. Run normally - API mode activates automatically!
python src/web_crawler.pyπ Full Confluence API Guide
- π Universal Web Crawling: Works with any website (Confluence, documentation sites, wikis, etc.)
- β‘ Parallel Processing: Configurable multi-threaded downloads for maximum speed (8+ workers)
- π Multiple Output Formats: HTML and Markdown with intelligent content extraction
- π Smart Link Conversion: Converts web links to local file references automatically
- π¨ Resource Management: Downloads and organizes CSS, images, and other assets
- ποΈ SQLite Database: Robust progress tracking with atomic operations and concurrent access
- π Advanced Reporting: Comprehensive statistics and progress tracking with Rich UI
- π Auto-Migration: Seamless migration from JSON to SQLite format
- π¦ Auto-Dependencies: Automatic package installation without user intervention
- π‘οΈ Thread-Safe: Advanced locking mechanisms prevent race conditions
- βοΈ Highly Configurable: YAML configuration files + command-line interface
- πͺ Cookie Authentication: Support for authenticated sessions
The crawler implements 6 layers of protection against duplicate downloads:
- In-Memory Verification: Fast checks using
downloaded_urlsset - Active Download Tracking: Prevents concurrent downloads of the same URL
- URL Normalization: Removes fragments (
#section) and standardizes formathttps://example.com/page#section1βhttps://example.com/pagehttps://example.com/page#section2βhttps://example.com/page
- Pre-Queue Filtering: Verifies URLs before adding to download queue
- Database Idempotency: SQLite
INSERT OR REPLACEprevents duplicate records - Unique Attachment IDs: Each attachment uses
{confluence_id}_{filename}format
Result: Zero duplicate downloads, even with:
- Multiple links to the same page
- Interrupted and resumed sessions
- Concurrent multi-threaded execution
- Complex Confluence space hierarchies
Special handling for Confluence spaces:
- Automatic Space Detection: Recognizes
/wiki/spaces/{KEY}/overviewURLs - Complete Space Discovery: Uses Confluence Search API (CQL) to find ALL pages
- Pagination Support: Handles spaces with 100+ pages automatically
- Intelligent Depth Reset: Space index doesn't consume depth budget
- URL Correction: Automatically adds
/wikiprefix to Confluence URLs
Example: Starting from space overview crawls entire space:
python src/web_crawler.py "https://company.atlassian.net/wiki/spaces/DOCS/overview" 2 DOCS markdown 8PythonHttpTracker/
βββ src/
β βββ web_crawler.py # Main crawler engine with SQLite integration
β βββ database_manager.py # SQLite database operations and management
β βββ dependency_installer.py # Automatic dependency installation
β βββ json_migrator.py # Migration utility from JSON to SQLite
β βββ db_reporter.py # Database reporting and statistics
βββ config/
β βββ config.yml.example # Configuration template with database settings
β βββ cookies.template.txt # Cookie template
βββ setup_wizard.sh # Interactive setup wizard
βββ README.md # This documentation
- WebCrawler Class: Main crawler engine with SQLite database integration
- DatabaseManager: Complete SQLite abstraction with thread-safe operations
- DependencyInstaller: Automatic installation of required Python packages
- JSONMigrator: Migration utility for existing JSON progress files
- CrawlerReporter: Advanced reporting and statistics generation
- Configuration System: YAML-based configuration with database settings
- Link Processing: Intelligent link extraction and local path conversion
- Resource Management: Shared resource handling with deduplication
- Python 3.8+
- pip (Python package manager)
- Clone the repository:
git clone https://github.com/sjseo298/PythonHttpTracker.git
cd PythonHttpTracker- Automatic Installation:
python install.pyThis intelligent installer will:
- β Detect your environment (Dev Container, Codespace, local, etc.)
- β Set up virtual environment (if needed)
- β Install all required dependencies automatically
- β Handle externally managed environments
- β Verify installation completeness
./setup_wizard.shThe wizard guides you through:
- Installing dependencies automatically
- Creating configuration files
- Setting up authentication
- Running your first crawl
# Install dependencies from requirements.txt
pip install -r requirements.txt
# Or install individually
pip install requests>=2.25.0 beautifulsoup4>=4.9.0 markdownify>=0.11.0 PyYAML>=6.0 rich>=13.0.0# The crawler will automatically install missing dependencies when run
python src/web_crawler.py- Configure your target website:
cp config/config.yml.example config/config.yml
# Edit config.yml with your website details and database settings- Set up authentication (if needed):
cp config/cookies.template.txt config/cookies.txt
# Add your authentication cookies to cookies.txt- Run the crawler:
python src/web_crawler.pyThe crawler will automatically:
- Install missing dependencies
- Create SQLite database
- Migrate existing JSON progress files
- Start crawling with robust progress tracking
Create your configuration from the example template:
cp config/config.yml.example config/config.yml
# Edit config.yml with your website details and database settings- Website: Target site URL patterns and exclusions
- Crawling: Depth limits, workers, delays, and retry settings
- Database: SQLite settings, migration options, and performance tuning
- Output: Format selection (HTML/Markdown) and directory structure
- Content: Processing rules for HTML cleaning and resource handling
- Files: Cookie authentication and backup settings
The crawler now uses SQLite for robust progress tracking:
database:
db_path: "crawler_data.db" # SQLite database file
auto_migrate_json: true # Auto-migrate from JSON files
json_backup_dir: "json_backups" # Backup directory for JSON files
keep_json_backup: true # Keep JSON files after migration
enable_wal_mode: true # Better concurrency
cache_size: 10000 # Performance optimizationThe crawler uses SQLite for reliable progress tracking with these benefits:
- Atomic Operations: No data corruption from interruptions
- Concurrent Access: Thread-safe operations with proper locking
- Efficient Queries: Fast lookups for URL status and statistics
- Data Integrity: Foreign key constraints and transaction safety
- Backup Support: JSON export functionality for data portability
The SQLite database includes these tables:
- discovered_urls: Track all discovered URLs with status and metadata
- downloaded_documents: Store information about downloaded HTML files
- downloaded_resources: Track CSS, images, and other resource files
- url_mappings: Map original URLs to local file paths
- crawler_stats: Store crawling statistics and metrics
Existing users with JSON progress files get automatic migration:
- Seamless conversion from JSON to SQLite format
- Automatic backup of original JSON files
- Zero data loss during migration
- Continued operation without interruption
Generate comprehensive crawling reports:
# Full report with statistics and breakdowns
python src/db_reporter.py
# Summary report only
python src/db_reporter.py --summary
# Progress overview
python src/db_reporter.py --progress
# Export completed URLs
python src/db_reporter.py --export-urls completed_urls.txt- URL Status Summary: Breakdown by pending, downloading, completed, failed
- Download Statistics: Document and resource counts with sizes
- Resource Analysis: Breakdown by file types (CSS, images, etc.)
- Recent Activity: Daily download activity tracking
- Failed URLs: Analysis of failed downloads with retry counts
- Progress Tracking: Real-time progress bars and percentages
The main configuration is done through config/config.yml:
# Website Configuration
website:
base_domain: "your-site.com"
base_url: "https://your-site.com"
start_url: "https://your-site.com/docs"
valid_url_patterns:
- "/docs/"
- "/wiki/"
exclude_patterns:
- "/admin"
- "/login"
# Crawling Parameters
crawling:
max_depth: 2
space_name: "DOCS"
max_workers: 8
request_delay: 0.5
request_timeout: 30
# Output Configuration
output:
format: "markdown" # or "html"
output_dir: "downloaded_content"
resources_dir: "shared_resources"For quick usage, you can also use command-line arguments:
# Basic usage (automatic dependency installation)
python src/web_crawler.py "https://example.com/docs"
# Advanced usage with all parameters
python src/web_crawler.py "https://example.com/wiki" 3 WIKI markdown 10
# URL depth space format workers
# Generate database reports
python src/db_reporter.py # Full report
python src/db_reporter.py --summary # Summary only
python src/db_reporter.py --progress # Progress only
# Export URLs by status
python src/db_reporter.py --export-urls completed.txt --export-status completedThe crawler automatically installs missing dependencies:
# Just run the crawler - dependencies will be installed automatically
python src/web_crawler.py
# Or manually install if preferred
pip install requests beautifulsoup4 markdownify pyyamlFor an easier setup experience, use the interactive wizard:
./setup_wizard.shThe wizard provides:
- Dependency checking and installation
- Interactive YAML configuration creation
- Step-by-step website setup
- Authentication configuration
- Example configurations for common sites
- Copy the template:
cp config/cookies.template.txt config/cookies.txt- Get your cookies:
- Open your browser and login to the target website
- Open Developer Tools (F12) β Network tab
- Refresh the page and copy the Cookie header from any request
- Paste the cookie string into
config/cookies.txt
Example format:
sessionid=abc123; csrftoken=def456; auth_token=ghi789
# config/config.yml
website:
base_domain: "company.atlassian.net"
base_url: "https://company.atlassian.net"
start_url: "https://company.atlassian.net/wiki/spaces/DOCS/overview"
valid_url_patterns:
- "/wiki/spaces/DOCS/"
exclude_patterns:
- "action="
- "/admin"
crawling:
max_depth: 2
max_workers: 5
output:
format: "markdown"# config/config.yml
website:
base_domain: "docs.example.com"
base_url: "https://docs.example.com"
start_url: "https://docs.example.com/v1/"
valid_url_patterns:
- "/v1/"
exclude_patterns:
- "/api/"
crawling:
max_depth: 3
max_workers: 10
output:
format: "html"# Download Confluence space with 8 parallel workers
python src/web_crawler.py "https://company.atlassian.net/wiki/spaces/TEAM" 2 TEAM markdown 8
# Download documentation site to HTML
python src/web_crawler.py "https://docs.example.com" 1 DOCS html 5The crawler automatically saves progress to download_progress.json and can resume interrupted downloads:
# If download is interrupted, simply re-run the same command
python src/web_crawler.py "https://example.com/docs"
# β Progress loaded: 45 URLs already downloaded, 23 pending- max_workers: Higher values = faster downloads, but may overload the server
- request_delay: Delay between requests to be respectful to the server
- request_timeout: Timeout for individual requests
Recommended settings:
- Small sites:
max_workers: 3-5 - Large sites:
max_workers: 8-15 - Slow servers:
request_delay: 1.0
The crawler implements advanced thread-safe mechanisms:
- URL Locks: Prevent multiple threads from downloading the same page
- Resource Locks: Prevent duplicate resource downloads
- Progress Locks: Ensure safe progress file updates
- Queue Locks: Thread-safe queue management
- Clean, readable Markdown files
- Intelligent content extraction (removes navigation, headers, footers)
- Local link conversion
- Metadata headers with original URLs
- ATX-style headings
- Clean HTML with JavaScript removed
- CSS and resources downloaded and linked locally
- Local link conversion maintained
- Original page structure preserved
-
Authentication Errors:
- Check if cookies are correctly formatted in
config/cookies.txt - Ensure cookies are not expired
- Verify you have access to the target URLs
- Check if cookies are correctly formatted in
-
Permission Errors:
- Check if you have write permissions in the output directory
- Ensure the user running the script can create directories
-
Network Errors:
- Check internet connectivity
- Verify the target website is accessible
- Consider increasing
request_timeout
-
Memory Issues:
- Reduce
max_workersfor large sites - Increase system memory or use swap space
- Reduce
Enable verbose logging by setting:
logging:
verbose: true
log_resources: true- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and add tests
- Commit your changes:
git commit -am 'Add feature' - Push to the branch:
git push origin feature-name - Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Repository: https://github.com/sjseo298/PythonHttpTracker
- Issues: https://github.com/sjseo298/PythonHttpTracker/issues
- Documentation: https://github.com/sjseo298/PythonHttpTracker/wiki
- v1.0.0 (December 2024): Initial release
- β¨ High-performance web crawler with parallel processing
- π Universal web crawling support (Confluence, documentation sites, wikis)
- β‘ Multi-threaded downloads with configurable workers
- π Multiple output formats (HTML and Markdown)
- π Smart link conversion to local references
- π¨ Automatic resource management (CSS, images, assets)
- π Progress tracking and resume functionality
- π‘οΈ Thread-safe architecture with advanced locking
- βοΈ YAML configuration with CLI override support
- πͺ Cookie authentication for protected sites
- π§Ή Integrated HTML cleaning and JavaScript removal
- π Interactive setup wizard
Made with β€οΈ for the web scraping community