Production-grade system for extracting MQL5 trading articles with proper formatting, syntax highlighting, and comprehensive metadata.
Use Case: Training data collection for algorithmic trading seq-2-seq models
# 1. One-time setup (creates persistent virtual environment)
./setup.sh
# 2. Extract all articles
.venv/bin/python mql5_extract.py discover-and-extractThat's it! 🎉
- ✅ Elegant markdown formatting with MQL5 syntax highlighting
 - ✅ Local image downloads with descriptive filenames
 - ✅ Batch processing with checkpoint-based resume
 - ✅ Retry logic with exponential backoff
 - ✅ Quality validation (word count, code blocks)
 - ✅ Comprehensive logging (file + console)
 - ✅ Statistics generation (word count, images, code blocks)
 - ✅ Rate limiting (respectful to mql5.com servers)
 
mql5_articles/
├── {user_id}/                  # User folder (numeric ID or username)
│   ├── article_{id}/
│   │   ├── article_{id}.md    # Clean markdown with syntax highlighting
│   │   ├── metadata.json      # Extraction metadata
│   │   ├── images/            # Local images
│   │   └── images_manifest.json
│   └── article_{id}/
├── extraction_summary.json     # Batch statistics
└── extraction.log             # Detailed logs
- CLAUDE.md - Complete production guide
 - docs/dependency_management.md - UV vs UVX comparison
 - docs/ood_authors_research.md - OOD detection author research
 - docs/attachment_extraction_design.md - Attachment extraction architecture
 - docs/attachment_extraction_test_results.md - Test validation results
 
.venv/bin/python mql5_extract.py discover-and-extract.venv/bin/python mql5_extract.py single https://www.mql5.com/en/articles/19625# Create URL file
.venv/bin/python mql5_extract.py discover-and-extract --save-urls urls.txt --dry-run
# Extract all
.venv/bin/python mql5_extract.py batch urls.txt
# Resume interrupted extraction
.venv/bin/python mql5_extract.py batch urls.txt --resume.venv/bin/python mql5_extract.py discover-and-extract --user-id jslopesEdit config.yaml to customize:
- Output directory
 - Browser timeout
 - Retry attempts
 - Rate limiting
 - Quality validation thresholds
 - Logging levels
 
uv pip install --python .venv/bin/python -r requirements.txt --upgraderm -rf .venv && ./setup.shExtracted content includes:
- ~400,000+ words of technical articles
 - 100% accurate MQL5 syntax detection (fixed in v3.0.0)
 - Local images with automatic naming
 - Complete metadata (author, date, views, ratings)
 
We use uv venv (persistent) instead of uvx (temporary):
| Feature | uvx | 
uv venv | 
|---|---|---|
| Setup time | Every run (~15s) | Once (~30s) | 
| Playwright browsers | Reinstall each time | Cached | 
| Command length | Very long | Short | 
| Best for | One-off scripts | This project ✅ | 
Result: 4x faster execution after initial setup
See docs/dependency_management.md for details.
- uv (Python package manager)
 - Python 3.13+ (auto-installed by uv)
 - ~230 MB disk space for virtual environment
 - Internet connection for article downloads
 
curl -LsSf https://astral.sh/uv/install.sh | sh# Increase timeout in config.yaml
extraction:
  timeout_ms: 60000  # 60 seconds.venv/bin/python -m playwright install chromiumSee CLAUDE.md for complete troubleshooting guide.
- v3.0.0 (2025-10-01) - Fixed MQL5 syntax detection (100% accuracy)
 - v2.0.0 (2025-09-30) - Production release with CLI, logging, retry logic
 - v1.0.0 (2025-09-29) - Initial release
 
See repository license for details.
Built with: Python, Playwright, BeautifulSoup, httpx, uv