Skip to content
/ mql5 Public

MQL5 Article Extraction System - Production-ready tool for extracting and processing MQL5 trading articles for seq-2-seq model training

Notifications You must be signed in to change notification settings

terrylica/mql5

Repository files navigation

MQL5 Article Extraction System

Production-grade system for extracting MQL5 trading articles with proper formatting, syntax highlighting, and comprehensive metadata.

Use Case: Training data collection for algorithmic trading seq-2-seq models


🚀 Quick Start

# 1. One-time setup (creates persistent virtual environment)
./setup.sh

# 2. Extract all articles
.venv/bin/python mql5_extract.py discover-and-extract

That's it! 🎉


✨ Features

  • Elegant markdown formatting with MQL5 syntax highlighting
  • Local image downloads with descriptive filenames
  • Batch processing with checkpoint-based resume
  • Retry logic with exponential backoff
  • Quality validation (word count, code blocks)
  • Comprehensive logging (file + console)
  • Statistics generation (word count, images, code blocks)
  • Rate limiting (respectful to mql5.com servers)

📊 Output Structure

mql5_articles/
├── {user_id}/                  # User folder (numeric ID or username)
│   ├── article_{id}/
│   │   ├── article_{id}.md    # Clean markdown with syntax highlighting
│   │   ├── metadata.json      # Extraction metadata
│   │   ├── images/            # Local images
│   │   └── images_manifest.json
│   └── article_{id}/
├── extraction_summary.json     # Batch statistics
└── extraction.log             # Detailed logs

📚 Documentation


🛠️ Usage Examples

Extract All Articles

.venv/bin/python mql5_extract.py discover-and-extract

Extract Single Article

.venv/bin/python mql5_extract.py single https://www.mql5.com/en/articles/19625

Batch Processing

# Create URL file
.venv/bin/python mql5_extract.py discover-and-extract --save-urls urls.txt --dry-run

# Extract all
.venv/bin/python mql5_extract.py batch urls.txt

# Resume interrupted extraction
.venv/bin/python mql5_extract.py batch urls.txt --resume

Custom User Extraction

.venv/bin/python mql5_extract.py discover-and-extract --user-id jslopes

⚙️ Configuration

Edit config.yaml to customize:

  • Output directory
  • Browser timeout
  • Retry attempts
  • Rate limiting
  • Quality validation thresholds
  • Logging levels

🔧 Maintenance

Update Dependencies

uv pip install --python .venv/bin/python -r requirements.txt --upgrade

Clean Reinstall

rm -rf .venv && ./setup.sh

📊 Quality Metrics

Extracted content includes:

  • ~400,000+ words of technical articles
  • 100% accurate MQL5 syntax detection (fixed in v3.0.0)
  • Local images with automatic naming
  • Complete metadata (author, date, views, ratings)

🎯 Why Persistent Virtual Environment?

We use uv venv (persistent) instead of uvx (temporary):

Feature uvx uv venv
Setup time Every run (~15s) Once (~30s)
Playwright browsers Reinstall each time Cached
Command length Very long Short
Best for One-off scripts This project

Result: 4x faster execution after initial setup

See docs/dependency_management.md for details.


📦 System Requirements

  • uv (Python package manager)
  • Python 3.13+ (auto-installed by uv)
  • ~230 MB disk space for virtual environment
  • Internet connection for article downloads

🆘 Troubleshooting

Issue: uv: command not found

curl -LsSf https://astral.sh/uv/install.sh | sh

Issue: Extraction fails with timeout

# Increase timeout in config.yaml
extraction:
  timeout_ms: 60000  # 60 seconds

Issue: Browser not found

.venv/bin/python -m playwright install chromium

See CLAUDE.md for complete troubleshooting guide.


📈 Version History

  • v3.0.0 (2025-10-01) - Fixed MQL5 syntax detection (100% accuracy)
  • v2.0.0 (2025-09-30) - Production release with CLI, logging, retry logic
  • v1.0.0 (2025-09-29) - Initial release

📄 License

See repository license for details.


Built with: Python, Playwright, BeautifulSoup, httpx, uv

About

MQL5 Article Extraction System - Production-ready tool for extracting and processing MQL5 trading articles for seq-2-seq model training

Resources

Stars

Watchers

Forks

Packages

No packages published