A production-grade Proximal Policy Optimization (PPO) agent for portfolio allocation and optimal execution in cryptocurrency markets. This project demonstrates deep understanding of reinforcement learning, financial engineering, and MLOps best practices.
This project implements a complete pipeline for training, evaluating, and deploying a deep reinforcement learning agent for autonomous trading:
- Custom Gymnasium environment with realistic friction modeling (transaction costs, slippage, execution delay)
- Differential Sharpe Ratio reward (Moody & Saffell, 2001) for risk-adjusted optimization
- Feature engineering pipeline with log-returns, technical indicators (RSI, MACD, ATR, Bollinger), and rolling Z-score normalization
- Interactive Streamlit dashboard for real-time training visualization and backtesting
- Full MLOps infrastructure with MLflow tracking, Docker containerization, and TensorBoard logging
Disclaimer: This project is for educational and research purposes. It is NOT financial advice. Trading involves substantial risk of loss. Past performance does not guarantee future results.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OBSERVATION β
β State: (window_size, features) - Normalized OHLCV + Indicators β
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FEATURE EXTRACTOR β
β MLP [64, 64] with ReLU activation β
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββ΄ββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β POLICY HEAD β β VALUE HEAD β
β (Actor - Ο(a|s)) β β (Critic - V(s)) β
β Softmax / Gaussian β β Single Output β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ACTION β
β Discrete: {HOLD, BUY, SELL} or Continuous: [-1, 1] β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Clone repository
git clone https://github.com/timothim/trading-agent.git
cd trading-agent
# Install with pip
pip install -e .
# Or with development dependencies
pip install -e ".[dev]"
# Setup pre-commit hooks (optional)
pre-commit installThe easiest way to interact with the project:
# Launch interactive Streamlit dashboard
make dashboard
# β http://localhost:8501
# Or directly
streamlit run app/dashboard.pyThe dashboard lets you:
- Train a PPO agent with configurable hyperparameters and watch it learn in real-time
- Backtest the trained model on unseen data with step-by-step visualization
- Analyze performance metrics (Sharpe, Sortino, drawdown) and compare against buy-and-hold
- Verify that the training is real (model inspection, TensorBoard logs)
# Train with default configuration
make train
# Train with custom config
python scripts/train.py --config configs/training.yaml --timesteps 1000000
# Run hyperparameter optimization
python scripts/train.py --config configs/training.yaml --hpo --hpo-trials 100# Evaluate trained model
python scripts/evaluate.py --model models/best_model --config configs/default.yaml
# Run walk-forward backtest
python scripts/backtest.py --model models/best_model --walk-forward --folds 5# Build and run with Docker Compose
docker-compose up training
# Launch MLflow UI
docker-compose up mlflow
# β http://localhost:5000
# Start TensorBoard
docker-compose up tensorboard
# β http://localhost:6006trading-agent/
βββ app/
β βββ dashboard.py # Streamlit UI (interactive dashboard)
β βββ engine.py # Training & backtest orchestration
β
βββ src/trading_agent/
β βββ data/ # Data acquisition
β β βββ fetcher.py # CCXT exchange integration
β β βββ dataset.py # Data management & caching
β β
β βββ preprocessing/ # Feature engineering
β β βββ transformers.py # Log-returns, fractional differentiation
β β βββ normalizers.py # Rolling Z-score normalization
β β βββ indicators.py # RSI, MACD, ATR, Bollinger Bands
β β βββ pipeline.py # Orchestrated preprocessing pipeline
β β
β βββ simulation/ # Custom Gymnasium environment
β β βββ trading_env.py # Core RL environment
β β βββ action_schemes.py # Discrete / Continuous action spaces
β β βββ reward_functions.py # Differential Sharpe, Sortino, etc.
β β βββ friction.py # Transaction costs, slippage, latency
β β
β βββ model/ # PPO agent
β β βββ networks.py # Actor-Critic MLP networks
β β βββ agent.py # PPO wrapper (Stable-Baselines3)
β β βββ callbacks.py # Training callbacks & checkpoints
β β
β βββ training/ # Training infrastructure
β β βββ trainer.py # Training orchestration
β β βββ vectorized.py # SubprocVecEnv parallel environments
β β βββ hyperopt.py # Optuna hyperparameter optimization
β β
β βββ evaluation/ # Validation & backtesting
β β βββ metrics.py # Sharpe, Sortino, max drawdown, Calmar
β β βββ backtester.py # Walk-forward validation
β β βββ visualization.py # Performance charts
β β
β βββ deployment/ # Production inference
β βββ inference.py # Lightweight model serving
β βββ live_trader.py # Live trading loop with safety controls
β
βββ configs/ # YAML configuration files
β βββ default.yaml # Default hyperparameters
β βββ training.yaml # Training overrides
β βββ inference.yaml # Deployment configuration
β
βββ scripts/ # CLI entry points
β βββ train.py # Training script
β βββ evaluate.py # Evaluation script
β βββ backtest.py # Backtesting script
β βββ live.py # Live trading script
β βββ quickstart.py # Quick start demo
β
βββ tests/ # Test suite
βββ notebooks/ # Jupyter notebooks for exploration
βββ models/ # Saved models (gitignored)
βββ data/ # Data storage (gitignored)
βββ logs/ # Training logs (gitignored)
All hyperparameters are configured via YAML files:
# configs/default.yaml
ppo:
learning_rate: 0.0003
n_steps: 2048
batch_size: 64
n_epochs: 10
gamma: 0.99
gae_lambda: 0.95
clip_range: 0.2
ent_coef: 0.01
environment:
window_size: 30
action_space_type: "discrete" # or "continuous"
initial_balance: 10000.0
reward:
type: "differential_sharpe"
risk_aversion: 0.5Instead of rewarding raw returns (which encourages excessive risk), the agent optimizes the marginal contribution to the Sharpe ratio (Moody & Saffell, 2001):
Where
Log-returns make price data stationary for neural network consumption:
Features are normalized using rolling statistics to maintain consistent scale:
| Metric | Formula | Good | Excellent |
|---|---|---|---|
| Sharpe Ratio | > 1.0 | > 2.0 | |
| Sortino Ratio | > 1.5 | > 2.5 | |
| Max Drawdown | < 20% | < 10% | |
| Calmar Ratio | Annual Return / Max Drawdown | > 1.0 | > 2.0 |
# Run tests
make test
# Run with coverage
make test-cov
# Lint and format
make lint
make format
# Type checking
make type-check
# All quality checks
make lint && make testThe live trading system includes multiple safety controls:
- Paper Trading Mode - Test without real money (enabled by default)
- Daily Loss Limits - Stop trading after exceeding configurable threshold
- Maximum Position Size - Limit exposure per asset
- Trade Cooldown - Minimum time between trades to prevent overtrading
- Graceful Shutdown - Handle SIGTERM/SIGINT signals cleanly
| Topic | Paper |
|---|---|
| PPO | Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347 |
| Differential Sharpe | Moody, J. & Saffell, M. (2001). Learning to Trade via Direct Reinforcement. IEEE Trans. Neural Networks |
| Financial ML | Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley |
| GAE | Schulman, J., et al. (2015). High-Dimensional Continuous Control Using GAE. arXiv:1506.02438 |
MIT License - see LICENSE for details.
- Stable-Baselines3 - PPO implementation
- CCXT - Exchange connectivity
- Gymnasium - RL environment API
- Streamlit - Dashboard framework