# ML Trading Strategy - Demo Notebook

This notebook demonstrates the complete workflow of the ML-driven trading strategy.

## Overview

1. **Data Loading**: Fetch stock price data and engineer features
2. **Model Training**: Train multiple ML models
3. **Strategy Backtesting**: Run historical simulation
4. **Performance Analysis**: Evaluate results and generate plots
5. **Model Comparison**: Compare different approaches

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

print("📊 ML Trading Strategy Demo")
print("=" * 40)

## 1. Configuration and Setup

In [None]:
# Load configuration
with open('../config.json', 'r') as f:
    config = json.load(f)

# Modify config for demo (smaller dataset)
config['data']['tickers'] = ['AAPL', 'GOOGL']  # Use fewer tickers for demo
config['data']['start_date'] = '2022-01-01'
config['data']['end_date'] = '2024-01-01'
config['models']['models_to_train'] = ['random_forest', 'xgboost']  # Faster models

print("📋 Configuration loaded:")
print(f"Tickers: {config['data']['tickers']}")
print(f"Period: {config['data']['start_date']} to {config['data']['end_date']}")
print(f"Models: {config['models']['models_to_train']}")

## 2. Data Loading and Feature Engineering

In [None]:
# Import data loading modules
import sys
sys.path.append('..')

from data.data_loader import load_and_prepare_data

# Load data
print("📥 Loading stock data...")
raw_data, processed_data = load_and_prepare_data(config)

print(f"✅ Data loaded for {len(processed_data)} tickers")
for ticker, df in processed_data.items():
    print(f"  - {ticker}: {len(df)} records, {len(df.columns)} features")

In [None]:
# Examine sample data
sample_ticker = list(processed_data.keys())[0]
sample_df = processed_data[sample_ticker]

print(f"📊 Sample data for {sample_ticker}:")
print(f"Shape: {sample_df.shape}")
print(f"Date range: {sample_df['date'].min()} to {sample_df['date'].max()}")

# Display first few rows
display_cols = ['date', 'close', 'return_1d', 'sma_20', 'rsi_14', 'target']
available_cols = [col for col in display_cols if col in sample_df.columns]
sample_df[available_cols].head(10)

## 3. Exploratory Data Analysis

In [None]:
# Plot price and technical indicators
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Price and moving averages
axes[0, 0].plot(sample_df['date'], sample_df['close'], label='Close Price', linewidth=2)
if 'sma_20' in sample_df.columns:
    axes[0, 0].plot(sample_df['date'], sample_df['sma_20'], label='SMA 20', alpha=0.7)
if 'sma_50' in sample_df.columns:
    axes[0, 0].plot(sample_df['date'], sample_df['sma_50'], label='SMA 50', alpha=0.7)
axes[0, 0].set_title(f'{sample_ticker} Price and Moving Averages')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Returns distribution
if 'return_1d' in sample_df.columns:
    returns = sample_df['return_1d'].dropna()
    axes[0, 1].hist(returns, bins=50, alpha=0.7, edgecolor='black')
    axes[0, 1].axvline(returns.mean(), color='red', linestyle='--', label=f'Mean: {returns.mean():.4f}')
    axes[0, 1].set_title('Daily Returns Distribution')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)

# RSI
if 'rsi_14' in sample_df.columns:
    axes[1, 0].plot(sample_df['date'], sample_df['rsi_14'], linewidth=1)
    axes[1, 0].axhline(70, color='red', linestyle='--', alpha=0.7, label='Overbought')
    axes[1, 0].axhline(30, color='green', linestyle='--', alpha=0.7, label='Oversold')
    axes[1, 0].set_title('RSI (14-day)')
    axes[1, 0].set_ylim(0, 100)
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)

# Target distribution
if 'target' in sample_df.columns:
    target_counts = sample_df['target'].value_counts()
    axes[1, 1].bar(target_counts.index, target_counts.values, alpha=0.7)
    axes[1, 1].set_title('Target Distribution')
    axes[1, 1].set_xlabel('Target (0=Down, 1=Up)')
    axes[1, 1].set_ylabel('Count')
    axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Model Training and Evaluation

In [None]:
from models.ml_models import MLModelManager
from data.data_loader import DataLoader

# Prepare training data
print("🎯 Preparing training data...")
loader = DataLoader(config)

# Combine data from all tickers
combined_features = []
combined_targets = []

for ticker, df in processed_data.items():
    X, y = loader.prepare_ml_data(df)
    
    # Remove rows with NaN targets
    valid_mask = ~y.isna() & ~X.isna().any(axis=1)
    X_clean = X[valid_mask]
    y_clean = y[valid_mask]
    
    if len(X_clean) > 10:
        combined_features.append(X_clean)
        combined_targets.append(y_clean)

# Combine all data
X_train = pd.concat(combined_features, ignore_index=True)
y_train = pd.concat(combined_targets, ignore_index=True)

print(f"Training dataset: {X_train.shape[0]} samples, {X_train.shape[asset:1]} features")
print(f"Target distribution: {y_train.value_counts().to_dict()}")

In [None]:
# Train models
print("🤖 Training ML models...")

model_manager = MLModelManager(config)
model_manager.initialize_models()
trained_models = model_manager.train_models(X_train, y_train)

print("\n📈 Training Results:")
print("-" * 50)

for model_name, metrics in model_manager.performance_metrics.items():
    print(f"{model_name.upper()}:")
    print(f"  Training Accuracy: {metrics.get('train_accuracy', 0):.4f}")
    print(f"  Validation Accuracy: {metrics.get('val_accuracy', 0):.4f}")
    if 'roc_auc' in metrics:
        print(f"  ROC AUC: {metrics.get('roc_auc', 0):.4f}")
    print()

## 5. Feature Importance Analysis

In [None]:
# Analyze feature importance for tree-based models
for model_name in ['random_forest', 'xgboost']:
    if model_name in trained_models:
        try:
            importance_df = model_manager.get_feature_importance(model_name)
            if not importance_df.empty:
                # Plot top 15 features
                top_features = importance_df.head(15)
                
                plt.figure(figsize=(10, 8))
                plt.barh(range(len(top_features)), top_features['importance'], alpha=0.8)
                plt.yticks(range(len(top_features)), top_features['feature'])
                plt.xlabel('Importance Score')
                plt.title(f'Top 15 Feature Importance - {model_name.replace("_", " ").title()}')
                plt.gca().invert_yaxis()
                plt.grid(True, alpha=0.3)
                plt.tight_layout()
                plt.show()
                
                print(f"\n📊 Top 10 Features for {model_name}:")
                for i, row in top_features.head(10).iterrows():
                    print(f"  {row['feature']}: {row['importance']:.4f}")
        except Exception as e:
            print(f"Could not analyze feature importance for {model_name}: {e}")

## 6. Backtesting

In [None]:
from backtest.backtester import Backtester

# Run backtesting
print("📊 Running backtesting...")

# Adjust backtest parameters for demo
config['backtest']['walk_forward']['training_window'] = 126  # 6 months
config['backtest']['walk_forward']['validation_window'] = 21  # 1 month
config['backtest']['walk_forward']['step_size'] = 7  # 1 week

backtester = Backtester(config)

backtest_results = backtester.run_backtest(
    data=processed_data,
    start_date=config['data']['start_date'],
    end_date=config['data']['end_date']
)

print("✅ Backtesting completed!")

## 7. Performance Analysis

In [None]:
# Get results
portfolio_df = pd.DataFrame(backtester.portfolio_history)
trades_df = backtester.get_trade_analysis()

print(f"📊 Backtest Results Summary:")
print(f"Portfolio records: {len(portfolio_df)}")
print(f"Total trades: {len(trades_df)}")

if not portfolio_df.empty:
    initial_value = config['backtest']['initial_capital']
    final_value = portfolio_df['total_value'].iloc[-1] if 'total_value' in portfolio_df.columns else portfolio_df['portfolio_value'].iloc[-1]
    total_return = (final_value - initial_value) / initial_value
    
    print(f"\n💰 PERFORMANCE METRICS")
    print(f"Initial Capital: ${initial_value:,.2f}")
    print(f"Final Value: ${final_value:,.2f}")
    print(f"Total Return: {total_return:.2%}")
    
    if not trades_df.empty and 'pnl' in trades_df.columns:
        winning_trades = trades_df[trades_df['pnl'] > 0]
        losing_trades = trades_df[trades_df['pnl'] <= 0]
        win_rate = len(winning_trades) / len(trades_df)
        
        print(f"Win Rate: {win_rate:.2%}")
        print(f"Average Win: ${winning_trades['pnl'].mean():.2f}") if len(winning_trades) > 0 else None
        print(f"Average Loss: ${losing_trades['pnl'].mean():.2f}") if len(losing_trades) > 0 else None

## 8. Visualization

In [None]:
# Portfolio performance visualization
if not portfolio_df.empty:
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # Portfolio value over time
    dates = pd.to_datetime(portfolio_df['date'])
    values = portfolio_df['total_value'] if 'total_value' in portfolio_df.columns else portfolio_df['portfolio_value']
    
    axes[0, 0].plot(dates, values, linewidth=2, color='blue')
    axes[0, 0].set_title('Portfolio Value Over Time')
    axes[0, 0].set_ylabel('Portfolio Value ($)')
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].tick_params(axis='x', rotation=45)
    
    # Drawdown analysis
    peak = values.expanding().max()
    drawdown = (values - peak) / peak * 100
    
    axes[0, 1].fill_between(dates, drawdown, 0, alpha=0.7, color='red')
    axes[0, 1].plot(dates, drawdown, linewidth=1, color='darkred')
    axes[0, 1].set_title('Drawdown Analysis')
    axes[0, 1].set_ylabel('Drawdown (%)')
    axes[0, 1].grid(True, alpha=0.3)
    axes[0, 1].tick_params(axis='x', rotation=45)
    
    # Trade analysis
    if not trades_df.empty and 'pnl' in trades_df.columns:
        # P&L distribution
        axes[1, 0].hist(trades_df['pnl'], bins=20, alpha=0.7, edgecolor='black')
        axes[1, 0].axvline(0, color='red', linestyle='--', linewidth=2)
        axes[1, 0].axvline(trades_df['pnl'].mean(), color='green', linestyle='--', 
                          label=f'Mean: ${trades_df["pnl"].mean():.2f}')
        axes[1, 0].set_title('Trade P&L Distribution')
        axes[1, 0].set_xlabel('P&L ($)')
        axes[1, 0].legend()
        axes[1, 0].grid(True, alpha=0.3)
        
        # Cumulative P&L
        cumulative_pnl = trades_df['pnl'].cumsum()
        trade_numbers = range(1, len(cumulative_pnl) + 1)
        axes[1, 1].plot(trade_numbers, cumulative_pnl, linewidth=2, color='green')
        axes[1, 1].fill_between(trade_numbers, cumulative_pnl, 0, alpha=0.3, color='green')
        axes[1, 1].set_title('Cumulative P&L')
        axes[1, 1].set_xlabel('Trade Number')
        axes[1, 1].set_ylabel('Cumulative P&L ($)')
        axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## 9. Model Comparison

In [None]:
# Compare model performance
if model_manager.performance_metrics:
    metrics_df = pd.DataFrame(model_manager.performance_metrics).T
    
    # Plot model comparison
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # Validation accuracy
    if 'val_accuracy' in metrics_df.columns:
        metrics_df['val_accuracy'].plot(kind='bar', ax=axes[0], color='skyblue', alpha=0.8)
        axes.set_title('Model Validation Accuracy')
        axes.set_ylabel('Accuracy')
        axes.tick_params(axis='x', rotation=45)
        axes.grid(True, alpha=0.3)
    
    # ROC AUC (if available)
    if 'roc_auc' in metrics_df.columns:
        metrics_df['roc_auc'].plot(kind='bar', ax=axes[1], color='lightcoral', alpha=0.8)
        axes[asset:1].set_title('Model ROC AUC Score')
        axes[asset:1].set_ylabel('ROC AUC')
        axes[asset:1].tick_params(axis='x', rotation=45)
        axes[asset:1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("📊 Model Performance Summary:")
    display_cols = ['train_accuracy', 'val_accuracy', 'roc_auc']
    available_cols = [col for col in display_cols if col in metrics_df.columns]
    print(metrics_df[available_cols].round(4))

## 10. Risk Analysis

In [None]:
# Calculate additional risk metrics
if not portfolio_df.empty and 'total_value' in portfolio_df.columns:
    values = portfolio_df['total_value']
    returns = values.pct_change().dropna()
    
    # Calculate risk metrics
    volatility = returns.std() * np.sqrt(252)
    max_drawdown = drawdown.min()
    sharpe_ratio = (returns.mean() * 252) / volatility if volatility > 0 else 0
    
    # VaR calculation (5% and 1%)
    var_95 = np.percentile(returns, 5)
    var_99 = np.percentile(returns, 1)
    
    print("📉 RISK ANALYSIS")
    print("-" * 30)
    print(f"Annualized Volatility: {volatility:.2%}")
    print(f"Maximum Drawdown: {abs(max_drawdown):.2%}")
    print(f"Sharpe Ratio: {sharpe_ratio:.2f}")
    print(f"VaR (95%): {var_95:.2%}")
    print(f"VaR (99%): {var_99:.2%}")
    
    # Plot return distribution with VaR
    plt.figure(figsize=(10, 6))
    plt.hist(returns, bins=50, alpha=0.7, density=True, edgecolor='black')
    plt.axvline(var_95, color='red', linestyle='--', label=f'VaR 95%: {var_95:.2%}')
    plt.axvline(var_99, color='darkred', linestyle='--', label=f'VaR 99%: {var_99:.2%}')
    plt.axvline(returns.mean(), color='green', linestyle='-', label=f'Mean: {returns.mean():.2%}')
    plt.title('Daily Returns Distribution with VaR')
    plt.xlabel('Daily Return')
    plt.ylabel('Density')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

## 11. Export Results

In [None]:
import os

# Create results directory
results_dir = 'demo_results'
os.makedirs(results_dir, exist_ok=True)

# Export data
if not portfolio_df.empty:
    portfolio_df.to_csv(f'{results_dir}/portfolio_history.csv', index=False)
    print(f"✅ Portfolio history exported to {results_dir}/portfolio_history.csv")

if not trades_df.empty:
    trades_df.to_csv(f'{results_dir}/trades.csv', index=False)
    print(f"✅ Trades exported to {results_dir}/trades.csv")

# Export model metrics
if model_manager.performance_metrics:
    with open(f'{results_dir}/model_metrics.json', 'w') as f:
        json.dump(model_manager.performance_metrics, f, indent=4, default=str)
    print(f"✅ Model metrics exported to {results_dir}/model_metrics.json")

# Export configuration used
with open(f'{results_dir}/config_used.json', 'w') as f:
    json.dump(config, f, indent=4)
    print(f"✅ Configuration exported to {results_dir}/config_used.json")

print(f"\n📁 All results saved to: {results_dir}/")

## 12. Summary and Interpretation

In [None]:
# Generate summary report
print("📋 TRADING STRATEGY PERFORMANCE SUMMARY")
print("=" * 50)

if not portfolio_df.empty:
    initial_capital = config['backtest']['initial_capital']
    final_value = portfolio_df['total_value'].iloc[-1] if 'total_value' in portfolio_df.columns else portfolio_df['portfolio_value'].iloc[-1]
    total_return = (final_value - initial_capital) / initial_capital
    
    print(f"📊 FINANCIAL PERFORMANCE")
    print(f"  Initial Capital: ${initial_capital:,.2f}")
    print(f"  Final Portfolio Value: ${final_value:,.2f}")
    print(f"  Total Return: {total_return:.2%}")
    print(f"  Annualized Return: {(1 + total_return) ** (252 / len(portfolio_df)) - 1:.2%}")
    
    if 'volatility' in locals():
        print(f"  Volatility: {volatility:.2%}")
        print(f"  Sharpe Ratio: {sharpe_ratio:.2f}")
        print(f"  Max Drawdown: {abs(max_drawdown):.2%}")

if not trades_df.empty:
    winning_trades = len(trades_df[trades_df['pnl'] > 0])
    total_trades = len(trades_df)
    win_rate = winning_trades / total_trades
    
    print(f"\n🎯 TRADING STATISTICS")
    print(f"  Total Trades: {total_trades}")
    print(f"  Winning Trades: {winning_trades}")
    print(f"  Win Rate: {win_rate:.2%}")
    print(f"  Average Trade P&L: ${trades_df['pnl'].mean():.2f}")
    print(f"  Best Trade: ${trades_df['pnl'].max():.2f}")
    print(f"  Worst Trade: ${trades_df['pnl'].min():.2f}")

if model_manager.performance_metrics:
    print(f"\n🤖 MODEL PERFORMANCE")
    for model_name, metrics in model_manager.performance_metrics.items():
        print(f"  {model_name.upper()}:")
        print(f"    Validation Accuracy: {metrics.get('val_accuracy', 0):.2%}")
        if 'roc_auc' in metrics:
            print(f"    ROC AUC: {metrics.get('roc_auc', 0):.3f}")

print(f"\n⚠️ IMPORTANT DISCLAIMERS:")
print(f"  • This is a demonstration for educational purposes only")
print(f"  • Past performance does not guarantee future results")
print(f"  • All trading involves substantial risk of loss")
print(f"  • Thoroughly backtest any strategy before live deployment")

## Conclusion

This notebook demonstrated the complete ML trading strategy workflow:

✅ **Data Pipeline**: Automated data fetching and feature engineering  
✅ **Model Training**: Multiple ML models with performance evaluation  
✅ **Backtesting**: Realistic historical simulation with transaction costs  
✅ **Analysis**: Comprehensive performance metrics and visualizations  
✅ **Export**: Results saved for further analysis  

### Next Steps

1. **Hyperparameter Tuning**: Use `GridSearchCV` for optimal parameters
2. **Extended Backtesting**: Test on longer periods and more assets
3. **Risk Analysis**: Implement stress testing and scenario analysis
4. **Live Trading**: Deploy strategy with paper trading first

### Important Notes

⚠️ **This is for educational purposes only**  
⚠️ **Past performance does not guarantee future results**  
⚠️ **Always test strategies thoroughly before live deployment**  
⚠️ **Consider transaction costs, slippage, and market impact**