# LSTM Baseline Model Training - Day 2

**Objective**: Train a simple LSTM model to predict container CPU usage 15 minutes ahead

## Approach:
1. Load and prepare data with sliding windows
2. Train LSTM baseline model
3. Evaluate predictions
4. Save model for future use

This serves as proof of concept before building more complex models.

In [1]:
# Import libraries
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import os

from preprocessing.sliding_windows import create_features_and_windows
from models.lstm_baseline import (
    LSTMPredictor,
    ModelTrainer,
    prepare_data_for_training,
    evaluate_predictions
)

print("‚úÖ Libraries imported")

ModuleNotFoundError: No module named 'torch'

## 1. Load Data

Load the most recent metrics export.

In [None]:
# Find and load latest metrics file
data_dir = '../data/raw'
metrics_files = glob.glob(os.path.join(data_dir, 'metrics_*.csv'))

if not metrics_files:
    print("‚ùå No metrics files found! Run export script first.")
else:
    latest_file = max(metrics_files, key=os.path.getctime)
    print(f"üìÅ Loading: {os.path.basename(latest_file)}")
    
    df = pd.read_csv(latest_file)
    
    print(f"‚úÖ Loaded {len(df):,} records")
    print(f"   Time range: {pd.to_datetime(df['timestamp']).min()} to {pd.to_datetime(df['timestamp']).max()}")

## 2. Create Sliding Windows

Create 60-minute input windows to predict 15 minutes ahead.

In [None]:
# Create features and windows
print("üîÑ Creating sliding windows...")
print("   Input: 60 minutes of history")
print("   Output: Predict next 15 minutes")

try:
    X, y, feature_names, metadata = create_features_and_windows(
        df=df,
        container_name='metrics-webapp',
        metric_name='container_cpu',
        window_size_minutes=60,
        prediction_horizon_minutes=15,
        include_temporal=True,
        include_lags=True,
        include_rolling=True
    )
    
    print(f"\n‚úÖ Created {len(X)} training samples")
    print(f"   X shape: {X.shape}")
    print(f"   y shape: {y.shape}")
    print(f"   Features: {len(feature_names)}")
    
except Exception as e:
    print(f"‚ùå Error creating windows: {e}")
    print("   This might happen if there's not enough data yet.")
    print("   Try running more load patterns and collecting more data.")

## 3. Prepare Data for Training

Split into train/validation sets and normalize.

In [None]:
# Prepare data
train_loader, val_loader, scalers = prepare_data_for_training(
    X=X,
    y=y,
    train_split=0.8,
    batch_size=32,
    normalize=True
)

X_scaler, y_scaler = scalers

print("\n‚úÖ Data preparation complete")

## 4. Build LSTM Model

Create a 2-layer LSTM model.

In [None]:
# Model hyperparameters
input_size = X.shape[2]  # Number of features
hidden_size = 64
num_layers = 2
prediction_horizon = y.shape[1]  # Number of timesteps to predict
dropout = 0.2

# Create model
model = LSTMPredictor(
    input_size=input_size,
    hidden_size=hidden_size,
    num_layers=num_layers,
    prediction_horizon=prediction_horizon,
    dropout=dropout
)

# Count parameters
n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("üß† LSTM Model Architecture:")
print("=" * 60)
print(model)
print("=" * 60)
print(f"Total trainable parameters: {n_params:,}")

## 5. Train Model

Train with early stopping to prevent overfitting.

In [None]:
# Create trainer
trainer = ModelTrainer(
    model=model,
    device='cpu',  # Change to 'cuda' if GPU available
    learning_rate=0.001
)

print("üöÄ Starting training...")
print("   This may take a few minutes...\n")

# Train model
history = trainer.train(
    train_loader=train_loader,
    val_loader=val_loader,
    epochs=100,
    patience=15,
    verbose=True
)

print("\n‚úÖ Training complete!")

## 6. Visualize Training

Plot training and validation loss curves.

In [None]:
# Plot training history
trainer.plot_training_history()

## 7. Evaluate Model

Make predictions on validation set and evaluate performance.

In [None]:
# Get validation data
split_idx = int(len(X) * 0.8)
X_val = X[split_idx:]
y_val = y[split_idx:]

# Normalize validation data
n_val, window_size, n_features = X_val.shape
X_val_reshaped = X_val.reshape(-1, n_features)
X_val_scaled = X_scaler.transform(X_val_reshaped)
X_val_normalized = X_val_scaled.reshape(n_val, window_size, n_features)

# Make predictions
print("üîÆ Making predictions on validation set...")
y_pred_normalized = trainer.predict(X_val_normalized)

# Denormalize predictions
y_pred = y_scaler.inverse_transform(y_pred_normalized)

print(f"‚úÖ Generated {len(y_pred)} predictions")

In [None]:
# Evaluate predictions
evaluate_predictions(
    y_true=y_val,
    y_pred=y_pred,
    n_examples=5
)

## 8. Analyze Prediction Horizon

How does accuracy degrade as we predict further into the future?

In [None]:
# Calculate error at each timestep in prediction horizon
timestep_errors = []

for t in range(y_val.shape[1]):
    mse_t = np.mean((y_val[:, t] - y_pred[:, t]) ** 2)
    timestep_errors.append(np.sqrt(mse_t))

# Plot
plt.figure(figsize=(12, 5))
plt.plot(timestep_errors, linewidth=2, marker='o')
plt.xlabel('Timestep (15-second intervals)', fontsize=12)
plt.ylabel('RMSE', fontsize=12)
plt.title('Prediction Error vs Time Horizon', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

# Add time labels
timesteps_in_minutes = np.arange(0, len(timestep_errors) * 0.25, 3)
minute_ticks = (timesteps_in_minutes / 0.25).astype(int)
plt.xticks(minute_ticks, [f'{int(m)} min' for m in timesteps_in_minutes])

plt.tight_layout()
plt.show()

print(f"\nüìä Error Analysis:")
print(f"   RMSE at 1 minute:  {timestep_errors[4]:.6f}")  # 4 timesteps = 1 min
print(f"   RMSE at 5 minutes: {timestep_errors[20]:.6f}")  # 20 timesteps = 5 min
print(f"   RMSE at 10 minutes: {timestep_errors[40]:.6f}")  # 40 timesteps = 10 min
print(f"   RMSE at 15 minutes: {timestep_errors[-1]:.6f}")

## 9. Save Model

Save the trained model and scalers for later use.

In [None]:
import torch
import pickle
from datetime import datetime

# Create models directory
model_dir = '../src/models/saved'
os.makedirs(model_dir, exist_ok=True)

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

# Save model
model_path = os.path.join(model_dir, f'lstm_baseline_{timestamp}.pth')
torch.save(model.state_dict(), model_path)

# Save scalers
scaler_path = os.path.join(model_dir, f'scalers_{timestamp}.pkl')
with open(scaler_path, 'wb') as f:
    pickle.dump({'X_scaler': X_scaler, 'y_scaler': y_scaler}, f)

# Save model config
config = {
    'input_size': input_size,
    'hidden_size': hidden_size,
    'num_layers': num_layers,
    'prediction_horizon': prediction_horizon,
    'dropout': dropout,
    'feature_names': feature_names,
    'window_size_minutes': 60,
    'prediction_horizon_minutes': 15
}

config_path = os.path.join(model_dir, f'model_config_{timestamp}.pkl')
with open(config_path, 'wb') as f:
    pickle.dump(config, f)

print("üíæ Model saved:")
print(f"   Model weights: {model_path}")
print(f"   Scalers: {scaler_path}")
print(f"   Config: {config_path}")

## 10. Summary

### Model Performance:
- **Architecture**: 2-layer LSTM with 64 hidden units
- **Input**: 60 minutes of historical data
- **Output**: 15-minute CPU usage prediction
- **Training**: Early stopping with validation

### Next Steps:
1. ‚úÖ Baseline LSTM model trained
2. Collect more diverse data (spikes, gradual changes, chaos)
3. Try more complex models (attention mechanisms, transformers)
4. Move to SageMaker for distributed training
5. Integrate with Kubernetes HPA

### Key Insights:
- Document what patterns the model captures well/poorly
- Note prediction accuracy at different time horizons
- Identify areas for improvement

In [None]:
# Final summary
print("=" * 60)
print("DAY 2 MODEL TRAINING - COMPLETE ‚úÖ")
print("=" * 60)
print(f"\nüìä Training Statistics:")
print(f"   Training samples: {split_idx:,}")
print(f"   Validation samples: {len(X_val):,}")
print(f"   Model parameters: {n_params:,}")
print(f"   Best validation loss: {min(history['val_loss']):.6f}")

print(f"\nüéØ Model Performance:")
val_mse = np.mean((y_val - y_pred) ** 2)
val_rmse = np.sqrt(val_mse)
val_mae = np.mean(np.abs(y_val - y_pred))
print(f"   RMSE: {val_rmse:.6f}")
print(f"   MAE:  {val_mae:.6f}")

print(f"\nüí° Next Actions:")
print("   1. Continue collecting data with varied load patterns")
print("   2. Experiment with different model architectures")
print("   3. Fine-tune hyperparameters")
print("   4. Prepare for SageMaker migration")

print(f"\nüéâ Day 2 objectives complete!")