# 03 - Training Models

Learn how to train the G-code fingerprinting model.

## Learning Objectives
- Understand model architecture
- Configure training hyperparameters
- Train a model from scratch
- Monitor training with W&B
- Manage checkpoints

## Model Components
1. **MM-DTAE-LSTM Backbone**: Processes sensor data
2. **Multi-Head Language Model**: Predicts G-code tokens
   - Command head
   - Parameter type head
   - Parameter value head

In [None]:
# Setup
import sys
from pathlib import Path

project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

print(f"Project root: {project_root}")

In [None]:
# Imports
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from miracle.model.model import MM_DTAE_LSTM, ModelConfig
from miracle.model.multihead_lm import MultiHeadGCodeLM
from miracle.dataset.target_utils import TokenDecomposer
from miracle.utilities.device import get_device

print("âœ“ Imports successful")

## 1. Model Architecture Overview

In [None]:
# Create a sample model to inspect architecture
device = get_device()
print(f"Using device: {device}")

# Load vocabulary
vocab_path = project_root / 'data' / 'vocabulary.json'
decomposer = TokenDecomposer(str(vocab_path))

print(f"\nVocabulary stats:")
print(f"  Total tokens: {decomposer.vocab_size}")
print(f"  Commands: {decomposer.n_commands}")
print(f"  Param types: {decomposer.n_param_types}")
print(f"  Param values: {decomposer.n_param_values}")

In [None]:
# Create backbone model
backbone_config = ModelConfig(
    sensor_dims=[155, 4],  # 155 continuous, 4 categorical
    d_model=128,
    lstm_layers=2,
    gcode_vocab=decomposer.vocab_size,
    n_heads=4
)

backbone = MM_DTAE_LSTM(backbone_config).to(device)

print("Backbone Model:")
print(f"  Parameters: {sum(p.numel() for p in backbone.parameters()):,}")
print(f"  Hidden dim: {backbone_config.d_model}")
print(f"  LSTM layers: {backbone_config.lstm_layers}")

In [None]:
# Create multi-head language model
multihead_lm = MultiHeadGCodeLM(
    d_model=128,
    n_commands=decomposer.n_commands,
    n_param_types=decomposer.n_param_types,
    n_param_values=decomposer.n_param_values,
    nhead=4,
    num_layers=2,
    dropout=0.1,
    vocab_size=decomposer.vocab_size
).to(device)

print("Multi-Head LM:")
print(f"  Parameters: {sum(p.numel() for p in multihead_lm.parameters()):,}")
print(f"  Transformer layers: 2")
print(f"  Attention heads: 4")

## 2. Training Configuration

In [None]:
# Training hyperparameters
config = {
    'learning_rate': 0.001,
    'batch_size': 16,
    'max_epochs': 10,  # Small for demo
    'hidden_dim': 128,
    'num_layers': 2,
    'num_heads': 4,
    'dropout': 0.1,
    'weight_decay': 1e-5,
    'grad_clip': 1.0,
    'command_weight': 3.0  # Higher weight for command prediction
}

print("Training Configuration:")
for k, v in config.items():
    print(f"  {k}: {v}")

## 3. Command-Line Training

The recommended way to train is using the training script:

In [None]:
# Training command (for reference)
print("To train from command line:")
print()
print("PYTORCH_ENABLE_MPS_FALLBACK=1 PYTHONPATH=src .venv/bin/python scripts/train_multihead.py \\")
print("    --data-dir outputs/processed_current \\")
print("    --vocab-path data/vocabulary.json \\")
print("    --output-dir outputs/training_demo \\")
print("    --max-epochs 10 \\")
print("    --batch_size 16 \\")
print("    --learning_rate 0.001 \\")
print("    --hidden_dim 128 \\")
print("    --use-wandb")
print()
print("This will train for 10 epochs with W&B logging.")

## 4. Monitoring Training

When using `--use-wandb`, metrics are logged to Weights & Biases:

- **Training loss**: Overall loss, command loss, param loss
- **Validation accuracy**: Command, param type, param value, overall
- **Learning rate**: Current LR (if using scheduler)
- **Gradient norm**: For monitoring stability

## 5. Checkpoint Management

In [None]:
# Find available checkpoints
import glob

checkpoints = glob.glob(str(project_root / 'outputs' / '*' / 'checkpoint_*.pt'))

print(f"Found {len(checkpoints)} checkpoint(s):")
for cp in checkpoints[:10]:
    cp_path = Path(cp)
    print(f"  - {cp_path.parent.name}/{cp_path.name}")

In [None]:
# Inspect checkpoint contents
if checkpoints:
    checkpoint = torch.load(checkpoints[0], map_location='cpu')
    
    print("Checkpoint contents:")
    for key in checkpoint.keys():
        if 'state_dict' in key:
            print(f"  {key}: {len(checkpoint[key])} parameters")
        else:
            print(f"  {key}: {checkpoint[key]}")

## 6. Training Tips

### Hyperparameter Guidelines:
- **Learning rate**: Start with 1e-3, reduce if loss plateaus
- **Batch size**: 16-32 works well, adjust based on memory
- **Hidden dim**: 128-256 for most datasets
- **Layers**: 2-3 LSTM/Transformer layers
- **Command weight**: 3.0-5.0 to emphasize command prediction

### Common Issues:
- **Loss not decreasing**: Try lower learning rate
- **Overfitting**: Increase dropout, add weight decay
- **OOM errors**: Reduce batch size or hidden dimension
- **Nan loss**: Reduce learning rate, check gradient clipping

## Summary

You learned:
- Model architecture (backbone + multi-head LM)
- Training configuration
- How to run training
- Checkpoint management
- Hyperparameter tuning tips

## Next Steps

Continue to **04_inference_prediction.ipynb** to use trained models for prediction.

## Troubleshooting

- **Import errors**: Set `PYTHONPATH=src`
- **Device errors**: Use CPU with `device='cpu'` or enable MPS fallback
- **W&B login**: Run `wandb login` in terminal