<a href="https://colab.research.google.com/github/sarahajbane/notebooks/blob/main/LSTM_V01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTM Time Series Forecasting Tutorial: Apple Stock Price Prediction

This notebook provides a comprehensive tutorial on using Long Short-Term Memory (LSTM) networks for time series forecasting. We'll predict Apple stock prices using historical data, implementing best practices to prevent overfitting and ensure model performance.

## Table of Contents
1. [Introduction to LSTM for Time Series](#introduction)
2. [Environment Setup](#setup)
3. [Data Preparation and Exploration](#data-preparation)
4. [Creating the LSTM Model Architecture](#lstm-architecture)
5. [Training and Validation Process](#training-process)
6. [Evaluation and Visualization](#evaluation)
7. [Model Improvements and Best Practices](#improvements)
8. [Conclusion](#conclusion)

<a id="introduction"></a>
## 1. Introduction to LSTM for Time Series

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) architecture designed to work with sequential data. Unlike traditional neural networks, LSTMs have internal mechanisms called gates that can regulate the flow of information, allowing them to capture long-term dependencies in time series data.

### Why LSTM for Stock Prediction?

- **Memory capabilities**: LSTMs can "remember" patterns over long sequences
- **Resistance to vanishing/exploding gradients**: A common problem in traditional RNNs
- **Ability to capture non-linear relationships**: Important for financial time series
- **Flexible architecture**: Can be configured for different prediction tasks

### LSTM Gates Explained

LSTMs contain three main gates that regulate information flow:

1. **Forget Gate**: Controls what information from the previous cell state should be discarded
   - Formula: f_t = σ(W_f · [h_{t-1}, x_t] + b_f)

2. **Input Gate**: Determines what new information should be stored in the cell state
   - Input gate layer: i_t = σ(W_i · [h_{t-1}, x_t] + b_i)
   - Candidate values: C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C)

3. **Output Gate**: Controls what information from the cell state goes to the output
   - Formula: o_t = σ(W_o · [h_{t-1}, x_t] + b_o)

The cell state update combines these gates:
- C_t = f_t * C_{t-1} + i_t * C̃_t
- h_t = o_t * tanh(C_t)

<a id="setup"></a>
## 2. Environment Setup

Let's import the necessary libraries for our implementation:

```python
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from copy import deepcopy as dc
import warnings
import os

# Suppress warnings
warnings.filterwarnings("ignore")

# Set device for computation
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Set random seeds for reproducibility
def set_seed(seed=42):
    """Set seeds for reproducibility."""
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# Set the seed
set_seed(42)
```

### Importance of Setting Seeds

Setting random seeds ensures that your model training is reproducible. This is crucial for:
- Debugging and troubleshooting
- Sharing results with others
- Comparing different model configurations fairly

<a id="data-preparation"></a>
## 3. Data Preparation and Exploration

Proper data preparation is critical for time series forecasting. We need to:
1. Load and visualize the time series data
2. Create lag features (using previous time steps to predict the next)
3. Normalize the data to a suitable range
4. Split into training and testing sets
5. Reshape for LSTM input requirements

```python
# Load and prepare the data
def load_and_prepare_data(n_steps=5, train_size=0.8):
    """
    Load and prepare time series data for LSTM model.
    
    Args:
        n_steps (int): Number of lag steps to use as features
        train_size (float): Proportion of data to use for training
        
    Returns:
        tuple: Prepared data and preprocessing objects
    """
    # Load and process Apple stock data
    timec = TimeConfig()
    df = timec.process_chain(path='../Data/apple.csv', columns=['Close/Last'])
    df = df.loc['2021':'2024'].copy()
    
    # Visualize the time series data
    plt.figure(figsize=(12, 6))
    df["Value"].plot(label="Apple Stock", color="orange")
    plt.title("Apple Stock Analysis")
    plt.xlabel("Date")
    plt.ylabel("Price")
    plt.gca().set_facecolor("#E8E8E4")
    plt.gcf().set_facecolor("#E8E8E4")
    plt.legend(facecolor="#E8E8E4")
    plt.savefig('apple_stock_data.png')  # Save the plot
    plt.close()
    
    # Create lag features
    timef = TransformLag()
    data = timef.lag_transform(df, n_steps)
    
    # Convert to numpy array
    data = data.to_numpy()
    
    # Normalize the data to [-1, 1] range
    scaler = MinMaxScaler(feature_range=(-1, 1))
    transformed = scaler.fit_transform(data)
    
    # Split into features (X) and target (y)
    X = transformed[:, 1:]
    y = transformed[:, 0]
    
    # Flip the order of lag features (for time series causality)
    X = dc(np.flip(X, axis=1))
    
    # Split into training and testing sets
    split_index = int(len(X) * train_size)
    X_train = X[:split_index]
    X_test = X[split_index:]
    y_train = y[:split_index]
    y_test = y[split_index:]
    
    # Reshape to match LSTM's expected input: [samples, sequence_length, features]
    X_train = X_train.reshape((-1, n_steps, 1))
    X_test = X_test.reshape((-1, n_steps, 1))
    y_train = y_train.reshape((-1, 1))
    y_test = y_test.reshape((-1, 1))
    
    # Convert to PyTorch tensors
    X_train = torch.tensor(X_train).float()
    X_test = torch.tensor(X_test).float()
    y_train = torch.tensor(y_train).float()
    y_test = torch.tensor(y_test).float()
    
    return (X_train, y_train, X_test, y_test, scaler, df)
```

### Understanding Time Series Feature Engineering

- **Lag Features**: Using previous time steps (t-1, t-2, ..., t-n) to predict the current value (t)
- **Number of Steps (n_steps)**: The lookback window or context window for prediction
- **Data Normalization**: Crucial for neural networks, as it:
  - Accelerates convergence
  - Prevents features with larger scales from dominating
  - Places all features on the same scale (-1 to 1 in our case)

### Custom Dataset for PyTorch

To efficiently feed data to our LSTM model, we create a custom PyTorch Dataset:

```python
# Custom Dataset class for time series
class TimeSeries(Dataset):
    """Custom PyTorch Dataset for time series data."""
    def __init__(self, X, y):
        self.X = X
        self.y = y
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, i):
        return self.X[i], self.y[i]
```

<a id="lstm-architecture"></a>
## 4. Creating the LSTM Model Architecture

Now we'll define our LSTM model. The architecture consists of:
1. One or more LSTM layers
2. Dropout for regularization
3. A fully connected (Linear) output layer

```python
# Define the LSTM model
class LSTM(nn.Module):
    """
    Long Short-Term Memory (LSTM) network for time series forecasting.
    
    Implements a stacked LSTM with dropout for regularization and
    a fully connected layer as the output layer.
    """
    def __init__(self, input_size, hidden_size, num_stacked_layers, dropout_rate=0.2):
        """
        Initialize the LSTM model.
        
        Args:
            input_size (int): Number of input features (typically 1 for univariate time series)
            hidden_size (int): Number of features in the hidden state
            num_stacked_layers (int): Number of stacked LSTM layers
            dropout_rate (float): Dropout rate for regularization
        """
        super().__init__()
        self.hidden_size = hidden_size
        self.num_stacked_layers = num_stacked_layers
        
        # LSTM layer with dropout
        self.lstm = nn.LSTM(input_size, hidden_size, num_stacked_layers,
                           batch_first=True, dropout=dropout_rate)
        
        # Output fully connected layer
        self.fc = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        """
        Forward pass through the LSTM.
        
        Args:
            x (torch.Tensor): Input tensor of shape [batch_size, sequence_length, features]
            
        Returns:
            torch.Tensor: Output predictions
        """
        batch_size = x.size(0)
        
        # Initialize hidden state and cell state with zeros
        h0 = torch.zeros(self.num_stacked_layers, batch_size, self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_stacked_layers, batch_size, self.hidden_size).to(x.device)
        
        # Pass input through LSTM layers
        out, _ = self.lstm(x, (h0, c0))
        
        # Use the final time step's output for prediction
        out = self.fc(out[:, -1, :])
        return out
```

### Key Components of the LSTM Model

- **input_size**: The number of features per time step (1 for univariate time series)
- **hidden_size**: The number of features in the hidden state (higher = more capacity)
- **num_stacked_layers**: Number of LSTM layers stacked on top of each other
- **dropout_rate**: Rate at which neurons are randomly "dropped" during training (regularization)
- **batch_first=True**: Input shape is [batch_size, sequence_length, features]
- **h0 and c0**: Initial hidden and cell states (initialized to zeros)

<a id="training-process"></a>
## 5. Training and Validation Process

Training an LSTM model requires:
1. A training function for each epoch
2. A validation function to monitor performance
3. Mechanisms to prevent overfitting (early stopping, learning rate scheduling)

```python
# Training function
def train_one_epoch(model, train_loader, optimizer, loss_function, device):
    """
    Train the model for one epoch.
    
    Args:
        model (nn.Module): The neural network model
        train_loader (DataLoader): DataLoader for training data
        optimizer (Optimizer): Optimizer for updating model weights
        loss_function (Loss): Loss function for training
        device (str): Device to perform computations on ('cpu' or 'cuda')
        
    Returns:
        float: Average training loss for the epoch
    """
    model.train()  # Set model to training mode
    running_loss = 0.0
    total_samples = 0

    for batch in train_loader:
        x_batch, y_batch = batch[0].to(device), batch[1].to(device)
        batch_size = x_batch.size(0)
        
        # Forward pass
        optimizer.zero_grad()  # Clear gradients
        output = model(x_batch)
        loss = loss_function(output, y_batch)
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        
        # Accumulate loss
        running_loss += loss.item() * batch_size
        total_samples += batch_size

    # Calculate average loss
    avg_train_loss = running_loss / total_samples
    return avg_train_loss

# Validation function
def validate(model, val_loader, loss_function, device):
    """
    Validate the model.
    
    Args:
        model (nn.Module): The neural network model
        val_loader (DataLoader): DataLoader for validation data
        loss_function (Loss): Loss function for evaluation
        device (str): Device to perform computations on ('cpu' or 'cuda')
        
    Returns:
        tuple: Average validation loss and model predictions
    """
    model.eval()  # Set model to evaluation mode
    running_loss = 0.0
    total_samples = 0
    all_predictions = []
    all_targets = []

    with torch.no_grad():  # Disable gradient calculation
        for batch in val_loader:
            x_batch, y_batch = batch[0].to(device), batch[1].to(device)
            batch_size = x_batch.size(0)
            
            # Forward pass
            output = model(x_batch)
            loss = loss_function(output, y_batch)
            
            # Accumulate loss
            running_loss += loss.item() * batch_size
            total_samples += batch_size
            
            # Store predictions and targets for metrics calculation
            all_predictions.append(output.cpu().numpy())
            all_targets.append(y_batch.cpu().numpy())

    # Concatenate all batches
    all_predictions = np.concatenate(all_predictions)
    all_targets = np.concatenate(all_targets)
    
    # Calculate average loss
    avg_val_loss = running_loss / total_samples
    return avg_val_loss, all_predictions, all_targets
```

### Understanding torch.no_grad()

The `with torch.no_grad():` context manager is crucial during validation:

- **Purpose**: Temporarily disables gradient calculation
- **Benefits**:
  - Reduces memory consumption (no gradients stored)
  - Speeds up computation (no backward pass preparation)
  - Prevents accidental parameter updates during evaluation
- **When to use**: Always in validation and inference, never in training

### Complete Training Loop with Early Stopping

Now we implement the full training loop with early stopping and learning rate scheduling:

```python
# Main training function with early stopping
def train_model(model, X_train, y_train, X_val, y_val, batch_size=32,
                learning_rate=0.001, max_epochs=100, patience=10,
                min_delta=0.001, enable_scheduler=True):
    """
    Train the LSTM model with early stopping and learning rate scheduling.
    
    Args:
        model (nn.Module): The neural network model
        X_train (torch.Tensor): Training features
        y_train (torch.Tensor): Training targets
        X_val (torch.Tensor): Validation features
        y_val (torch.Tensor): Validation targets
        batch_size (int): Batch size for training
        learning_rate (float): Initial learning rate
        max_epochs (int): Maximum number of epochs to train
        patience (int): Number of epochs to wait for improvement before stopping
        min_delta (float): Minimum change in validation loss to qualify as improvement
        enable_scheduler (bool): Whether to use learning rate scheduling
        
    Returns:
        tuple: Trained model, training history, best validation loss
    """
    # Create DataLoaders
    train_dataset = TimeSeries(X_train, y_train)
    val_dataset = TimeSeries(X_val, y_val)
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    
    # Define loss function and optimizer
    loss_function = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-5)
    
    # Learning rate scheduler
    scheduler = None
    if enable_scheduler:
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode='min', factor=0.5, patience=patience // 3,
            verbose=True, min_lr=1e-6
        )
    
    # Initialize variables for early stopping
    best_val_loss = float('inf')
    best_model_state = None
    counter = 0
    train_losses = []
    val_losses = []
    
    # Training loop
    for epoch in range(max_epochs):
        # Train one epoch
        train_loss = train_one_epoch(model, train_loader, optimizer, loss_function, device)
        
        # Validate
        val_loss, _, _ = validate(model, val_loader, loss_function, device)
        
        # Store losses
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        
        # Print progress
        print(f'Epoch {epoch+1}/{max_epochs}, Train Loss: {train_loss:.6f}, Val Loss: {val_loss:.6f}')
        
        # Update learning rate if scheduler is enabled
        if scheduler is not None:
            scheduler.step(val_loss)
        
        # Check for improvement
        if val_loss < best_val_loss - min_delta:
            best_val_loss = val_loss
            best_model_state = dc(model.state_dict())
            counter = 0
        else:
            counter += 1
        
        # Early stopping
        if counter >= patience:
            print(f'Early stopping triggered at epoch {epoch+1}')
            break
    
    # Load best model
    if best_model_state is not None:
        model.load_state_dict(best_model_state)
        
    return model, {'train_losses': train_losses, 'val_losses': val_losses}, best_val_loss
```

### Anti-Overfitting Techniques

This implementation includes several techniques to prevent overfitting:

1. **Early Stopping**: Stops training when validation loss stops improving
   - **patience**: Number of epochs to wait for improvement
   - **min_delta**: Minimum improvement threshold

2. **Learning Rate Scheduling**: Reduces learning rate when progress plateaus
   - Uses ReduceLROnPlateau scheduler
   - Helps fine-tune model weights when nearing convergence

3. **Weight Decay**: L2 regularization penalty on model weights
   - Implemented through the optimizer (weight_decay=1e-5)
   - Discourages large weight values

4. **Dropout**: Randomly ignores neurons during training
   - Applied in the LSTM model definition
   - Forces the network to learn redundant representations

<a id="evaluation"></a>
## 6. Evaluation and Visualization

After training, we need to evaluate model performance and visualize results:

```python
# Inverse transform predictions back to original scale
def inverse_transform_predictions(predictions, targets, scaler):
    """
    Transform normalized predictions and targets back to original scale.
    
    Args:
        predictions (numpy.ndarray): Model predictions in normalized scale
        targets (numpy.ndarray): Actual targets in normalized scale
        scaler (MinMaxScaler): Scaler used for normalization
        
    Returns:
        tuple: Predictions and targets in original scale
    """
    # Flatten if needed
    pred_flat = predictions.flatten()
    target_flat = targets.flatten()
    
    # Create dummy arrays for inverse transformation
    pred_dummy = np.zeros((len(pred_flat), 2))
    pred_dummy[:, 0] = pred_flat
    
    target_dummy = np.zeros((len(target_flat), 2))
    target_dummy[:, 0] = target_flat
    
    # Inverse transform
    pred_inverted = scaler.inverse_transform(pred_dummy)[:, 0]
    target_inverted = scaler.inverse_transform(target_dummy)[:, 0]
    
    return pred_inverted, target_inverted

# Calculate performance metrics
def calculate_metrics(predictions, targets):
    """
    Calculate regression performance metrics.
    
    Args:
        predictions (numpy.ndarray): Model predictions
        targets (numpy.ndarray): Actual targets
        
    Returns:
        dict: Dictionary of performance metrics
    """
    mse = mean_squared_error(targets, predictions)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(targets, predictions)
    r2 = r2_score(targets, predictions)
    
    # Calculate MAPE (Mean Absolute Percentage Error)
    mape = np.mean(np.abs((targets - predictions) / np.abs(targets + 1e-10))) * 100
    
    return {
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R2': r2,
        'MAPE': mape
    }

# Plotting function for actual vs predicted values
def plot_predictions(predictions, actual, title, filename):
    """
    Plot actual vs. predicted values.
    
    Args:
        predictions (numpy.ndarray): Model predictions
        actual (numpy.ndarray): Actual values
        title (str): Plot title
        filename (str): File to save the plot to
    """
    plt.figure(figsize=(12, 6))
    plt.plot(actual, label='Actual', color='blue')
    plt.plot(predictions, label='Predicted', color='red')
    plt.title(title)
    plt.xlabel('Time Steps')
    plt.ylabel('Stock Price')
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.savefig(filename)
    plt.close()

# Plot learning curves
def plot_learning_curves(train_losses, val_losses, filename):
    """
    Plot training and validation loss curves.
    
    Args:
        train_losses (list): Training losses per epoch
        val_losses (list): Validation losses per epoch
        filename (str): File to save the plot to
    """
    plt.figure(figsize=(10, 6))
    epochs = range(1, len(train_losses) + 1)
    plt.plot(epochs, train_losses, 'b-', label='Training Loss')
    plt.plot(epochs, val_losses, 'r-', label='Validation Loss')
    plt.title('Training and Validation Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.savefig(filename)
    plt.close()
```

### Key Evaluation Metrics for Time Series

For time series forecasting, these metrics provide valuable insights:

1. **MSE (Mean Squared Error)**: Average of squared differences between predictions and actual values
2. **RMSE (Root Mean Squared Error)**: Square root of MSE, in the same unit as the original data
3. **MAE (Mean Absolute Error)**: Average of absolute differences, less sensitive to outliers than MSE
4. **R² (R-squared)**: Proportion of variance explained by the model (higher is better)
5. **MAPE (Mean Absolute Percentage Error)**: Average percentage error, scale-independent

### Visualizing Results

Visual analysis helps understand model performance:

1. **Learning Curves**: Plot training vs. validation loss to detect overfitting
2. **Predictions vs. Actual**: Compare predicted values with actual values
3. **Residual Analysis**: Examine the difference between predicted and actual values

<a id="improvements"></a>
## 7. Model Improvements and Best Practices

Here's the main execution function that brings everything together:

```python
# Main execution
def main():
    """Main execution function."""
    # Configuration
    n_steps = 5  # Increased from 1 to 5 for more context
    batch_size = 32
    hidden_size = 50  # Increased from 4
    num_stacked_layers = 2  # Increased from 1
    dropout_rate = 0.3  # Increased from 0.05
    learning_rate = 0.001
    max_epochs = 100
    patience = 10
    
    # Create output directory
    os.makedirs('outputs', exist_ok=True)
    
    # Load and prepare data
    X_train, y_train, X_test, y_test, scaler, df = load_and_prepare_data(n_steps=n_steps)
    
    # Print dataset information
    print(f"Dataset Info:")
    print(f"X_train shape: {X_train.shape}")
    print(f"y_train shape: {y_train.shape}")
    print(f"X_test shape: {X_test.shape}")
    print(f"y_test shape: {y_test.shape}")
    
    # Initialize model
    model = LSTM(
        input_size=1,
        hidden_size=hidden_size,
        num_stacked_layers=num_stacked_layers,
        dropout_rate=dropout_rate
    ).to(device)
    
    # Print model architecture
    print(f"Model Architecture:")
    print(model)
    
    # Train model
    print("Training model...")
    trained_model, history, best_val_loss = train_model(
        model, X_train, y_train, X_test, y_test,
        batch_size=batch_size,
        learning_rate=learning_rate,
        max_epochs=max_epochs,
        patience=patience
    )
    
    # Plot learning curves
    plot_learning_curves(
        history['train_losses'],
        history['val_losses'],
        'outputs/learning_curves.png'
    )
    
    # Evaluate model
    print("Evaluating model...")
    # Create DataLoaders for evaluation
    train_dataset = TimeSeries(X_train, y_train)
    test_dataset = TimeSeries(X_test, y_test)
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    # Loss function
    loss_function = nn.MSELoss()
    
    # Evaluate on training set
    train_loss, train_pred, train_targets = validate(trained_model, train_loader, loss_function, device)
    
    # Evaluate on test set
    test_loss, test_pred, test_targets = validate(trained_model, test_loader, loss_function, device)
    
    # Transform predictions back to original scale
    train_pred_orig, train_targets_orig = inverse_transform_predictions(
        train_pred, train_targets, scaler)
    test_pred_orig, test_targets_orig = inverse_transform_predictions(
        test_pred, test_targets, scaler)
    
    # Calculate metrics
    train_metrics = calculate_metrics(train_pred_orig, train_targets_orig)
    test_metrics = calculate_metrics(test_pred_orig, test_targets_orig)
    
    # Plot predictions
    plot_predictions(train_pred_orig, train_targets_orig,
                    'Training: Actual vs Predicted', 'outputs/train_predictions.png')
    plot_predictions(test_pred_orig, test_targets_orig,
                    'Test: Actual vs Predicted', 'outputs/test_predictions.png')
    
    # Print evaluation metrics
    print("\nTraining Metrics:")
    for metric, value in train_metrics.items():
        print(f"{metric}: {value:.4f}")
    
    print("\nTest Metrics:")
    for metric, value in test_metrics.items():
        print(f"{metric}: {value:.4f}")
    
    # Save trained model
    torch.save(trained_model.state_dict(), 'outputs/lstm_model.pth')
    print("Model saved to outputs/lstm_model.pth")
    
    print("Done!")

if __name__ == "__main__":
    main()
```

### Best Practices for LSTM Time Series Models

1. **Increase Context Window**: Using n_steps=5 instead of just 1 provides more historical context
2. **Proper Model Sizing**:
   - Increased hidden_size from 4 to 50
   - Increased num_stacked_layers from 1 to 2
3. **Strong Regularization**:
   - Increased dropout from 0.05 to 0.3
   - Added weight decay in optimizer
4. **Smart Training Process**:
   - Early stopping with patience=10
   - Learning rate scheduling
   - Proper batch size (32)
5. **Comprehensive Evaluation**:
   - Multiple metrics (RMSE, MAE, R², MAPE)
   - Visualization tools

<a id="conclusion"></a>
## 8. Conclusion

### Key Takeaways

1. **LSTM Architecture**: Powerful for time series due to its memory capabilities and gate mechanisms
2. **Data Preparation**: Critical steps include normalization, lag features, and proper reshaping
3. **Overfitting Prevention**: Multiple techniques including dropout, early stopping, and L2 regularization
4. **Evaluation**: Always evaluate on both training and test sets with multiple metrics

### Common Issues and Solutions

| Issue | Solution |
|-------|----------|
| Overfitting | Increase dropout, add weight decay, implement early stopping |
| Underfitting | Increase model capacity (hidden size, layers), add more features |
| Vanishing Gradients | Use LSTM instead of basic RNN, ensure proper weight initialization |
| Training Instability | Reduce learning rate, clip gradients, normalize inputs |
