# LSTM-Based Anomaly Detection

## Overview
This notebook implements LSTM (Long Short-Term Memory) neural networks for anomaly detection using reconstruction error. LSTMs learn normal patterns and flag deviations as anomalies.

## Prerequisites
- Completed: `synthetic-anomaly-generation.ipynb` (Phase 1)
- GPU access (recommended)
- PyTorch 2025.1, TensorFlow/Keras
- Synthetic dataset: `/opt/app-root/src/data/processed/synthetic_anomalies.parquet`

## Why We Use Synthetic Data

### The Problem: Real Anomalies Are Rare
In production OpenShift clusters:
- Anomalies occur <1% of the time
- Collecting 1000 labeled anomalies takes months/years
- Different anomaly types are hard to capture
- Can't deliberately cause failures to collect data

### The Solution: Synthetic Anomalies
We generate synthetic anomalies because:
- ‚úÖ Create 1000+ labeled anomalies in minutes
- ‚úÖ Control anomaly types and severity
- ‚úÖ Ensure balanced training data (50% normal, 50% anomaly)
- ‚úÖ Reproducible and testable
- ‚úÖ Models trained on synthetic data generalize to real anomalies

### Machine Learning Best Practice
Supervised learning requires labeled data. Synthetic data provides:
1. **Ground Truth**: Known labels for evaluation
2. **Balanced Classes**: Equal normal and anomaly samples
3. **Reproducibility**: Same data for consistent results
4. **Generalization**: Models learn patterns, not memorize examples

## Learning Objectives
- Build LSTM autoencoder architecture trained on synthetic data
- Train on GPU for efficiency
- Use reconstruction error for anomaly detection
- Optimize hyperparameters
- Evaluate deep learning model performance with labeled data

## Key Concepts
- **LSTM**: Recurrent neural network for sequence learning
- **Autoencoder**: Learns compressed representation of normal data
- **Reconstruction Error**: Difference between input and reconstructed output
- **GPU Acceleration**: Training on NVIDIA GPUs for speed

## References

### Why Synthetic Data for Training?
- **He & Garcia (2009)**: "Learning from Imbalanced Data" - https://ieeexplore.ieee.org/document/5128907
- **Nikolenko (2021)**: "Synthetic Data for Deep Learning" - https://arxiv.org/abs/1909.11373
- **Goldstein & Uchida (2016)**: "Anomaly Detection with Robust Deep Autoencoders" - https://arxiv.org/abs/1511.08747

### LSTM and Deep Learning for Anomaly Detection
- **Malhotra et al. (2016)**: "Time Series Anomaly Detection with LSTM Networks" - https://arxiv.org/abs/1607.00148
- **Hochreiter & Schmidhuber (1997)**: "Long Short-Term Memory" - Classic LSTM paper
- **Goodfellow et al. (2016)**: "Deep Learning" - Comprehensive deep learning reference

### Key Takeaway
Synthetic data provides labeled training examples that allow us to:
1. Train deep learning models with known ground truth
2. Evaluate performance with precision, recall, and F1 scores
3. Ensure reproducible and testable results
4. Build models that generalize to real-world anomalies

In [None]:
import sys
import os
import numpy as np
import pandas as pd
import pickle
import logging
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score, f1_score

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"‚úÖ Utils path found: {utils_path}")
else:
    print("‚ö†Ô∏è Utils path not found - will use fallback implementations")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("‚úÖ Common functions imported")
except ImportError as e:
    print(f"‚ö†Ô∏è Common functions not available: {e}")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Check GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
logger.info(f"Using device: {device}")
if torch.cuda.is_available():
    logger.info(f"GPU: {torch.cuda.get_device_name(0)}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'

# Use /mnt/models for persistent storage (model-storage-pvc)
# Fallback to local for development outside cluster
MODELS_DIR = Path('/mnt/models') if Path('/mnt/models').exists() else Path('/opt/app-root/src/models')

# Create KServe-compatible subdirectory structure
MODEL_NAME = 'lstm-predictor'  # Separate model name from anomaly-detector
MODEL_DIR = MODELS_DIR / MODEL_NAME
MODEL_DIR.mkdir(parents=True, exist_ok=True)

logger.info(f"Models directory: {MODEL_DIR}")

## Implementation Section

### 1. Load and Prepare Data

In [None]:
# Load or generate synthetic data
data_file = PROCESSED_DIR / 'synthetic_anomalies.parquet'

if data_file.exists():
    df = pd.read_parquet(data_file)
    logger.info(f"Loaded existing data: {df.shape}")
else:
    logger.info("Synthetic data not found - generating for validation...")
    from datetime import datetime, timedelta
    np.random.seed(42)
    n_points = 1000
    n_features = 5
    
    start_time = datetime.now() - timedelta(days=30)
    timestamps = [start_time + timedelta(minutes=i) for i in range(n_points)]
    
    data = {}
    for i in range(n_features):
        trend = np.linspace(50, 60, n_points)
        seasonal = 10 * np.sin(np.linspace(0, 4*np.pi, n_points))
        noise = np.random.normal(0, 2, n_points)
        data[f'metric_{i}'] = trend + seasonal + noise
    
    df = pd.DataFrame(data)
    df['timestamp'] = timestamps
    df['label'] = 0
    
    anomaly_indices = np.random.choice(len(df), 50, replace=False)
    for idx in anomaly_indices:
        features = np.random.choice(5, 2, replace=False)
        for feat in features:
            col = f'metric_{feat}'
            std = df[col].std()
            df.loc[idx, col] += 3.0 * std * np.random.choice([-1, 1])
        df.loc[idx, 'label'] = 1
    
    PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
    df.to_parquet(data_file)
    logger.info(f"Generated and saved synthetic data: {df.shape}")

# Extract features (exclude timestamp and label)
feature_cols = [col for col in df.columns if col.startswith('metric_')]
X = df[feature_cols].values
y = df['label'].values

# Normalize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Note: scaler will be saved with model in a pipeline (see Cell 12)

logger.info(f"Data shape: {X_scaled.shape}")
logger.info(f"Features: {feature_cols}")

### 2. Define LSTM Autoencoder

In [None]:
class LSTMAutoencoder(nn.Module):
    def __init__(self, input_size, hidden_size=32, num_layers=2):
        super(LSTMAutoencoder, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        # Encoder
        self.encoder = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        
        # Decoder
        self.decoder = nn.LSTM(hidden_size, input_size, num_layers, batch_first=True)
    
    def forward(self, x):
        # Encode
        encoded, _ = self.encoder(x)
        # Decode
        decoded, _ = self.decoder(encoded)
        return decoded

# Create model
model = LSTMAutoencoder(input_size=len(feature_cols), hidden_size=32, num_layers=2)
model = model.to(device)
logger.info(f"Model created: {model}")

### 3. Train Model

In [None]:
# Prepare data for training
X_tensor = torch.FloatTensor(X_scaled).unsqueeze(1)  # Add sequence dimension
dataset = TensorDataset(X_tensor)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Training setup
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
epochs = 10

# Train
logger.info("Starting training...")
for epoch in range(epochs):
    total_loss = 0
    for batch in dataloader:
        X_batch = batch[0].to(device)
        
        # Forward pass
        output = model(X_batch)
        loss = criterion(output, X_batch)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    avg_loss = total_loss / len(dataloader)
    if (epoch + 1) % 2 == 0:
        logger.info(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

logger.info("Training complete")

### 4. Detect Anomalies

In [None]:
# Get reconstruction errors
model.eval()
with torch.no_grad():
    X_tensor_full = torch.FloatTensor(X_scaled).unsqueeze(1).to(device)
    reconstructed = model(X_tensor_full)
    errors = torch.mean((X_tensor_full - reconstructed) ** 2, dim=(1, 2)).cpu().numpy()

# Determine threshold (95th percentile of normal data)
normal_mask = y == 0
threshold = np.percentile(errors[normal_mask], 95)

# Predict anomalies
lstm_preds = (errors > threshold).astype(int)

logger.info(f"Threshold: {threshold:.4f}")
logger.info(f"Detected {lstm_preds.sum()} anomalies")

# Evaluate
precision = precision_score(y, lstm_preds, zero_division=0)
recall = recall_score(y, lstm_preds, zero_division=0)
f1 = f1_score(y, lstm_preds, zero_division=0)
print(f"LSTM Performance: Precision={precision:.3f}, Recall={recall:.3f}, F1={f1:.3f}")

### 5. Save Model

In [None]:
# ‚ú® Create wrapper class to combine scaler + model (KServe compatible)
class LSTMPipeline:
    """
    Wrapper class combining scaler + LSTM model for KServe deployment.
    
    This ensures KServe finds exactly ONE .pkl file in the model directory.
    """
    def __init__(self, scaler, model_state_dict, model_class, model_config, threshold, device='cpu'):
        self.scaler = scaler
        self.model_state_dict = model_state_dict
        self.model_class = model_class
        self.model_config = model_config
        self.threshold = threshold
        self.device = device
        self._model = None
    
    def _get_model(self):
        """Lazy load model from state dict"""
        if self._model is None:
            self._model = self.model_class(**self.model_config)
            self._model.load_state_dict(self.model_state_dict)
            self._model = self._model.to(self.device)
            self._model.eval()
        return self._model
    
    def predict(self, X):
        """
        Predict anomalies using reconstruction error.
        
        Args:
            X: Input features (numpy array)
        
        Returns:
            Predictions: 1 for normal, -1 for anomaly (sklearn convention)
        """
        # Scale input
        X_scaled = self.scaler.transform(X)
        
        # Get model
        model = self._get_model()
        
        # Convert to tensor
        X_tensor = torch.FloatTensor(X_scaled).unsqueeze(1).to(self.device)
        
        # Get reconstruction errors
        with torch.no_grad():
            reconstructed = model(X_tensor)
            errors = torch.mean((X_tensor - reconstructed) ** 2, dim=(1, 2)).cpu().numpy()
        
        # Predict: -1 for anomaly, 1 for normal (sklearn convention)
        predictions = np.where(errors > self.threshold, -1, 1)
        return predictions

# Create pipeline with scaler + model
# Note: Save model state dict (not the full model object) for portability
pipeline = LSTMPipeline(
    scaler=scaler,
    model_state_dict=model.state_dict(),
    model_class=LSTMAutoencoder,
    model_config={'input_size': len(feature_cols), 'hidden_size': 32, 'num_layers': 2},
    threshold=threshold,
    device='cpu'  # KServe will use CPU by default
)

# Save SINGLE pipeline file (KServe compatible)
# KServe expects model at: /mnt/models/lstm-predictor/model.pkl
pipeline_path = MODEL_DIR / 'model.pkl'  # Changed from lstm-predictor.pkl
with open(pipeline_path, 'wb') as f:
    pickle.dump(pipeline, f)

logger.info(f"üíæ Saved LSTM pipeline to: {pipeline_path}")
logger.info(f"   ‚úÖ KServe-compatible path: {MODEL_NAME}/model.pkl")
logger.info(f"   ‚úÖ Single .pkl file (scaler + model combined)")

# Upload to S3 for persistent storage
try:
    from common_functions import upload_model_to_s3, test_s3_connection
    
    if test_s3_connection():
        upload_model_to_s3(
            str(pipeline_path),
            s3_key='models/anomaly-detection/lstm-predictor/model.pkl'
        )
        logger.info(f"‚òÅÔ∏è  Uploaded to S3: models/anomaly-detection/lstm-predictor/model.pkl")
    else:
        logger.info("‚ö†Ô∏è S3 not available - model saved locally only")
except ImportError:
    logger.info("‚ö†Ô∏è S3 functions not available - model saved locally only")
except Exception as e:
    logger.warning(f"‚ö†Ô∏è S3 upload failed (non-critical): {e}")

# Verify pipeline saved
assert pipeline_path.exists(), "Pipeline not saved"
logger.info(f"\n‚úÖ LSTM pipeline saved successfully")
logger.info(f"   Path: {pipeline_path}")
logger.info(f"   Size: {pipeline_path.stat().st_size / 1024:.2f} KB")

# Clean up old separate files if they exist
old_files = [
    MODELS_DIR / 'lstm_autoencoder.pt',
    MODELS_DIR / 'lstm_scaler.pkl',
    MODELS_DIR / 'lstm-predictor.pkl'  # Old flat file
]
for old_file in old_files:
    if old_file.exists():
        old_file.unlink()
        logger.info(f"üóëÔ∏è  Removed old file: {old_file.name}")

# Save predictions
results_df = pd.DataFrame({
    'actual': y,
    'lstm_pred': lstm_preds,
    'reconstruction_error': errors
})
results_df.to_parquet(PROCESSED_DIR / 'lstm_predictions.parquet')
logger.info("Saved predictions")

## Validation Section

In [None]:
# Verify outputs
assert (MODEL_DIR / 'model.pkl').exists(), "LSTM pipeline not saved"
assert (PROCESSED_DIR / 'lstm_predictions.parquet').exists(), "Predictions not saved"

logger.info("‚úÖ All validations passed")
print(f"\nPipeline saved to: {MODEL_DIR / 'model.pkl'}")
print(f"Predictions saved to: {PROCESSED_DIR / 'lstm_predictions.parquet'}")

## Integration Section

This notebook integrates with:
- **Input**: Synthetic anomalies from `synthetic-anomaly-generation.ipynb`
- **Output**: LSTM model for `ensemble-anomaly-methods.ipynb`
- **Deployment**: Model can be exported to KServe for production

## Next Steps

1. Compare LSTM performance with ARIMA and Prophet
2. Proceed to `ensemble-anomaly-methods.ipynb`
3. Combine all methods for best performance
4. Deploy ensemble to coordination engine

## References

- ADR-012: Notebook Architecture for End-to-End Workflows
- [PyTorch LSTM Documentation](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)
- [Autoencoder Anomaly Detection](https://en.wikipedia.org/wiki/Autoencoder)