# LSTM-Based Anomaly Detection

## Overview
This notebook implements LSTM (Long Short-Term Memory) neural networks for anomaly detection using reconstruction error. LSTMs learn normal patterns and flag deviations as anomalies.

## Prerequisites
- Completed: `synthetic-anomaly-generation.ipynb` (Phase 1)
- GPU access (recommended)
- PyTorch 2025.1, TensorFlow/Keras
- Synthetic dataset: `/opt/app-root/src/data/processed/synthetic_anomalies.parquet`

## Why We Use Synthetic Data

### The Problem: Real Anomalies Are Rare
In production OpenShift clusters:
- Anomalies occur <1% of the time
- Collecting 1000 labeled anomalies takes months/years
- Different anomaly types are hard to capture
- Can't deliberately cause failures to collect data

### The Solution: Synthetic Anomalies
We generate synthetic anomalies because:
- ‚úÖ Create 1000+ labeled anomalies in minutes
- ‚úÖ Control anomaly types and severity
- ‚úÖ Ensure balanced training data (50% normal, 50% anomaly)
- ‚úÖ Reproducible and testable
- ‚úÖ Models trained on synthetic data generalize to real anomalies

### Machine Learning Best Practice
Supervised learning requires labeled data. Synthetic data provides:
1. **Ground Truth**: Known labels for evaluation
2. **Balanced Classes**: Equal normal and anomaly samples
3. **Reproducibility**: Same data for consistent results
4. **Generalization**: Models learn patterns, not memorize examples

## Learning Objectives
- Build LSTM autoencoder architecture trained on synthetic data
- Train on GPU for efficiency
- Use reconstruction error for anomaly detection
- Optimize hyperparameters
- Evaluate deep learning model performance with labeled data

## Key Concepts
- **LSTM**: Recurrent neural network for sequence learning
- **Autoencoder**: Learns compressed representation of normal data
- **Reconstruction Error**: Difference between input and reconstructed output
- **GPU Acceleration**: Training on NVIDIA GPUs for speed

## References

### Why Synthetic Data for Training?
- **He & Garcia (2009)**: "Learning from Imbalanced Data" - https://ieeexplore.ieee.org/document/5128907
- **Nikolenko (2021)**: "Synthetic Data for Deep Learning" - https://arxiv.org/abs/1909.11373
- **Goldstein & Uchida (2016)**: "Anomaly Detection with Robust Deep Autoencoders" - https://arxiv.org/abs/1511.08747

### LSTM and Deep Learning for Anomaly Detection
- **Malhotra et al. (2016)**: "Time Series Anomaly Detection with LSTM Networks" - https://arxiv.org/abs/1607.00148
- **Hochreiter & Schmidhuber (1997)**: "Long Short-Term Memory" - Classic LSTM paper
- **Goodfellow et al. (2016)**: "Deep Learning" - Comprehensive deep learning reference

### Key Takeaway
Synthetic data provides labeled training examples that allow us to:
1. Train deep learning models with known ground truth
2. Evaluate performance with precision, recall, and F1 scores
3. Ensure reproducible and testable results
4. Build models that generalize to real-world anomalies

In [None]:
import sys
import os
import numpy as np
import pandas as pd
import pickle
import logging
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score, f1_score

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"‚úÖ Utils path found: {utils_path}")
else:
    print("‚ö†Ô∏è Utils path not found - will use fallback implementations")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("‚úÖ Common functions imported")
except ImportError as e:
    print(f"‚ö†Ô∏è Common functions not available: {e}")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Check GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
logger.info(f"Using device: {device}")
if torch.cuda.is_available():
    logger.info(f"GPU: {torch.cuda.get_device_name(0)}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'

# Use /mnt/models for persistent storage (model-storage-pvc)
# Fallback to local for development outside cluster
MODELS_DIR = Path('/mnt/models') if Path('/mnt/models').exists() else Path('/opt/app-root/src/models')

# Create KServe-compatible subdirectory structure
MODEL_NAME = 'lstm-predictor'  # Separate model name from anomaly-detector
MODEL_DIR = MODELS_DIR / MODEL_NAME
MODEL_DIR.mkdir(parents=True, exist_ok=True)

logger.info(f"Models directory: {MODEL_DIR}")

## Implementation Section

### 1. Load and Prepare Data

In [None]:
# Cell 2 - Load Data with TARGET_METRICS

TARGET_METRICS = [
    # Resource Metrics (5)
    'node_memory_utilization',
    'pod_cpu_usage',
    'pod_memory_usage',
    'alt_cpu_usage',
    'alt_memory_usage',
    
    # Stability Metrics (3)
    'container_restart_count',
    'container_restart_rate_1h',
    'deployment_unavailable',
    
    # Pod Status Metrics (4)
    'namespace_pod_count',
    'pods_pending',
    'pods_running',
    'pods_failed',
    
    # Storage Metrics (2)
    'persistent_volume_usage',
    'cluster_resource_quota',
    
    # Control Plane Metrics (2)
    'apiserver_request_total',
    'apiserver_error_rate',
]

PROMETHEUS_QUERIES = {
    'node_memory_utilization': 'instance:node_memory_utilisation:ratio * 100',
    'pod_cpu_usage': 'sum by (pod, namespace) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate)',
    'pod_memory_usage': 'sum by (pod, namespace) (container_memory_working_set_bytes{container!="POD", container!=""})',
    'alt_cpu_usage': 'sum(rate(container_cpu_usage_seconds_total{container!="POD", container!=""}[5m])) by (pod, namespace)',
    'alt_memory_usage': 'sum(container_memory_rss{container!="POD", container!=""}) by (pod, namespace)',
    'container_restart_count': 'sum by (pod, namespace, container) (kube_pod_container_status_restarts_total)',
    'container_restart_rate_1h': 'sum by (pod, namespace) (increase(kube_pod_container_status_restarts_total[1h]))',
    'deployment_unavailable': 'sum by (deployment, namespace) (kube_deployment_status_replicas_unavailable)',
    'namespace_pod_count': 'sum by (namespace) (kube_pod_status_phase)',
    'pods_pending': 'sum by (namespace) (kube_pod_status_phase{phase="Pending"})',
    'pods_running': 'sum by (namespace) (kube_pod_status_phase{phase="Running"})',
    'pods_failed': 'sum by (namespace) (kube_pod_status_phase{phase="Failed"})',
    'persistent_volume_usage': 'kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100',
    'cluster_resource_quota': 'kube_resourcequota',
    'apiserver_request_total': 'sum(rate(apiserver_request_total[5m]))',
    'apiserver_error_rate': 'sum(rate(apiserver_request_total{code=~"5.."}[5m])) / sum(rate(apiserver_request_total[5m])) * 100',
}

print(f"üìä Target metrics for LSTM: {len(TARGET_METRICS)}")

# =============================================================================
# PROMETHEUS CLIENT
# =============================================================================

import requests
from datetime import datetime, timedelta

class PrometheusClient:
    """Client for querying Prometheus in OpenShift."""
    
    def __init__(self):
        token_path = '/var/run/secrets/kubernetes.io/serviceaccount/token'
        self.token = None
        if os.path.exists(token_path):
            with open(token_path, 'r') as f:
                self.token = f.read().strip()
        
        self.base_url = 'https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091'
        self.session = requests.Session()
        if self.token:
            self.session.headers.update({'Authorization': f'Bearer {self.token}'})
        self.session.verify = False
        
        import urllib3
        urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
        
        try:
            response = self.session.get(f"{self.base_url}/api/v1/status/config", timeout=5)
            self.connected = response.status_code == 200
        except:
            self.connected = False
    
    def query_range(self, query, start, end, step='1m'):
        if not self.connected:
            return None
        url = f"{self.base_url}/api/v1/query_range"
        params = {'query': query, 'start': start, 'end': end, 'step': step}
        try:
            response = self.session.get(url, params=params, timeout=60)
            response.raise_for_status()
            return response.json()
        except:
            return None

# =============================================================================
# LOAD DATA
# =============================================================================

def load_data_for_lstm(duration_hours=24, use_real_data=True):
    """Load time series data for LSTM training."""
    
    print("=" * 70)
    print("üîÑ LOADING DATA FOR LSTM")
    print("=" * 70)
    
    prometheus = None
    if use_real_data:
        prometheus = PrometheusClient()
        print(f"   Prometheus connected: {prometheus.connected}")
    
    end_time = datetime.now()
    start_time = end_time - timedelta(hours=duration_hours)
    time_index = pd.date_range(start=start_time, end=end_time, freq='1min')
    
    df = pd.DataFrame(index=time_index)
    df.index.name = 'timestamp'
    
    data_sources = {}
    
    print(f"\nüìä Loading {len(TARGET_METRICS)} metrics...")
    
    for i, metric in enumerate(TARGET_METRICS):
        real_data_loaded = False
        
        if prometheus and prometheus.connected and metric in PROMETHEUS_QUERIES:
            query = PROMETHEUS_QUERIES[metric]
            result = prometheus.query_range(
                query, start_time.timestamp(), end_time.timestamp(), step='1m'
            )
            
            if result and result.get('status') == 'success':
                data = result.get('data', {}).get('result', [])
                if data:
                    rows = []
                    for series in data:
                        for ts, value in series.get('values', []):
                            try:
                                rows.append({
                                    'timestamp': pd.to_datetime(ts, unit='s'),
                                    'value': float(value) if value != 'NaN' else np.nan
                                })
                            except:
                                pass
                    
                    if rows:
                        metric_df = pd.DataFrame(rows)
                        metric_series = metric_df.groupby('timestamp')['value'].mean()
                        metric_series = metric_series.reindex(time_index, method='nearest')
                        df[metric] = metric_series
                        data_sources[metric] = 'REAL'
                        real_data_loaded = True
                        print(f"   ‚úÖ [{i+1:2}/{len(TARGET_METRICS)}] {metric}: REAL")
        
        if not real_data_loaded:
            # Generate synthetic data with realistic patterns
            n_points = len(time_index)
            trend = np.linspace(0, 5, n_points)
            seasonal = np.sin(np.linspace(0, 4*np.pi, n_points))
            noise = np.random.normal(0, 1, n_points)
            
            if 'cpu' in metric.lower():
                base = np.clip(0.3 + 0.1 * seasonal + 0.02 * noise, 0, 2)
            elif 'memory_utilization' in metric.lower():
                base = np.clip(60 + 10 * seasonal + 2 * noise, 0, 100)
            elif 'memory' in metric.lower():
                base = np.clip(2.5e8 + 2e7 * seasonal + 5e6 * noise, 1e8, 5e8)
            elif 'restart' in metric.lower():
                base = np.clip(10 + 0.5 * np.abs(noise), 0, 50)
            elif 'error' in metric.lower() or 'failed' in metric.lower():
                base = np.clip(0.01 * np.abs(noise), 0, 0.1)
            elif 'pending' in metric.lower() or 'unavailable' in metric.lower():
                base = np.clip(0.01 * np.abs(noise), 0, 1)
            else:
                base = 50 + trend + 10 * seasonal + 2 * noise
            
            df[metric] = base
            data_sources[metric] = 'SYNTHETIC'
            print(f"   üìä [{i+1:2}/{len(TARGET_METRICS)}] {metric}: SYNTHETIC")
    
    # Add labels
    df['label'] = 0
    n_anomalies = int(len(df) * 0.03)
    anomaly_indices = np.random.choice(len(df), n_anomalies, replace=False)
    
    for idx in anomaly_indices:
        anomaly_metrics = np.random.choice(TARGET_METRICS, 2, replace=False)
        for metric in anomaly_metrics:
            if metric in df.columns:
                std = df[metric].std()
                if std > 0:
                    df.iloc[idx, df.columns.get_loc(metric)] += 3.0 * std * np.random.choice([-1, 1])
        df.iloc[idx, df.columns.get_loc('label')] = 1
    
    real_count = sum(1 for s in data_sources.values() if s == 'REAL')
    print(f"\n‚úÖ Data loaded: {df.shape}")
    print(f"   REAL: {real_count} | SYNTHETIC: {len(data_sources) - real_count}")
    print(f"   Anomalies: {df['label'].sum()} ({df['label'].mean():.1%})")
    
    return df, data_sources

# Load the data
df, data_sources = load_data_for_lstm(duration_hours=24, use_real_data=True)

# =============================================================================
# PREPARE FEATURES
# =============================================================================

# Get feature columns (use TARGET_METRICS that exist in df)
feature_cols = [col for col in df.columns if col in TARGET_METRICS]
print(f"\nüìä Feature columns: {len(feature_cols)}")

# Extract features and labels
X = df[feature_cols].fillna(method='ffill').fillna(method='bfill').values
y = df['label'].values

# Normalize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

logger.info(f"Data shape: {X_scaled.shape}")
logger.info(f"Features: {len(feature_cols)}")
print(f"\nüìã Sample features: {feature_cols[:5]}...")

### 2. Define LSTM Autoencoder

In [None]:
# Cell 3 - Define LSTM Autoencoder

class LSTMAutoencoder(nn.Module):
    """
    LSTM Autoencoder for anomaly detection.
    
    Learns to reconstruct normal patterns.
    High reconstruction error = anomaly.
    """
    def __init__(self, input_size, hidden_size=64, num_layers=2, dropout=0.2):
        super(LSTMAutoencoder, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Encoder
        self.encoder = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Decoder
        self.decoder = nn.LSTM(
            input_size=hidden_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Output layer
        self.output_layer = nn.Linear(hidden_size, input_size)
    
    def forward(self, x):
        # x shape: (batch, seq_len, input_size)
        batch_size, seq_len, _ = x.shape
        
        # Encode
        _, (hidden, cell) = self.encoder(x)
        
        # Create decoder input (repeat encoded representation)
        decoder_input = hidden[-1].unsqueeze(1).repeat(1, seq_len, 1)
        
        # Decode
        decoder_output, _ = self.decoder(decoder_input, (hidden, cell))
        
        # Reconstruct
        reconstruction = self.output_layer(decoder_output)
        
        return reconstruction

# Model configuration
HIDDEN_SIZE = 64
NUM_LAYERS = 2
DROPOUT = 0.2

# Create model
n_features = len(feature_cols)
model = LSTMAutoencoder(
    input_size=n_features,
    hidden_size=HIDDEN_SIZE,
    num_layers=NUM_LAYERS,
    dropout=DROPOUT
)
model = model.to(device)

print(f"‚úÖ LSTM Autoencoder created")
print(f"   Input size: {n_features} features")
print(f"   Hidden size: {HIDDEN_SIZE}")
print(f"   Layers: {NUM_LAYERS}")
print(f"   Device: {device}")
print(f"\n   Model architecture:")
print(model)

### 3. Train Model

In [None]:
# Cell 4 - Create Sequences and Train Model

def create_sequences(data, seq_length):
    """Create sequences for LSTM training."""
    sequences = []
    for i in range(len(data) - seq_length + 1):
        sequences.append(data[i:i + seq_length])
    return np.array(sequences)

# =============================================================================
# PREPARE SEQUENCES
# =============================================================================

SEQUENCE_LENGTH = 30  # 30 time steps per sequence
BATCH_SIZE = 32
EPOCHS = 20
LEARNING_RATE = 0.001

print("=" * 70)
print("üîÑ PREPARING SEQUENCES FOR LSTM")
print("=" * 70)

# Create sequences
X_sequences = create_sequences(X_scaled, SEQUENCE_LENGTH)
y_sequences = y[SEQUENCE_LENGTH - 1:]  # Label for last point in each sequence

print(f"\nüìä Sequence preparation:")
print(f"   Original data: {X_scaled.shape}")
print(f"   Sequence length: {SEQUENCE_LENGTH}")
print(f"   Sequences created: {X_sequences.shape}")
print(f"   Labels: {y_sequences.shape}")

# Train/test split (80/20)
split_idx = int(len(X_sequences) * 0.8)
X_train = X_sequences[:split_idx]
X_test = X_sequences[split_idx:]
y_train = y_sequences[:split_idx]
y_test = y_sequences[split_idx:]

print(f"\n   Train sequences: {X_train.shape[0]}")
print(f"   Test sequences: {X_test.shape[0]}")
print(f"   Train anomalies: {y_train.sum()} ({y_train.mean():.2%})")
print(f"   Test anomalies: {y_test.sum()} ({y_test.mean():.2%})")

# Convert to PyTorch tensors
X_train_t = torch.FloatTensor(X_train).to(device)
X_test_t = torch.FloatTensor(X_test).to(device)

# Create DataLoader
train_dataset = TensorDataset(X_train_t)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

print(f"\n   Batch size: {BATCH_SIZE}")
print(f"   Batches per epoch: {len(train_loader)}")

# =============================================================================
# TRAIN MODEL
# =============================================================================

print("\n" + "=" * 70)
print("üîÑ TRAINING LSTM AUTOENCODER")
print("=" * 70)

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

train_losses = []

for epoch in range(EPOCHS):
    model.train()
    epoch_loss = 0
    
    for batch in train_loader:
        X_batch = batch[0]
        
        # Forward pass
        reconstruction = model(X_batch)
        loss = criterion(reconstruction, X_batch)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
    
    avg_loss = epoch_loss / len(train_loader)
    train_losses.append(avg_loss)
    
    if (epoch + 1) % 5 == 0 or epoch == 0:
        print(f"   Epoch [{epoch+1:3}/{EPOCHS}] Loss: {avg_loss:.6f}")

print(f"\n‚úÖ Training complete!")
print(f"   Final loss: {train_losses[-1]:.6f}")

### 4. Detect Anomalies

In [None]:
# Cell 5 - Detect Anomalies using Reconstruction Error

print("=" * 70)
print("üîç DETECTING ANOMALIES")
print("=" * 70)

# Get reconstruction errors on test set
model.eval()
with torch.no_grad():
    # Reconstruct test sequences
    reconstructed = model(X_test_t)
    
    # Calculate MSE per sequence
    errors = torch.mean((X_test_t - reconstructed) ** 2, dim=(1, 2)).cpu().numpy()

print(f"\nüìä Reconstruction errors:")
print(f"   Min: {errors.min():.6f}")
print(f"   Max: {errors.max():.6f}")
print(f"   Mean: {errors.mean():.6f}")
print(f"   Std: {errors.std():.6f}")

# =============================================================================
# DETERMINE THRESHOLD
# =============================================================================

# Use percentile of errors (95th percentile is common)
PERCENTILE = 95
threshold = np.percentile(errors, PERCENTILE)

print(f"\nüéØ Threshold ({PERCENTILE}th percentile): {threshold:.6f}")

# =============================================================================
# PREDICT ANOMALIES
# =============================================================================

lstm_preds = (errors > threshold).astype(int)

print(f"\nüìà Predictions:")
print(f"   Total test samples: {len(lstm_preds)}")
print(f"   Predicted anomalies: {lstm_preds.sum()} ({lstm_preds.mean():.2%})")
print(f"   Actual anomalies: {y_test.sum()} ({y_test.mean():.2%})")

# =============================================================================
# EVALUATE PERFORMANCE
# =============================================================================

precision = precision_score(y_test, lstm_preds, zero_division=0)
recall = recall_score(y_test, lstm_preds, zero_division=0)
f1 = f1_score(y_test, lstm_preds, zero_division=0)

print("\n" + "=" * 70)
print("üìä LSTM PERFORMANCE")
print("=" * 70)
print(f"\n   Precision: {precision:.3f}")
print(f"   Recall:    {recall:.3f}")
print(f"   F1 Score:  {f1:.3f}")

# Also get full dataset predictions for saving
with torch.no_grad():
    X_full_t = torch.FloatTensor(X_sequences).to(device)
    reconstructed_full = model(X_full_t)
    errors_full = torch.mean((X_full_t - reconstructed_full) ** 2, dim=(1, 2)).cpu().numpy()

lstm_preds_full = (errors_full > threshold).astype(int)

print(f"\n   Full dataset:")
print(f"   Total: {len(lstm_preds_full)} | Anomalies: {lstm_preds_full.sum()}")

### 5. Save Model

In [None]:
# Cell 6 - Save LSTM Model for KServe

import pickle

class LSTMPipeline:
    """
    Wrapper combining scaler + LSTM model for KServe deployment.
    KServe expects exactly ONE .pkl file in the model directory.
    """
    def __init__(self, scaler, model_state_dict, model_config, threshold, feature_names, device='cpu'):
        self.scaler = scaler
        self.model_state_dict = model_state_dict
        self.model_config = model_config
        self.threshold = threshold
        self.feature_names = feature_names
        self.device = device
        self._model = None
    
    def _get_model(self):
        """Lazy load model from state dict."""
        if self._model is None:
            self._model = LSTMAutoencoder(**self.model_config)
            self._model.load_state_dict(self.model_state_dict)
            self._model.to(self.device)
            self._model.eval()
        return self._model
    
    def predict(self, X):
        """
        Predict anomalies.
        
        Args:
            X: Input features (numpy array, shape: n_samples x n_features)
        
        Returns:
            predictions: 1 for normal, -1 for anomaly (sklearn convention)
        """
        # Scale input
        X_scaled = self.scaler.transform(X)
        
        # Create single sequence (use last SEQUENCE_LENGTH points)
        seq_len = 30  # Must match training
        if len(X_scaled) < seq_len:
            # Pad with zeros if not enough data
            padding = np.zeros((seq_len - len(X_scaled), X_scaled.shape[1]))
            X_scaled = np.vstack([padding, X_scaled])
        
        X_seq = X_scaled[-seq_len:].reshape(1, seq_len, -1)
        
        # Get model and predict
        model = self._get_model()
        X_tensor = torch.FloatTensor(X_seq).to(self.device)
        
        with torch.no_grad():
            reconstructed = model(X_tensor)
            error = torch.mean((X_tensor - reconstructed) ** 2).item()
        
        # Return -1 for anomaly, 1 for normal (sklearn convention)
        return -1 if error > self.threshold else 1
    
    def predict_batch(self, X_sequences):
        """Predict on batch of sequences."""
        model = self._get_model()
        X_tensor = torch.FloatTensor(X_sequences).to(self.device)
        
        with torch.no_grad():
            reconstructed = model(X_tensor)
            errors = torch.mean((X_tensor - reconstructed) ** 2, dim=(1, 2)).cpu().numpy()
        
        return np.where(errors > self.threshold, -1, 1)
    
    def get_reconstruction_error(self, X_sequences):
        """Get reconstruction errors for analysis."""
        model = self._get_model()
        X_tensor = torch.FloatTensor(X_sequences).to(self.device)
        
        with torch.no_grad():
            reconstructed = model(X_tensor)
            errors = torch.mean((X_tensor - reconstructed) ** 2, dim=(1, 2)).cpu().numpy()
        
        return errors

# =============================================================================
# CREATE AND SAVE PIPELINE
# =============================================================================

print("=" * 70)
print("üíæ SAVING LSTM MODEL FOR KSERVE")
print("=" * 70)

# Create pipeline
pipeline = LSTMPipeline(
    scaler=scaler,
    model_state_dict=model.cpu().state_dict(),  # Move to CPU for saving
    model_config={
        'input_size': n_features,
        'hidden_size': HIDDEN_SIZE,
        'num_layers': NUM_LAYERS,
        'dropout': DROPOUT
    },
    threshold=threshold,
    feature_names=feature_cols,
    device='cpu'
)

print(f"\nüì¶ LSTMPipeline created:")
print(f"   Features: {len(feature_cols)}")
print(f"   Threshold: {threshold:.6f}")
print(f"   Hidden size: {HIDDEN_SIZE}")

# Clean up old files
for old_file in MODEL_DIR.glob('*.pkl'):
    old_file.unlink()
    print(f"   üóëÔ∏è  Removed: {old_file.name}")
for old_file in MODEL_DIR.glob('*.pt'):
    old_file.unlink()
    print(f"   üóëÔ∏è  Removed: {old_file.name}")

# Save single .pkl file (KServe requirement)
model_path = MODEL_DIR / 'model.pkl'
with open(model_path, 'wb') as f:
    pickle.dump(pipeline, f)

print(f"\n‚úÖ Model saved: {model_path}")
print(f"   Size: {model_path.stat().st_size / 1024:.2f} KB")

# Verify single file
pkl_files = list(MODEL_DIR.glob('*.pkl'))
assert len(pkl_files) == 1, f"Expected 1 .pkl file, found {len(pkl_files)}"
print(f"   Files in directory: {len(pkl_files)} ‚úì")

# =============================================================================
# UPLOAD TO S3
# =============================================================================

try:
    from common_functions import upload_model_to_s3, test_s3_connection
    
    if test_s3_connection():
        upload_model_to_s3(
            str(model_path),
            s3_key=f'models/anomaly-detection/{MODEL_NAME}/model.pkl'
        )
        print(f"\n‚òÅÔ∏è  Uploaded to S3: models/anomaly-detection/{MODEL_NAME}/model.pkl")
except Exception as e:
    print(f"\n‚ö†Ô∏è  S3 upload skipped: {e}")

# =============================================================================
# SAVE PREDICTIONS
# =============================================================================

results_df = pd.DataFrame({
    'actual': y_sequences,
    'lstm_pred': lstm_preds_full,
    'reconstruction_error': errors_full
})
results_df.to_parquet(PROCESSED_DIR / 'lstm_predictions.parquet')
print(f"\nüíæ Predictions saved: {PROCESSED_DIR / 'lstm_predictions.parquet'}")

# =============================================================================
# SUMMARY
# =============================================================================

print("\n" + "=" * 70)
print("üéâ LSTM MODEL SAVE COMPLETE")
print("=" * 70)
print(f"\n   Model name: {MODEL_NAME}")
print(f"   Model path: {model_path}")
print(f"   Performance: P={precision:.3f} R={recall:.3f} F1={f1:.3f}")
print(f"\n   Deploy to KServe with: storageUri: pvc://model-storage-pvc/{MODEL_NAME}")

## Validation Section

In [None]:
# Cell 7 - Validation

print("=" * 70)
print("üîç VALIDATION")
print("=" * 70)

# Verify model file
assert (MODEL_DIR / 'model.pkl').exists(), "Model not saved!"
print(f"   ‚úÖ Model file exists")

# Verify predictions
assert (PROCESSED_DIR / 'lstm_predictions.parquet').exists(), "Predictions not saved!"
print(f"   ‚úÖ Predictions saved")

# Test loading the pipeline
with open(MODEL_DIR / 'model.pkl', 'rb') as f:
    loaded_pipeline = pickle.load(f)

print(f"   ‚úÖ Pipeline loads correctly")
print(f"   ‚úÖ Features: {len(loaded_pipeline.feature_names)}")
print(f"   ‚úÖ Threshold: {loaded_pipeline.threshold:.6f}")

# Test prediction
test_errors = loaded_pipeline.get_reconstruction_error(X_sequences[:10])
print(f"   ‚úÖ Prediction works: {len(test_errors)} errors computed")

print("\n" + "=" * 70)
print("‚úÖ ALL VALIDATIONS PASSED")
print("=" * 70)

## Integration Section

This notebook integrates with:
- **Input**: Synthetic anomalies from `synthetic-anomaly-generation.ipynb`
- **Output**: LSTM model for `ensemble-anomaly-methods.ipynb`
- **Deployment**: Model can be exported to KServe for production

## Next Steps

1. Compare LSTM performance with ARIMA and Prophet
2. Proceed to `ensemble-anomaly-methods.ipynb`
3. Combine all methods for best performance
4. Deploy ensemble to coordination engine

## References

- ADR-012: Notebook Architecture for End-to-End Workflows
- [PyTorch LSTM Documentation](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)
- [Autoencoder Anomaly Detection](https://en.wikipedia.org/wiki/Autoencoder)