# Time Series Anomaly Detection

## Overview
This notebook implements time series anomaly detection using ARIMA and Prophet forecasting methods. It detects anomalies by identifying deviations from predicted values.

## Prerequisites
- Completed: `synthetic-anomaly-generation.ipynb` (Phase 1)
- Libraries: statsmodels, prophet, pandas, numpy
- Synthetic dataset: `/opt/app-root/src/data/processed/synthetic_anomalies.parquet`

## Why We Use Synthetic Data

### The Problem: Real Anomalies Are Rare
In production OpenShift clusters:
- Anomalies occur <1% of the time
- Collecting 1000 labeled anomalies takes months/years
- Different anomaly types are hard to capture
- Can't deliberately cause failures to collect data

### The Solution: Synthetic Anomalies
We generate synthetic anomalies because:
- ‚úÖ Create 1000+ labeled anomalies in minutes
- ‚úÖ Control anomaly types and severity
- ‚úÖ Ensure balanced training data (50% normal, 50% anomaly)
- ‚úÖ Reproducible and testable
- ‚úÖ Models trained on synthetic data generalize to real anomalies

### Machine Learning Best Practice
Supervised learning requires labeled data. Synthetic data provides:
1. **Ground Truth**: Known labels for evaluation
2. **Balanced Classes**: Equal normal and anomaly samples
3. **Reproducibility**: Same data for consistent results
4. **Generalization**: Models learn patterns, not memorize examples

## Learning Objectives
- Implement ARIMA forecasting on synthetic data
- Use Prophet for time series analysis
- Detect anomalies via forecast deviations
- Handle seasonal patterns
- Evaluate detection performance with labeled data

## Key Concepts
- **ARIMA**: AutoRegressive Integrated Moving Average
- **Prophet**: Facebook's time series forecasting tool
- **Forecast Error**: Deviation between actual and predicted values
- **Seasonality**: Repeating patterns in time series

## References

### Why Synthetic Data for Training?
- **He & Garcia (2009)**: "Learning from Imbalanced Data" - https://ieeexplore.ieee.org/document/5128907
- **Nikolenko (2021)**: "Synthetic Data for Deep Learning" - https://arxiv.org/abs/1909.11373
- **Goldstein & Uchida (2016)**: "Anomaly Detection with Robust Deep Autoencoders" - https://arxiv.org/abs/1511.08747

### Time Series Anomaly Detection
- **Malhotra et al. (2016)**: "Time Series Anomaly Detection with LSTM Networks" - https://arxiv.org/abs/1607.00148
- **Taylor & Letham (2018)**: "Forecasting at Scale (Prophet)" - https://peerj.com/articles/3190
- **Box & Jenkins (1970)**: "Time Series Analysis, Forecasting and Control (ARIMA)" - Classic reference

### Key Takeaway
Synthetic data provides labeled training examples that allow us to:
1. Train models with known ground truth
2. Evaluate performance with precision, recall, and F1 scores
3. Ensure reproducible and testable results
4. Build models that generalize to real-world anomalies

In [None]:
import sys
import os
import numpy as np
import pandas as pd
import pickle
import logging
from pathlib import Path
from datetime import datetime, timedelta
from sklearn.metrics import precision_score, recall_score, f1_score

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"‚úÖ Utils path found: {utils_path}")
else:
    print("‚ö†Ô∏è Utils path not found - will use fallback implementations")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("‚úÖ Common functions imported")
except ImportError as e:
    print(f"‚ö†Ô∏è Common functions not available: {e}")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'

# Use /mnt/models for persistent storage (model-storage-pvc)
# Fallback to local for development outside cluster
MODELS_DIR = Path('/mnt/models') if Path('/mnt/models').exists() else Path('/opt/app-root/src/models')

# Create KServe-compatible subdirectory structure
# CHANGED: Use unique model name to avoid conflict with isolation-forest's anomaly-detector model
MODEL_NAME = 'timeseries-predictor'
MODEL_DIR = MODELS_DIR / MODEL_NAME
MODEL_DIR.mkdir(parents=True, exist_ok=True)

logger.info(f"Data directory: {DATA_DIR}")
logger.info(f"Models directory: {MODEL_DIR}")

## Implementation Section

### 1. Load Synthetic Data

In [None]:
TARGET_METRICS = [
    # Resource Metrics (5)
    'node_memory_utilization',
    'pod_cpu_usage',
    'pod_memory_usage',
    'alt_cpu_usage',
    'alt_memory_usage',
    
    # Stability Metrics (3)
    'container_restart_count',
    'container_restart_rate_1h',
    'deployment_unavailable',
    
    # Pod Status Metrics (4)
    'namespace_pod_count',
    'pods_pending',
    'pods_running',
    'pods_failed',
    
    # Storage Metrics (2)
    'persistent_volume_usage',
    'cluster_resource_quota',
    
    # Control Plane Metrics (2)
    'apiserver_request_total',
    'apiserver_error_rate',
]

# Prometheus queries for real data collection
PROMETHEUS_QUERIES = {
    'node_memory_utilization': 'instance:node_memory_utilisation:ratio * 100',
    'pod_cpu_usage': 'sum by (pod, namespace) (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate)',
    'pod_memory_usage': 'sum by (pod, namespace) (container_memory_working_set_bytes{container!="POD", container!=""})',
    'alt_cpu_usage': 'sum(rate(container_cpu_usage_seconds_total{container!="POD", container!=""}[5m])) by (pod, namespace)',
    'alt_memory_usage': 'sum(container_memory_rss{container!="POD", container!=""}) by (pod, namespace)',
    'container_restart_count': 'sum by (pod, namespace, container) (kube_pod_container_status_restarts_total)',
    'container_restart_rate_1h': 'sum by (pod, namespace) (increase(kube_pod_container_status_restarts_total[1h]))',
    'deployment_unavailable': 'sum by (deployment, namespace) (kube_deployment_status_replicas_unavailable)',
    'namespace_pod_count': 'sum by (namespace) (kube_pod_status_phase)',
    'pods_pending': 'sum by (namespace) (kube_pod_status_phase{phase="Pending"})',
    'pods_running': 'sum by (namespace) (kube_pod_status_phase{phase="Running"})',
    'pods_failed': 'sum by (namespace) (kube_pod_status_phase{phase="Failed"})',
    'persistent_volume_usage': 'kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100',
    'cluster_resource_quota': 'kube_resourcequota',
    'apiserver_request_total': 'sum(rate(apiserver_request_total[5m]))',
    'apiserver_error_rate': 'sum(rate(apiserver_request_total{code=~"5.."}[5m])) / sum(rate(apiserver_request_total[5m])) * 100',
}

print(f"üìä Target metrics for time series analysis: {len(TARGET_METRICS)}")

# =============================================================================
# PROMETHEUS CLIENT (for loading real data)
# =============================================================================

import requests
import os

class PrometheusClient:
    """Client for querying Prometheus in OpenShift."""
    
    def __init__(self):
        token_path = '/var/run/secrets/kubernetes.io/serviceaccount/token'
        self.token = None
        if os.path.exists(token_path):
            with open(token_path, 'r') as f:
                self.token = f.read().strip()
        
        self.base_url = 'https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091'
        self.session = requests.Session()
        if self.token:
            self.session.headers.update({'Authorization': f'Bearer {self.token}'})
        self.session.verify = False
        
        import urllib3
        urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
        
        # Test connection
        try:
            response = self.session.get(f"{self.base_url}/api/v1/status/config", timeout=5)
            self.connected = response.status_code == 200
        except:
            self.connected = False
    
    def query_range(self, query, start, end, step='1m'):
        if not self.connected:
            return None
        
        url = f"{self.base_url}/api/v1/query_range"
        params = {'query': query, 'start': start, 'end': end, 'step': step}
        
        try:
            response = self.session.get(url, params=params, timeout=60)
            response.raise_for_status()
            return response.json()
        except:
            return None

# =============================================================================
# DATA LOADING FUNCTION
# =============================================================================

def load_timeseries_data(duration_hours=24, use_real_data=True):
    """
    Load time series data for anomaly detection.
    
    For ARIMA/Prophet, we need univariate time series, so we'll create
    a DataFrame where each column is one metric's aggregated values over time.
    
    Args:
        duration_hours: Hours of historical data
        use_real_data: Try Prometheus first, fallback to synthetic
    
    Returns:
        DataFrame with timestamp index and metric columns
    """
    print("=" * 70)
    print("üîÑ LOADING TIME SERIES DATA")
    print("=" * 70)
    print(f"   Duration: {duration_hours} hours")
    print(f"   Metrics: {len(TARGET_METRICS)}")
    print(f"   Use real data: {use_real_data}")
    
    # Try to connect to Prometheus
    prometheus = None
    if use_real_data:
        prometheus = PrometheusClient()
        print(f"   Prometheus connected: {prometheus.connected}")
    
    end_time = datetime.now()
    start_time = end_time - timedelta(hours=duration_hours)
    
    # Create time index (1-minute intervals)
    time_index = pd.date_range(start=start_time, end=end_time, freq='1min')
    
    # Initialize DataFrame
    df = pd.DataFrame(index=time_index)
    df.index.name = 'timestamp'
    
    data_sources = {}
    
    print(f"\nüìä Loading {len(TARGET_METRICS)} metrics...")
    print("-" * 50)
    
    for i, metric in enumerate(TARGET_METRICS):
        real_data_loaded = False
        
        if prometheus and prometheus.connected and metric in PROMETHEUS_QUERIES:
            query = PROMETHEUS_QUERIES[metric]
            result = prometheus.query_range(
                query,
                start_time.timestamp(),
                end_time.timestamp(),
                step='1m'
            )
            
            if result and result.get('status') == 'success':
                data = result.get('data', {}).get('result', [])
                if data:
                    # Parse and aggregate Prometheus data
                    rows = []
                    for series in data:
                        for ts, value in series.get('values', []):
                            try:
                                rows.append({
                                    'timestamp': pd.to_datetime(ts, unit='s'),
                                    'value': float(value) if value != 'NaN' else np.nan
                                })
                            except:
                                pass
                    
                    if rows:
                        metric_df = pd.DataFrame(rows)
                        # Aggregate by timestamp (mean across all series)
                        metric_series = metric_df.groupby('timestamp')['value'].mean()
                        # Resample to our time index
                        metric_series = metric_series.reindex(time_index, method='nearest')
                        df[metric] = metric_series
                        data_sources[metric] = 'REAL'
                        real_data_loaded = True
                        print(f"   ‚úÖ [{i+1:2}/{len(TARGET_METRICS)}] {metric}: REAL ({metric_series.notna().sum()} points)")
        
        if not real_data_loaded:
            # Generate synthetic time series with realistic patterns
            n_points = len(time_index)
            
            # Base pattern: trend + seasonality + noise
            trend = np.linspace(50, 55, n_points)  # Slight upward trend
            daily_seasonal = 10 * np.sin(np.linspace(0, 2*np.pi * (duration_hours/24), n_points))
            hourly_seasonal = 3 * np.sin(np.linspace(0, 2*np.pi * duration_hours, n_points))
            noise = np.random.normal(0, 2, n_points)
            
            # Customize based on metric type
            if 'cpu' in metric.lower():
                base = 30 + trend * 0.5 + daily_seasonal + noise
            elif 'memory' in metric.lower():
                base = 60 + trend * 0.3 + daily_seasonal * 0.5 + noise
            elif 'restart' in metric.lower():
                base = np.abs(noise * 0.5)  # Low values, occasional spikes
            elif 'pending' in metric.lower() or 'failed' in metric.lower():
                base = np.abs(noise * 0.2)  # Mostly zeros
            else:
                base = 50 + trend + daily_seasonal + noise
            
            df[metric] = base
            data_sources[metric] = 'SYNTHETIC'
            print(f"   üìä [{i+1:2}/{len(TARGET_METRICS)}] {metric}: SYNTHETIC ({len(base)} points)")
    
    # Add labels column (for synthetic anomalies)
    df['label'] = 0
    
    # Inject some anomalies for training/testing
    anomaly_rate = 0.03  # 3% anomalies
    n_anomalies = int(len(df) * anomaly_rate)
    anomaly_indices = np.random.choice(len(df), n_anomalies, replace=False)
    
    for idx in anomaly_indices:
        # Pick random metrics to make anomalous
        anomaly_metrics = np.random.choice(TARGET_METRICS, 2, replace=False)
        for metric in anomaly_metrics:
            if metric in df.columns:
                std = df[metric].std()
                df.iloc[idx, df.columns.get_loc(metric)] += 3.0 * std * np.random.choice([-1, 1])
        df.iloc[idx, df.columns.get_loc('label')] = 1
    
    # Summary
    real_count = sum(1 for s in data_sources.values() if s == 'REAL')
    synthetic_count = sum(1 for s in data_sources.values() if s == 'SYNTHETIC')
    
    print("\n" + "=" * 70)
    print("üìä DATA LOADING SUMMARY")
    print("=" * 70)
    print(f"   Total metrics: {len(TARGET_METRICS)}")
    print(f"   ‚úÖ REAL data: {real_count} metrics")
    print(f"   üìä SYNTHETIC data: {synthetic_count} metrics")
    print(f"   Data points: {len(df):,}")
    print(f"   Anomalies injected: {df['label'].sum()} ({df['label'].mean():.1%})")
    print("=" * 70)
    
    return df, data_sources

# =============================================================================
# LOAD THE DATA
# =============================================================================

# Set to True to try loading real Prometheus data
USE_REAL_DATA = True

df, data_sources = load_timeseries_data(
    duration_hours=24,
    use_real_data=USE_REAL_DATA
)

# Display summary
print("\nüìã DATA LOADED:")
print(f"   Shape: {df.shape}")
print(f"   Columns: {list(df.columns[:5])}... + {len(df.columns)-5} more")
print(f"   Time range: {df.index.min()} to {df.index.max()}")
print(f"\n   Normal samples: {(df['label'] == 0).sum()}")
print(f"   Anomalous samples: {(df['label'] == 1).sum()}")

# Show first few rows
print("\nüìä Sample data:")
print(df.head())


### 2. ARIMA-Based Detection

In [None]:
# Cell 3 - Load or generate synthetic data with TARGET_METRICS names

data_file = PROCESSED_DIR / 'synthetic_anomalies.parquet'

# Check if we already have real data loaded with proper columns
real_data_loaded = (
    'df' in dir() and 
    not df.empty and 
    any(col in df.columns for col in TARGET_METRICS)
)

if real_data_loaded:
    logger.info(f"‚úÖ Using real Prometheus data already loaded: {df.shape}")
    print(f"‚úÖ Real data already loaded with {len([c for c in df.columns if c in TARGET_METRICS])} metrics")
elif data_file.exists():
    # Check if existing file has TARGET_METRICS columns
    existing_df = pd.read_parquet(data_file)
    if any(col in existing_df.columns for col in TARGET_METRICS):
        df = existing_df
        logger.info(f"‚úÖ Loaded existing data with TARGET_METRICS: {df.shape}")
    else:
        logger.info("‚ö†Ô∏è Existing synthetic data uses old column names - regenerating...")
        # Will regenerate below
        data_file = None  # Force regeneration

if not real_data_loaded and (not data_file or not data_file.exists()):
    logger.info("üìä Generating synthetic data with TARGET_METRICS names...")
    
    from datetime import datetime, timedelta
    np.random.seed(42)
    n_points = 1000
    
    # Create timestamp index
    start_time = datetime.now() - timedelta(days=30)
    timestamps = [start_time + timedelta(minutes=i) for i in range(n_points)]
    
    data = {}
    
    # Generate realistic patterns for each metric type
    for metric in TARGET_METRICS:
        # Base components
        trend = np.linspace(0, 5, n_points)  # Slight upward trend
        daily_seasonal = np.sin(np.linspace(0, 4*np.pi, n_points))  # ~2 day cycles
        hourly_seasonal = 0.3 * np.sin(np.linspace(0, 48*np.pi, n_points))  # Hourly variation
        noise = np.random.normal(0, 1, n_points)
        
        # Customize based on metric type
        if metric == 'node_memory_utilization':
            # Memory utilization: 40-80% range with gradual changes
            base = 60 + trend + 10 * daily_seasonal + 2 * noise
            data[metric] = np.clip(base, 0, 100)
            
        elif metric == 'pod_cpu_usage':
            # CPU usage: 0-1 range (cores), spiky
            base = 0.3 + 0.1 * daily_seasonal + 0.05 * hourly_seasonal + 0.02 * np.abs(noise)
            data[metric] = np.clip(base, 0, 2)
            
        elif metric == 'pod_memory_usage':
            # Memory in bytes: ~200-300MB range
            base = 2.5e8 + 2e7 * daily_seasonal + 5e6 * noise
            data[metric] = np.clip(base, 1e8, 5e8)
            
        elif metric == 'alt_cpu_usage':
            # Alternative CPU metric: similar to pod_cpu_usage
            base = 0.25 + 0.08 * daily_seasonal + 0.03 * hourly_seasonal + 0.015 * np.abs(noise)
            data[metric] = np.clip(base, 0, 2)
            
        elif metric == 'alt_memory_usage':
            # Alternative memory: ~150MB range
            base = 1.5e8 + 1e7 * daily_seasonal + 3e6 * noise
            data[metric] = np.clip(base, 5e7, 3e8)
            
        elif metric == 'container_restart_count':
            # Restart count: low integers, mostly stable
            base = 10 + 0.5 * np.abs(noise)
            data[metric] = np.clip(base, 0, 50).astype(int)
            
        elif metric == 'container_restart_rate_1h':
            # Restart rate: very low, occasional spikes
            base = 0.002 + 0.001 * np.abs(noise)
            # Add occasional spikes
            spike_indices = np.random.choice(n_points, 20, replace=False)
            base[spike_indices] += np.random.uniform(0.01, 0.05, 20)
            data[metric] = np.clip(base, 0, 0.1)
            
        elif metric == 'deployment_unavailable':
            # Unavailable replicas: mostly 0, occasional non-zero
            base = np.zeros(n_points)
            unavail_indices = np.random.choice(n_points, 30, replace=False)
            base[unavail_indices] = np.random.randint(1, 3, 30)
            data[metric] = base
            
        elif metric == 'namespace_pod_count':
            # Pod count per namespace: 5-15 range
            base = 8 + 2 * daily_seasonal + 0.5 * noise
            data[metric] = np.clip(base, 1, 20)
            
        elif metric == 'pods_pending':
            # Pending pods: mostly 0, occasional small values
            base = 0.01 + 0.005 * np.abs(noise)
            data[metric] = np.clip(base, 0, 1)
            
        elif metric == 'pods_running':
            # Running pods: stable around 5-7
            base = 6 + 0.5 * daily_seasonal + 0.2 * noise
            data[metric] = np.clip(base, 1, 15)
            
        elif metric == 'pods_failed':
            # Failed pods: mostly 0, rare failures
            base = 0.1 + 0.05 * np.abs(noise)
            data[metric] = np.clip(base, 0, 2)
            
        elif metric == 'persistent_volume_usage':
            # PV usage: 5-15% range, slow growth
            base = 0.08 + 0.02 * (trend / 5) + 0.01 * daily_seasonal + 0.005 * noise
            data[metric] = np.clip(base, 0, 1)
            
        elif metric == 'cluster_resource_quota':
            # Resource quota: often 0 (not set)
            data[metric] = np.zeros(n_points)
            
        elif metric == 'apiserver_request_total':
            # API server requests: 80-150 range, variable
            base = 110 + 20 * daily_seasonal + 10 * hourly_seasonal + 5 * noise
            data[metric] = np.clip(base, 50, 200)
            
        elif metric == 'apiserver_error_rate':
            # Error rate: mostly 0, occasional spikes
            base = np.zeros(n_points)
            error_indices = np.random.choice(n_points, 15, replace=False)
            base[error_indices] = np.random.uniform(0.01, 0.05, 15)
            data[metric] = np.clip(base, 0, 0.1)
        
        else:
            # Default pattern for any unknown metrics
            base = 50 + trend + 10 * daily_seasonal + 2 * noise
            data[metric] = base
    
    # Create DataFrame
    df = pd.DataFrame(data)
    df['timestamp'] = timestamps
    df['label'] = 0
    
    # Set timestamp as index
    df = df.set_index('timestamp')
    df.index.name = 'timestamp'
    
    # Inject anomalies (5% of data)
    n_anomalies = int(len(df) * 0.05)
    anomaly_indices = np.random.choice(len(df), n_anomalies, replace=False)
    
    for idx in anomaly_indices:
        # Pick 2-3 random metrics to make anomalous
        n_affected = np.random.randint(2, 4)
        affected_metrics = np.random.choice(TARGET_METRICS, n_affected, replace=False)
        
        for metric in affected_metrics:
            if metric in df.columns:
                std = df[metric].std()
                mean = df[metric].mean()
                # Add anomaly: 3-5 std deviations
                multiplier = np.random.uniform(3.0, 5.0) * np.random.choice([-1, 1])
                df.iloc[idx, df.columns.get_loc(metric)] = mean + multiplier * std
        
        df.iloc[idx, df.columns.get_loc('label')] = 1
    
    # Save for downstream notebooks
    PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
    df.to_parquet(PROCESSED_DIR / 'synthetic_anomalies.parquet')
    logger.info(f"‚úÖ Generated and saved synthetic data: {df.shape}")

# Display summary
print(f"\nüìã DATA SUMMARY:")
print(f"   Shape: {df.shape}")
metric_cols = [c for c in df.columns if c in TARGET_METRICS]
print(f"   TARGET_METRICS columns: {len(metric_cols)}")
print(f"   Columns: {metric_cols[:5]}... + {max(0, len(metric_cols)-5)} more")

if 'label' in df.columns:
    print(f"\n   Normal samples: {(df['label'] == 0).sum()}")
    print(f"   Anomalous samples: {(df['label'] == 1).sum()}")

print(f"\nüìä Sample data:")
print(df.head())

In [None]:
# DIAGNOSTIC CELL - Run this after Cell 2 to verify data state
print("=" * 70)
print("üîç DIAGNOSTIC: Checking data state")
print("=" * 70)

print(f"\n1. Is 'df' defined? {'df' in dir()}")
print(f"2. Is 'df' in globals? {'df' in globals()}")

if 'df' in dir():
    print(f"3. df is None? {df is None}")
    if df is not None:
        print(f"4. df.empty? {df.empty}")
        print(f"5. df.shape: {df.shape}")
        print(f"6. df.columns: {list(df.columns)[:8]}...")
        
        # Check for TARGET_METRICS
        if 'TARGET_METRICS' in dir():
            matches = [c for c in df.columns if c in TARGET_METRICS]
            print(f"7. TARGET_METRICS columns found: {len(matches)}")
            print(f"   Matching: {matches[:5]}...")
        else:
            print("7. TARGET_METRICS not defined!")
else:
    print("3. df is NOT defined - this is the problem!")

# Check the parquet file
parquet_path = PROCESSED_DIR / 'synthetic_anomalies.parquet'
print(f"\n8. Parquet file exists? {parquet_path.exists()}")
if parquet_path.exists():
    temp_df = pd.read_parquet(parquet_path)
    print(f"   Parquet columns: {list(temp_df.columns)[:5]}...")
    has_target = any(c in temp_df.columns for c in TARGET_METRICS)
    print(f"   Has TARGET_METRICS? {has_target}")

In [None]:
# Cell 4 - ARIMA-Based Detection (FIXED shape alignment)

from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

def detect_anomalies_arima(series, threshold_std=2.5, metric_name="unknown"):
    """
    Detect anomalies using ARIMA forecasting - FIXED for shape alignment.
    """
    try:
        if series is None:
            return None, None, None, "Series is None"
        
        series_clean = series.dropna()
        
        if len(series_clean) < 50:
            return None, None, None, f"Insufficient data: {len(series_clean)} points"
        
        std_val = series_clean.std()
        if std_val == 0 or np.isnan(std_val):
            return None, None, None, f"Constant values (std={std_val})"
        
        if np.isinf(series_clean).any():
            return None, None, None, "Contains infinity values"
        
        # Fit ARIMA model
        model = ARIMA(series_clean.values, order=(1, 1, 1))
        results = model.fit()
        
        # Get fitted values
        fitted = results.fittedvalues
        n_fitted = len(fitted)
        
        # FIXED: Align lengths properly - take last n_fitted values from original
        actual_values = series_clean.values[-n_fitted:]
        
        # Calculate residuals (now same length)
        residuals = actual_values - fitted
        
        # Detect anomalies based on residual threshold
        residual_std = np.std(residuals)
        if residual_std == 0 or np.isnan(residual_std):
            return None, None, None, f"Residual std is {residual_std}"
        
        threshold = threshold_std * residual_std
        anomaly_flags = (np.abs(residuals) > threshold).astype(int)
        
        # Create full prediction series aligned with original index
        full_predictions = pd.Series(0, index=series.index)
        
        # Map predictions to the END of the series (where fitted values align)
        start_idx = len(series) - n_fitted
        for i, flag in enumerate(anomaly_flags):
            full_predictions.iloc[start_idx + i] = flag
        
        return full_predictions, results, residuals, None
    
    except Exception as e:
        return None, None, None, f"{type(e).__name__}: {str(e)}"


# =============================================================================
# VERIFY DATA STATE
# =============================================================================

print("=" * 70)
print("üîÑ ARIMA ANOMALY DETECTION (FIXED)")
print("=" * 70)

metric_columns = [col for col in df.columns if col in TARGET_METRICS]

print(f"\nüìä Data verification:")
print(f"   DataFrame shape: {df.shape}")
print(f"   TARGET_METRICS found: {len(metric_columns)}")

# =============================================================================
# ANALYZE EACH METRIC
# =============================================================================

print(f"\nüìà Analyzing {len(metric_columns)} metrics...")
print("-" * 70)

arima_results = {}
arima_models = {}
arima_errors = {}
arima_skipped = {}

for i, metric in enumerate(metric_columns):
    series = df[metric]
    
    n_unique = len(series.unique())
    std_val = series.std()
    
    print(f"\n[{i+1:2}/{len(metric_columns)}] {metric}")
    print(f"      Points: {len(series)} | Unique: {n_unique} | Std: {std_val:.6f}")
    
    # Skip metrics with issues
    if std_val == 0:
        arima_skipped[metric] = "Constant values (std=0)"
        print(f"      ‚è≠Ô∏è  SKIPPED: Constant values")
        continue
    
    if n_unique < 10:
        arima_skipped[metric] = f"Too few unique values ({n_unique})"
        print(f"      ‚è≠Ô∏è  SKIPPED: Only {n_unique} unique values")
        continue
    
    # Run ARIMA
    predictions, model, residuals, error = detect_anomalies_arima(series, metric_name=metric)
    
    if predictions is not None:
        arima_results[metric] = predictions
        arima_models[metric] = model
        
        anomalies = predictions.sum()
        print(f"      ‚úÖ Success! Detected {anomalies} anomalies")
        
        if 'label' in df.columns:
            p = precision_score(df['label'], predictions, zero_division=0)
            r = recall_score(df['label'], predictions, zero_division=0)
            f = f1_score(df['label'], predictions, zero_division=0)
            print(f"      üìä P={p:.3f} | R={r:.3f} | F1={f:.3f}")
    else:
        arima_errors[metric] = error
        print(f"      ‚ùå FAILED: {error}")

# =============================================================================
# RESULTS SUMMARY
# =============================================================================

print("\n" + "=" * 70)
print("üìä ARIMA RESULTS SUMMARY")
print("=" * 70)

print(f"\n   ‚úÖ Successful: {len(arima_results)}/{len(metric_columns)} metrics")
print(f"   ‚è≠Ô∏è  Skipped:    {len(arima_skipped)}/{len(metric_columns)} metrics")
print(f"   ‚ùå Failed:     {len(arima_errors)}/{len(metric_columns)} metrics")

if arima_skipped:
    print(f"\n   Skipped metrics (unsuitable for ARIMA):")
    for m, reason in arima_skipped.items():
        print(f"      - {m}: {reason}")

if arima_errors:
    print(f"\n   Failed metrics:")
    for m, error in arima_errors.items():
        print(f"      - {m}: {error}")

# =============================================================================
# ENSEMBLE PREDICTIONS
# =============================================================================

if arima_results:
    print("\n" + "-" * 70)
    print("üîπ ENSEMBLE PREDICTIONS")
    print("-" * 70)
    
    arima_ensemble = pd.DataFrame(arima_results)
    ensemble_any = (arima_ensemble.sum(axis=1) > 0).astype(int)
    ensemble_majority = (arima_ensemble.sum(axis=1) > len(arima_results) / 2).astype(int)
    
    print(f"\n   Metrics in ensemble: {len(arima_results)}")
    print(f"   ANY method:      {ensemble_any.sum()} anomalies ({ensemble_any.mean():.2%})")
    print(f"   MAJORITY method: {ensemble_majority.sum()} anomalies ({ensemble_majority.mean():.2%})")
    
    if 'label' in df.columns:
        print(f"\n   üìà Ensemble Performance (ANY method):")
        p = precision_score(df['label'], ensemble_any, zero_division=0)
        r = recall_score(df['label'], ensemble_any, zero_division=0)
        f = f1_score(df['label'], ensemble_any, zero_division=0)
        print(f"      Precision: {p:.3f} | Recall: {r:.3f} | F1: {f:.3f}")
    
    arima_preds = ensemble_any
    arima_model = arima_models
    
    print("\n" + "=" * 70)
    print("‚úÖ ARIMA ANALYSIS COMPLETE")
    print("=" * 70)
else:
    print("\n‚ùå No ARIMA models trained - cannot create ensemble")
    arima_preds = None
    arima_model = None

### 3. Prophet-Based Detection

In [None]:
try:
    from prophet import Prophet
    PROPHET_AVAILABLE = True
except ImportError:
    print("‚ö†Ô∏è Prophet not installed. Run: pip install prophet")
    PROPHET_AVAILABLE = False

from sklearn.metrics import precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

def detect_anomalies_prophet(series, threshold_std=2.5):
    """
    Detect anomalies using Prophet forecasting.
    
    Args:
        series: Time series data (pandas Series with datetime index)
        threshold_std: Number of standard deviations for anomaly threshold
    
    Returns:
        tuple: (predictions, model, forecast)
    """
    if not PROPHET_AVAILABLE:
        return None, None, None
    
    try:
        # Prepare data for Prophet (requires 'ds' and 'y' columns)
        prophet_df = pd.DataFrame({
            'ds': series.index,
            'y': series.values
        }).dropna()
        
        if len(prophet_df) < 50:
            return None, None, None
        
        # Fit Prophet model (suppress logging)
        import logging
        logging.getLogger('cmdstanpy').setLevel(logging.WARNING)
        
        model = Prophet(
            daily_seasonality=True,
            weekly_seasonality=True,
            yearly_seasonality=False,  # Not enough data
            changepoint_prior_scale=0.05
        )
        model.fit(prophet_df)
        
        # Make predictions
        forecast = model.predict(prophet_df[['ds']])
        
        # Calculate residuals
        residuals = prophet_df['y'].values - forecast['yhat'].values
        threshold = threshold_std * np.std(residuals)
        
        # Detect anomalies
        predictions = (np.abs(residuals) > threshold).astype(int)
        
        # Create full prediction series aligned with original index
        full_predictions = pd.Series(0, index=series.index)
        for i, (idx, row) in enumerate(prophet_df.iterrows()):
            if i < len(predictions):
                full_predictions.loc[row['ds']] = predictions[i]
        
        return full_predictions, model, forecast
    
    except Exception as e:
        logger.warning(f"Prophet error: {e}")
        return None, None, None

# =============================================================================
# ANALYZE ALL 16 METRICS WITH PROPHET
# =============================================================================

if PROPHET_AVAILABLE:
    print("=" * 70)
    print("üîÑ PROPHET ANOMALY DETECTION ON ALL METRICS")
    print("=" * 70)
    
    prophet_results = {}
    prophet_models = {}
    
    # Get only metric columns (exclude 'label')
    metric_columns = [col for col in df.columns if col in TARGET_METRICS]
    
    # Prophet is slow, so we'll analyze a subset or all based on user choice
    ANALYZE_ALL_METRICS = True  # Set to False to only analyze top 5 metrics
    
    if not ANALYZE_ALL_METRICS:
        # Prioritize most important metrics for time series analysis
        priority_metrics = [
            'pod_cpu_usage',
            'pod_memory_usage', 
            'container_restart_rate_1h',
            'apiserver_error_rate',
            'deployment_unavailable'
        ]
        metric_columns = [m for m in priority_metrics if m in metric_columns]
        print(f"\n‚ö° Fast mode: Analyzing {len(metric_columns)} priority metrics")
    else:
        print(f"\nüìä Full mode: Analyzing all {len(metric_columns)} metrics (this may take a few minutes)")
    
    print("-" * 70)
    
    for i, metric in enumerate(metric_columns):
        print(f"\n[{i+1:2}/{len(metric_columns)}] {metric}...", end=" ")
        
        # Get the time series for this metric
        series = df[metric]
        
        # Run Prophet detection
        predictions, model, forecast = detect_anomalies_prophet(series)
        
        if predictions is not None:
            prophet_results[metric] = predictions
            prophet_models[metric] = model
            
            anomalies_detected = predictions.sum()
            print(f"‚úÖ {anomalies_detected} anomalies")
        else:
            print(f"‚ùå Failed")
    
    # =============================================================================
    # COMBINE RESULTS - ENSEMBLE ACROSS METRICS
    # =============================================================================
    
    print("\n" + "=" * 70)
    print("üìä PROPHET ENSEMBLE RESULTS")
    print("=" * 70)
    
    if prophet_results:
        # Combine all metric predictions
        prophet_ensemble = pd.DataFrame(prophet_results)
        
        # Ensemble prediction: anomaly if ANY metric flags it
        ensemble_any = (prophet_ensemble.sum(axis=1) > 0).astype(int)
        
        # Ensemble prediction: anomaly if MAJORITY of metrics flag it
        ensemble_majority = (prophet_ensemble.sum(axis=1) > len(prophet_results) / 2).astype(int)
        
        print(f"\n   Metrics analyzed: {len(prophet_results)}")
        print(f"   Total data points: {len(df)}")
        
        print(f"\n   üîπ ANY metric anomaly: {ensemble_any.sum()} anomalies ({ensemble_any.mean():.2%})")
        print(f"   üîπ MAJORITY metrics anomaly: {ensemble_majority.sum()} anomalies ({ensemble_majority.mean():.2%})")
        
        if 'label' in df.columns:
            print(f"\n   üìà Performance (ANY method):")
            precision = precision_score(df['label'], ensemble_any, zero_division=0)
            recall = recall_score(df['label'], ensemble_any, zero_division=0)
            f1 = f1_score(df['label'], ensemble_any, zero_division=0)
            print(f"      Precision: {precision:.3f} | Recall: {recall:.3f} | F1: {f1:.3f}")
        
        # Store for later use
        prophet_preds = ensemble_any
        prophet_model = prophet_models
        
        print("\n" + "=" * 70)
        print("‚úÖ Prophet analysis complete!")
        print("=" * 70)
    else:
        print("‚ùå No Prophet models were successfully trained")
        prophet_preds = None
        prophet_model = None
else:
    print("‚ö†Ô∏è Prophet not available - skipping Prophet analysis")
    print("   Install with: pip install prophet")
    prophet_preds = None
    prophet_model = None


### 4. Save Models

In [None]:
# Cell - TimeSeriesEnsemble class and Save Models

from sklearn.base import BaseEstimator, TransformerMixin
import joblib

class TimeSeriesEnsemble(BaseEstimator, TransformerMixin):
    """
    Wrapper class that combines ARIMA and Prophet models for KServe compatibility.
    KServe requires a single .pkl file, not multiple model files.
    
    Now supports multiple models (one per metric) for multi-metric anomaly detection.
    """
    def __init__(self, arima_models=None, prophet_models=None):
        # Support both single model (legacy) and dict of models (new)
        self.arima_models = arima_models or {}
        self.prophet_models = prophet_models or {}
        
        # List of metrics we have models for
        self.metrics = list(set(
            list(self.arima_models.keys()) + 
            list(self.prophet_models.keys())
        ))
    
    def predict(self, X, periods=None):
        """
        Make predictions using both models and return ensemble result.
        
        Args:
            X: Input data (DataFrame with metric columns, or dict of series)
            periods: Number of periods to forecast (for time series)
        
        Returns:
            Dictionary with predictions from both models per metric
        """
        results = {
            'arima': {},
            'prophet': {},
            'ensemble': {},
            'anomalies': {}
        }
        
        # Handle DataFrame input
        if hasattr(X, 'columns'):
            metrics_to_predict = [col for col in X.columns if col in self.metrics]
        else:
            metrics_to_predict = self.metrics
        
        for metric in metrics_to_predict:
            # ARIMA predictions
            if metric in self.arima_models and self.arima_models[metric] is not None:
                try:
                    model = self.arima_models[metric]
                    forecast = model.forecast(steps=periods or 10)
                    results['arima'][metric] = forecast
                except Exception as e:
                    results['arima'][f'{metric}_error'] = str(e)
            
            # Prophet predictions
            if metric in self.prophet_models and self.prophet_models[metric] is not None:
                try:
                    model = self.prophet_models[metric]
                    future = model.make_future_dataframe(periods=periods or 10, freq='1min')
                    forecast = model.predict(future)
                    results['prophet'][metric] = forecast['yhat'].values[-periods:] if periods else forecast['yhat'].values
                except Exception as e:
                    results['prophet'][f'{metric}_error'] = str(e)
        
        return results
    
    def get_model_summary(self):
        """Return summary of loaded models."""
        return {
            'arima_count': len(self.arima_models),
            'prophet_count': len(self.prophet_models),
            'arima_metrics': list(self.arima_models.keys()),
            'prophet_metrics': list(self.prophet_models.keys()),
            'total_metrics': len(self.metrics)
        }
    
    def get_params(self, deep=True):
        return {
            'arima_models': self.arima_models,
            'prophet_models': self.prophet_models
        }
    
    def set_params(self, **params):
        for key, value in params.items():
            setattr(self, key, value)
        return self

print("‚úÖ TimeSeriesEnsemble class defined for KServe compatibility")

# =============================================================================
# CREATE AND SAVE ENSEMBLE MODEL
# =============================================================================

print("\n" + "=" * 70)
print("üíæ SAVING MODELS FOR KSERVE DEPLOYMENT")
print("=" * 70)

# Get the model dictionaries from earlier cells
arima_models_dict = arima_models if 'arima_models' in dir() else {}
prophet_models_dict = prophet_models if 'prophet_models' in dir() else {}

print(f"\nüìä Models available:")
print(f"   ARIMA models:   {len(arima_models_dict)} metrics")
print(f"   Prophet models: {len(prophet_models_dict)} metrics")

# Create ensemble wrapper
ensemble_model = TimeSeriesEnsemble(
    arima_models=arima_models_dict,
    prophet_models=prophet_models_dict
)

# Verify
summary = ensemble_model.get_model_summary()
print(f"\nüì¶ TimeSeriesEnsemble created:")
print(f"   ARIMA metrics:   {summary['arima_metrics'][:3]}... ({summary['arima_count']} total)")
print(f"   Prophet metrics: {summary['prophet_metrics'][:3]}... ({summary['prophet_count']} total)")

# Clean up old files in model directory
for old_file in MODEL_DIR.glob('*.pkl'):
    old_file.unlink()
    print(f"   üóëÔ∏è  Removed old: {old_file.name}")

# Save single .pkl file (KServe requirement)
model_path = MODEL_DIR / 'model.pkl'
joblib.dump(ensemble_model, model_path)

print(f"\n‚úÖ Model saved: {model_path}")
print(f"   Size: {model_path.stat().st_size / 1024:.2f} KB")

# Verify only ONE .pkl file exists
pkl_files = list(MODEL_DIR.glob('*.pkl'))
if len(pkl_files) != 1:
    raise RuntimeError(f"Expected 1 .pkl file, found {len(pkl_files)}: {pkl_files}")
print(f"   Files in directory: {len(pkl_files)} ‚úì (KServe requirement met)")

# =============================================================================
# SAVE PREDICTIONS
# =============================================================================

results_df = pd.DataFrame({
    'label': df['label'],
    'arima_pred': arima_preds if arima_preds is not None else 0,
    'prophet_pred': prophet_preds if prophet_preds is not None else 0
})

# Combined ensemble: anomaly if EITHER model flags it
results_df['combined_pred'] = (
    (results_df['arima_pred'] == 1) | (results_df['prophet_pred'] == 1)
).astype(int)

predictions_path = PROCESSED_DIR / 'timeseries_predictions.parquet'
results_df.to_parquet(predictions_path)
print(f"\nüíæ Predictions saved: {predictions_path}")

# =============================================================================
# PERFORMANCE SUMMARY
# =============================================================================

print("\n" + "-" * 70)
print("üìà MODEL PERFORMANCE COMPARISON")
print("-" * 70)

from sklearn.metrics import precision_score, recall_score, f1_score

for name, preds in [('ARIMA', arima_preds), ('Prophet', prophet_preds), ('Combined', results_df['combined_pred'])]:
    if preds is not None:
        p = precision_score(df['label'], preds, zero_division=0)
        r = recall_score(df['label'], preds, zero_division=0)
        f = f1_score(df['label'], preds, zero_division=0)
        n_anom = preds.sum()
        print(f"   {name:10}: {n_anom:4} anomalies | P={p:.3f} R={r:.3f} F1={f:.3f}")

# =============================================================================
# S3 UPLOAD (OPTIONAL)
# =============================================================================

try:
    from common_functions import upload_model_to_s3, test_s3_connection
    if test_s3_connection():
        upload_model_to_s3(
            str(model_path), 
            s3_key=f'models/anomaly-detection/{MODEL_NAME}/model.pkl'
        )
        print(f"\n‚òÅÔ∏è  Uploaded to S3: s3://first.bucket/models/anomaly-detection/{MODEL_NAME}/model.pkl")
except Exception as e:
    print(f"\n‚ö†Ô∏è  S3 upload skipped: {e}")

# =============================================================================
# VERIFICATION
# =============================================================================

print("\n" + "=" * 70)
print("üîç VERIFICATION")
print("=" * 70)

# Reload and verify
loaded_model = joblib.load(model_path)
loaded_summary = loaded_model.get_model_summary()

print(f"\n   ‚úÖ Model reloaded successfully")
print(f"   ‚úÖ ARIMA models: {loaded_summary['arima_count']}")
print(f"   ‚úÖ Prophet models: {loaded_summary['prophet_count']}")
print(f"   ‚úÖ Total metrics: {loaded_summary['total_metrics']}")

print("\n" + "=" * 70)
print("üéâ MODEL SAVE COMPLETE")
print("=" * 70)
print(f"\n   Model name: {MODEL_NAME}")
print(f"   Model path: {model_path}")
print(f"   Deploy to KServe with: storageUri: pvc://model-storage-pvc/{MODEL_NAME}")

# Verify outputs - KServe compatible structure
model_path = MODEL_DIR / 'model.pkl'

# Check model file exists
assert model_path.exists(), f"Model not saved: {model_path}"

# Check predictions saved
assert (PROCESSED_DIR / 'timeseries_predictions.parquet').exists(), "Predictions not saved"

# Verify KServe requirement: EXACTLY ONE .pkl file
pkl_files = list(MODEL_DIR.glob('*.pkl'))
assert len(pkl_files) == 1, f"ERROR: Expected 1 .pkl file, found {len(pkl_files)}: {pkl_files}"

# Verify can load and use the model
loaded_model = joblib.load(model_path)
assert hasattr(loaded_model, 'predict'), "Model doesn't have predict method"
assert hasattr(loaded_model, 'arima_model'), "Missing arima_model attribute"
assert hasattr(loaded_model, 'prophet_model'), "Missing prophet_model attribute"

logger.info("‚úÖ All validations passed")
print(f"\n‚úÖ KServe Validation Complete:")
print(f"   Model path: {model_path}")
print(f"   File count: {len(pkl_files)} (correct - must be 1)")
print(f"   Model type: {type(loaded_model).__name__}")
print(f"   Predictions: {PROCESSED_DIR / 'timeseries_predictions.parquet'}")
print(f"\nüéØ Ready for KServe deployment!")
print(f"   Deploy with: oc apply -f <inference-service.yaml>")
print(f"   storageUri: pvc://model-storage-pvc/{MODEL_NAME}")

## Integration Section

This notebook integrates with:
- **Input**: Synthetic anomalies from `synthetic-anomaly-generation.ipynb`
- **Output**: Time series models for `ensemble-anomaly-methods.ipynb`
- **Coordination Engine**: Models can be deployed for real-time detection

## Next Steps

1. Review model performance metrics
2. Proceed to `lstm-based-prediction.ipynb` for deep learning approach
3. Compare with ensemble methods
4. Deploy best model to coordination engine

## References

- ADR-012: Notebook Architecture for End-to-End Workflows
- [ARIMA Documentation](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html)
- [Prophet Documentation](https://facebook.github.io/prophet/)