# Time Series Anomaly Detection

## Overview
This notebook implements time series anomaly detection using ARIMA and Prophet forecasting methods. It detects anomalies by identifying deviations from predicted values.

## Prerequisites
- Completed: `synthetic-anomaly-generation.ipynb` (Phase 1)
- Libraries: statsmodels, prophet, pandas, numpy
- Synthetic dataset: `/opt/app-root/src/data/processed/synthetic_anomalies.parquet`

## Why We Use Synthetic Data

### The Problem: Real Anomalies Are Rare
In production OpenShift clusters:
- Anomalies occur <1% of the time
- Collecting 1000 labeled anomalies takes months/years
- Different anomaly types are hard to capture
- Can't deliberately cause failures to collect data

### The Solution: Synthetic Anomalies
We generate synthetic anomalies because:
- ‚úÖ Create 1000+ labeled anomalies in minutes
- ‚úÖ Control anomaly types and severity
- ‚úÖ Ensure balanced training data (50% normal, 50% anomaly)
- ‚úÖ Reproducible and testable
- ‚úÖ Models trained on synthetic data generalize to real anomalies

### Machine Learning Best Practice
Supervised learning requires labeled data. Synthetic data provides:
1. **Ground Truth**: Known labels for evaluation
2. **Balanced Classes**: Equal normal and anomaly samples
3. **Reproducibility**: Same data for consistent results
4. **Generalization**: Models learn patterns, not memorize examples

## Learning Objectives
- Implement ARIMA forecasting on synthetic data
- Use Prophet for time series analysis
- Detect anomalies via forecast deviations
- Handle seasonal patterns
- Evaluate detection performance with labeled data

## Key Concepts
- **ARIMA**: AutoRegressive Integrated Moving Average
- **Prophet**: Facebook's time series forecasting tool
- **Forecast Error**: Deviation between actual and predicted values
- **Seasonality**: Repeating patterns in time series

## References

### Why Synthetic Data for Training?
- **He & Garcia (2009)**: "Learning from Imbalanced Data" - https://ieeexplore.ieee.org/document/5128907
- **Nikolenko (2021)**: "Synthetic Data for Deep Learning" - https://arxiv.org/abs/1909.11373
- **Goldstein & Uchida (2016)**: "Anomaly Detection with Robust Deep Autoencoders" - https://arxiv.org/abs/1511.08747

### Time Series Anomaly Detection
- **Malhotra et al. (2016)**: "Time Series Anomaly Detection with LSTM Networks" - https://arxiv.org/abs/1607.00148
- **Taylor & Letham (2018)**: "Forecasting at Scale (Prophet)" - https://peerj.com/articles/3190
- **Box & Jenkins (1970)**: "Time Series Analysis, Forecasting and Control (ARIMA)" - Classic reference

### Key Takeaway
Synthetic data provides labeled training examples that allow us to:
1. Train models with known ground truth
2. Evaluate performance with precision, recall, and F1 scores
3. Ensure reproducible and testable results
4. Build models that generalize to real-world anomalies

In [None]:
import sys
import os
import numpy as np
import pandas as pd
import pickle
import logging
from pathlib import Path
from sklearn.metrics import precision_score, recall_score, f1_score

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"‚úÖ Utils path found: {utils_path}")
else:
    print("‚ö†Ô∏è Utils path not found - will use fallback implementations")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("‚úÖ Common functions imported")
except ImportError as e:
    print(f"‚ö†Ô∏è Common functions not available: {e}")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'

# Use /mnt/models for persistent storage (model-storage-pvc)
# Fallback to local for development outside cluster
MODELS_DIR = Path('/mnt/models') if Path('/mnt/models').exists() else Path('/opt/app-root/src/models')

# Create KServe-compatible subdirectory structure
MODEL_NAME = 'anomaly-detector'
MODEL_DIR = MODELS_DIR / MODEL_NAME
MODEL_DIR.mkdir(parents=True, exist_ok=True)

logger.info(f"Data directory: {DATA_DIR}")
logger.info(f"Models directory: {MODEL_DIR}")

## Implementation Section

### 1. Load Synthetic Data

In [None]:
# Load or generate synthetic data
data_file = PROCESSED_DIR / 'synthetic_anomalies.parquet'

if data_file.exists():
    df = pd.read_parquet(data_file)
    logger.info(f"Loaded existing data: {df.shape}")
else:
    logger.info("Synthetic data not found - generating for validation...")
    # Generate synthetic data inline
    from datetime import datetime, timedelta
    np.random.seed(42)
    n_points = 1000
    n_features = 5
    
    start_time = datetime.now() - timedelta(days=30)
    timestamps = [start_time + timedelta(minutes=i) for i in range(n_points)]
    
    data = {}
    for i in range(n_features):
        trend = np.linspace(50, 60, n_points)
        seasonal = 10 * np.sin(np.linspace(0, 4*np.pi, n_points))
        noise = np.random.normal(0, 2, n_points)
        data[f'metric_{i}'] = trend + seasonal + noise
    
    df = pd.DataFrame(data)
    df['timestamp'] = timestamps
    df['label'] = 0
    
    # Inject anomalies
    anomaly_indices = np.random.choice(len(df), 50, replace=False)
    for idx in anomaly_indices:
        features = np.random.choice(5, 2, replace=False)
        for feat in features:
            col = f'metric_{feat}'
            std = df[col].std()
            df.loc[idx, col] += 3.0 * std * np.random.choice([-1, 1])
        df.loc[idx, 'label'] = 1
    
    # Save for downstream notebooks
    PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
    df.to_parquet(data_file)
    logger.info(f"Generated and saved synthetic data: {df.shape}")

print(df.head())
print(f"\nData info:")
print(f"  Normal samples: {(df['label'] == 0).sum()}")
print(f"  Anomalous samples: {(df['label'] == 1).sum()}")

### 2. ARIMA-Based Detection

In [None]:
from statsmodels.tsa.arima.model import ARIMA

def detect_anomalies_arima(series, threshold_std=2.5):
    """
    Detect anomalies using ARIMA forecasting.
    
    Args:
        series: Time series data
        threshold_std: Number of standard deviations for anomaly threshold
    
    Returns:
        Anomaly predictions (0=normal, 1=anomaly)
    """
    try:
        # Fit ARIMA model
        model = ARIMA(series, order=(1, 1, 1))
        results = model.fit()
        
        # Get residuals
        residuals = results.resid
        
        # Calculate threshold
        threshold = threshold_std * residuals.std()
        
        # Detect anomalies
        predictions = (np.abs(residuals) > threshold).astype(int)
        
        return predictions, results
    except Exception as e:
        logger.error(f"ARIMA error: {e}")
        return None, None

# Apply ARIMA to first metric
metric_col = 'metric_0'
arima_preds, arima_model = detect_anomalies_arima(df[metric_col])

if arima_preds is not None:
    logger.info(f"ARIMA detected {arima_preds.sum()} anomalies")
    # Evaluate
    precision = precision_score(df['label'], arima_preds, zero_division=0)
    recall = recall_score(df['label'], arima_preds, zero_division=0)
    f1 = f1_score(df['label'], arima_preds, zero_division=0)
    print(f"ARIMA Performance: Precision={precision:.3f}, Recall={recall:.3f}, F1={f1:.3f}")

### 3. Prophet-Based Detection

In [None]:
from prophet import Prophet

def detect_anomalies_prophet(df_input, threshold_std=2.5):
    """
    Detect anomalies using Prophet forecasting.
    
    Args:
        df_input: DataFrame with 'timestamp' and 'metric_0' columns
        threshold_std: Number of standard deviations for anomaly threshold
    
    Returns:
        Anomaly predictions
    """
    try:
        # Prepare data for Prophet
        prophet_df = pd.DataFrame({
            'ds': df_input['timestamp'],
            'y': df_input['metric_0']
        })
        
        # Fit Prophet model
        model = Prophet(yearly_seasonality=False, daily_seasonality=False)
        model.fit(prophet_df)
        
        # Make forecast
        forecast = model.predict(prophet_df[['ds']])
        
        # Calculate residuals
        residuals = df_input['metric_0'].values - forecast['yhat'].values
        threshold = threshold_std * residuals.std()
        
        # Detect anomalies
        predictions = (np.abs(residuals) > threshold).astype(int)
        
        return predictions, model, forecast
    except Exception as e:
        logger.error(f"Prophet error: {e}")
        return None, None, None

# Apply Prophet
prophet_preds, prophet_model, prophet_forecast = detect_anomalies_prophet(df)

if prophet_preds is not None:
    logger.info(f"Prophet detected {prophet_preds.sum()} anomalies")
    precision = precision_score(df['label'], prophet_preds, zero_division=0)
    recall = recall_score(df['label'], prophet_preds, zero_division=0)
    f1 = f1_score(df['label'], prophet_preds, zero_division=0)
    print(f"Prophet Performance: Precision={precision:.3f}, Recall={recall:.3f}, F1={f1:.3f}")

### 4. Save Models

In [None]:
# Create ensemble wrapper combining both models (KServe compatible)
# KServe sklearn server requires EXACTLY ONE .pkl file per model directory
ensemble_model = TimeSeriesEnsemble(
    arima_model=arima_model,
    prophet_model=prophet_model
)

print("‚úÖ Created TimeSeriesEnsemble wrapper")
print("   ARIMA model:", "‚úì" if arima_model is not None else "‚úó")
print("   Prophet model:", "‚úì" if prophet_model is not None else "‚úó")

# Save as SINGLE .pkl file for KServe
# Path: /mnt/models/anomaly-detector/model.pkl
model_path = MODEL_DIR / 'model.pkl'

# Remove old separate model files if they exist (migration)
old_arima = MODEL_DIR / 'arima_model.pkl'
old_prophet = MODEL_DIR / 'prophet_model.pkl'
for old_file in [old_arima, old_prophet]:
    if old_file.exists():
        old_file.unlink()
        logger.info(f"üóëÔ∏è  Removed old file: {old_file.name}")

# Save ensemble model
joblib.dump(ensemble_model, model_path)
logger.info(f"üíæ Saved ensemble model to {model_path}")

# Verify only ONE .pkl file exists (KServe requirement)
pkl_files = list(MODEL_DIR.glob('*.pkl'))
if len(pkl_files) != 1:
    raise RuntimeError(
        f"‚ùå ERROR: Expected 1 .pkl file, found {len(pkl_files)}: {pkl_files}\n"
        f"KServe requires EXACTLY ONE .pkl file per model directory."
    )

print(f"‚úÖ Model saved correctly for KServe:")
print(f"   Path: {pkl_files[0]}")
print(f"   Size: {pkl_files[0].stat().st_size / 1024:.2f} KB")
print(f"   Files in directory: {len(pkl_files)} (correct - must be 1)")

# Upload to S3 for persistent storage (optional)
try:
    from common_functions import upload_model_to_s3, test_s3_connection
    
    if test_s3_connection():
        upload_model_to_s3(
            str(model_path),
            s3_key='models/anomaly-detection/anomaly-detector/model.pkl'
        )
        logger.info("‚òÅÔ∏è  Uploaded to S3")
    else:
        logger.info("‚ö†Ô∏è S3 not available - model saved locally only")
except ImportError:
    logger.info("‚ö†Ô∏è S3 functions not available - model saved locally only")
except Exception as e:
    logger.warning(f"‚ö†Ô∏è S3 upload failed (non-critical): {e}")

# Save predictions
results_df = pd.DataFrame({
    'actual': df['label'],
    'arima_pred': arima_preds if arima_preds is not None else 0,
    'prophet_pred': prophet_preds if prophet_preds is not None else 0
})
results_df.to_parquet(PROCESSED_DIR / 'timeseries_predictions.parquet')
logger.info("üíæ Saved predictions")

print(f"\nüéâ KServe model deployment ready!")
print(f"   Model name: {MODEL_NAME}")
print(f"   Model path: {model_path}")
print(f"   Deploy to KServe with: storageUri: pvc://model-storage-pvc/{MODEL_NAME}")

In [None]:
# Save models with KServe-compatible paths
if arima_model is not None:
    with open(MODEL_DIR / 'arima_model.pkl', 'wb') as f:
        pickle.dump(arima_model, f)
    logger.info("Saved ARIMA model")

if prophet_model is not None:
    with open(MODEL_DIR / 'prophet_model.pkl', 'wb') as f:
        pickle.dump(prophet_model, f)
    logger.info("Saved Prophet model")

# Upload models to S3 for persistent storage
try:
    from common_functions import upload_model_to_s3, test_s3_connection
    
    if test_s3_connection():
        if arima_model is not None:
            upload_model_to_s3(
                str(MODEL_DIR / 'arima_model.pkl'),
                s3_key='models/anomaly-detection/anomaly-detector/arima_model.pkl'
            )
        if prophet_model is not None:
            upload_model_to_s3(
                str(MODEL_DIR / 'prophet_model.pkl'),
                s3_key='models/anomaly-detection/anomaly-detector/prophet_model.pkl'
            )
        logger.info("Uploaded to S3")
    else:
        logger.info("S3 not available - models saved locally only")
except ImportError:
    logger.info("S3 functions not available - models saved locally only")
except Exception as e:
    logger.warning(f"S3 upload failed (non-critical): {e}")

# Save predictions
results_df = pd.DataFrame({
    'actual': df['label'],
    'arima_pred': arima_preds if arima_preds is not None else 0,
    'prophet_pred': prophet_preds if prophet_preds is not None else 0
})
results_df.to_parquet(PROCESSED_DIR / 'timeseries_predictions.parquet')
logger.info("Saved predictions")

print(f"\n‚úÖ Models saved to KServe-compatible path: {MODEL_NAME}/")
print(f"   Path: {MODEL_DIR}")

# Verify outputs - KServe compatible structure
model_path = MODEL_DIR / 'model.pkl'

# Check model file exists
assert model_path.exists(), f"Model not saved: {model_path}"

# Check predictions saved
assert (PROCESSED_DIR / 'timeseries_predictions.parquet').exists(), "Predictions not saved"

# Verify KServe requirement: EXACTLY ONE .pkl file
pkl_files = list(MODEL_DIR.glob('*.pkl'))
assert len(pkl_files) == 1, f"ERROR: Expected 1 .pkl file, found {len(pkl_files)}: {pkl_files}"

# Verify can load and use the model
loaded_model = joblib.load(model_path)
assert hasattr(loaded_model, 'predict'), "Model doesn't have predict method"
assert hasattr(loaded_model, 'arima_model'), "Missing arima_model attribute"
assert hasattr(loaded_model, 'prophet_model'), "Missing prophet_model attribute"

logger.info("‚úÖ All validations passed")
print(f"\n‚úÖ KServe Validation Complete:")
print(f"   Model path: {model_path}")
print(f"   File count: {len(pkl_files)} (correct - must be 1)")
print(f"   Model type: {type(loaded_model).__name__}")
print(f"   Predictions: {PROCESSED_DIR / 'timeseries_predictions.parquet'}")
print(f"\nüéØ Ready for KServe deployment!")
print(f"   Deploy with: oc apply -f <inference-service.yaml>")
print(f"   storageUri: pvc://model-storage-pvc/{MODEL_NAME}")

In [None]:
# Verify outputs
assert (MODELS_DIR / 'arima_model.pkl').exists(), "ARIMA model not saved"
assert (MODELS_DIR / 'prophet_model.pkl').exists(), "Prophet model not saved"
assert (PROCESSED_DIR / 'timeseries_predictions.parquet').exists(), "Predictions not saved"

logger.info("‚úÖ All validations passed")
print(f"\nModels saved to: {MODELS_DIR}")
print(f"Predictions saved to: {PROCESSED_DIR}")

## Integration Section

This notebook integrates with:
- **Input**: Synthetic anomalies from `synthetic-anomaly-generation.ipynb`
- **Output**: Time series models for `ensemble-anomaly-methods.ipynb`
- **Coordination Engine**: Models can be deployed for real-time detection

## Next Steps

1. Review model performance metrics
2. Proceed to `lstm-based-prediction.ipynb` for deep learning approach
3. Compare with ensemble methods
4. Deploy best model to coordination engine

## References

- ADR-012: Notebook Architecture for End-to-End Workflows
- [ARIMA Documentation](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html)
- [Prophet Documentation](https://facebook.github.io/prophet/)