# Isolation Forest Anomaly Detection for Self-Healing Platform

## Overview
This notebook demonstrates implementing Isolation Forest for anomaly detection in OpenShift metrics. Isolation Forest is particularly effective for detecting anomalies in high-dimensional data without requiring labeled training data.

## Prerequisites
- Completed: `synthetic-anomaly-generation.ipynb` (Phase 1)
- PyTorch workbench environment with scikit-learn
- Synthetic dataset: `/opt/app-root/src/data/processed/synthetic_anomalies.parquet`

## Why We Use Synthetic Data

### The Problem: Real Anomalies Are Rare
In production OpenShift clusters:
- Anomalies occur <1% of the time
- Collecting 1000 labeled anomalies takes months/years
- Different anomaly types are hard to capture
- Can't deliberately cause failures to collect data

### The Solution: Synthetic Anomalies
We generate synthetic anomalies because:
- ✅ Create 1000+ labeled anomalies in minutes
- ✅ Control anomaly types and severity
- ✅ Ensure balanced training data (50% normal, 50% anomaly)
- ✅ Reproducible and testable
- ✅ Models trained on synthetic data generalize to real anomalies

### Machine Learning Best Practice
Supervised learning requires labeled data. Synthetic data provides:
1. **Ground Truth**: Known labels for evaluation
2. **Balanced Classes**: Equal normal and anomaly samples
3. **Reproducibility**: Same data for consistent results
4. **Generalization**: Models learn patterns, not memorize examples

## Expected Outcomes
- Train Isolation Forest model on synthetic anomalies
- Evaluate model performance (Precision, Recall, F1)
- Save trained model for integration with coordination engine
- Generate anomaly detection pipeline for real-time use

## References
- ADR-002: Hybrid Deterministic-AI Self-Healing Approach
- ADR-012: Notebook Architecture for End-to-End Workflows
- [Isolation Forest Paper](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf) - Liu, Ting & Zhou (2008)
- [Learning from Imbalanced Data](https://ieeexplore.ieee.org/document/5128907) - He & Garcia (2009)
- [Anomaly Detection with Robust Deep Autoencoders](https://arxiv.org/abs/1511.08747) - Goldstein & Uchida (2016)

## Setup and Configuration

In [None]:
# Import required libraries
import sys
import os
from pathlib import Path

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"✅ Utils path found: {utils_path}")
else:
    print("⚠️ Utils path not found - will use fallback implementations")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import joblib
import warnings
warnings.filterwarnings('ignore')

# Machine learning libraries
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA

# Try to import common functions, with fallback
try:
    from common_functions import (
        setup_environment, print_environment_info,
        generate_synthetic_timeseries, validate_data_quality,
        plot_metric_overview, save_processed_data, load_processed_data
    )
    print("✅ Common functions imported")
except ImportError as e:
    print(f"⚠️ Common functions not available: {e}")
    print("   Using minimal fallback implementations")
    
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models/anomaly-detection', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}
    
    def print_environment_info(env_info):
        print(f"📁 Data dir: {env_info.get('data_dir', 'N/A')}")
    
    def generate_synthetic_timeseries(metric_name, duration_hours=24, interval_minutes=1, 
                                      add_anomalies=True, anomaly_probability=0.02):
        num_points = int(duration_hours * 60 / interval_minutes)
        timestamps = pd.date_range(end=datetime.now(), periods=num_points, freq=f'{interval_minutes}min')
        values = np.random.normal(50, 10, num_points)
        if add_anomalies:
            anomaly_idx = np.random.choice(num_points, int(num_points * anomaly_probability), replace=False)
            values[anomaly_idx] *= np.random.choice([0.3, 3.0], len(anomaly_idx))
        df = pd.DataFrame({'timestamp': timestamps, 'value': values, 'metric': metric_name, 'is_anomaly': False})
        if add_anomalies:
            df.loc[anomaly_idx, 'is_anomaly'] = True
        return df
    
    def save_processed_data(data, filename):
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        filepath = f'/opt/app-root/src/data/processed/{filename}'
        if hasattr(data, 'to_parquet'):
            data.to_parquet(filepath)
        print(f"💾 Saved: {filepath}")

print("✅ Libraries imported successfully")
print(f"🔬 Scikit-learn available")
print(f"📊 Pandas version: {pd.__version__}")

In [None]:
# Set up environment
env_info = setup_environment()
print_environment_info(env_info)

# Configuration for Isolation Forest
ISOLATION_FOREST_CONFIG = {
    'contamination': 0.05,  # Expected proportion of anomalies
    'n_estimators': 200,    # Number of trees
    'max_samples': 'auto',  # Number of samples to draw
    'max_features': 1.0,    # Number of features to draw
    'random_state': 42      # For reproducibility
}

# Metrics to focus on for anomaly detection
TARGET_METRICS = [
    'node_cpu_utilization',
    'node_memory_utilization',
    'pod_cpu_usage',
    'pod_memory_usage',
    'container_restart_count'
]

print(f"🎯 Target metrics for anomaly detection: {len(TARGET_METRICS)}")
print(f"🌲 Isolation Forest configuration: {ISOLATION_FOREST_CONFIG['n_estimators']} trees")

## Data Preparation

### Load Synthetic Anomalies for Training

We load synthetic anomalies from Phase 1 (`synthetic-anomaly-generation.ipynb`) for training.

**Why Synthetic Data?**
- Real anomalies are rare (<1% in production clusters)
- Synthetic data provides labeled training examples
- Models learn general patterns, not memorize specific examples
- Balanced dataset (50% normal, 50% anomaly) improves performance
- Reproducible and testable

**Machine Learning Best Practice:**
Supervised learning requires labeled data. Synthetic data provides:
1. **Ground Truth**: Known labels for evaluation
2. **Balanced Classes**: Equal normal and anomaly samples
3. **Reproducibility**: Same data for consistent results
4. **Generalization**: Models learn patterns, not memorize examples

**References:**
- He & Garcia (2009): "Learning from Imbalanced Data" - https://ieeexplore.ieee.org/document/5128907
- Nikolenko (2021): "Synthetic Data for Deep Learning" - https://arxiv.org/abs/1909.11373
- Goldstein & Uchida (2016): "Anomaly Detection with Robust Deep Autoencoders" - https://arxiv.org/abs/1511.08747

In [None]:
def prepare_anomaly_detection_data(duration_hours=48):
    """
    Generate and prepare data for anomaly detection training
    """
    print("🔄 Preparing anomaly detection dataset...")
    
    # Generate synthetic data for each target metric
    all_data = {}
    
    for metric in TARGET_METRICS:
        print(f"  📊 Generating {metric}...")
        df = generate_synthetic_timeseries(
            metric_name=metric,
            duration_hours=duration_hours,
            interval_minutes=1,
            add_anomalies=True,
            anomaly_probability=0.03  # 3% anomalies
        )
        all_data[metric] = df
        print(f"    ✅ {len(df)} points, {df['is_anomaly'].sum()} anomalies")
    
    return all_data

# Generate training data
training_data = prepare_anomaly_detection_data(duration_hours=48)

# Display summary
total_points = sum(len(df) for df in training_data.values())
total_anomalies = sum(df['is_anomaly'].sum() for df in training_data.values())
print(f"\n📈 Dataset Summary:")
print(f"  Total data points: {total_points:,}")
print(f"  Total anomalies: {total_anomalies:,} ({total_anomalies/total_points:.2%})")
print(f"  Metrics: {len(training_data)}")

In [None]:
def create_feature_matrix(data_dict):
    """
    Create feature matrix for anomaly detection
    """
    print("🔧 Creating feature matrix...")
    
    # Align all time series to common timestamps
    # Find common time range
    min_start = max(df['timestamp'].min() for df in data_dict.values())
    max_end = min(df['timestamp'].max() for df in data_dict.values())
    
    print(f"  📅 Time range: {min_start} to {max_end}")
    
    # Create common time index
    time_index = pd.date_range(start=min_start, end=max_end, freq='1min')
    
    # Build feature matrix
    features = pd.DataFrame(index=time_index)
    labels = pd.Series(index=time_index, dtype=bool, name='is_anomaly')
    
    for metric_name, df in data_dict.items():
        # Resample to common time index
        df_resampled = df.set_index('timestamp').reindex(time_index, method='nearest')
        
        # Add basic features
        features[f'{metric_name}_value'] = df_resampled['value']
        
        # Add rolling statistics (5-minute windows)
        features[f'{metric_name}_mean_5m'] = df_resampled['value'].rolling('5min').mean()
        features[f'{metric_name}_std_5m'] = df_resampled['value'].rolling('5min').std()
        features[f'{metric_name}_min_5m'] = df_resampled['value'].rolling('5min').min()
        features[f'{metric_name}_max_5m'] = df_resampled['value'].rolling('5min').max()
        
        # Add lag features
        features[f'{metric_name}_lag_1'] = df_resampled['value'].shift(1)
        features[f'{metric_name}_lag_5'] = df_resampled['value'].shift(5)
        
        # Add rate of change
        features[f'{metric_name}_diff'] = df_resampled['value'].diff()
        features[f'{metric_name}_pct_change'] = df_resampled['value'].pct_change()
        
        # Combine anomaly labels (any metric anomaly = overall anomaly)
        metric_anomalies = df_resampled['is_anomaly'].fillna(False)
        labels = labels | metric_anomalies
    
    # Fill missing values
    features = features.ffill().bfill()
    labels = labels.fillna(False)
    
    # Replace infinity values with 0 and remaining NaN with 0
    features = features.replace([np.inf, -np.inf], 0)
    features = features.fillna(0)
    
    print(f"  ✅ Feature matrix: {features.shape}")
    print(f"  🏷️ Anomaly labels: {labels.sum()} anomalies ({labels.mean():.2%})")
    
    return features, labels

# Create feature matrix
X, y = create_feature_matrix(training_data)

print(f"\n📊 Feature Engineering Complete:")
print(f"  Features: {X.shape[1]} columns")
print(f"  Samples: {X.shape[0]:,} rows")
print(f"  Anomaly rate: {y.mean():.2%}")

## Model Training and Evaluation

Train Isolation Forest model and evaluate its performance.

In [None]:
# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"📊 Data Split:")
print(f"  Training: {X_train.shape[0]:,} samples")
print(f"  Testing: {X_test.shape[0]:,} samples")
print(f"  Training anomalies: {y_train.sum()} ({y_train.mean():.2%})")
print(f"  Testing anomalies: {y_test.sum()} ({y_test.mean():.2%})")

# Scale features
print("\n🔧 Scaling features...")
scaler = RobustScaler()  # More robust to outliers than StandardScaler
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("✅ Feature scaling complete")

In [None]:
# Train Isolation Forest
print("🌲 Training Isolation Forest...")

isolation_forest = IsolationForest(**ISOLATION_FOREST_CONFIG)

# Fit on training data (unsupervised)
isolation_forest.fit(X_train_scaled)

print("✅ Training complete")

# Make predictions
print("🔮 Making predictions...")
y_pred_train = isolation_forest.predict(X_train_scaled)
y_pred_test = isolation_forest.predict(X_test_scaled)

# Get anomaly scores
train_scores = isolation_forest.decision_function(X_train_scaled)
test_scores = isolation_forest.decision_function(X_test_scaled)

# Convert predictions to binary (1 = normal, -1 = anomaly)
y_pred_train_binary = (y_pred_train == -1)
y_pred_test_binary = (y_pred_test == -1)

print(f"  Training predictions: {y_pred_train_binary.sum()} anomalies detected")
print(f"  Testing predictions: {y_pred_test_binary.sum()} anomalies detected")

In [None]:
# Evaluate model performance
print("📊 Model Evaluation")
print("=" * 50)

# Training set performance
print("\n🏋️ Training Set Performance:")
print(classification_report(y_train, y_pred_train_binary, 
                          target_names=['Normal', 'Anomaly']))

# Test set performance
print("\n🧪 Test Set Performance:")
print(classification_report(y_test, y_pred_test_binary, 
                          target_names=['Normal', 'Anomaly']))

# Confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Training confusion matrix
cm_train = confusion_matrix(y_train, y_pred_train_binary)
sns.heatmap(cm_train, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Normal', 'Anomaly'], 
            yticklabels=['Normal', 'Anomaly'], ax=axes[0])
axes[0].set_title('Training Set Confusion Matrix')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# Test confusion matrix
cm_test = confusion_matrix(y_test, y_pred_test_binary)
sns.heatmap(cm_test, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Normal', 'Anomaly'], 
            yticklabels=['Normal', 'Anomaly'], ax=axes[1])
axes[1].set_title('Test Set Confusion Matrix')
axes[1].set_ylabel('True Label')
axes[1].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

## Model Analysis and Visualization

In [None]:
# Analyze anomaly scores distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Isolation Forest Analysis', fontsize=16, fontweight='bold')

# Score distribution
axes[0, 0].hist(train_scores[~y_train], bins=50, alpha=0.7, label='Normal', density=True)
axes[0, 0].hist(train_scores[y_train], bins=50, alpha=0.7, label='Anomaly', density=True)
axes[0, 0].set_title('Anomaly Score Distribution (Training)')
axes[0, 0].set_xlabel('Anomaly Score')
axes[0, 0].set_ylabel('Density')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Score vs time (sample)
sample_size = min(1000, len(test_scores))
sample_indices = np.random.choice(len(test_scores), sample_size, replace=False)
sample_indices = np.sort(sample_indices)

axes[0, 1].plot(sample_indices, test_scores[sample_indices], 'b-', alpha=0.7, linewidth=1)
anomaly_indices = sample_indices[y_test.iloc[sample_indices]]
if len(anomaly_indices) > 0:
    axes[0, 1].scatter(anomaly_indices, test_scores[anomaly_indices], 
                      color='red', s=30, alpha=0.8, label='True Anomalies')
axes[0, 1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0, 1].set_title('Anomaly Scores Over Time (Test Sample)')
axes[0, 1].set_xlabel('Sample Index')
axes[0, 1].set_ylabel('Anomaly Score')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Feature importance (using PCA to visualize)
pca = PCA(n_components=2)
X_test_pca = pca.fit_transform(X_test_scaled)

normal_mask = ~y_test
anomaly_mask = y_test

axes[1, 0].scatter(X_test_pca[normal_mask, 0], X_test_pca[normal_mask, 1], 
                  c='blue', alpha=0.6, s=20, label='Normal')
axes[1, 0].scatter(X_test_pca[anomaly_mask, 0], X_test_pca[anomaly_mask, 1], 
                  c='red', alpha=0.8, s=30, label='Anomaly')
axes[1, 0].set_title('PCA Visualization (Test Set)')
axes[1, 0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[1, 0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Model performance metrics
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

precision = precision_score(y_test, y_pred_test_binary)
recall = recall_score(y_test, y_pred_test_binary)
f1 = f1_score(y_test, y_pred_test_binary)

# Convert scores to probabilities for AUC calculation
test_scores_prob = (test_scores - test_scores.min()) / (test_scores.max() - test_scores.min())
auc = roc_auc_score(y_test, 1 - test_scores_prob)  # Invert because lower scores = more anomalous

metrics_text = f"""
Model Performance Metrics:

Precision: {precision:.3f}
Recall: {recall:.3f}
F1-Score: {f1:.3f}
AUC-ROC: {auc:.3f}

Configuration:
Trees: {ISOLATION_FOREST_CONFIG['n_estimators']}
Contamination: {ISOLATION_FOREST_CONFIG['contamination']}
Features: {X.shape[1]}

Data:
Training: {X_train.shape[0]:,}
Testing: {X_test.shape[0]:,}
"""

axes[1, 1].text(0.05, 0.95, metrics_text, transform=axes[1, 1].transAxes, 
               fontsize=10, verticalalignment='top',
               bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))
axes[1, 1].set_title('Model Summary')
axes[1, 1].axis('off')

plt.tight_layout()
plt.show()

print(f"\n🎯 Model Performance Summary:")
print(f"  Precision: {precision:.3f}")
print(f"  Recall: {recall:.3f}")
print(f"  F1-Score: {f1:.3f}")
print(f"  AUC-ROC: {auc:.3f}")

## Save Model and Upload to S3

In [None]:
# Save model and scaler locally
# Use /mnt/models for persistent storage (model-storage-pvc)
# Fallback to local for development outside cluster
MODELS_DIR = Path('/mnt/models/predictive-analytics') if Path('/mnt/models').exists() else Path('/opt/app-root/src/models')
MODELS_DIR.mkdir(parents=True, exist_ok=True)

model_path = MODELS_DIR / 'isolation_forest_model.pkl'
scaler_path = MODELS_DIR / 'isolation_forest_scaler.pkl'

joblib.dump(isolation_forest, model_path)
joblib.dump(scaler, scaler_path)
print(f"💾 Saved Isolation Forest model to {model_path}")
print(f"💾 Saved scaler to {scaler_path}")

# Upload models to S3 for persistent storage
try:
    from common_functions import upload_model_to_s3, test_s3_connection
    
    if test_s3_connection():
        upload_model_to_s3(
            str(model_path),
            s3_key='models/anomaly-detection/isolation_forest_model.pkl'
        )
        upload_model_to_s3(
            str(scaler_path),
            s3_key='models/anomaly-detection/isolation_forest_scaler.pkl'
        )
    else:
        print("⚠️ S3 not available - models saved locally only")
except ImportError:
    print("⚠️ S3 functions not available - models saved locally only")
except Exception as e:
    print(f"⚠️ S3 upload failed (non-critical): {e}")

# Verify model saved
assert model_path.exists(), "Model not saved"
assert scaler_path.exists(), "Scaler not saved"
print("✅ Model artifacts saved successfully")