# Advanced Anomaly Detection - Demonstration

This notebook demonstrates the advanced anomaly detection system with multiple models:
- LSTM Autoencoder for temporal patterns
- Prophet for seasonality detection
- Isolation Forest for multivariate outliers
- Changepoint detection for structural breaks
- Ensemble framework combining all models

## Setup

In [None]:
# Import required libraries
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

# Set up plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

# Add advanced_anomaly_detection to path
sys.path.insert(0, os.path.join(os.getcwd(), '..', 'advanced_anomaly_detection'))

print("Libraries imported successfully!")

In [None]:
# Initialize Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Advanced Anomaly Detection Demo") \
    .master("local[*]") \
    .config("spark.driver.memory", "4g") \
    .config("spark.sql.shuffle.partitions", "10") \
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"Spark UI: {spark.sparkContext.uiWebUrl}")

## 1. Load Data

Load the gold funnel hourly brand data from the existing pipeline.

In [None]:
# Define paths
TABLES_DIR = "../tables"  # Local path
# For GCS: TABLES_DIR = "gs://funnelpulse-ss18851-data/tables"

GOLD_HOURLY_BRAND_PATH = f"{TABLES_DIR}/gold_funnel_hourly_brand"

# Load data
df = spark.read.parquet(GOLD_HOURLY_BRAND_PATH)

print(f"Total records: {df.count():,}")
print(f"\nSchema:")
df.printSchema()

# Show sample
print("\nSample data:")
df.show(5, truncate=False)

In [None]:
# Filter for reliable data
from pyspark.sql.functions import col

MIN_VIEWS = 20

df_filtered = df.filter(
    (col("views") >= MIN_VIEWS) & 
    (col("brand").isNotNull())
)

print(f"Filtered records: {df_filtered.count():,}")
print(f"Number of brands: {df_filtered.select('brand').distinct().count()}")

# Basic statistics
print("\nConversion rate statistics:")
df_filtered.select('conversion_rate').describe().show()

## 2. Initialize Advanced Pipeline

Create the advanced anomaly detection pipeline with all models.

In [None]:
from advanced_anomaly_pipeline import create_pipeline

# Create pipeline
pipeline = create_pipeline(spark)

print("\nPipeline initialized successfully!")

## 3. Feature Engineering

Apply advanced feature engineering to create multi-dimensional features.

In [None]:
# Apply feature engineering
df_features = pipeline.prepare_features(df_filtered)

print(f"\nFeatures added. New column count: {len(df_features.columns)}")
print("\nNew columns:")
for col_name in df_features.columns:
    if col_name not in df.columns:
        print(f"  - {col_name}")

# Show sample with new features
print("\nSample with features:")
df_features.select(
    'window_start', 'brand', 'conversion_rate',
    'hour_of_day', 'day_of_week', 'is_weekend',
    'rolling_7d_conversion', 'conversion_z_score'
).show(5)

## 4. Train Models

Train all available anomaly detection models on historical data.

In [None]:
# Train models
training_stats = pipeline.train_models(df_features, train_split=0.8)

# Display training statistics
print("\nTraining Statistics:")
print("=" * 60)
for model_name, stats in training_stats.items():
    print(f"\n{model_name.upper()}:")
    if isinstance(stats, dict):
        for key, value in stats.items():
            if key != 'history':
                print(f"  {key}: {value}")

## 5. Detect Anomalies

Run all models to detect anomalies in the data.

In [None]:
# Detect anomalies
df_anomalies = pipeline.detect_anomalies(df_features)

print("Anomaly detection complete!")
print(f"\nTotal records analyzed: {len(df_anomalies):,}")

# Check which models detected anomalies
if 'lstm_is_anomaly' in df_anomalies.columns:
    print(f"LSTM anomalies: {df_anomalies['lstm_is_anomaly'].sum()}")
if 'prophet_is_anomaly' in df_anomalies.columns:
    print(f"Prophet anomalies: {df_anomalies['prophet_is_anomaly'].sum()}")
if 'isolation_forest_is_anomaly' in df_anomalies.columns:
    print(f"Isolation Forest anomalies: {df_anomalies['isolation_forest_is_anomaly'].sum()}")
if 'changepoint_is_anomaly' in df_anomalies.columns:
    print(f"Changepoint anomalies: {df_anomalies['changepoint_is_anomaly'].sum()}")

## 6. Apply Ensemble

Combine predictions from all models using weighted voting.

In [None]:
# Apply ensemble
df_ensemble = pipeline.apply_ensemble(df_anomalies)

# Get summary statistics
summary = pipeline.ensemble.get_summary_statistics(df_ensemble)

print("\nEnsemble Summary:")
print("=" * 60)
print(f"Total records: {summary['total_records']:,}")
print(f"Total anomalies: {summary['total_anomalies']:,}")
print(f"Anomaly rate: {summary['anomaly_rate']:.2%}")
print(f"\nBy severity: {summary['by_severity']}")
print(f"By type: {summary['by_type']}")

if 'total_estimated_revenue_impact' in summary:
    print(f"\nEstimated revenue impact: ${summary['total_estimated_revenue_impact']:,.2f}")

## 7. Visualize Results

Create visualizations to understand anomaly patterns.

In [None]:
# Filter to anomalies only
anomalies = df_ensemble[df_ensemble['is_anomaly']].copy()

if len(anomalies) > 0:
    # Plot 1: Anomaly severity distribution
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Severity distribution
    severity_counts = anomalies['anomaly_severity'].value_counts()
    axes[0, 0].bar(severity_counts.index, severity_counts.values, color='coral')
    axes[0, 0].set_title('Anomalies by Severity')
    axes[0, 0].set_xlabel('Severity')
    axes[0, 0].set_ylabel('Count')
    
    # Type distribution
    type_counts = anomalies['anomaly_type'].value_counts()
    axes[0, 1].bar(type_counts.index, type_counts.values, color='skyblue')
    axes[0, 1].set_title('Anomalies by Type')
    axes[0, 1].set_xlabel('Type')
    axes[0, 1].set_ylabel('Count')
    
    # Anomaly score distribution
    axes[1, 0].hist(anomalies['ensemble_anomaly_score'], bins=20, color='lightgreen', edgecolor='black')
    axes[1, 0].set_title('Ensemble Anomaly Score Distribution')
    axes[1, 0].set_xlabel('Anomaly Score')
    axes[1, 0].set_ylabel('Frequency')
    
    # Top affected brands
    top_brands = anomalies['brand'].value_counts().head(10)
    axes[1, 1].barh(top_brands.index, top_brands.values, color='mediumpurple')
    axes[1, 1].set_title('Top 10 Brands by Anomaly Count')
    axes[1, 1].set_xlabel('Anomaly Count')
    axes[1, 1].set_ylabel('Brand')
    
    plt.tight_layout()
    plt.show()
else:
    print("No anomalies to visualize")

In [None]:
# Plot 2: Time series with anomalies for a sample brand
if len(anomalies) > 0:
    sample_brand = anomalies['brand'].value_counts().index[0]
    brand_data = df_ensemble[df_ensemble['brand'] == sample_brand].sort_values('window_start')
    brand_anomalies = brand_data[brand_data['is_anomaly']]
    
    plt.figure(figsize=(15, 6))
    
    # Plot conversion rate
    plt.plot(brand_data['window_start'], brand_data['conversion_rate'], 
             label='Actual', color='blue', alpha=0.7, linewidth=1)
    
    # Plot expected value if available
    if 'expected_value' in brand_data.columns:
        plt.plot(brand_data['window_start'], brand_data['expected_value'], 
                 label='Expected', color='green', linestyle='--', alpha=0.5)
    
    # Highlight anomalies
    plt.scatter(brand_anomalies['window_start'], brand_anomalies['conversion_rate'],
                color='red', s=100, zorder=5, label='Anomalies', marker='X')
    
    plt.title(f'Conversion Rate Over Time - {sample_brand}')
    plt.xlabel('Time')
    plt.ylabel('Conversion Rate')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    print(f"\nShowing {len(brand_anomalies)} anomalies for brand: {sample_brand}")

## 8. Model Comparison

Compare the performance of individual models.

In [None]:
# Model score comparison
model_columns = [
    'lstm_anomaly_score',
    'prophet_anomaly_score', 
    'isolation_forest_anomaly_score',
    'changepoint_anomaly_score'
]

available_models = [col for col in model_columns if col in df_ensemble.columns]

if available_models:
    # Box plot of model scores
    fig, ax = plt.subplots(figsize=(12, 6))
    
    data_to_plot = [df_ensemble[col].dropna() for col in available_models]
    labels = [col.replace('_anomaly_score', '').replace('_', ' ').title() for col in available_models]
    
    ax.boxplot(data_to_plot, labels=labels)
    ax.set_title('Model Score Distribution Comparison')
    ax.set_ylabel('Anomaly Score')
    ax.grid(True, alpha=0.3)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    # Correlation between models
    if len(available_models) > 1:
        correlation = df_ensemble[available_models].corr()
        
        plt.figure(figsize=(10, 8))
        sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0,
                    xticklabels=labels, yticklabels=labels)
        plt.title('Model Score Correlation')
        plt.tight_layout()
        plt.show()

## 9. Root Cause Analysis

Analyze root causes for critical anomalies.

In [None]:
# Get critical anomalies
critical_anomalies = df_ensemble[
    (df_ensemble['is_anomaly']) & 
    (df_ensemble['anomaly_severity'] == 'critical')
].copy()

if len(critical_anomalies) > 0:
    print(f"Found {len(critical_anomalies)} critical anomalies\n")
    
    # Perform RCA on top 5
    for idx, row in critical_anomalies.head(5).iterrows():
        rca = pipeline.analytics.perform_root_cause_analysis(row, df_ensemble)
        
        print("="*60)
        print(f"Brand: {rca['brand']}")
        print(f"Time: {rca['timestamp']}")
        print(f"Type: {rca['anomaly_type']} | Severity: {rca['severity']}")
        print(f"\nSummary: {rca['summary']}")
        print(f"\nTop Contributing Factors:")
        for factor in rca['primary_factors']:
            print(f"  - {factor['metric']}: {factor['change_percent']:.1f}%")
        print()
else:
    print("No critical anomalies found")

## 10. Export Results

Export anomaly detection results for further analysis or reporting.

In [None]:
# Export anomalies to CSV
output_path = "../output/anomalies_advanced.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# Select key columns for export
export_columns = [
    'window_start', 'brand', 'anomaly_type', 'anomaly_severity',
    'conversion_rate', 'expected_value', 'deviation_percentage',
    'views', 'purchases', 'revenue',
    'ensemble_anomaly_score', 'anomaly_confidence',
    'estimated_revenue_impact', 'estimated_conversion_loss'
]

available_export_cols = [col for col in export_columns if col in anomalies.columns]
export_df = anomalies[available_export_cols]

export_df.to_csv(output_path, index=False)
print(f"Exported {len(export_df)} anomalies to {output_path}")

# Display summary
print("\nExported Data Summary:")
print(export_df.describe())

## 11. Conclusion

### Key Findings

1. **Total Anomalies Detected**: Compare with baseline Z-score method
2. **Model Performance**: Individual model contributions to ensemble
3. **Business Impact**: Estimated revenue and conversion losses
4. **Top Issues**: Brands and time periods with most anomalies

### Next Steps

1. Review critical anomalies with business stakeholders
2. Set up automated alerting for high-severity anomalies
3. Tune model parameters based on false positive feedback
4. Deploy to production with real-time streaming
5. Monitor model performance over time

### Resources

- [Advanced Anomaly Detection README](../advanced_anomaly_detection/README_advanced_anomaly.md)
- [Configuration Guide](../advanced_anomaly_detection/config/anomaly_config.py)
- [Model Documentation](../advanced_anomaly_detection/advanced_models/)

In [None]:
# Clean up
spark.stop()
print("Spark session stopped. Notebook complete!")