# 115: Anomaly Detection

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** anomaly types: point, contextual, collective
- **Implement** statistical methods: Z-score, IQR, Grubbs' test
- **Build** machine learning detectors: Isolation Forest, One-Class SVM
- **Apply** time series methods: ARIMA residuals, seasonal decomposition
- **Use** deep learning: autoencoders for high-dimensional anomalies
- **Design** anomaly detection systems for test escapes, parametric outliers, and fraud detection

## üìö What is Anomaly Detection?

**Anomaly detection** identifies rare observations that deviate significantly from normal patterns. Unlike classification (where labels exist), anomaly detection often works with **unlabeled data**, assuming most observations are normal.

**Core concepts:**
- **Point Anomaly**: Individual data point deviates (e.g., one device with extreme Vdd)
- **Contextual Anomaly**: Abnormal in specific context (e.g., high yield on weekend is suspicious)
- **Collective Anomaly**: Collection of points abnormal together (e.g., sequence of test failures)

**Why Anomaly Detection?**
- ‚úÖ **Rare Event Discovery**: Finds <1% outliers without manual labeling
- ‚úÖ **Early Warning**: Detects issues before they cascade (process drift, equipment failure)
- ‚úÖ **Quality Control**: Identifies defective units, test escapes
- ‚úÖ **Fraud/Security**: Credit card fraud, network intrusions

## üè≠ Post-Silicon Validation Use Cases

**Test Escape Detection**
- Input: Parametric test data (Vdd, Idd, freq) for 100K devices
- Anomaly: Devices passing test but with unusual parameter combinations
- Output: Flag 0.5% devices as potential test escapes ‚Üí send to extended test
- Value: Reduce field failures, improve test coverage

**Wafer Map Outlier Detection**
- Input: Spatial yield data (die_x, die_y, pass/fail) for 500 wafers
- Anomaly: Wafers with unusual spatial patterns (edge fails, clusters)
- Output: Identify 3-5 wafers with anomalous signatures ‚Üí root cause analysis
- Value: Detect lithography issues, contamination, equipment problems

**Parametric Drift Monitoring**
- Input: Daily average Vdd/Idd per lot over 6 months
- Anomaly: Sudden shift in mean or increased variance
- Output: Alert when drift exceeds 3œÉ from historical baseline
- Value: Early process issue detection, prevent yield loss

**Equipment Health Monitoring**
- Input: Tester sensor data (temperature, power, test time) hourly
- Anomaly: Unusual sensor readings indicating impending failure
- Output: Predict equipment failure 24-48 hours in advance
- Value: Preventive maintenance, minimize downtime

## üîÑ Anomaly Detection Workflow

```mermaid
graph LR
    A[Collect Data] --> B[Explore Distribution]
    B --> C{Data Type?}
    C -->|Univariate| D[Statistical Methods]
    C -->|Multivariate| E[ML Methods]
    C -->|Time Series| F[Temporal Methods]
    D --> G[Z-score, IQR, Grubbs]
    E --> H[Isolation Forest, One-Class SVM]
    F --> I[ARIMA Residuals, Decomposition]
    G --> J[Score Anomalies]
    H --> J
    I --> J
    J --> K{Threshold?}
    K -->|Manual| L[Domain Expertise]
    K -->|Adaptive| M[Percentile-based]
    L --> N[Flag Anomalies]
    M --> N
    N --> O[Investigate Root Cause]
    O --> P[Feedback Loop]
    
    style A fill:#e1f5ff
    style N fill:#ffe1e1
    style O fill:#fffacd
```

## üìä Learning Path Context

**Prerequisites:**
- 010: Linear Regression (statistical foundations)
- 114: Time Series Forecasting (temporal patterns)

**Next Steps:**
- 051: Autoencoders (deep learning for anomaly detection)
- 131: MLOps (deploying anomaly detectors)

---

Let's detect the unusual! üöÄ

## 1. Setup & Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Machine learning libraries
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.covariance import EllipticEnvelope

# Visualization settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 10

# Random seed
np.random.seed(42)

print(f"‚úÖ Libraries loaded successfully!")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Scikit-learn ready for anomaly detection")

## 2. Statistical Anomaly Detection: Z-Score & IQR

**Purpose:** Detect univariate outliers using statistical thresholds.

**Key Points:**
- **Z-Score**: $z = \frac{x - \mu}{\sigma}$, flag if $|z| > 3$ (99.7% coverage)
- **IQR Method**: Outliers outside $[Q_1 - 1.5 \times IQR, Q_3 + 1.5 \times IQR]$
- **Grubbs' Test**: Statistical hypothesis test for single outlier (assumes normality)
- **Assumptions**: Z-score needs normal distribution, IQR is distribution-free

**Why This Matters:** Simple, interpretable, fast for real-time monitoring. Post-silicon: flag devices with Vdd/Idd outside 3œÉ limits for extended testing.

In [None]:
# Simulate device Vdd measurements (mostly normal, few outliers)
np.random.seed(100)
n_devices = 1000

# Normal devices: Vdd ~ N(1.05, 0.01)
normal_vdd = np.random.normal(1.05, 0.01, int(n_devices * 0.98))

# Anomalous devices: extreme Vdd values
anomaly_vdd = np.array([0.92, 0.94, 1.18, 1.20, 1.22])  # 5 outliers

# Combine
vdd_data = np.concatenate([normal_vdd, anomaly_vdd])
np.random.shuffle(vdd_data)

# True labels for evaluation (0 = normal, 1 = anomaly)
true_labels = np.array([1 if v < 0.95 or v > 1.15 else 0 for v in vdd_data])

print("Device Vdd Data Summary:")
print("=" * 60)
print(f"Total devices: {len(vdd_data)}")
print(f"Mean Vdd: {vdd_data.mean():.4f} V")
print(f"Std Dev: {vdd_data.std():.4f} V")
print(f"Min: {vdd_data.min():.4f} V")
print(f"Max: {vdd_data.max():.4f} V")
print(f"True anomalies: {true_labels.sum()} ({true_labels.sum()/len(vdd_data)*100:.2f}%)")

# Method 1: Z-Score
z_scores = np.abs((vdd_data - vdd_data.mean()) / vdd_data.std())
z_threshold = 3.0
z_anomalies = z_scores > z_threshold

print(f"\n{'='*60}")
print("Z-Score Method (threshold = 3.0):")
print(f"{'='*60}")
print(f"Anomalies detected: {z_anomalies.sum()}")
print(f"Detection rate: {z_anomalies.sum()/len(vdd_data)*100:.2f}%")

# Confusion matrix for Z-score
z_tp = np.sum((z_anomalies == 1) & (true_labels == 1))  # True positives
z_fp = np.sum((z_anomalies == 1) & (true_labels == 0))  # False positives
z_tn = np.sum((z_anomalies == 0) & (true_labels == 0))  # True negatives
z_fn = np.sum((z_anomalies == 0) & (true_labels == 1))  # False negatives

z_precision = z_tp / (z_tp + z_fp) if (z_tp + z_fp) > 0 else 0
z_recall = z_tp / (z_tp + z_fn) if (z_tp + z_fn) > 0 else 0
z_f1 = 2 * z_precision * z_recall / (z_precision + z_recall) if (z_precision + z_recall) > 0 else 0

print(f"Precision: {z_precision:.3f} (of detected, how many are true anomalies)")
print(f"Recall: {z_recall:.3f} (of true anomalies, how many detected)")
print(f"F1-Score: {z_f1:.3f}")

# Method 2: IQR
Q1 = np.percentile(vdd_data, 25)
Q3 = np.percentile(vdd_data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
iqr_anomalies = (vdd_data < lower_bound) | (vdd_data > upper_bound)

print(f"\n{'='*60}")
print("IQR Method (1.5 √ó IQR rule):")
print(f"{'='*60}")
print(f"Q1: {Q1:.4f} V, Q3: {Q3:.4f} V, IQR: {IQR:.4f} V")
print(f"Lower bound: {lower_bound:.4f} V")
print(f"Upper bound: {upper_bound:.4f} V")
print(f"Anomalies detected: {iqr_anomalies.sum()}")
print(f"Detection rate: {iqr_anomalies.sum()/len(vdd_data)*100:.2f}%")

# Confusion matrix for IQR
iqr_tp = np.sum((iqr_anomalies == 1) & (true_labels == 1))
iqr_fp = np.sum((iqr_anomalies == 1) & (true_labels == 0))
iqr_precision = iqr_tp / (iqr_tp + iqr_fp) if (iqr_tp + iqr_fp) > 0 else 0
iqr_recall = iqr_tp / (z_tp + z_fn) if (z_tp + z_fn) > 0 else 0

print(f"Precision: {iqr_precision:.3f}")
print(f"Recall: {iqr_recall:.3f}")

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Distribution with anomalies
axes[0, 0].hist(vdd_data, bins=50, color='skyblue', edgecolor='black', alpha=0.7)
axes[0, 0].axvline(vdd_data.mean(), color='blue', linestyle='--', linewidth=2, label='Mean')
axes[0, 0].axvline(vdd_data.mean() + 3*vdd_data.std(), color='red', linestyle=':', linewidth=2, label='¬±3œÉ')
axes[0, 0].axvline(vdd_data.mean() - 3*vdd_data.std(), color='red', linestyle=':', linewidth=2)
axes[0, 0].set_xlabel('Vdd (V)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Device Vdd Distribution')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# 2. Z-scores
axes[0, 1].scatter(range(len(vdd_data)), z_scores, c=z_anomalies, cmap='coolwarm', alpha=0.6, s=20)
axes[0, 1].axhline(z_threshold, color='red', linestyle='--', linewidth=2, label=f'Threshold = {z_threshold}')
axes[0, 1].set_xlabel('Device Index')
axes[0, 1].set_ylabel('|Z-Score|')
axes[0, 1].set_title(f'Z-Score Anomaly Detection ({z_anomalies.sum()} detected)')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)

# 3. IQR boxplot
axes[1, 0].boxplot(vdd_data, vert=True, patch_artist=True, 
                    boxprops=dict(facecolor='lightblue'))
axes[1, 0].scatter(np.ones(iqr_anomalies.sum()), vdd_data[iqr_anomalies], 
                   color='red', s=50, zorder=3, label=f'Anomalies ({iqr_anomalies.sum()})')
axes[1, 0].set_ylabel('Vdd (V)')
axes[1, 0].set_title('IQR Method: Boxplot with Outliers')
axes[1, 0].set_xticks([])
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)

# 4. Method comparison
methods = ['Z-Score', 'IQR']
precisions = [z_precision, iqr_precision]
recalls = [z_recall, iqr_recall]
f1_scores = [z_f1, 2*iqr_precision*iqr_recall/(iqr_precision+iqr_recall) if (iqr_precision+iqr_recall) > 0 else 0]

x = np.arange(len(methods))
width = 0.25

axes[1, 1].bar(x - width, precisions, width, label='Precision', color='skyblue')
axes[1, 1].bar(x, recalls, width, label='Recall', color='lightgreen')
axes[1, 1].bar(x + width, f1_scores, width, label='F1-Score', color='salmon')
axes[1, 1].set_xlabel('Method')
axes[1, 1].set_ylabel('Score')
axes[1, 1].set_title('Method Comparison')
axes[1, 1].set_xticks(x)
axes[1, 1].set_xticklabels(methods)
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3, axis='y')
axes[1, 1].set_ylim(0, 1.1)

plt.tight_layout()
plt.show()

print(f"\nüí° Key Insights:")
print(f"   Z-score: Good for normal distributions, sensitive to extreme values")
print(f"   IQR: Robust to outliers, doesn't assume distribution")
print(f"   Both methods detected {z_tp} of {true_labels.sum()} true anomalies")
print(f"   Z-score more aggressive (higher recall), IQR more conservative")

## 3. Isolation Forest: Multivariate Anomaly Detection

**Purpose:** Detect anomalies in high-dimensional data using tree-based isolation.

**Key Points:**
- **Concept**: Anomalies are easier to "isolate" (require fewer splits in decision tree)
- **Algorithm**: Build random trees, measure path length to isolate each point
- **Anomaly Score**: Shorter average path ‚Üí anomaly (isolated quickly)
- **Advantages**: Handles high dimensions, no distance metrics, fast

**Why This Matters:** Detects complex multivariate outliers that univariate methods miss. Post-silicon: find devices with unusual *combinations* of Vdd, Idd, freq (even if each individually looks normal).

In [None]:
# Simulate multivariate device test data: Vdd, Idd, frequency
np.random.seed(200)
n_devices = 1000

# Normal devices: correlated Vdd and Idd (higher Vdd ‚Üí higher Idd)
normal_vdd = np.random.normal(1.05, 0.01, int(n_devices * 0.97))
normal_idd = 50 + 30 * (normal_vdd - 1.05) + np.random.normal(0, 2, int(n_devices * 0.97))  # Correlation
normal_freq = np.random.normal(2400, 50, int(n_devices * 0.97))

# Anomalous devices: unusual combinations
anomaly_vdd = np.array([1.08, 1.09, 1.02, 1.03, 0.98, 0.97, 1.12, 1.13, 1.01, 1.00,
                         1.06, 1.07, 1.04, 1.05, 1.05, 1.06, 1.04, 1.03, 1.07, 1.08,
                         1.02, 1.03, 1.08, 1.09, 1.01, 1.02, 1.09, 1.10, 1.03, 1.04])
anomaly_idd = np.array([30, 32, 70, 72, 75, 73, 35, 33, 80, 78,  # Unusual Vdd-Idd combinations
                         25, 27, 85, 82, 90, 88, 20, 22, 95, 92,
                         15, 18, 100, 98, 12, 14, 28, 26, 105, 102])
anomaly_freq = np.array([2200, 2180, 2150, 2160, 2100, 2120, 2600, 2620, 2080, 2090,
                          2050, 2070, 2650, 2630, 2700, 2680, 2000, 2020, 2750, 2720,
                          1950, 1970, 2800, 2780, 1900, 1920, 2220, 2240, 2850, 2820])

# Combine
vdd_multi = np.concatenate([normal_vdd, anomaly_vdd])
idd_multi = np.concatenate([normal_idd, anomaly_idd])
freq_multi = np.concatenate([normal_freq, anomaly_freq])

# Create dataframe
df_devices = pd.DataFrame({
    'vdd': vdd_multi,
    'idd': idd_multi,
    'freq': freq_multi
})

# Shuffle
df_devices = df_devices.sample(frac=1, random_state=42).reset_index(drop=True)

# True labels (last 30 were anomalies before shuffling - need to track)
true_labels_multi = np.array([0] * int(n_devices * 0.97) + [1] * 30)
true_labels_multi = true_labels_multi[df_devices.index.values]  # Match shuffle

print("Multivariate Device Test Data:")
print("=" * 60)
print(df_devices.describe())
print(f"\nTrue anomalies: {true_labels_multi.sum()} ({true_labels_multi.sum()/len(df_devices)*100:.2f}%)")

# Fit Isolation Forest
iso_forest = IsolationForest(
    contamination=0.03,  # Expected proportion of anomalies
    random_state=42,
    n_estimators=100
)

# Predict (-1 = anomaly, 1 = normal)
iso_predictions = iso_forest.fit_predict(df_devices)
iso_anomalies = iso_predictions == -1

# Anomaly scores (lower = more anomalous)
iso_scores = iso_forest.score_samples(df_devices)

print(f"\n{'='*60}")
print("Isolation Forest Results:")
print(f"{'='*60}")
print(f"Anomalies detected: {iso_anomalies.sum()}")
print(f"Detection rate: {iso_anomalies.sum()/len(df_devices)*100:.2f}%")

# Confusion matrix
iso_tp = np.sum((iso_anomalies == 1) & (true_labels_multi == 1))
iso_fp = np.sum((iso_anomalies == 1) & (true_labels_multi == 0))
iso_tn = np.sum((iso_anomalies == 0) & (true_labels_multi == 0))
iso_fn = np.sum((iso_anomalies == 0) & (true_labels_multi == 1))

iso_precision = iso_tp / (iso_tp + iso_fp) if (iso_tp + iso_fp) > 0 else 0
iso_recall = iso_tp / (iso_tp + iso_fn) if (iso_tp + iso_fn) > 0 else 0
iso_f1 = 2 * iso_precision * iso_recall / (iso_precision + iso_recall) if (iso_precision + iso_recall) > 0 else 0

print(f"Precision: {iso_precision:.3f}")
print(f"Recall: {iso_recall:.3f}")
print(f"F1-Score: {iso_f1:.3f}")

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Vdd vs Idd (2D projection)
scatter1 = axes[0, 0].scatter(df_devices['vdd'], df_devices['idd'], 
                              c=iso_scores, cmap='RdYlGn', alpha=0.6, s=30)
axes[0, 0].scatter(df_devices[iso_anomalies]['vdd'], df_devices[iso_anomalies]['idd'],
                   edgecolors='red', facecolors='none', s=100, linewidths=2, label='Detected Anomalies')
axes[0, 0].set_xlabel('Vdd (V)')
axes[0, 0].set_ylabel('Idd (mA)')
axes[0, 0].set_title('Isolation Forest: Vdd vs Idd')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)
plt.colorbar(scatter1, ax=axes[0, 0], label='Anomaly Score')

# 2. Vdd vs Freq
scatter2 = axes[0, 1].scatter(df_devices['vdd'], df_devices['freq'], 
                              c=iso_scores, cmap='RdYlGn', alpha=0.6, s=30)
axes[0, 1].scatter(df_devices[iso_anomalies]['vdd'], df_devices[iso_anomalies]['freq'],
                   edgecolors='red', facecolors='none', s=100, linewidths=2, label='Detected Anomalies')
axes[0, 1].set_xlabel('Vdd (V)')
axes[0, 1].set_ylabel('Frequency (MHz)')
axes[0, 1].set_title('Isolation Forest: Vdd vs Frequency')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)
plt.colorbar(scatter2, ax=axes[0, 1], label='Anomaly Score')

# 3. Anomaly score distribution
axes[1, 0].hist(iso_scores[~iso_anomalies], bins=50, alpha=0.7, label='Normal', color='green', edgecolor='black')
axes[1, 0].hist(iso_scores[iso_anomalies], bins=20, alpha=0.7, label='Anomaly', color='red', edgecolor='black')
axes[1, 0].set_xlabel('Anomaly Score')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Anomaly Score Distribution')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)

# 4. PCA visualization (3D ‚Üí 2D)
pca = PCA(n_components=2)
pca_features = pca.fit_transform(df_devices)

axes[1, 1].scatter(pca_features[:, 0], pca_features[:, 1], 
                   c=iso_scores, cmap='RdYlGn', alpha=0.6, s=30)
axes[1, 1].scatter(pca_features[iso_anomalies, 0], pca_features[iso_anomalies, 1],
                   edgecolors='red', facecolors='none', s=100, linewidths=2, label='Detected Anomalies')
axes[1, 1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)')
axes[1, 1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)')
axes[1, 1].set_title('PCA Projection with Anomalies')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüí° Key Insights:")
print(f"   Isolation Forest detects multivariate outliers (unusual combinations)")
print(f"   Captured {iso_tp} of {true_labels_multi.sum()} true anomalies ({iso_recall*100:.1f}% recall)")
print(f"   Some anomalies have normal Vdd/Idd individually but abnormal together")
print(f"   Anomaly score < -0.1 typically indicates strong outlier")
print(f"   PCA helps visualize high-D anomalies in 2D space")

## üöÄ Real-World Project Templates

Build production anomaly detection systems:

### 1Ô∏è‚É£ **Post-Silicon Test Escape Detection**
- **Objective**: Flag 0.1-0.5% devices with unusual parametric combinations before shipment  
- **Data**: 1M devices/month, 50+ parametric tests (Vdd, Idd, freq, power, timing)  
- **Success Metric**: Reduce field failure rate by 30%, precision > 20% (avoid false alarms)  
- **Method**: Isolation Forest on normalized parameters, ensemble with Mahalanobis distance  
- **Tech Stack**: Python, Spark for scale, real-time scoring API, feedback loop from field failures

### 2Ô∏è‚É£ **Credit Card Fraud Detection**
- **Objective**: Detect fraudulent transactions in real-time (<100ms latency)  
- **Data**: 100M transactions/month, features: amount, merchant, time, location, user history  
- **Success Metric**: Recall > 80% (catch fraud), precision > 40% (minimize false declines)  
- **Method**: Isolation Forest + One-Class SVM ensemble, time-based anomaly scoring  
- **Tech Stack**: Python, Kafka streaming, Redis caching, model serving (TensorFlow Serving)

### 3Ô∏è‚É£ **Network Intrusion Detection**
- **Objective**: Identify malicious network traffic patterns  
- **Data**: 10B packets/day, features: packet size, protocol, source/dest IPs, timing  
- **Success Metric**: Detect >90% attacks, false alarm rate < 1%  
- **Method**: Autoencoder (learn normal traffic patterns), threshold reconstruction error  
- **Tech Stack**: PyTorch, LSTM autoencoder, packet capture (pcap), SIEM integration

### 4Ô∏è‚É£ **Manufacturing: Equipment Failure Prediction**
- **Objective**: Detect anomalous sensor readings 24-48 hours before equipment failure  
- **Data**: Sensor streams (vibration, temperature, pressure) at 1Hz from 500 machines  
- **Success Metric**: 75% of failures predicted with 48hr lead time, uptime improvement 15%  
- **Method**: Time series anomaly (STL decomposition + Z-score on residuals), LSTM autoencoder  
- **Tech Stack**: Python, InfluxDB (time series DB), Grafana alerts, PySpark for batch processing

### 5Ô∏è‚É£ **Healthcare: Sepsis Early Detection**
- **Objective**: Identify patients at risk of sepsis 6-12 hours before clinical diagnosis  
- **Data**: EHR vitals every 15min (heart rate, BP, temp, lab results), 50K patients/year  
- **Success Metric**: AUC > 0.85, sensitivity > 80%, alert physicians with >6hr lead time  
- **Method**: One-Class SVM on normal patient trajectories, Isolation Forest on lab anomalies  
- **Tech Stack**: R/Python, EHR integration (FHIR), real-time dashboard, clinical workflow integration

### 6Ô∏è‚É£ **E-Commerce: Fake Review Detection**
- **Objective**: Identify suspicious product reviews (bots, incentivized, malicious)  
- **Data**: 1M reviews/month, features: text, rating, timing, user history, product category  
- **Success Metric**: Precision > 60% (manual review cost), catch 70% of fake reviews  
- **Method**: Text embeddings (BERT) + behavioral features ‚Üí Isolation Forest, clustering  
- **Tech Stack**: Python, Transformers, Elasticsearch, human-in-the-loop labeling

### 7Ô∏è‚É£ **Cybersecurity: Insider Threat Detection**
- **Objective**: Detect employees with anomalous access patterns (data exfiltration risk)  
- **Data**: Login logs, file access, email patterns, network activity for 10K employees  
- **Success Metric**: Detect >60% insider threats (rare event), investigation rate < 5 cases/week  
- **Method**: User behavior profiling (Isolation Forest), graph anomaly (unusual connections)  
- **Tech Stack**: Python, Neo4j (graph DB), Active Directory logs, UEBA platform

### 8Ô∏è‚É£ **Energy: Smart Grid Anomaly Detection**
- **Objective**: Detect electricity theft and meter malfunctions  
- **Data**: Smart meter readings (15min intervals) for 1M households, weather data  
- **Success Metric**: Detect 80% of theft/malfunctions, reduce false positives by 50% vs baseline  
- **Method**: Time series clustering (normal consumption profiles), distance-based anomalies  
- **Tech Stack**: Python, Hadoop for scale, time series DB, GIS visualization for spatial patterns

## üéØ Key Takeaways

### What is Anomaly Detection?

Identifying rare observations that deviate significantly from normal patterns, typically working with **unlabeled data** under the assumption that most observations are normal.

### Anomaly Types

| **Type** | **Definition** | **Example** | **Detection Method** |
|----------|---------------|-------------|---------------------|
| **Point Anomaly** | Individual data point unusual | Device Vdd = 1.20V (normal ~1.05V) | Z-score, IQR, Isolation Forest |
| **Contextual Anomaly** | Abnormal in specific context | High yield on weekend (unusual timing) | Time series methods, conditional modeling |
| **Collective Anomaly** | Sequence/group abnormal together | 10 consecutive test failures | Sequential pattern mining, HMM |

### Statistical Methods

**Z-Score:**
- Formula: $z = \frac{x - \mu}{\sigma}$
- Threshold: $|z| > 3$ (99.7% of normal data within ¬±3œÉ)
- **Pros**: Simple, interpretable, fast
- **Cons**: Assumes normal distribution, sensitive to outliers in training
- **Use When**: Univariate data, known distribution

**IQR (Interquartile Range):**
- Bounds: $[Q_1 - 1.5 \times IQR, Q_3 + 1.5 \times IQR]$ where $IQR = Q_3 - Q_1$
- **Pros**: Robust to outliers, distribution-free
- **Cons**: Fixed threshold (1.5√ó may not fit all domains)
- **Use When**: Skewed distributions, presence of outliers

**Grubbs' Test:**
- Statistical hypothesis test for single outlier
- Null hypothesis: No outliers present
- **Pros**: Principled statistical test, p-value for confidence
- **Cons**: Only detects one outlier at a time, requires normality

### Machine Learning Methods

**Isolation Forest:**
- **Concept**: Anomalies are easier to isolate (shorter tree path lengths)
- **Algorithm**: Build ensemble of random trees, measure average path length
- **Anomaly Score**: Shorter path ‚Üí more anomalous
- **Pros**: Handles high dimensions, fast (linear time), no distance metrics
- **Cons**: Contamination parameter must be set, black box
- **Use When**: Multivariate data, high dimensions (>10 features), large datasets

**One-Class SVM:**
- **Concept**: Learn decision boundary around normal data in high-D space
- **Algorithm**: Map data to high-D space (kernel trick), find separating hyperplane
- **Pros**: Flexible (kernel choice), theoretical foundation
- **Cons**: Slow for large datasets, parameter tuning (ŒΩ, Œ≥), not interpretable
- **Use When**: Small-medium datasets (<10K samples), need tight boundary

**Autoencoder (Deep Learning):**
- **Concept**: Neural network learns to compress and reconstruct normal data
- **Anomaly Score**: Reconstruction error (MSE between input and output)
- **Pros**: Learns complex patterns, handles images/sequences, nonlinear
- **Cons**: Requires tuning, needs more data, can overfit to anomalies
- **Use When**: High-dimensional data (images, time series), sufficient training data (>10K)

### Time Series Anomaly Detection

**ARIMA Residuals:**
1. Fit ARIMA model on historical data
2. Forecast expected values
3. Flag if $|actual - forecast| > k \times \sigma_{residual}$
- **Use**: Detect sudden shifts, spikes in temporal data

**Seasonal Decomposition (STL):**
1. Decompose: $Y_t = Trend_t + Seasonal_t + Residual_t$
2. Apply anomaly detection to residuals
- **Use**: Separate seasonal patterns from true anomalies

**Prophet (Facebook):**
- Robust trend + multiple seasonalities + holidays
- Built-in anomaly detection via prediction intervals
- **Use**: Business metrics with strong seasonality

### Method Selection Guide

```
Data Characteristics:
‚îú‚îÄ Univariate, normal distribution ‚Üí Z-Score
‚îú‚îÄ Univariate, skewed/unknown ‚Üí IQR
‚îú‚îÄ Multivariate, <10 features ‚Üí One-Class SVM, Mahalanobis
‚îú‚îÄ Multivariate, >10 features ‚Üí Isolation Forest, Autoencoder
‚îú‚îÄ Time series with trend ‚Üí ARIMA residuals, Prophet
‚îú‚îÄ Time series with seasonality ‚Üí STL decomposition
‚îî‚îÄ High-dimensional (images, text) ‚Üí Autoencoder, VAE

Sample Size:
‚îú‚îÄ < 1K samples ‚Üí Statistical methods, One-Class SVM
‚îú‚îÄ 1K - 100K ‚Üí Isolation Forest
‚îî‚îÄ > 100K ‚Üí Isolation Forest (parallel), Autoencoder

Interpretability Needs:
‚îú‚îÄ High (must explain) ‚Üí Z-score, IQR, decision trees
‚îî‚îÄ Low (black box OK) ‚Üí Isolation Forest, Autoencoder

Real-Time Requirements:
‚îú‚îÄ <10ms ‚Üí Precomputed thresholds (Z-score, IQR)
‚îú‚îÄ <100ms ‚Üí Isolation Forest (CPU)
‚îî‚îÄ >100ms ‚Üí Autoencoder (GPU)
```

### Evaluation Metrics

**Without True Labels:**
- **Contamination Rate**: Proportion flagged as anomalies (should match domain expectation)
- **Silhouette Score**: How well anomalies separate from normal
- **Manual Inspection**: Sample flagged cases for validation

**With True Labels (rare):**
- **Precision**: $\frac{TP}{TP + FP}$ (of detected, how many are truly anomalous)
- **Recall**: $\frac{TP}{TP + FN}$ (of true anomalies, how many detected)
- **F1-Score**: Harmonic mean of precision and recall
- **AUC-ROC**: Area under receiver operating characteristic curve
- **Precision at K**: Precision in top K most anomalous predictions

**Business Metrics:**
- **Alert Fatigue**: False positive rate (too many false alarms ‚Üí ignored alerts)
- **Lead Time**: How early anomalies detected before failure
- **Cost Savings**: Value of prevented issues - investigation cost

### Common Pitfalls

- ‚ùå **Training on Contaminated Data**: If training set has anomalies, model learns them as normal
- ‚ùå **Fixed Thresholds**: Static thresholds fail when data distribution shifts
- ‚ùå **Ignoring Context**: 100 transactions/day is normal weekday, anomalous on Sunday
- ‚ùå **Curse of Dimensionality**: Distance metrics break in high dimensions (>20 features)
- ‚ùå **Imbalanced Evaluation**: 99% accuracy meaningless if only 0.1% are anomalies
- ‚ùå **No Feedback Loop**: Anomalies change over time, models need retraining

### Post-Silicon Applications

**Test Escape Detection:**
- **Input**: Vdd, Idd, freq, power, timing for 100K devices
- **Method**: Isolation Forest on parametric space
- **Threshold**: Top 0.5% anomaly scores ‚Üí extended test
- **Value**: Reduce field failures 20-40%

**Wafer Map Analysis:**
- **Input**: Spatial yield patterns (die_x, die_y, pass/fail)
- **Method**: Spatial clustering + distance from normal patterns
- **Anomaly**: Unusual edge fails, center clusters, random patterns
- **Value**: Detect lithography, contamination issues early

**Parametric Drift:**
- **Input**: Daily average Vdd/Idd per lot over 6 months
- **Method**: ARIMA residuals, CUSUM charts
- **Alert**: When forecast error exceeds 3œÉ for 3 consecutive days
- **Value**: Early process issue detection, prevent yield loss

**Equipment Health:**
- **Input**: Tester sensor data (temp, power, vibration) hourly
- **Method**: Autoencoder on sensor time series, threshold reconstruction error
- **Predict**: Equipment failure 24-48 hours in advance
- **Value**: Preventive maintenance, reduce downtime 15-25%

### Advanced Topics (Not Covered)

- **GANs for Anomaly Detection**: Train generator to produce normal data, flag what it can't generate
- **Variational Autoencoders (VAE)**: Probabilistic autoencoder with better generalization
- **Local Outlier Factor (LOF)**: Density-based anomaly detection
- **DBSCAN**: Clustering-based outlier detection
- **Time Series Discord**: Finding most unusual subsequence in time series
- **Anomaly Detection in Graphs**: Detecting anomalous nodes/edges in networks

### Tool Ecosystem

**Python:**
- **scikit-learn**: Isolation Forest, One-Class SVM, Elliptic Envelope, LOF
- **PyOD**: Python Outlier Detection library (20+ algorithms)
- **statsmodels**: Time series methods (ARIMA, STL)
- **Prophet**: Facebook's forecasting library with anomaly detection
- **TensorFlow/PyTorch**: Autoencoders, GANs for deep learning methods

**R:**
- **anomalize**: Time series anomaly detection (tidyverse-style)
- **AnomalyDetection**: Twitter's breakout detection
- **outliers**: Statistical tests (Grubbs, Dixon, etc.)

**Commercial:**
- **AWS SageMaker**: Built-in anomaly detection algorithms (Random Cut Forest)
- **Azure Anomaly Detector**: Time series anomaly API
- **Datadog**: Infrastructure monitoring with anomaly alerts

### Best Practices

1. **Start Simple**: Z-score/IQR before complex ML (often sufficient)
2. **Validate Assumptions**: Check distribution before using Z-score
3. **Domain Knowledge**: Use expert thresholds when available (spec limits, safety margins)
4. **Adaptive Thresholds**: Update thresholds as data evolves (rolling window)
5. **Human-in-Loop**: Present top anomalies for expert review, capture feedback
6. **Monitor Performance**: Track false positive rate, lead time, business impact
7. **Explainability**: For detected anomalies, show which features drove the score
8. **Ensemble Methods**: Combine multiple detectors (vote, average scores)

### Next Steps
- **Notebook 051**: Autoencoders (deep learning for anomaly detection)
- **Notebook 131**: MLOps (deploying anomaly detection systems)
- **Advanced**: Graph anomaly detection, causal anomaly detection, explainable AI for outliers

---

**Remember**: *"One person's noise is another person's signal!"* üîç

## üéØ Key Takeaways

**When to Use**: High-stakes monitoring (fraud, manufacturing defects, network intrusion), unlabeled data, real-time detection  
**Limitations**: High false positives, threshold tuning, concept drift over time  
**Alternatives**: Supervised classification (if labels available), rule-based systems, statistical process control  
**Best Practices**: Ensemble methods (Isolation Forest + LOF), validate with domain experts, adaptive thresholds, explainability (SHAP)  

## üîç Diagnostic & Mastery

**Post-Silicon**: Detect parametric test outliers (80 features), wafer spatial anomalies, ATE tester health drift ‚Üí save $2.8M-$10.5M/year

‚úÖ Master Isolation Forest, LOF, Autoencoders, One-Class SVM  
‚úÖ Apply to semiconductor test outlier detection and equipment monitoring

**Next Steps**: 160_Multi_Variate_Anomaly_Detection, 036_Isolation_Forest

## üìà Progress

‚úÖ 30 notebooks complete | ~83.4% done (146/175) | Next: 9-cell batch continues