# Part 6: Anomaly Detection Methods

Apply anomaly detection algorithms to OCSF embeddings.

**What you'll learn:**
1. Distance-based anomaly detection (k-NN)
2. Density-based detection (Local Outlier Factor)
3. Tree-based detection (Isolation Forest)
4. Evaluating detection performance
5. Comparing methods on labeled evaluation subset

**Prerequisites:**
- Embeddings from [04-self-supervised-training.ipynb](04-self-supervised-training.ipynb)
- Labeled evaluation subset (optional, for evaluation)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor, NearestNeighbors
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# For nicer plots
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

## 1. Load Embeddings and Labels

Load the embeddings from training and the labeled evaluation subset.

In [None]:
# Load embeddings
embeddings = np.load('../data/embeddings.npy')
print(f"Embeddings shape: {embeddings.shape}")

# Load labeled evaluation subset (if available)
try:
    eval_df = pd.read_parquet('../data/ocsf_eval_subset.parquet')
    print(f"Evaluation subset: {len(eval_df)} events")
    print(f"Anomaly rate: {eval_df['is_anomaly'].mean():.2%}")
    has_labels = True
except FileNotFoundError:
    print("No labeled evaluation subset found. Will use unsupervised evaluation.")
    has_labels = False

## 2. k-NN Distance-Based Detection

**Idea**: Anomalies are far from their nearest neighbors.

For each point:
1. Find k nearest neighbors
2. Compute average distance to neighbors
3. High average distance = likely anomaly

In [None]:
def detect_anomalies_knn_distance(embeddings, k=20, contamination=0.05):
    """
    Detect anomalies using k-NN average distance.
    
    Args:
        embeddings: (N, d) array of embeddings
        k: Number of neighbors
        contamination: Expected anomaly proportion
    
    Returns:
        predictions: 1 for anomaly, 0 for normal
        scores: Average distance to k neighbors (higher = more anomalous)
    """
    # Fit k-NN model
    nn = NearestNeighbors(n_neighbors=k+1, metric='cosine')  # +1 because point is its own neighbor
    nn.fit(embeddings)
    
    # Get distances to k nearest neighbors
    distances, _ = nn.kneighbors(embeddings)
    
    # Average distance (excluding self)
    scores = distances[:, 1:].mean(axis=1)
    
    # Threshold at percentile
    threshold = np.percentile(scores, 100 * (1 - contamination))
    predictions = (scores > threshold).astype(int)
    
    return predictions, scores, threshold

# Run k-NN detection
knn_preds, knn_scores, knn_threshold = detect_anomalies_knn_distance(
    embeddings, k=20, contamination=0.05
)

print(f"k-NN Distance Detection:")
print(f"  Threshold: {knn_threshold:.4f}")
print(f"  Anomalies detected: {knn_preds.sum()} ({knn_preds.mean():.2%})")

In [None]:
# Plot score distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of scores
axes[0].hist(knn_scores, bins=50, edgecolor='black', alpha=0.7)
axes[0].axvline(knn_threshold, color='red', linestyle='--', linewidth=2, label=f'Threshold: {knn_threshold:.4f}')
axes[0].set_xlabel('Average k-NN Distance')
axes[0].set_ylabel('Count')
axes[0].set_title('k-NN Distance Score Distribution')
axes[0].legend()

# Sorted scores
sorted_scores = np.sort(knn_scores)[::-1]
axes[1].plot(sorted_scores, linewidth=1)
axes[1].axhline(knn_threshold, color='red', linestyle='--', linewidth=2, label='Threshold')
axes[1].set_xlabel('Rank')
axes[1].set_ylabel('k-NN Distance Score')
axes[1].set_title('Sorted Anomaly Scores')
axes[1].legend()

plt.tight_layout()
plt.show()

## 3. Local Outlier Factor (LOF)

**Idea**: Anomalies are in regions of lower density than their neighbors.

LOF adapts to local density - a point can be far from clusters but still normal if its local area has similar density.

In [None]:
def detect_anomalies_lof(embeddings, n_neighbors=20, contamination=0.05):
    """
    Detect anomalies using Local Outlier Factor.
    
    Args:
        embeddings: (N, d) array of embeddings
        n_neighbors: Number of neighbors for density estimation
        contamination: Expected anomaly proportion
    
    Returns:
        predictions: 1 for anomaly, 0 for normal
        scores: Negative outlier factor (more negative = more anomalous)
    """
    lof = LocalOutlierFactor(n_neighbors=n_neighbors, contamination=contamination)
    lof_predictions = lof.fit_predict(embeddings)
    
    # Convert: LOF returns -1 for anomalies, 1 for normal
    predictions = (lof_predictions == -1).astype(int)
    
    # Scores (negative_outlier_factor_ is more negative for anomalies)
    scores = -lof.negative_outlier_factor_  # Flip so higher = more anomalous
    
    return predictions, scores

# Run LOF detection
lof_preds, lof_scores = detect_anomalies_lof(embeddings, n_neighbors=20, contamination=0.05)

print(f"Local Outlier Factor (LOF) Detection:")
print(f"  Anomalies detected: {lof_preds.sum()} ({lof_preds.mean():.2%})")

## 4. Isolation Forest

**Idea**: Anomalies are easier to "isolate" with random splits.

Build random trees that recursively split data. Anomalies require fewer splits to isolate (shorter path length).

In [None]:
def detect_anomalies_isolation_forest(embeddings, contamination=0.05, n_estimators=100):
    """
    Detect anomalies using Isolation Forest.
    
    Args:
        embeddings: (N, d) array of embeddings
        contamination: Expected anomaly proportion
        n_estimators: Number of trees
    
    Returns:
        predictions: 1 for anomaly, 0 for normal
        scores: Anomaly score (higher = more anomalous)
    """
    iso = IsolationForest(contamination=contamination, n_estimators=n_estimators, random_state=42)
    iso_predictions = iso.fit_predict(embeddings)
    
    # Convert: Isolation Forest returns -1 for anomalies, 1 for normal
    predictions = (iso_predictions == -1).astype(int)
    
    # Scores (score_samples returns negative values, more negative = more anomalous)
    scores = -iso.score_samples(embeddings)  # Flip so higher = more anomalous
    
    return predictions, scores

# Run Isolation Forest detection
iso_preds, iso_scores = detect_anomalies_isolation_forest(embeddings, contamination=0.05)

print(f"Isolation Forest Detection:")
print(f"  Anomalies detected: {iso_preds.sum()} ({iso_preds.mean():.2%})")

## 5. Compare Detection Methods

If we have labeled evaluation data, we can compare method performance.

In [None]:
def evaluate_detector(true_labels, predictions, scores, method_name):
    """
    Evaluate detection performance.
    """
    precision = precision_score(true_labels, predictions, zero_division=0)
    recall = recall_score(true_labels, predictions, zero_division=0)
    f1 = f1_score(true_labels, predictions, zero_division=0)
    
    # ROC AUC (if we have scores)
    try:
        auc = roc_auc_score(true_labels, scores)
    except:
        auc = 0.0
    
    return {
        'Method': method_name,
        'Precision': precision,
        'Recall': recall,
        'F1': f1,
        'AUC': auc
    }

# If we have labels, evaluate
if has_labels:
    # Note: Evaluation subset may have different indices than full embeddings
    # For proper evaluation, you'd need to match indices or generate embeddings for eval subset
    print("Using labeled evaluation subset for comparison...")
    
    # For demo, we'll create synthetic labels based on scores
    # In practice, match your evaluation subset to embeddings
    n_eval = min(len(eval_df), len(embeddings))
    
    # Use the eval_df labels if indices match
    if 'is_anomaly' in eval_df.columns:
        true_labels = eval_df['is_anomaly'].values[:n_eval]
        
        # Evaluate each method (on first n_eval samples)
        results = []
        results.append(evaluate_detector(true_labels, knn_preds[:n_eval], knn_scores[:n_eval], 'k-NN Distance'))
        results.append(evaluate_detector(true_labels, lof_preds[:n_eval], lof_scores[:n_eval], 'LOF'))
        results.append(evaluate_detector(true_labels, iso_preds[:n_eval], iso_scores[:n_eval], 'Isolation Forest'))
        
        results_df = pd.DataFrame(results)
        print("\nMethod Comparison:")
        print(results_df.to_string(index=False))
else:
    print("No labels available. Showing unsupervised comparison.")
    
    # Compare agreement between methods
    print(f"\nMethod Agreement:")
    print(f"  k-NN & LOF agree:      {(knn_preds == lof_preds).mean():.2%}")
    print(f"  k-NN & IsoForest agree: {(knn_preds == iso_preds).mean():.2%}")
    print(f"  LOF & IsoForest agree:  {(lof_preds == iso_preds).mean():.2%}")

## 6. Visualize Anomalies

Plot anomaly scores and highlight detected anomalies.

In [None]:
# Compare score distributions across methods
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

methods = [
    ('k-NN Distance', knn_scores, knn_preds),
    ('LOF', lof_scores, lof_preds),
    ('Isolation Forest', iso_scores, iso_preds)
]

for ax, (name, scores, preds) in zip(axes, methods):
    # Plot normal vs anomaly score distributions
    normal_scores = scores[preds == 0]
    anomaly_scores = scores[preds == 1]
    
    ax.hist(normal_scores, bins=30, alpha=0.7, label='Normal', color='blue')
    ax.hist(anomaly_scores, bins=30, alpha=0.7, label='Anomaly', color='red')
    ax.set_xlabel('Anomaly Score')
    ax.set_ylabel('Count')
    ax.set_title(name)
    ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Ensemble: combine multiple methods
def ensemble_detection(predictions_list, threshold=2):
    """
    Ensemble detection: flag as anomaly if >= threshold methods agree.
    """
    votes = np.sum(predictions_list, axis=0)
    return (votes >= threshold).astype(int)

# Combine all three methods
ensemble_preds = ensemble_detection([knn_preds, lof_preds, iso_preds], threshold=2)

print(f"Ensemble Detection (2/3 agreement):")
print(f"  Anomalies detected: {ensemble_preds.sum()} ({ensemble_preds.mean():.2%})")

# Venn diagram of method overlap
print(f"\nMethod Overlap:")
print(f"  Only k-NN:          {((knn_preds == 1) & (lof_preds == 0) & (iso_preds == 0)).sum()}")
print(f"  Only LOF:           {((knn_preds == 0) & (lof_preds == 1) & (iso_preds == 0)).sum()}")
print(f"  Only IsoForest:     {((knn_preds == 0) & (lof_preds == 0) & (iso_preds == 1)).sum()}")
print(f"  All three agree:    {((knn_preds == 1) & (lof_preds == 1) & (iso_preds == 1)).sum()}")

## 7. Inspect Top Anomalies

Look at the events with highest anomaly scores.

In [None]:
# Load original data for inspection
df = pd.read_parquet('../data/ocsf_logs.parquet')

# Add anomaly scores
df = df.iloc[:len(knn_scores)].copy()  # Match lengths
df['knn_score'] = knn_scores[:len(df)]
df['lof_score'] = lof_scores[:len(df)]
df['iso_score'] = iso_scores[:len(df)]
df['ensemble_anomaly'] = ensemble_preds[:len(df)]

# Top anomalies by k-NN score
print("Top 10 Anomalies by k-NN Distance Score:")
top_cols = ['message', 'service', 'level', 'knn_score', 'lof_score', 'iso_score']
top_cols = [c for c in top_cols if c in df.columns]
display_df = df.nlargest(10, 'knn_score')[top_cols]
print(display_df.to_string())

## 8. Save Results

Save anomaly predictions for further analysis.

In [None]:
# Save anomaly predictions
results = pd.DataFrame({
    'knn_score': knn_scores,
    'knn_anomaly': knn_preds,
    'lof_score': lof_scores,
    'lof_anomaly': lof_preds,
    'iso_score': iso_scores,
    'iso_anomaly': iso_preds,
    'ensemble_anomaly': ensemble_preds
})

results.to_parquet('../data/anomaly_predictions.parquet')
print("Saved anomaly predictions to ../data/anomaly_predictions.parquet")

# Summary statistics
print("\nSummary:")
print(results.describe())

## Summary

In this notebook, we:

1. **k-NN Distance**: Detected anomalies based on average distance to neighbors
2. **LOF**: Used local density comparison for adaptive detection
3. **Isolation Forest**: Leveraged tree-based isolation for anomaly scoring
4. **Ensemble**: Combined methods for robust detection
5. **Evaluated**: Compared methods on labeled subset (if available)

**Key insights**:
- Different methods catch different types of anomalies
- Ensemble voting (2/3 agreement) reduces false positives
- LOF adapts to varying local densities in the data
- k-NN distance is simple but effective for global outliers

**Production considerations**:
- Use vector database (FAISS, Milvus) for efficient k-NN at scale
- Tune `contamination` based on expected anomaly rate
- Monitor detection performance over time (concept drift)