# Hidden Markov Model for Anomaly Detection

## Core Concept
HMM assumes:
- Hidden states (e.g., Normal/Anomalous) that we can't directly observe
- Observable values (e.g., sensor readings) that depend on hidden states
- Transitions between hidden states follow probabilities

For anomaly detection: **Low probability sequences = Anomalies**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from hmmlearn import hmm

np.random.seed(42)

## Step 1: Generate Simple Data
Normal behavior: values around 0
Anomalous behavior: sudden spikes

In [None]:
# Normal data: small fluctuations around 0
normal_data = np.random.normal(0, 1, 200)

# Inject anomalies: sudden spikes
anomaly_indices = [50, 51, 120, 121, 122]
data = normal_data.copy()
data[anomaly_indices] = np.random.normal(5, 1, len(anomaly_indices))

# Reshape for HMM (needs 2D array)
X_train = normal_data[:150].reshape(-1, 1)
X_test = data.reshape(-1, 1)

plt.figure(figsize=(12, 4))
plt.plot(data, label='Observations')
plt.scatter(anomaly_indices, data[anomaly_indices], color='red', s=100, label='True Anomalies')
plt.legend()
plt.title('Time Series with Anomalies')
plt.show()

## Step 2: Train HMM on Normal Data
We use a Gaussian HMM with 2 hidden states

In [None]:
# Create and train HMM
model = hmm.GaussianHMM(n_components=2, covariance_type="full", n_iter=100)
model.fit(X_train)

print("Model trained on normal data")
print(f"Hidden state means: {model.means_.flatten()}")
print(f"Hidden state variances: {np.sqrt(model.covars_.flatten())}")

## Step 3: Detect Anomalies
Calculate log-likelihood for each point. Low likelihood = Anomaly

In [None]:
# Score each observation (log-likelihood)
log_likelihood = np.array([model.score(X_test[i:i+1]) for i in range(len(X_test))])

# Set threshold: points below this are anomalies
threshold = np.percentile(log_likelihood, 5)  # Bottom 5%
anomalies = log_likelihood < threshold

print(f"Threshold: {threshold:.2f}")
print(f"Detected {anomalies.sum()} anomalies")

## Step 4: Visualize Results

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))

# Plot 1: Original data with detected anomalies
ax1.plot(data, label='Observations', alpha=0.7)
ax1.scatter(np.where(anomalies)[0], data[anomalies], 
            color='red', s=100, label='Detected Anomalies', zorder=5)
ax1.scatter(anomaly_indices, data[anomaly_indices], 
            color='orange', s=50, marker='x', label='True Anomalies', zorder=6)
ax1.set_title('Anomaly Detection Results')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Log-likelihood scores
ax2.plot(log_likelihood, label='Log-Likelihood', color='blue')
ax2.axhline(threshold, color='red', linestyle='--', label='Threshold')
ax2.fill_between(range(len(log_likelihood)), log_likelihood, threshold, 
                  where=anomalies, alpha=0.3, color='red', label='Anomaly Region')
ax2.set_title('Anomaly Scores (Log-Likelihood)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 5: Evaluation

In [None]:
# Create ground truth labels
true_labels = np.zeros(len(data), dtype=bool)
true_labels[anomaly_indices] = True

# Calculate metrics
true_positives = np.sum(anomalies & true_labels)
false_positives = np.sum(anomalies & ~true_labels)
false_negatives = np.sum(~anomalies & true_labels)

precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print("\n=== Performance ===")
print(f"True Positives: {true_positives}")
print(f"False Positives: {false_positives}")
print(f"False Negatives: {false_negatives}")
print(f"\nPrecision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")

## Key Takeaways

1. **HMM learns normal patterns** from training data
2. **Low probability = Anomaly**: Points that don't fit the learned pattern
3. **Threshold matters**: Adjust the percentile to control sensitivity
4. **Works well for sequential data** where temporal patterns matter

## When to Use HMM for Anomaly Detection
- ✅ Time series with temporal dependencies
- ✅ When normal behavior has distinct states
- ✅ Sensor data, system logs, user behavior sequences
- ❌ Independent observations (use simpler methods)
- ❌ Very high-dimensional data (too many parameters)