### Anomaly Detection with Isolation Forest

**Goal:** To detect unusual patterns (anomalies) in synthetic time-series data using the Isolation Forest algorithm.

**What is Anomaly Detection?**
Anomaly detection, also known as outlier detection, is the process of identifying data points that deviate significantly from the majority of the data. These "anomalies" can indicate critical events, errors, or interesting insights. In time-series data, an anomaly might be an unusually high sensor reading, a sudden drop in website traffic, or a burst of network activity.

**Why Isolation Forest?**
Isolation Forest is an unsupervised machine learning algorithm particularly well-suited for anomaly detection. Its core idea is that anomalies are "few and different" and thus easier to isolate than normal data points. It does this by randomly partitioning data and observing how many splits it takes to isolate a data point. Anomalies typically require fewer splits to be isolated.

In [None]:
# Cell 1: Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import random # For injecting anomalies

**Explanation for Cell 1:**
* `numpy` (as `np`): Fundamental package for numerical computation in Python, especially for arrays.
* `pandas` (as `pd`): Used for data manipulation and analysis, especially with DataFrames.
* `matplotlib.pyplot` (as `plt`): For creating static, interactive, and animated visualizations in Python.
* `IsolationForest` from `sklearn.ensemble`: The core machine learning model we'll use for anomaly detection.
* `StandardScaler` from `sklearn.preprocessing`: Used to standardize features by removing the mean and scaling to unit variance. While Isolation Forest is less sensitive to scaling than some other algorithms, it's good practice, especially if you were to compare with other models later.
* `random`: Will be used to inject random anomalies into our synthetic data.

In [None]:
# Cell 2: Generate Synthetic Time-Series Data

# Set a random seed for reproducibility
np.random.seed(42)
random.seed(42)

# --- Generate Normal Data ---
# Number of data points
n_samples = 1000

# Create a time index
dates = pd.date_range(start='2023-01-01', periods=n_samples, freq='H')

# Simulate a "normal" sensor reading with a slight trend and seasonality
# Base value with some daily fluctuation
normal_data = np.sin(np.linspace(0, 50, n_samples)) * 5 + np.linspace(0, 20, n_samples) + np.random.normal(0, 1, n_samples)

# Create a DataFrame
df = pd.DataFrame({'timestamp': dates, 'value': normal_data})

# --- Inject Anomalies ---
# Inject some sudden spikes (point anomalies)
num_spikes = 10
spike_indices = random.sample(range(n_samples), num_spikes)
for idx in spike_indices:
    df.loc[idx, 'value'] += np.random.uniform(20, 50) # Add a large random value

# Inject some sudden drops (point anomalies)
num_drops = 5
drop_indices = random.sample(range(n_samples), num_drops)
for idx in drop_indices:
    df.loc[idx, 'value'] -= np.random.uniform(20, 50) # Subtract a large random value

# Inject a short "contextual anomaly" (a sustained unusual period)
contextual_start = 500
contextual_end = 530
df.loc[contextual_start:contextual_end, 'value'] = np.random.normal(70, 5, contextual_end - contextual_start + 1)

print("Synthetic data generated with normal patterns and injected anomalies.")
print(df.head())
print("\nDataFrame Info:")
df.info()

**Explanation for Cell 2:**
* **Reproducibility:** `np.random.seed()` and `random.seed()` ensure that every time you run this notebook, you get the exact same "random" data, which is crucial for debugging and and comparing results.
* **Normal Data Generation:**
    * We create a `pd.date_range` to simulate hourly sensor readings over a period.
    * The `normal_data` is generated using a combination of:
        * `np.sin()`: To introduce a cyclical (seasonal) pattern.
        * `np.linspace()`: To create a gradual increasing trend.
        * `np.random.normal()`: To add realistic noise.
    * This combination gives us a baseline time-series that looks somewhat natural.
* **Anomaly Injection:**
    * **Spikes/Drops:** We randomly select indices and add/subtract large values to simulate sudden, isolated anomalous events. These are **point anomalies**.
    * **Contextual Anomaly:** We select a continuous block of time and replace the data with values that are unusual for that period (e.g., a sustained higher reading), even if individual points aren't extreme. This demonstrates a **contextual anomaly**.
* **DataFrame Creation:** The generated data is stored in a Pandas DataFrame, which is a standard way to handle tabular data in Python.
* **`df.head()` and `df.info()`:** These are used to inspect the first few rows of the data and get a summary of its structure (data types, non-null counts).

In [None]:
# Cell 3: Visualize the Raw Time-Series Data

plt.figure(figsize=(15, 7))
plt.plot(df['timestamp'], df['value'], label='Sensor Value', alpha=0.8)
plt.title('Synthetic Time-Series Data with Injected Anomalies')
plt.xlabel('Time')
plt.ylabel('Value')
plt.grid(True)
plt.legend()
plt.show()

print("Notice the sudden spikes, drops, and the sustained unusual period. These are our injected anomalies.")

**Explanation for Cell 3:**
* This cell simply plots the `value` column against the `timestamp` to visually inspect the data.
* It's important to visualize your data to understand its patterns and to confirm that the anomalies you injected (or expect to find) are actually visible. This helps in intuition building.

In [None]:
# Cell 4: Prepare Data for Isolation Forest

# Isolation Forest works best on numerical features.
# We will focus on the 'value' column for anomaly detection.
# If you had multiple features (e.g., temperature, pressure, humidity),
# you would select all of them here.
data_for_model = df[['value']]

# Optional: Scale the data. While Isolation Forest is not highly sensitive to scaling,
# it can sometimes help with performance and is generally good practice.
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_for_model)

print(f"Original data shape: {data_for_model.shape}")
print(f"Scaled data shape: {scaled_data.shape}")
print("Data prepared for Isolation Forest.")

**Explanation for Cell 4:**
* **Feature Selection:** We select only the `'value'` column as our feature for anomaly detection. In a real-world scenario, you might have multiple sensor readings or metrics that collectively define "normal" behavior, and you would pass all of them to the model.
* **`StandardScaler`:**
    * `fit_transform()`: This method calculates the mean and standard deviation of the `data_for_model` (`fit`) and then applies the scaling transformation (`transform`). This results in data with a mean of 0 and a standard deviation of 1, which can make optimization easier for some algorithms and prevent features with larger scales from dominating.

In [None]:
# Cell 5: Train the Isolation Forest Model

# Initialize the Isolation Forest model
# key parameters:
# n_estimators: The number of trees in the forest. More trees generally lead to more robust results.
# contamination: The expected proportion of outliers in the data. This is an important parameter
#                as it defines the threshold for anomaly scores. If 'auto', the threshold is
#                determined by the original paper. We're setting it explicitly based on our
#                injected anomalies (10 spikes + 5 drops + 30 contextual points = ~45/1000 = 0.045)
#                Let's use 0.05 (5%) as a reasonable estimate.
# random_state: For reproducibility of the model's training process.
model = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)

# Fit the model to the scaled data
# The fit method 'learns' the normal behavior from the data.
model.fit(scaled_data)

print("Isolation Forest model trained successfully.")

**Explanation for Cell 5:**
* **`IsolationForest` Initialization:**
    * `n_estimators=100`: We're building 100 "isolation trees." More trees generally improve accuracy but increase computation time.
    * `contamination=0.05`: This is a crucial hyperparameter. It's our *estimate* of the proportion of anomalies in the dataset. Isolation Forest uses this to set a decision threshold: it will identify the top `contamination` percentage of data points with the lowest anomaly scores (most anomalous) as outliers. If you don't know the contamination, you can leave it as `'auto'` or experiment with different values.
    * `random_state=42`: Again, for reproducibility of the model's internal random processes.
* **`model.fit(scaled_data)`:** This is where the magic happens. The model learns the underlying structure of the normal data by building the isolation trees.

In [None]:
# Cell 6: Predict Anomalies and Get Anomaly Scores

# `predict` method returns -1 for anomalies and 1 for normal points
df['anomaly_prediction'] = model.predict(scaled_data)

# `decision_function` returns the anomaly scores.
# Lower scores indicate a higher likelihood of being an anomaly.
df['anomaly_score'] = model.decision_function(scaled_data)

# Convert predictions to a more intuitive label
# -1 -> Anomaly, 1 -> Normal
df['is_anomaly'] = df['anomaly_prediction'].apply(lambda x: 'Anomaly' if x == -1 else 'Normal')

print("Anomaly predictions and scores generated.")
print(df.head())
print("\nAnomaly counts:")
print(df['is_anomaly'].value_counts())

**Explanation for Cell 6:**
* **`model.predict(scaled_data)`:** This method applies the trained Isolation Forest to your data and returns a prediction for each data point:
    * `-1`: Indicates an anomaly.
    * `1`: Indicates a normal data point.
* **`model.decision_function(scaled_data)`:** This method returns the raw anomaly score for each data point.
    * **Interpretation:** For Isolation Forest, a *lower* (more negative) score means a higher likelihood of being an anomaly. Data points that are easier to isolate (requiring fewer splits) will have lower scores.
* **`df['is_anomaly']`:** We create a more human-readable column by mapping the `-1` and `1` predictions to 'Anomaly' and 'Normal' strings.
* **`value_counts()`:** Shows how many data points were classified as normal and how many as anomalous, based on the `contamination` parameter set during model initialization.

In [None]:
# Cell 7: Visualize Detected Anomalies

plt.figure(figsize=(18, 8))

# Plot normal data points
normal_points = df[df['is_anomaly'] == 'Normal']
plt.plot(normal_points['timestamp'], normal_points['value'], 'b.', markersize=8, label='Normal Data', alpha=0.6)

# Plot anomalies
anomaly_points = df[df['is_anomaly'] == 'Anomaly']
plt.plot(anomaly_points['timestamp'], anomaly_points['value'], 'ro', markersize=6, label='Detected Anomaly', alpha=0.9)

plt.title('Time-Series Anomaly Detection with Isolation Forest')
plt.xlabel('Time')
plt.ylabel('Value')
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

print("The red circles indicate data points detected as anomalies.")

**Explanation for Cell 7:**
* This cell visualizes the results of our anomaly detection.
* It separates the DataFrame into two parts: `normal_points` and `anomaly_points` based on the `is_anomaly` column.
* Normal points are plotted as blue dots, and detected anomalies are plotted as distinct red circles. This visual representation makes it very clear where the anomalies were detected and how they align with the injected anomalies.

In [None]:
# Cell 8: Explore Anomaly Scores (Optional but Recommended)

plt.figure(figsize=(15, 6))
plt.hist(df['anomaly_score'], bins=50, density=True, alpha=0.7, color='c')
plt.title('Distribution of Anomaly Scores')
plt.xlabel('Anomaly Score')
plt.ylabel('Density')
plt.grid(True)
plt.show()

# You can also look at the most anomalous points
print("\nTop 10 most anomalous points (lowest scores):")
print(df.sort_values(by='anomaly_score').head(10))

print("\nInterpretation of scores: Lower (more negative) scores indicate higher likelihood of being an anomaly.")
print("The 'contamination' parameter in IsolationForest sets the threshold internally.")

**Explanation for Cell 8:**
* **Histogram of Scores:** This plot shows the distribution of anomaly scores. You'll typically see a cluster of scores around a higher value (for normal points) and a tail extending to lower (more negative) values where the anomalies lie.
* **Top Anomalous Points:** By sorting the DataFrame by `anomaly_score` in ascending order, you can easily identify the data points that the model considers most anomalous. This is useful for further investigation.

### Short README/Summary

This Jupyter Notebook demonstrates a simple **Anomaly Detection** task using the **Isolation Forest** algorithm.

**1. Data Generation:**
  * Synthetic time-series data is created with a gentle trend and seasonality to mimic real-world sensor data.
  * Specific anomalies (sudden spikes, drops, and a sustained unusual period) are deliberately injected into this data to serve as ground truth for our detection.

**2. Data Preparation:**
  * The `value` column (representing our time-series measurement) is selected as the feature for the model.
  * The data is optionally scaled using `StandardScaler` to ensure all features contribute equally, although Isolation Forest is robust to unscaled data.

**3. Model Training:**
   * An `IsolationForest` model from `scikit-learn` is initialized.
   * Key parameters like `n_estimators` (number of trees) and `contamination` (estimated proportion of anomalies) are set.
   * The model is trained on the prepared data. Being an unsupervised algorithm, it learns patterns of "normality" without needing explicit anomaly labels.

**4. Anomaly Prediction & Scoring:**
   * The trained model `predicts` whether each data point is an anomaly (`-1`) or normal (`1`).
   * It also provides `decision_function` scores, where lower (more negative) scores indicate a higher likelihood of being an anomaly.

**5. Visualization:**
   * The original time-series data is plotted, with detected anomalies highlighted in a distinct color (red circles), making the results visually intuitive.
   * A histogram of anomaly scores is provided to understand the distribution of anomaly likelihoods.

**How Isolation Forest Works:**
  
Isolation Forest builds multiple "isolation trees" by randomly selecting a feature and then a random split value for that feature. This process is repeated recursively. Anomalies, being "different" from the majority, tend to be isolated closer to the root of these trees (requiring fewer splits), resulting in shorter "path lengths." Normal data points, being more densely packed, require more splits to be isolated, leading to longer path lengths. The anomaly score is based on these path lengths: shorter paths indicate higher anomaly likelihood.