```{contents}
```

# Anamoly Detection


* **Anomaly detection** means identifying **outliers** in data — points that deviate significantly from normal patterns.
* Outliers are crucial in some problems (e.g., fraud detection, security breaches, disease detection) but may be irrelevant in others.
* **Examples**:

  * Bank login from unusual locations
  * Unusual runs scored in an IPL over
  * Rare disease detection in healthcare datasets

---

## Importance of Outliers

* Outliers indicate unique or abnormal events in a dataset.
* Detecting anomalies is often **unsupervised**, as labels for anomalies are usually not available.
* Outliers may represent critical events (fraud, disease) or noise, depending on the context.

---

## Isolation Forest Concept

* **Isolation Forest** is an **unsupervised anomaly detection algorithm**.
* Uses **isolation trees** (similar to decision trees) to separate individual points:

  * Outliers are **isolated faster**, requiring fewer splits.
  * Normal points require more splits to isolate.
* The **anomaly score** is calculated using the formula:

$$
s(x, m) = 2^{-\frac{E(h(x))}{c(m)}}
$$

Where:

* $h(x)$ = average path length to isolate point $x$ in a tree

* $E(h(x))$ = average path length over multiple trees

* $c(m)$ = expected path length for a sample of size $m$

* Points with scores close to 1 → likely anomalies.

* Threshold (e.g., 0.5) is set to classify points as outliers.

---

## How Isolation Forest Works

1. Randomly select a feature and split a value between its min and max.
2. Recursively create nodes until each point is isolated in a leaf.
3. Points that are isolated in **shorter paths** → anomalies.
4. Multiple isolation trees are used for robustness.

---

### **5. Practical Example**

* A healthcare dataset with 2 features (indicating disease) was used.

* Steps:

  1. Load dataset.
  2. Fit **Isolation Forest** (`contamination` parameter defines proportion of expected anomalies).
  3. Predict anomalies (`1` = normal, `-1` = outlier).
  4. Visualize outliers on a scatter plot (outliers highlighted in red).

* Results: Outliers were clearly separated from normal points, demonstrating Isolation Forest’s effectiveness.

---

### **6. Key Points**

* Anomaly detection is **unsupervised**.
* Isolation Forest isolates data points rather than clustering.
* Outliers are detected based on how quickly they can be separated from the rest of the data.
* Useful for fraud detection, cybersecurity, healthcare, and other domains with rare events.


## Statistical / Classical Methods

* **Z-score / Standard Deviation**
  Detect points that deviate from mean by > n standard deviations.

* **Modified Z-score**
  More robust to outliers using median and MAD (Median Absolute Deviation).

* **Grubbs’ Test / Dixon’s Q Test**
  Statistical tests for single outliers.

* **Boxplot / IQR Method**
  Points outside `Q1 - 1.5*IQR` or `Q3 + 1.5*IQR`.

---

## Distance-based Methods

* **k-Nearest Neighbors (kNN) for anomaly detection**
  Anomalies have large distances to nearest neighbors.

* **Local Outlier Factor (LOF)**
  Measures how isolated a point is compared to its neighbors.

* **Mahalanobis Distance**
  Measures distance considering correlation between features.

---

## Clustering-based Methods

* **K-Means-based anomaly detection**
  Points far from any cluster centroid are anomalies.

* **DBSCAN**
  Points labeled as noise (`-1`) are anomalies.

* **Hierarchical clustering**
  Small isolated clusters or singleton points can be anomalies.

---

## Classification / Supervised Methods

*(Requires labeled data: normal vs. anomaly)*

* **Support Vector Machine (SVM) – One-Class SVM**
  Learns the boundary of normal points; points outside are anomalies.

* **Random Forest / Isolation Forest**
  Detects anomalies by isolating points that are easier to split.

* **Gradient Boosting / XGBoost** (for anomaly classification if labeled)

---

## Neural Network / Deep Learning Methods

* **Autoencoders**
  Reconstruct input; large reconstruction error → anomaly.

* **Variational Autoencoders (VAE)**
  Probabilistic reconstruction; high likelihood deviations → anomaly.

* **LSTM-based Autoencoders**
  For **time series anomaly detection**.

* **Generative Adversarial Networks (GANs)**
  Identify anomalies as points the generator fails to reproduce well.

---

## Probabilistic / Density-based Methods

* **Gaussian Mixture Models (GMM)**
  Low probability points under the model are anomalies.

* **Kernel Density Estimation (KDE)**
  Points in low-density regions → anomalies.

* **Bayesian Networks**
  Probabilistic modeling to detect unusual events.

---

## Time-Series Specific Methods

* **ARIMA / SARIMA Residuals**
  Residuals beyond thresholds → anomaly.

* **Prophet / Facebook Prophet**
  Detect deviations from predicted trends.

* **Twitter AnomalyDetection (R / Python port)**

---

## Ensemble Methods

* Combine multiple anomaly detection models:

  * Isolation Forest + LOF
  * Autoencoder + Statistical threshold
  * Voting / stacking ensemble

```{dropdown} Click here for Sections
```{tableofcontents}