[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wasim/Data-Science/blob/main/data-analyst-roadmap/08_data_cleaning_projects/03_outlier_detection.ipynb)

# Outlier Detection

Find and handle anomalies in your data.

## Why handle outliers?
- Can skew statistical measures (Mean, Std Dev).
- Can negative impact Machine Learning models.
- **Note:** Sometimes outliers are the most interesting part (Fraud)! 

## Methods
1. **Box Plot / IQR:** Robust statistical method.
2. **Z-Score:** For normal distributions.
3. **Isolation Forest:** Machine Learning approach.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Generate data with outliers
np.random.seed(42)
data = np.random.normal(50, 5, 100)
outliers = np.array([10, 100, 110]) # Extreme values
data = np.concatenate([data, outliers])

df = pd.DataFrame({'Value': data})

sns.boxplot(x=df['Value'])
plt.title('Box Plot showing outliers')
plt.show()

## 1. IQR Method (Interquartile Range)
The standard for most datasets.

In [None]:
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Bounds: {lower_bound:.2f} to {upper_bound:.2f}")

# Identify outliers
outliers_iqr = df[
    (df['Value'] < lower_bound) | 
    (df['Value'] > upper_bound)
]
print("Outliers using IQR:")
print(outliers_iqr.values.flatten())

## 2. Z-Score Method
Assume normal distribution. Outliers are > 3 Std Dev away.

In [None]:
df['Z_Score'] = np.abs(stats.zscore(df['Value']))

# Threshold = 3
outliers_z = df[df['Z_Score'] > 3]

print("Outliers using Z-Score > 3:")
print(outliers_z['Value'].values)

## 3. Isolation Forest (Machine Learning)
Good for high-dimensional data or complex patterns.

In [None]:
from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.03) # We expect ~3% outliers
df['Anomaly'] = iso.fit_predict(df[['Value']])

# -1 is Outlier, 1 is Normal
outliers_iso = df[df['Anomaly'] == -1]

print("Outliers using Isolation Forest:")
print(outliers_iso['Value'].values)

## 4. Handling Outliers

Once identified, what to do?
1. **Remove:** If error or irrelevant.
2. **Cap (Winsorize):** Set to upper/lower bound.
3. **Transform:** Log transform to reduce impact.

In [None]:
# Capping Example
df_capped = df.copy()
df_capped['Value'] = np.where(
    df_capped['Value'] > upper_bound, 
    upper_bound,
    np.where(
        df_capped['Value'] < lower_bound,
        lower_bound,
        df_capped['Value']
    )
)

print("Max value before cap:", df['Value'].max())
print("Max value after cap:", df_capped['Value'].max())

## Practice Exercise
Find outliers in 'Fare' column of Titanic dataset.

In [None]:
# Load titanic
# Plot Boxplot of 'fare'
# Calculate IQR bounds
# Count number of outliers
# Your code here

## Key Takeaways

✅ **IQR** - Best general purpose method.
✅ **Visuals** - Always look at Box Plots and Histograms.
✅ **Context** - Never delete blindly. Ask "Why is this here?"

**Next:** [Text Cleaning](04_text_cleaning.ipynb) →