# Outliers 
## 1. What are Outliers?
- Definition: Data points that significantly differ from other observations in a dataset

# Characteristics:

- Unusual values compared to the rest of the data

- Can occur due to:

- Measurement errors

- Data entry errors

- Natural variation in data

- Fraudulent activity

- Experimental errors

# Types:

- Univariate outliers: Extreme values in a single feature/variable

- Multivariate outliers: Unusual combinations of values across multiple features

## 2. When are Outliers Dangerous?
- Outliers become problematic when they:

- Skew statistical analyses (mean, standard deviation)

- Bias machine learning model training

- Reduce model accuracy and performance

- Violate assumptions of statistical methods (normality, homoscedasticity)

- Mask important patterns in the data

- Cause overfitting if model tries to accommodate them unnecessarily

- Situations where outliers matter most:

- Small datasets

- Algorithms sensitive to scale and distribution

- When making predictions for typical cases

- When model interpretability is important

## 3. Effect of Outliers on ML Algorithms
- Linear Regression: Heavily influenced (especially with squared error loss)

- K-Means Clustering: Can distort centroid positions

- PCA (Principal Component Analysis: Can skew principal components

- Tree-based algorithms (Random Forest, Decision Trees): Generally more robust

- SVM (Support Vector Machines): Can affect margin and support vectors

- Neural Networks: May slow convergence or lead to poor generalization

- K-NN (K-Nearest Neighbors): Can affect distance calculations

## 4. How to Treat Outliers
- Options to handle outliers:
## A. Removal

- Delete outlier records (only if they're errors and dataset is large enough)

- Risk: May lose valuable information

##  B. Transformation

- Log transformation

- Square root transformation

- Box-Cox transformation

- Effect: Reduces impact of extreme values

 ## C. Imputation

- Replace with:

- Mean/median (for symmetric data)

- Mode (for categorical)

- Predictive values from models

- Use winsorization (capping at certain percentiles)

## D. Separate Modeling

- Build separate models for outliers vs. normal data

##  E. Use Robust Algorithms

- Algorithms less sensitive to outliers

## F. Binning/Discretization

- Convert continuous values to categories

## 5. How to Detect Outliers
- Visual Methods:
- Box plots (shows Q1, Q3, IQR, whiskers)

- Scatter plots

- Histograms

- QQ plots (for normality checking)

- Statistical Methods:
- Z-score method: Values beyond ±3 standard deviations

- IQR method: Values below Q1-1.5×IQR or above Q3+1.5×IQR

- Modified Z-score: Uses median and MAD (Median Absolute Deviation)

- ML-Based Methods:
- Isolation Forest

- Local Outlier Factor (LOF)

- One-Class SVM

- DBSCAN clustering (identifies points in low-density regions)

## 6. Techniques for Outlier Detection
## A. Univariate Methods
- Z-Score Method

text
Z = (x - μ) / σ
If |Z| > 3 → Potential outlier
IQR Method

text
IQR = Q3 - Q1
Lower bound = Q1 - 1.5 × IQR
Upper bound = Q3 + 1.5 × IQR
Percentile Method

- Flag values below 5th or above 95th percentile

## B. Multivariate Methods
- Mahalanobis Distance

- Measures distance from center considering correlations

- DBSCAN

- Identifies core, border, and noise points

- Noise points = potential outliers

- Isolation Forest

- Builds random trees to isolate observations

- Fewer splits needed to isolate outliers → higher anomaly score

- Local Outlier Factor (LOF)

- Compares local density of a point with its neighbors

- Low density relative to neighbors = potential outlier

- One-Class SVM

- Learns a boundary around normal data

- Points outside boundary = outliers

- Autoencoders (Neural Networks)

- Train on normal data

- High reconstruction error = potential outlier

## 7. Quick Decision Guide
- When reviewing data:
- Visualize first (box plots, scatter plots)

- Check if outliers are errors → Remove/correct

- Check impact on model metrics

- Try multiple detection methods (different methods catch different outliers)

 - Consider domain knowledge (some "outliers" might be important)

- Document decisions made about outliers

- Rule of thumb:
- Large dataset, clear errors: Remove

- Small dataset, unclear if errors: Transform or use robust methods

- Critical applications: Investigate each outlier thoroughly

- Exploratory phase: Note them but don't automatically remove

  ## 8. Key Takeaways
- Not all outliers are bad - some represent important rare events

- Always investigate outliers before deciding treatment

- Choice of treatment depends on:

- Dataset size

- Problem context

- Algorithm choice

- Business implications

- When in doubt, compare model performance with/without outlier treatment

