# Handling Outliers Using Python

## Introduction

**Outliers** are data points that significantly differ from other observations in the dataset. They can be unusually high or low values that don't follow the general pattern of the data.

### What are Outliers?

An outlier is an observation that lies an abnormal distance from other values in a dataset. For example:
- In a dataset of house prices mostly between $200K-$500K, a $5M mansion is an outlier
- In student test scores mostly between 60-95, a score of 15 is an outlier
- In temperature readings between 20-30¬∞C, a value of 100¬∞C is likely an error

### Types of Outliers

**1. Univariate Outliers**
   - Extreme values in a single variable
   - Example: Age = 200 years in a health dataset

**2. Multivariate Outliers**
   - Normal individually but unusual in combination
   - Example: Age = 5 years with Income = $200K

**3. Point Outliers**
   - Individual data points far from others
   - Most common type

**4. Contextual Outliers**
   - Outliers in specific contexts
   - Example: 30¬∞C is normal in summer but outlier in winter

**5. Collective Outliers**
   - Collection of points that are outliers together
   - Example: Sudden spike in website traffic

### Causes of Outliers

**Natural Outliers (Legitimate):**
- Rare events (lottery winners, natural disasters)
- Exceptional performance (Olympic athletes)
- Genuine variability in data

**Artificial Outliers (Errors):**
- Data entry errors (typos: 150 instead of 15.0)
- Measurement errors (faulty sensors)
- Processing errors (unit conversions gone wrong)
- Sampling errors (wrong population sampled)

### Impact of Outliers

**Negative Effects:**
- Skew statistical measures (mean, standard deviation)
- Violate assumptions of statistical tests
- Reduce model accuracy
- Increase error in predictions
- Mask true patterns in data

**Positive Effects:**
- Can represent important rare events (fraud detection)
- May indicate new discoveries or patterns
- Sometimes the most interesting data points

### When to Remove vs Keep Outliers

**Remove When:**
- Data entry errors confirmed
- Measurement errors identified
- Outside the scope of analysis
- Breaking model assumptions

**Keep When:**
- Natural variability in data
- Rare but legitimate events
- Target of analysis (fraud, anomalies)
- Domain knowledge confirms validity

Let's explore various methods to detect and handle outliers!

## Step 1: Import Libraries and Create Dataset with Outliers

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

# Create a dataset with intentional outliers
# Normal data
normal_data = np.random.normal(100, 15, 200)

# Add outliers
outliers = np.array([50, 45, 180, 190, 200, 35, 185, 195])
data_with_outliers = np.concatenate([normal_data, outliers])

# Create DataFrame
df = pd.DataFrame({
    'Values': data_with_outliers,
    'Age': np.concatenate([np.random.randint(20, 60, 200), np.array([5, 95, 3, 100, 102, 8, 98, 105])]),
    'Salary': np.concatenate([np.random.normal(50000, 10000, 200), 
                              np.array([10000, 5000, 150000, 180000, 200000, 8000, 160000, 190000])])
})

print("=" * 100)
print("DATASET WITH OUTLIERS CREATED")
print("=" * 100)
print(f"\nDataset Shape: {df.shape}")
print(f"\nFirst 10 rows:")
print(df.head(10))
print(f"\nLast 10 rows (includes outliers):")
print(df.tail(10))

print("\n" + "-" * 100)
print("BASIC STATISTICS:")
print("-" * 100)
print(df.describe())

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Dataset Overview: Visualizing Outliers', fontsize=16, fontweight='bold')

# Histograms
for idx, col in enumerate(['Values', 'Age', 'Salary']):
    axes[0, idx].hist(df[col], bins=30, color='skyblue', edgecolor='black', alpha=0.7)
    axes[0, idx].set_xlabel(col, fontweight='bold')
    axes[0, idx].set_ylabel('Frequency', fontweight='bold')
    axes[0, idx].set_title(f'{col} Distribution', fontweight='bold')
    axes[0, idx].grid(alpha=0.3)
    axes[0, idx].axvline(df[col].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
    axes[0, idx].axvline(df[col].median(), color='green', linestyle='--', linewidth=2, label='Median')
    axes[0, idx].legend()

# Box plots
for idx, col in enumerate(['Values', 'Age', 'Salary']):
    box = axes[1, idx].boxplot(df[col], vert=True, patch_artist=True)
    box['boxes'][0].set_facecolor('lightcoral')
    axes[1, idx].set_ylabel(col, fontweight='bold')
    axes[1, idx].set_title(f'{col} Box Plot', fontweight='bold')
    axes[1, idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("üëÅÔ∏è VISUAL INSPECTION:")
print("  ‚Ä¢ Box plots show points beyond whiskers = potential outliers")
print("  ‚Ä¢ Histograms show extreme values on the tails")
print("  ‚Ä¢ Notice how outliers affect the mean vs median")
print("=" * 100)

## Method 1: Interquartile Range (IQR) Method

The **IQR method** is one of the most popular statistical techniques for outlier detection.

### How IQR Works:

1. **Calculate Q1 (25th percentile)** and **Q3 (75th percentile)**
2. **Calculate IQR** = Q3 - Q1
3. **Define boundaries:**
   - Lower Bound = Q1 - 1.5 √ó IQR
   - Upper Bound = Q3 + 1.5 √ó IQR
4. **Outliers** = values outside these boundaries

### Why 1.5 √ó IQR?
- Standard convention in statistics
- Identifies values in the extreme tails
- Can be adjusted (1.5 for moderate, 3.0 for extreme outliers)

### Advantages:
‚úÖ Simple and intuitive
‚úÖ Robust to extreme values
‚úÖ Works well with skewed distributions
‚úÖ Visualized in box plots

### Disadvantages:
‚ùå Only works for univariate data
‚ùå May not detect multivariate outliers
‚ùå Assumes specific distribution shape

In [None]:
# IQR Method for Outlier Detection
def detect_outliers_iqr(data, column, multiplier=1.5):
    """Detect outliers using IQR method"""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    
    return outliers, lower_bound, upper_bound, Q1, Q3, IQR

print("IQR METHOD FOR OUTLIER DETECTION")
print("=" * 100)

# Apply IQR method to each column
results_iqr = {}
for column in ['Values', 'Age', 'Salary']:
    outliers, lower, upper, q1, q3, iqr = detect_outliers_iqr(df, column)
    results_iqr[column] = {
        'outliers': outliers,
        'count': len(outliers),
        'lower_bound': lower,
        'upper_bound': upper,
        'Q1': q1,
        'Q3': q3,
        'IQR': iqr
    }
    
    print(f"\n{column}:")
    print(f"  Q1 (25th percentile): {q1:.2f}")
    print(f"  Q3 (75th percentile): {q3:.2f}")
    print(f"  IQR: {iqr:.2f}")
    print(f"  Lower Bound: {lower:.2f}")
    print(f"  Upper Bound: {upper:.2f}")
    print(f"  Outliers Detected: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")
    if len(outliers) > 0:
        print(f"  Outlier values: {sorted(outliers[column].values)[:10]}")  # Show first 10

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('IQR Method: Outlier Detection', fontsize=16, fontweight='bold')

for idx, column in enumerate(['Values', 'Age', 'Salary']):
    # Scatter plot with outliers highlighted
    normal_data = df[~df.index.isin(results_iqr[column]['outliers'].index)]
    outlier_data = results_iqr[column]['outliers']
    
    axes[idx].scatter(normal_data.index, normal_data[column], 
                     c='blue', label='Normal', alpha=0.6, s=30)
    axes[idx].scatter(outlier_data.index, outlier_data[column], 
                     c='red', label='Outliers', alpha=0.9, s=80, marker='x', linewidths=3)
    
    # Add boundary lines
    axes[idx].axhline(y=results_iqr[column]['lower_bound'], color='orange', 
                     linestyle='--', linewidth=2, label='Lower Bound')
    axes[idx].axhline(y=results_iqr[column]['upper_bound'], color='orange', 
                     linestyle='--', linewidth=2, label='Upper Bound')
    axes[idx].axhline(y=results_iqr[column]['Q1'], color='green', 
                     linestyle=':', linewidth=1, alpha=0.5, label='Q1')
    axes[idx].axhline(y=results_iqr[column]['Q3'], color='green', 
                     linestyle=':', linewidth=1, alpha=0.5, label='Q3')
    
    axes[idx].set_xlabel('Index', fontweight='bold')
    axes[idx].set_ylabel(column, fontweight='bold')
    axes[idx].set_title(f'{column}: {len(outlier_data)} Outliers', fontweight='bold')
    axes[idx].legend(loc='best', fontsize=8)
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Remove outliers
df_no_outliers_iqr = df.copy()
for column in ['Values', 'Age', 'Salary']:
    outlier_indices = results_iqr[column]['outliers'].index
    df_no_outliers_iqr = df_no_outliers_iqr.drop(outlier_indices)

df_no_outliers_iqr = df_no_outliers_iqr.drop_duplicates()  # Remove any duplicates

print("\n" + "=" * 100)
print("DATASET AFTER REMOVING OUTLIERS (IQR Method):")
print(f"  Original size: {len(df)} rows")
print(f"  After removal: {len(df_no_outliers_iqr)} rows")
print(f"  Removed: {len(df) - len(df_no_outliers_iqr)} rows ({(len(df) - len(df_no_outliers_iqr))/len(df)*100:.1f}%)")
print("=" * 100)

## Method 2: Z-Score Method

The **Z-Score method** measures how many standard deviations a data point is from the mean.

### How Z-Score Works:

**Formula:** Z = (X - Œº) / œÉ
- X = data point
- Œº = mean
- œÉ = standard deviation

**Threshold:** Typically |Z| > 3 indicates an outlier
- |Z| > 2: Moderate outlier (95% confidence)
- |Z| > 3: Extreme outlier (99.7% confidence)

### When to Use:
- Data is normally distributed (or approximately)
- Want to use statistical significance
- Need standardized measure across features

### Advantages:
‚úÖ Based on statistical theory
‚úÖ Easy to interpret (in terms of standard deviations)
‚úÖ Works well with normal distributions
‚úÖ Standardized across different scales

### Disadvantages:
‚ùå Assumes normal distribution
‚ùå Sensitive to extreme outliers (affects mean and std)
‚ùå Not robust for skewed data
‚ùå Only for univariate analysis

In [None]:
# Z-Score Method
def detect_outliers_zscore(data, column, threshold=3):
    """Detect outliers using Z-score method"""
    mean = data[column].mean()
    std = data[column].std()
    
    z_scores = np.abs((data[column] - mean) / std)
    outliers = data[z_scores > threshold]
    
    return outliers, z_scores, mean, std

print("Z-SCORE METHOD FOR OUTLIER DETECTION")
print("=" * 100)

results_zscore = {}
for column in ['Values', 'Age', 'Salary']:
    outliers, z_scores, mean, std = detect_outliers_zscore(df, column, threshold=3)
    results_zscore[column] = {
        'outliers': outliers,
        'z_scores': z_scores,
        'count': len(outliers),
        'mean': mean,
        'std': std
    }
    
    print(f"\n{column}:")
    print(f"  Mean: {mean:.2f}")
    print(f"  Std Dev: {std:.2f}")
    print(f"  Threshold: 3 standard deviations")
    print(f"  Outliers Detected: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")
    if len(outliers) > 0:
        print(f"  Outlier values: {sorted(outliers[column].values)[:10]}")
        print(f"  Max Z-score: {z_scores.max():.2f}")

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Z-Score Method: Outlier Detection', fontsize=16, fontweight='bold')

for idx, column in enumerate(['Values', 'Age', 'Salary']):
    # Z-score distribution
    z_scores = results_zscore[column]['z_scores']
    axes[0, idx].hist(z_scores, bins=30, color='lightblue', edgecolor='black', alpha=0.7)
    axes[0, idx].axvline(x=3, color='red', linestyle='--', linewidth=2, label='Threshold (3œÉ)')
    axes[0, idx].axvline(x=-3, color='red', linestyle='--', linewidth=2)
    axes[0, idx].set_xlabel('Z-Score', fontweight='bold')
    axes[0, idx].set_ylabel('Frequency', fontweight='bold')
    axes[0, idx].set_title(f'{column}: Z-Score Distribution', fontweight='bold')
    axes[0, idx].legend()
    axes[0, idx].grid(alpha=0.3)
    
    # Scatter plot with outliers
    normal_mask = z_scores <= 3
    axes[1, idx].scatter(df[normal_mask].index, df[normal_mask][column], 
                        c='blue', label='Normal', alpha=0.6, s=30)
    axes[1, idx].scatter(results_zscore[column]['outliers'].index, 
                        results_zscore[column]['outliers'][column], 
                        c='red', label='Outliers', alpha=0.9, s=80, marker='x', linewidths=3)
    axes[1, idx].axhline(y=results_zscore[column]['mean'], color='green', 
                        linestyle='--', linewidth=2, label='Mean')
    axes[1, idx].set_xlabel('Index', fontweight='bold')
    axes[1, idx].set_ylabel(column, fontweight='bold')
    axes[1, idx].set_title(f'{column}: {len(results_zscore[column]["outliers"])} Outliers', fontweight='bold')
    axes[1, idx].legend()
    axes[1, idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("üí° Z-SCORE INTERPRETATION:")
print("  ‚Ä¢ |Z| < 2: Within normal range (95% of data)")
print("  ‚Ä¢ 2 < |Z| < 3: Moderate outlier")
print("  ‚Ä¢ |Z| > 3: Extreme outlier (only 0.3% of normal data)")
print("=" * 100)

## Method 3: Isolation Forest

**Isolation Forest** is a machine learning algorithm specifically designed for anomaly/outlier detection.

### How Isolation Forest Works:

1. **Randomly select a feature** and split value
2. **Recursively partition data** (create decision trees)
3. **Outliers are isolated faster** (fewer splits needed)
4. **Anomaly score** based on path length to isolate the point

**Key Idea:** Outliers are "few and different", so they're easier to isolate than normal points.

### Why It's Effective:
- Outliers require fewer splits to isolate
- Normal points are clustered together (many splits needed)
- Uses ensemble of isolation trees

### When to Use:
- High-dimensional data
- Multivariate outlier detection
- No assumptions about data distribution
- Large datasets

### Advantages:
‚úÖ Handles multivariate outliers
‚úÖ No assumptions about distribution
‚úÖ Efficient with large datasets
‚úÖ Can handle high dimensions
‚úÖ Based on machine learning

### Disadvantages:
‚ùå Less interpretable (black box)
‚ùå Requires parameter tuning
‚ùå May not work well with small datasets
‚ùå Computational overhead

In [None]:
# Isolation Forest Method
print("ISOLATION FOREST METHOD")
print("=" * 100)

# Apply Isolation Forest
iso_forest = IsolationForest(contamination=0.05, random_state=42)  # Expect 5% outliers
predictions = iso_forest.fit_predict(df[['Values', 'Age', 'Salary']])

# -1 for outliers, 1 for inliers
df['ISO_Outlier'] = predictions
outliers_iso = df[df['ISO_Outlier'] == -1]
inliers_iso = df[df['ISO_Outlier'] == 1]

print(f"\nContamination Parameter: 0.05 (expect ~5% outliers)")
print(f"Outliers Detected: {len(outliers_iso)} ({len(outliers_iso)/len(df)*100:.1f}%)")
print(f"Inliers: {len(inliers_iso)} ({len(inliers_iso)/len(df)*100:.1f}%)")

print("\n" + "-" * 100)
print("Sample Outliers Detected:")
print(outliers_iso[['Values', 'Age', 'Salary']].head(10))

# Anomaly scores
anomaly_scores = iso_forest.score_samples(df[['Values', 'Age', 'Salary']])
df['Anomaly_Score'] = anomaly_scores

# Visualization
fig = plt.figure(figsize=(18, 10))
gs = fig.add_gridspec(2, 3, hspace=0.3)

# 2D scatter plots for each pair of features
feature_pairs = [('Values', 'Age'), ('Values', 'Salary'), ('Age', 'Salary')]

for idx, (feat1, feat2) in enumerate(feature_pairs):
    ax = fig.add_subplot(gs[0, idx])
    
    # Plot inliers and outliers
    ax.scatter(inliers_iso[feat1], inliers_iso[feat2], 
              c='blue', label='Inliers', alpha=0.6, s=30)
    ax.scatter(outliers_iso[feat1], outliers_iso[feat2], 
              c='red', label='Outliers', alpha=0.9, s=100, marker='x', linewidths=3)
    
    ax.set_xlabel(feat1, fontweight='bold')
    ax.set_ylabel(feat2, fontweight='bold')
    ax.set_title(f'Isolation Forest: {feat1} vs {feat2}', fontweight='bold')
    ax.legend()
    ax.grid(alpha=0.3)

# Anomaly score distribution
ax = fig.add_subplot(gs[1, :])
ax.hist(inliers_iso['Anomaly_Score'], bins=30, alpha=0.7, label='Inliers', 
        color='blue', edgecolor='black')
ax.hist(outliers_iso['Anomaly_Score'], bins=30, alpha=0.7, label='Outliers', 
        color='red', edgecolor='black')
ax.set_xlabel('Anomaly Score', fontweight='bold', fontsize=12)
ax.set_ylabel('Frequency', fontweight='bold', fontsize=12)
ax.set_title('Anomaly Score Distribution', fontweight='bold', fontsize=14)
ax.legend(fontsize=11)
ax.grid(alpha=0.3)

plt.suptitle('Isolation Forest: Multivariate Outlier Detection', 
             fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

# Compare with other methods
print("\n" + "=" * 100)
print("METHOD COMPARISON:")
print("=" * 100)
comparison_df = pd.DataFrame({
    'Method': ['IQR', 'Z-Score', 'Isolation Forest'],
    'Outliers_Detected': [
        sum([results_iqr[col]['count'] for col in ['Values', 'Age', 'Salary']]),
        sum([results_zscore[col]['count'] for col in ['Values', 'Age', 'Salary']]),
        len(outliers_iso)
    ],
    'Type': ['Univariate', 'Univariate', 'Multivariate']
})
print(comparison_df.to_string(index=False))

print("\n" + "=" * 100)
print("üí° KEY INSIGHTS:")
print("  ‚Ä¢ IQR & Z-Score: Detect outliers in each feature independently")
print("  ‚Ä¢ Isolation Forest: Detects multivariate outliers (unusual combinations)")
print("  ‚Ä¢ A point can be normal in each feature but outlier in combination")
print("=" * 100)

## Handling Strategies: What to Do with Outliers

Once outliers are detected, you have several options for handling them.

### Strategy 1: Remove Outliers
**When:** Confirmed errors, measurement mistakes
**Code:** `df_clean = df[~outlier_mask]`

### Strategy 2: Cap/Winsorize
**When:** Want to retain data but limit extreme values
**Code:** Set outliers to boundary values (e.g., 5th and 95th percentiles)

### Strategy 3: Transform Data
**When:** Reduce impact of outliers
**Methods:** Log transformation, square root, Box-Cox

### Strategy 4: Impute Outliers
**When:** Outliers are errors but removing loses info
**Code:** Replace with median, mean, or predicted values

### Strategy 5: Keep Them
**When:** Outliers are legitimate rare events
**Use:** Robust algorithms (tree-based models) or separate analysis

Let's implement these strategies!

In [None]:
# Demonstrate different handling strategies
print("OUTLIER HANDLING STRATEGIES")
print("=" * 100)

# Use 'Values' column for demonstration
column = 'Values'
data_original = df[column].copy()

# Strategy 1: Remove outliers (already shown)
outliers_iqr, lower, upper, _, _, _ = detect_outliers_iqr(df, column)
data_removed = df[~df.index.isin(outliers_iqr.index)][column]

# Strategy 2: Cap/Winsorize (clip to boundaries)
data_capped = data_original.copy()
data_capped = data_capped.clip(lower=lower, upper=upper)

# Strategy 3: Log transformation
data_log = np.log1p(data_original - data_original.min() + 1)  # Shift to positive

# Strategy 4: Impute with median
data_imputed = data_original.copy()
outlier_mask = (data_original < lower) | (data_original > upper)
data_imputed[outlier_mask] = data_original.median()

# Strategy 5: Keep as is
data_keep = data_original.copy()

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Outlier Handling Strategies Comparison', fontsize=16, fontweight='bold')

strategies = [
    ('Original Data', data_original),
    ('Removed', data_removed),
    ('Capped/Winsorized', data_capped),
    ('Log Transformed', data_log),
    ('Imputed (Median)', data_imputed),
    ('Keep Outliers', data_keep)
]

for idx, (title, data) in enumerate(strategies):
    row = idx // 3
    col = idx % 3
    
    axes[row, col].hist(data, bins=30, color='skyblue', edgecolor='black', alpha=0.7)
    axes[row, col].axvline(data.mean(), color='red', linestyle='--', 
                           linewidth=2, label=f'Mean: {data.mean():.1f}')
    axes[row, col].axvline(data.median(), color='green', linestyle='--', 
                           linewidth=2, label=f'Median: {data.median():.1f}')
    axes[row, col].set_xlabel(column, fontweight='bold')
    axes[row, col].set_ylabel('Frequency', fontweight='bold')
    axes[row, col].set_title(f'{title} (n={len(data)})', fontweight='bold')
    axes[row, col].legend(fontsize=8)
    axes[row, col].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics comparison
print("\nSTATISTICS COMPARISON:")
print("=" * 100)
print(f"{'Strategy':<20} {'Count':<8} {'Mean':<10} {'Median':<10} {'Std':<10} {'Min':<10} {'Max':<10}")
print("-" * 100)

for title, data in strategies:
    print(f"{title:<20} {len(data):<8} {data.mean():<10.2f} {data.median():<10.2f} "
          f"{data.std():<10.2f} {data.min():<10.2f} {data.max():<10.2f}")

print("\n" + "=" * 100)
print("STRATEGY RECOMMENDATIONS:")
print("-" * 100)
print("‚úì REMOVE: Best when outliers are confirmed errors")
print("‚úì CAP: Retains all data, limits extreme values")
print("‚úì TRANSFORM: Reduces skewness, compresses scale")
print("‚úì IMPUTE: Replaces outliers but keeps sample size")
print("‚úì KEEP: Use with robust models or when outliers are valid")
print("=" * 100)

## Summary: Outlier Detection and Handling Best Practices

### Quick Decision Guide

| Question | Answer ‚Üí Method |
|----------|----------------|
| Is data normally distributed? | Yes ‚Üí Z-Score, No ‚Üí IQR |
| Need multivariate detection? | Yes ‚Üí Isolation Forest |
| Small dataset? | IQR (more robust) |
| High dimensions? | Isolation Forest |
| Need interpretability? | IQR or Z-Score |

### Method Comparison Summary

| Method | Best For | Pros | Cons |
|--------|----------|------|------|
| **IQR** | General purpose, skewed data | Simple, robust, visual | Univariate only |
| **Z-Score** | Normal distribution | Statistical, standardized | Assumes normality |
| **Isolation Forest** | High-dim, multivariate | No assumptions, ML-based | Less interpretable |

### Best Practices

‚úÖ **DO:**
- Visualize data first (box plots, histograms)
- Understand domain context before removing
- Document outlier handling decisions
- Try multiple detection methods
- Consider business impact
- Use appropriate method for data distribution
- Keep original data backup

‚ùå **DON'T:**
- Remove outliers blindly
- Use Z-score on non-normal data
- Forget to check for multivariate outliers
- Remove outliers before understanding them
- Use same threshold for all features
- Ignore domain knowledge

### Handling Strategy Selection

```
Is outlier a DATA ERROR? ‚Üí REMOVE
Is outlier LEGITIMATE but extreme? ‚Üí CAP or TRANSFORM
Is outlier RARE but VALID event? ‚Üí KEEP (use robust model)
Is outlier pattern IMPORTANT? ‚Üí SEPARATE ANALYSIS
Unsure? ‚Üí IMPUTE or FLAG for review
```

### Real-World Tips

1. **Always investigate** outliers before removing
2. **Domain expertise** is crucial - what's an outlier in one context may be normal in another
3. **Multiple methods** - use 2-3 methods and compare results
4. **Document everything** - record which outliers removed and why
5. **Consider the goal** - fraud detection needs outliers, but regression modeling may not
6. **Robust algorithms** - Tree-based models handle outliers naturally
7. **Feature-specific thresholds** - different features may need different sensitivity

### Code Template

```python
# 1. Detect outliers
def detect_outliers_comprehensive(df, column):
    # IQR method
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    iqr_outliers = df[(df[column] < Q1 - 1.5*IQR) | (df[column] > Q3 + 1.5*IQR)]
    
    # Z-score method
    z_scores = np.abs(stats.zscore(df[column]))
    z_outliers = df[z_scores > 3]
    
    return iqr_outliers, z_outliers

# 2. Handle based on strategy
def handle_outliers(df, column, strategy='cap'):
    if strategy == 'remove':
        # Remove outliers
        outliers, _, _ = detect_outliers_iqr(df, column)
        return df[~df.index.isin(outliers.index)]
    
    elif strategy == 'cap':
        # Cap to boundaries
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        df[column] = df[column].clip(lower=lower, upper=upper)
        return df
    
    elif strategy == 'impute':
        # Replace with median
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        mask = (df[column] < Q1 - 1.5*IQR) | (df[column] > Q3 + 1.5*IQR)
        df.loc[mask, column] = df[column].median()
        return df
```

### Final Recommendations

- **Start with visualization** - understand your data
- **Use IQR as default** - works well in most cases
- **Try Isolation Forest** for complex, high-dimensional data
- **Consider the context** - medical data vs sales data have different needs
- **Validate results** - check model performance with/without outliers
- **Be conservative** - when in doubt, keep the data

---

**Remember:** Outliers are not always bad! They can be the most interesting and valuable part of your data, especially in fraud detection, quality control, or discovering new patterns.