In [None]:
%store -r

print("Project configuration:")
print(f"SLUG = {SLUG}")
print(f"DATA_DIR = {DATA_DIR}")
print(f"DATASET_KEY = {DATASET_KEY}")
print(f"FIG_DIR = {FIG_DIR}")
print(f"REP_DIR = {REP_DIR}")
print(f"NOTEBOOK_DIR = {NOTEBOOK_DIR}")

missing_vars = [var for var in ['SLUG', 'DATA_DIR', 'FIG_DIR', 'REP_DIR', 'NOTEBOOK_DIR', 'DATASET_KEY'] if var not in globals()]
print(f"Vars not found in globals: {missing_vars}")

# Set default values if variables are not found in store or are empty
if not SLUG:  # Check if empty string
    print(f"{SLUG=} is empty, initializing everything explicitly")
    SLUG = 'customer-segmentation'
    DATASET_KEY = 'vjchoudhary7/customer-segmentation-tutorial-in-python'
    GIT_ROOT = Path.cwd().parent.parent
    DATA_DIR = GIT_ROOT / 'data' / SLUG
    FIG_DIR = GIT_ROOT / 'figures' / SLUG
    REP_DIR = GIT_ROOT / 'reports' / SLUG
    NOTEBOOK_DIR = GIT_ROOT / 'notebooks' / SLUG


Project configuration:
SLUG = customer-segmentation
DATA_DIR = /Users/ravisharma/workdir/eda_practice/data/customer-segmentation
DATASET_KEY = vjchoudhary7/customer-segmentation-tutorial-in-python
FIG_DIR = /Users/ravisharma/workdir/eda_practice/figures/customer-segmentation
REP_DIR = /Users/ravisharma/workdir/eda_practice/reports/customer-segmentation
NOTEBOOK_DIR = /Users/ravisharma/workdir/eda_practice/notebooks/customer-segmentation
Vars not found in globals: []


In [None]:
from pathlib import Path
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from IPython.display import display

In [None]:
# Downloading data

base_df = pd.DataFrame()

CSV_PATH = Path(DATA_DIR) / "Mall_Customers.csv"
if not CSV_PATH.exists:
    print(f"CSV {CSV_PATH} does not exist. base_df will remain empty.")
else:
    base_df = pd.read_csv(CSV_PATH)
    print(f"CSV {CSV_PATH} loaded successfully.")

base_df.head()

CSV /Users/ravisharma/workdir/eda_practice/data/customer-segmentation/Mall_Customers.csv loaded successfully.


Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [None]:
base_df.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200.0,200.0,200.0
mean,100.5,38.85,60.56,50.2
std,57.879185,13.969007,26.264721,25.823522
min,1.0,18.0,15.0,1.0
25%,50.75,28.75,41.5,34.75
50%,100.5,36.0,61.5,50.0
75%,150.25,49.0,78.0,73.0
max,200.0,70.0,137.0,99.0


In [None]:
# Use Z-Score to detect outliers
z_scores = np.abs(stats.zscore(base_df['Age']))
# print(z_scores)
print("method1")
potential_outliers = np.where(z_scores > 2)
print(len(potential_outliers[0]), potential_outliers)
print("method2")
potential_outliers_2 = base_df[z_scores > 3]
print(potential_outliers_2)
# more robust modified z-score
median = base_df['Age'].median()
mean_absolute_deviation = np.median(np.abs(base_df['Age'] - median))
modified_z_score = 0.6745 * (base_df['Age'] - median) / mean_absolute_deviation
outliers = base_df[np.abs(modified_z_score) > 3.5]
print(outliers)

## 📊 Modified Z-Score: Theory & Robustness Explained

Your code implements the **Modified Z-Score (Median Absolute Deviation method)**, which is significantly more robust than the standard Z-score. Let me break down the theory:

### 🔍 Standard Z-Score Problems
```python
z_score = (x - mean) / standard_deviation
```
**Issues:**
- **Mean is sensitive** to extreme outliers
- **Standard deviation is sensitive** to extreme outliers  
- **One extreme outlier** can distort both mean and std, making other outliers invisible

### 💪 Modified Z-Score Solution
```python
# Your code breakdown:
median = base_df['Age'].median()                                    # Step 1
mean_absolute_deviation = np.median(np.abs(base_df['Age'] - median)) # Step 2  
modified_z_score = 0.6745 * (base_df['Age'] - median) / mean_absolute_deviation # Step 3
```

### 📈 Mathematical Theory

#### **Step 1: Median (Robust Center)**
- **Median** is the 50th percentile - unaffected by extreme values
- Unlike **mean**, adding extreme outliers doesn't shift the median significantly

#### **Step 2: Median Absolute Deviation (MAD)**
```python
MAD = median(|x_i - median(x)|)
```
- **MAD** measures spread using median instead of mean
- **Robust alternative** to standard deviation
- **Outliers don't inflate** the measure of variability

#### **Step 3: The Magic Constant 0.6745**
```python
0.6745 = Φ^(-1)(0.75)  # 75th percentile of standard normal distribution
```
**Why this constant?**
- Makes MAD **equivalent to standard deviation** for normal distributions
- **Φ^(-1)(0.75) ≈ 0.6745** is the z-value where 75% of normal distribution lies below
- **Conversion factor**: `1.4826 × MAD ≈ σ` for normal data, so `1/1.4826 ≈ 0.6745`

### 🛡️ Why Modified Z-Score is MORE ROBUST

| **Aspect** | **Standard Z-Score** | **Modified Z-Score** |
|------------|---------------------|---------------------|
| **Center** | Mean (sensitive to outliers) | Median (robust to outliers) |
| **Spread** | Std Dev (inflated by outliers) | MAD (resistant to outliers) |
| **Breakdown Point** | ~0% (1 outlier affects it) | ~50% (needs >50% outliers to break) |
| **Effect of Extreme Values** | High sensitivity | Low sensitivity |

#### 🎯 Practical Example
Imagine ages: `[20, 21, 22, 23, 24, 150]` (150 is a data entry error)

**Standard Z-Score:**
- Mean = 43.3 (pulled up by 150!)
- Std = 50.4 (inflated by 150!)
- Normal ages (20-24) now look like outliers!

**Modified Z-Score:**
- Median = 22.5 (barely affected)
- MAD = 2.0 (barely affected)  
- Only 150 is correctly identified as outlier

### 🎯 The 3.5 Threshold

```python
outliers = base_df[np.abs(modified_z_score) > 3.5]
```

**Why 3.5?**
- **Conservative threshold** - roughly equivalent to 2.5-3.0 standard deviations
- **Iglewicz & Hoaglin (1993)** recommendation for modified Z-score
- **Lower false positive rate** compared to 2.0 threshold
- **Good balance** between sensitivity and specificity

### 📊 Statistical Properties

#### **Breakdown Point:**
- **Standard Z-Score**: 0% (any extreme value affects it)
- **Modified Z-Score**: 50% (robust until majority are outliers)

#### **Efficiency:**
- **Standard Z-Score**: 100% efficient for normal data
- **Modified Z-Score**: ~37% efficient for normal data, but much better for contaminated data

### 🔬 When to Use Each

**Use Standard Z-Score when:**
- ✅ Data is approximately normal
- ✅ No extreme outliers expected
- ✅ All data points are trustworthy

**Use Modified Z-Score when:**
- ✅ **Data may contain outliers** (your case!)
- ✅ **Robust detection needed**
- ✅ **Data quality uncertain**
- ✅ **Non-normal distributions**

### 🎯 Your Customer Segmentation Context

For customer data, Modified Z-Score is **excellent** because:
1. **Age data** often has data entry errors (e.g., age 999, age 0)
2. **Customer surveys** may have extreme responses
3. **Business context** requires reliable outlier detection
4. **Robust method** ensures real patterns aren't masked by bad data

Your choice of Modified Z-Score shows **statistical sophistication** - it's the gold standard for outlier detection in real-world, potentially messy datasets! 🏆



In [None]:
# Use Z-Score to detect outliers
z_scores = np.abs(stats.zscore(base_df['Age']))
# print(z_scores)
print("method1")
potential_outliers = np.where(z_scores > 2)
print(len(potential_outliers[0]), potential_outliers)
print("method2")
potential_outliers_2 = base_df[z_scores > 3]
print(potential_outliers_2)

method1
10 (array([ 10,  57,  60,  62,  67,  70,  82,  90, 102, 108]),)
method2
Empty DataFrame
Columns: [CustomerID, Gender, Age, Annual Income (k$), Spending Score (1-100)]
Index: []


In [None]:
# more robust modified z-score
median = base_df['Age'].median()
mean_absolute_deviation = np.median(np.abs(base_df['Age'] - median))
modified_z_score = 0.6745 * (base_df['Age'] - median) / mean_absolute_deviation
outliers = base_df[np.abs(modified_z_score) > 2.0]
print(outliers)

    CustomerID Gender  Age  Annual Income (k$)  Spending Score (1-100)
57          58   Male   69                  44                      46
60          61   Male   70                  46                      56
70          71   Male   70                  49                      55
