# Day 01 — Data Cleaning + EDA Fundamentals

Welcome to **Day 01 of the ML Track!** This is a comprehensive walkthrough for **Exploratory Data Analysis (EDA)** — the essential first step in every ML project.

## Why EDA is Non-Negotiable

| Reason | What happens if you skip it |
|---|---|
| **Data quality** | Garbage in, garbage out. Missing values or wrong types silently corrupt your model. |
| **Feature understanding** | You miss obvious transformations that would boost performance. |
| **Target leakage** | You accidentally include future information, getting unrealistically high scores. |
| **Model selection** | You pick the wrong model type because you never looked at the distributions. |

## The EDA Workflow

```
Raw Data -> 1. Inspect -> 2. Missing Values -> 3. Summary Stats
         -> 4. Distributions -> 5. Correlation -> 6. Feature Engineering
         -> Clean Data ready for Modeling (Day 02)
```

## What We Cover Today

1. **Theory**: Types of data (numerical, categorical, ordinal)
2. **Theory**: Summary statistics — mean, median, std, quartiles
3. **Theory**: Missing values — MCAR/MAR/MNAR and imputation strategies
4. **Theory**: Distributions, skewness, outlier detection
5. **Theory**: Correlation — Pearson r, heatmaps, caveats
6. **Theory**: Feature engineering — ratios and bucketing
7. **Practice**: All of the above on a small HR dataset

---
## Theory: Types of Data

### Numerical — quantities you can do arithmetic on

| Subtype | Description | Examples |
|---|---|---|
| **Continuous** | Any value within a range | Age (22.5), salary, temperature |
| **Discrete** | Only specific values (usually integers) | Number of children, floor number |

### Categorical — groups or labels (arithmetic is meaningless)

| Subtype | Description | Examples |
|---|---|---|
| **Nominal** | No natural order | Department (sales, marketing), country |
| **Ordinal** | Meaningful order | Education level, star ratings (1-5) |

### Why This Matters for ML

| Data Type | Summary Stats | Visualizations | Model Encoding |
|---|---|---|---|
| **Continuous** | Mean, median, std | Histogram, boxplot | Use directly (maybe scale) |
| **Nominal** | Mode, value_counts | Bar chart | One-hot encoding |
| **Ordinal** | Median, mode | Bar chart | Label encoding (preserves order) |

Pandas does not automatically know the correct type — it is your job to set it.

---
## Setup and Imports

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['figure.dpi'] = 100
pd.set_option('display.max_columns', 20)

print('All imports successful!')

---
## Creating a Sample Dataset

In real projects you load a CSV or SQL table. Here we create a small in-memory dataset
to focus on EDA concepts. It intentionally has **missing values** and mixed **data types**.

In [None]:
data = {
    'age':        [22, 35, 28, None, 40, 19, 50],
    'salary':     [48000, 54000, 50000, 62000, None, 41000, 80000],
    'department': ['sales', 'marketing', 'sales', 'engineering',
                   'engineering', 'sales', 'marketing'],
    'tenure':     [1.2, 3.4, 2.1, 5.0, 4.2, 0.8, 6.5],
}
df = pd.DataFrame(data)
print(f'Shape: {df.shape[0]} rows x {df.shape[1]} columns')
df.head()

---
## Theory: The First Three Questions

1. **How big is the data?** -> `df.shape`
2. **What types do I have?** -> `df.info()` (shows dtypes and non-null counts)
3. **What does it look like?** -> `df.head()` / `df.tail()`

### Pandas Data Types

| dtype | What it represents |
|---|---|
| `int64` | Integers (no decimals) |
| `float64` | Decimals — **also used when integers have missing values** |
| `object` | Text strings or mixed types |
| `category` | Categorical (memory-efficient; correct for nominals/ordinals) |

**Important:** When an integer column has `None`, pandas converts it to `float64`
because `NaN` is a float. You will see this in the `age` column.

In [None]:
print('=' * 50)
print('DATASET SIZE')
print('=' * 50)
print(f'Rows:    {df.shape[0]}')
print(f'Columns: {df.shape[1]}')
print(f'Names:   {list(df.columns)}')
print()

print('=' * 50)
print('DATA TYPES AND NON-NULL COUNTS')
print('=' * 50)
df.info()
print()
print('OBSERVATIONS:')
print('- age is float64 (not int64) because it has a missing value (NaN is a float)')
print('- salary is float64 for the same reason')
print('- age and salary each have 6/7 non-null => 1 missing value each')

---
## Theory: Summary Statistics

### Central Tendency

| Statistic | What it tells you | Sensitive to outliers? |
|---|---|---|
| **Mean** | Center of gravity of the data | **Yes** — one extreme value pulls it strongly |
| **Median** | Typical value — 50% above, 50% below | **No** — robust to outliers |
| **Mode** | Most frequent value | No |

### Spread

| Statistic | What it tells you |
|---|---|
| **Std** | Average distance of values from the mean |
| **IQR** | Q3 - Q1 — spread of the middle 50% (robust to outliers) |

### Worked Example

Age values (excluding the missing one): 19, 22, 28, 35, 40, 50

- **Mean** = (19+22+28+35+40+50)/6 = **32.3**
- **Median** = average of 28 and 35 = **31.5**
- mean (32.3) > median (31.5) => slight **right skew**

**Rule:** mean > median = right-skewed | mean < median = left-skewed | mean ≈ median = symmetric

In [None]:
# .describe() gives count, mean, std, min, 25%, 50%, 75%, max for every numeric column
print('NUMERICAL SUMMARY STATISTICS')
print(df.describe().round(2))
print()
print('HOW TO READ THIS:')
print('- count < total rows => missing values')
print('- mean vs 50% (median): large gap = skewed distribution')
print('- min/max: check for impossible values (negative age? salary=0?)')
print()

# For categorical columns, value_counts() is the equivalent of describe()
print('CATEGORICAL: department')
print(df['department'].value_counts())
print()
for dept, pct in (df['department'].value_counts(normalize=True)*100).items():
    print(f'  {dept}: {pct:.1f}%')

---
## Theory: Missing Values — Types and Strategies

### Three Types of Missingness

| Type | Description | Implication |
|---|---|---|
| **MCAR** (Missing Completely At Random) | No relationship to any variable | Safe to drop or impute |
| **MAR** (Missing At Random) | Depends on observed variables, not the missing value | Imputation using other features works well |
| **MNAR** (Missing Not At Random) | Depends on the missing value itself | Any strategy may introduce bias |

**Example of MNAR:** High-income people refuse to report salary. Imputing with the mean
will systematically underestimate their salary.

### Strategies

| Strategy | When to use |
|---|---|
| **Drop rows** | Very few missing (<5%), MCAR |
| **Median imputation** | Continuous, skewed or with outliers — more robust than mean |
| **Mean imputation** | Continuous, roughly symmetric |
| **Mode imputation** | Categorical |
| **Indicator column** | When missingness itself carries signal (add BEFORE imputing) |

> Always **quantify first** before deciding on a strategy.

In [None]:
# Step 1: Quantify
print('MISSING VALUE ANALYSIS')
missing = pd.DataFrame({'count': df.isna().sum(), 'pct': (df.isna().sum()/len(df)*100).round(1)})
print(missing)
print()

# Step 2: Add indicator column BEFORE imputing
# Records WHERE data was originally missing — preserves missingness signal
df['age_missing'] = df['age'].isna().astype(int)
print('age_missing indicator (1=was missing):')
print(df[['age', 'age_missing']].to_string())
print()

# Step 3: Impute with median (robust to outliers)
age_med = df['age'].median()
sal_med = df['salary'].median()
print(f'Imputing age with median:    {age_med}')
print(f'Imputing salary with median: {sal_med}')
df['age'] = df['age'].fillna(age_med)
df['salary'] = df['salary'].fillna(sal_med)
print()
print(f'Missing after imputation: {df.isna().sum().sum()} values')

---
## Theory: Data Types and Casting

### Why Cast Dtypes Explicitly?

1. **Memory efficiency**: `category` uses 95%+ less memory than `object` for columns
   with few unique values.
2. **Correct operations**: `.describe()` shows mean/std for numeric; count/unique for
   object/category.
3. **Prevents bugs**: A numeric column stored as `object` silently fails arithmetic.

| From | To | Method |
|---|---|---|
| `object` | `category` | `.astype('category')` |
| `object` | `float64` | `pd.to_numeric(col, errors='coerce')` |
| `object` | `datetime64` | `pd.to_datetime(col)` |

In [None]:
print(f'Before: department dtype = {df["department"].dtype}')
df['department'] = df['department'].astype('category')
print(f'After:  department dtype = {df["department"].dtype}')
print(f'Categories: {list(df["department"].cat.categories)}')
print()
print('FINAL COLUMN TYPES')
print(df.dtypes)

---
## Theory: Distributions, Skewness, and Outliers

### Distribution Shapes

```
Symmetric         Right-Skewed      Left-Skewed
    @                 @                  @
   @@@               @@@               @@@  
  @@@@@             @@@@@@           @@@@@@
 @@@@@@@@         @@@@@@@@@@       @@@@@@@@@@
---------       -------------    -------------
mean=median     mean > median     mean < median
```

### The 1.5 x IQR Outlier Rule (Boxplot Whiskers)

- Lower fence: Q1 - 1.5 x IQR
- Upper fence: Q3 + 1.5 x IQR
- Values beyond fences are **potential outliers** (shown as dots in boxplots)

**Example:** Q1=25, Q3=45, IQR=20 => fences at -5 and 75.

### KDE (Kernel Density Estimation)
A smooth version of a histogram. Use `kde=True` in seaborn's `histplot`.

In [None]:
# Histograms with KDE overlay and mean/median reference lines
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, col, color in zip(axes, ['age', 'salary'], ['steelblue', 'coral']):
    sns.histplot(df[col], kde=True, color=color, edgecolor='white', ax=ax)
    m, med = df[col].mean(), df[col].median()
    ax.axvline(m, color='red', linestyle='--', lw=1.5, label=f'Mean={m:.1f}')
    ax.axvline(med, color='green', linestyle='-.', lw=1.5, label=f'Median={med:.1f}')
    ax.set_title(f'{col.capitalize()} Distribution', fontsize=13, fontweight='bold')
    ax.legend(fontsize=9)

plt.suptitle('Distribution Analysis', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# Boxplot: Salary by department
fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(x='department', y='salary', data=df, palette='Set2', ax=ax)
sns.stripplot(x='department', y='salary', data=df, color='black', size=8, alpha=0.7, ax=ax)
ax.set_title('Salary by Department', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print('Salary by department:')
for dept in df['department'].cat.categories:
    s = df[df['department']==dept]['salary']
    print(f'  {dept}: median=${s.median():,.0f}, range=${s.min():,.0f}-${s.max():,.0f}')

---
## Theory: Correlation and Relationships

### Pearson r — Measures Linear Relationship Strength

| r value | Interpretation |
|---|---|
| +1.0 | Perfect positive linear relationship |
| +0.7 to +0.9 | Strong positive correlation |
| 0 | No linear relationship |
| -0.7 to -1.0 | Strong negative correlation |

### Critical Caveats

1. **Correlation does not imply causation.** Ice cream sales and drowning rates both
   peak in summer — ice cream does not cause drowning.

2. **Linear only.** A perfect parabola (y = x^2) can have r = 0.

3. **Sensitive to outliers.** One extreme point can drastically change r.

In [None]:
corr = df.select_dtypes(include='number').corr()

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
            vmin=-1, vmax=1, square=True, linewidths=0.5, ax=ax)
ax.set_title('Correlation Matrix (Pearson)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print('Notable correlations (|r| > 0.4):')
cols = corr.columns
found = False
for i in range(len(cols)):
    for j in range(i+1, len(cols)):
        r = corr.iloc[i, j]
        if abs(r) > 0.4:
            found = True
            s = 'strong' if abs(r) > 0.7 else 'moderate'
            d = 'positive' if r > 0 else 'negative'
            print(f'  {cols[i]} vs {cols[j]}: r={r:.2f} ({s} {d})')
if not found:
    print('  None found.')

---
## Theory: Feature Engineering — Ratios and Bucketing

Creating new features from existing ones is often the single biggest lever for improving
model performance.

### Why Create New Features?

- A burnout risk model cares about **salary relative to tenure**, not each value alone.
- A career prediction model may care about **age group** more than exact age.

### Common Techniques

| Technique | Example | When to Use |
|---|---|---|
| **Ratio** | salary / tenure | When the ratio carries more signal than each value separately |
| **Bucketing** | age -> '<25', '25-35' | When the effect is non-linear across the value range |
| **Log transform** | log(salary) | When a feature is heavily right-skewed |

### Warning: Avoid Target Leakage

Never create features using information unavailable at prediction time.

In [None]:
# Feature 1: Ratio
# salary_per_year = pay per year of experience
# +0.1 prevents division by zero for very new hires
df['salary_per_year'] = df['salary'] / (df['tenure'] + 0.1)

# Feature 2: Bucketing
# age_bucket = career stage categories
df['age_bucket'] = pd.cut(df['age'], bins=[0, 25, 35, 45, 100],
                          labels=['<25', '25-35', '35-45', '45+'])

print('New features:')
print(df[['age', 'salary', 'tenure', 'salary_per_year', 'age_bucket']].to_string())
print()
print(f'salary_per_year: min=${df["salary_per_year"].min():,.0f}, max=${df["salary_per_year"].max():,.0f}')
print()
print('age_bucket distribution:')
print(df['age_bucket'].value_counts().sort_index())

---
## Summary of Findings

### Theory Recap

| Topic | Key Takeaway |
|---|---|
| **Data types** | Numerical vs categorical. Each needs different summary stats, visualizations, and encoding. |
| **Summary statistics** | Mean is sensitive to outliers; median is robust. Compare them to detect skew. |
| **Missing values** | Understand MCAR/MAR/MNAR. Add indicator columns before imputing. |
| **Distributions** | Histograms reveal shape; boxplots show quartiles and flag outliers. |
| **Correlation** | Pearson r is linear only. Correlation does not imply causation. |
| **Feature engineering** | Ratios and buckets create more informative signals than raw columns. |

### Practice Completed

1. Inspected structure: `df.shape`, `df.info()`, `df.head()`
2. Quantified missing values, added indicator column, imputed with median
3. Computed summary statistics with `df.describe()` and `value_counts()`
4. Cast `department` to `category` dtype
5. Visualized distributions: histograms with KDE and mean/median lines, boxplots by group
6. Built Pearson correlation heatmap
7. Engineered `salary_per_year` (ratio) and `age_bucket` (bucketing)

> EDA is not optional. It is the foundation of every reliable ML pipeline.

---
## Next Steps: Day 02 — Baseline Model

In **Day 02**, we build a **logistic regression baseline** on the Breast Cancer Wisconsin dataset.
Today's work connects directly:

- The **imputation** is required before passing data to any sklearn model (no NaN allowed).
- The **dtype casting** determines how we encode features (one-hot vs scaling).
- The **correlation analysis** helps identify useful vs redundant predictors.
- The **feature engineering** ideas can be applied before modeling for better inputs.

**See you in Day 02!**