# Day 01 — Data Cleaning + EDA Fundamentals

Welcome to **Day 01 of the ML Track!** This notebook is a comprehensive, beginner-friendly
walkthrough for **Exploratory Data Analysis (EDA)** — the essential first step in every
machine learning project.

## What Is EDA and Why Does It Matter?

**Exploratory Data Analysis** is the process of examining a dataset to understand its structure,
spot problems, and develop intuitions — all *before* building any models.

### Why EDA is non-negotiable

| Reason | What happens if you skip it |
|---|---|
| **Data quality** | Garbage in, garbage out. Undetected missing values or wrong types corrupt your model. |
| **Feature understanding** | You build features blindly, missing obvious transformations. |
| **Target leakage** | You accidentally include future information as a feature. |
| **Model selection** | You pick the wrong model because you never looked at the distributions. |

### The EDA Workflow



## What We Will Cover Today

1. **Theory**: Types of data (numerical, categorical, ordinal)
2. **Theory**: Summary statistics — mean, median, std, quartiles
3. **Theory**: Missing values — MCAR/MAR/MNAR and imputation strategies
4. **Theory**: Distributions, skewness, and outlier detection
5. **Theory**: Correlation and relationships between features
6. **Theory**: Feature engineering — ratios and bucketing
7. **Practice**: All of the above on a small HR dataset

**Goal:** By the end, you should feel comfortable exploring any new dataset before modeling.

---
## Theory: Types of Data

Every column in a dataset falls into one of these categories:

### Numerical Data — represents quantities you can do arithmetic on

| Subtype | Description | Examples |
|---|---|---|
| **Continuous** | Any value within a range | Age (22.5), salary (8,123), temperature |
| **Discrete** | Only specific values (usually integers) | Number of children (0,1,2), floor number |

### Categorical Data — represents groups or labels

| Subtype | Description | Examples |
|---|---|---|
| **Nominal** | No natural order | Department (sales, marketing), color, country |
| **Ordinal** | Meaningful order | Education level, star ratings (1-5) |

### Why This Matters for ML

| Data Type | Summary Stats | Visualizations | Model Encoding |
|---|---|---|---|
| **Continuous** | Mean, median, std | Histogram, boxplot | Use directly (maybe scale) |
| **Nominal** | Mode, value_counts | Bar chart | One-hot encoding |
| **Ordinal** | Median, mode | Bar chart | Label encoding (preserves order) |

**Key insight:** Pandas does not automatically know the correct type. It is your job
to understand and handle each column appropriately.

---
## Setup and Imports

In [None]:
# Data manipulation
import pandas as pd    # DataFrames for tabular data
import numpy as np     # Numerical operations

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Ensure plots appear inline in the notebook
%matplotlib inline

plt.style.use("seaborn-v0_8-whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)
plt.rcParams["figure.dpi"] = 100
pd.set_option("display.max_columns", 20)

print("All imports successful. Ready to explore!")

---
## Creating a Sample Dataset

In real projects you would load a CSV, SQL table, or API response. Here we create a
small in-memory dataset to keep the focus on **EDA concepts** rather than data loading.

Our dataset simulates an HR table with employee information. It intentionally includes
**missing values** and a mix of **data types** — just like real-world data.

In [None]:
# Build a small employee dataset with realistic imperfections:
#   - Missing values (None) in age and salary
#   - Mix of data types: numeric (age, salary, tenure), categorical (department)
#   - Very different scales: age is 19-50, salary is 41000-80000
data = {
    "age":        [22, 35, 28, None, 40, 19, 50],
    "salary":     [48000, 54000, 50000, 62000, None, 41000, 80000],
    "department": ["sales", "marketing", "sales", "engineering",
                   "engineering", "sales", "marketing"],
    "tenure":     [1.2, 3.4, 2.1, 5.0, 4.2, 0.8, 6.5],
}

df = pd.DataFrame(data)

print(f"Dataset shape: {df.shape[0]} rows x {df.shape[1]} columns")
print()
df.head()

---
## Theory: The First Three Questions

When you encounter a new dataset, always start with these three questions:

1. **How big is the data?** ->  gives (rows, columns)
2. **What types of data do I have?** ->  shows dtypes and non-null counts
3. **What does it look like?** ->  /  give a visual sanity check

### Understanding Pandas Data Types

| Pandas dtype | What it represents |
|---|---|
|  | Integer numbers (no decimals) |
|  | Decimal numbers — **also used when integers have missing values** |
|  | Text strings or mixed types |
|  | Categorical data (memory-efficient; correct for nominals/ordinals) |
|  | Dates and timestamps |

**Important:** When a column of integers contains , pandas converts it to 
because  (Not a Number) is a float. You will see this in our  column.

In [None]:
# ---- Question 1: How big is the data? ----
print("=" * 50)
print("DATASET SIZE")
print("=" * 50)
print(f"Rows:    {df.shape[0]}")
print(f"Columns: {df.shape[1]}")
print(f"Names:   {list(df.columns)}")
print()

# ---- Question 2: What data types? ----
# .info() shows column names, non-null counts, and dtypes all at once.
# It is the single most useful method for initial dataset inspection.
print("=" * 50)
print("DATA TYPES AND NON-NULL COUNTS")
print("=" * 50)
df.info()
print()

print("OBSERVATIONS:")
print("- age is float64 (not int64) because it has a missing value")
print("- salary is float64 for the same reason")
print("- age has 6 non-null out of 7 => 1 missing value")
print("- salary has 6 non-null out of 7 => 1 missing value")

---
## Theory: Summary Statistics

### Measures of Central Tendency

| Statistic | What it tells you | Sensitive to outliers? |
|---|---|---|
| **Mean** | Center of gravity of the data | **Yes** — one extreme value pulls it strongly |
| **Median** | Typical value — 50% above, 50% below | **No** — robust to outliers |
| **Mode** | Most frequent value (useful for categorical) | No |

### Measures of Spread

| Statistic | What it tells you |
|---|---|
| **Std** | Average distance of values from the mean |
| **Range** | max - min (sensitive to outliers) |
| **IQR** | Q3 - Q1 — spread of the middle 50% (robust to outliers) |

### Worked Example

Age values (after removing the missing one): 19, 22, 28, 35, 40, 50

- **Mean** = (19+22+28+35+40+50)/6 = **32.3**
- **Median** = average of 28 and 35 = **31.5** (middle two values, n=6 is even)
- Here mean (32.3) > median (31.5) => slight **right skew**

**Rule of thumb:**
- mean ≈ median => symmetric distribution
- mean > median => right-skewed (long tail to the right)
- mean < median => left-skewed (long tail to the left)

In [None]:
# .describe() computes count, mean, std, min, 25%, 50%, 75%, max
# for every numeric column. This is your EDA best friend.
print("=" * 55)
print("NUMERICAL SUMMARY STATISTICS")
print("=" * 55)
print(df.describe().round(2))
print()
print("HOW TO READ THIS:")
print("- count < total rows => missing values")
print("- mean vs 50% (median): large gap = skewed distribution")
print("- std large relative to mean = high variability")
print("- min/max: check for impossible values (negative age? salary=0?)")
print()

# For categorical columns, value_counts() is the equivalent of describe()
print("=" * 55)
print("CATEGORICAL SUMMARY: department")
print("=" * 55)
print(df["department"].value_counts())
print()
dept_pct = df["department"].value_counts(normalize=True) * 100
for dept, pct in dept_pct.items():
    print(f"  {dept}: {pct:.1f}%")

---
## Theory: Missing Values — Types and Strategies

### Three Types of Missingness

| Type | Description | Implication |
|---|---|---|
| **MCAR** (Missing Completely At Random) | No relationship to any variable | Safe to drop or impute |
| **MAR** (Missing At Random) | Depends on observed variables, not the missing value | Imputation using other features works well |
| **MNAR** (Missing Not At Random) | Depends on the missing value itself | Any strategy may introduce bias |

**Example of MNAR:** High-income people refuse to report salary. If you impute with the mean,
you are systematically underestimating salary for those people.

### Common Strategies

| Strategy | When to use | Risk |
|---|---|---|
| **Drop rows** | Very few missing (<5%), MCAR | Loses data |
| **Mean imputation** | Continuous, symmetric distribution | Underestimates variance |
| **Median imputation** | Continuous, skewed or with outliers | Same as mean, more robust |
| **Mode imputation** | Categorical | Over-represents dominant category |
| **Indicator column** | When missingness itself is a signal | More features |

> Always **quantify first** — understand the scale of the problem before deciding.

In [None]:
# ---- Step 1: Quantify ----
print("=" * 50)
print("MISSING VALUE ANALYSIS")
print("=" * 50)
missing = pd.DataFrame({
    "count": df.isna().sum(),
    "pct": (df.isna().sum() / len(df) * 100).round(1)
})
print(missing)
print()

# ---- Step 2: Add indicator column BEFORE imputing ----
# Records WHERE data was originally missing.
# Preserves missingness signal in case the data is MNAR.
df["age_missing"] = df["age"].isna().astype(int)
print("Added age_missing indicator (1=was missing, 0=was present):")
print(df[["age", "age_missing"]].to_string())
print()

# ---- Step 3: Impute with median ----
# Median is preferred over mean because it is robust to outliers.
age_median = df["age"].median()
salary_median = df["salary"].median()
print(f"Imputing age with median:    {age_median}")
print(f"Imputing salary with median: {salary_median}")

df["age"] = df["age"].fillna(age_median)
df["salary"] = df["salary"].fillna(salary_median)

print()
print("Missing after imputation:", df.isna().sum().sum(), "values")
print("RESULT: All missing values handled.")

---
## Theory: Data Types and Casting

### Why Explicitly Cast Dtypes?

1. **Memory efficiency**:  dtype uses 95%+ less memory than  for
   columns with few unique values (like department with 3 values).

2. **Correct operations**:  shows mean/std for numeric columns but
   count/unique/top/freq for object/category columns.

3. **Preventing bugs**: A numeric column accidentally stored as  will silently
   fail arithmetic operations.

### Common Casts

| From | To | Method |
|---|---|---|
|  |  |  |
|  |  |  |
|  |  |  |

In [None]:
print(f"Before: department dtype = {df['department'].dtype}")

# Cast to category: more memory-efficient and semantically correct
df["department"] = df["department"].astype("category")

print(f"After:  department dtype = {df['department'].dtype}")
print(f"Categories: {list(df['department'].cat.categories)}")
print()

print("=" * 45)
print("FINAL COLUMN TYPES")
print("=" * 45)
print(df.dtypes)
print()
print("All dtypes correct:")
print("- age, salary, tenure: float64 (numeric)")
print("- department: category (nominal categorical)")
print("- age_missing: int64 (binary indicator)")

---
## Theory: Distributions, Skewness, and Outliers

Numbers (mean, std) can hide patterns that plots reveal instantly.

### Distribution Shapes



### The 1.5 x IQR Outlier Rule

A value is a **potential outlier** if it is:
- Below Q1 - 1.5 x IQR, or
- Above Q3 + 1.5 x IQR

**Example:** Q1=25, Q3=45, IQR=20
- Lower fence: 25 - 30 = **-5**
- Upper fence: 45 + 30 = **75**
- Values outside [-5, 75] are flagged as outliers

Boxplots display this rule visually. Points beyond the whiskers are potential outliers.

### KDE (Kernel Density Estimation)

KDE is a smooth version of a histogram. It places a curve on each data point and sums
them up, giving a continuous probability density estimate. Use  in seaborn.

In [None]:
# ---- Histograms with KDE and mean/median reference lines ----
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Age distribution
sns.histplot(df["age"], kde=True, color="steelblue", edgecolor="white", ax=axes[0])
axes[0].set_title("Age Distribution", fontsize=13, fontweight="bold")
axes[0].set_xlabel("Age (years)")
age_mean = df["age"].mean()
age_med = df["age"].median()
axes[0].axvline(age_mean, color="red", linestyle="--", linewidth=1.5, label=f"Mean = {age_mean:.1f}")
axes[0].axvline(age_med, color="green", linestyle="-.", linewidth=1.5, label=f"Median = {age_med:.1f}")
axes[0].legend(fontsize=9)

# Salary distribution
sns.histplot(df["salary"], kde=True, color="coral", edgecolor="white", ax=axes[1])
axes[1].set_title("Salary Distribution", fontsize=13, fontweight="bold")
axes[1].set_xlabel("Salary ($)")
sal_mean = df["salary"].mean()
sal_med = df["salary"].median()
axes[1].axvline(sal_mean, color="red", linestyle="--", linewidth=1.5, label=f"Mean = {sal_mean:,.0f}")
axes[1].axvline(sal_med, color="green", linestyle="-.", linewidth=1.5, label=f"Median = {sal_med:,.0f}")
axes[1].legend(fontsize=9)

plt.suptitle("Distribution Analysis", fontsize=15, fontweight="bold", y=1.02)
plt.tight_layout()
plt.show()

print(f"Age:    mean={age_mean:.1f}, median={age_med:.1f}")
print(f"Salary: mean={sal_mean:,.0f}, median={sal_med:,.0f}")
print()

# ---- Boxplot: Salary by department ----
# Great for comparing distributions across groups and spotting outliers.
fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(x="department", y="salary", data=df, palette="Set2", ax=ax)
# Overlay raw data points so we can see every individual value
sns.stripplot(x="department", y="salary", data=df, color="black", size=8, alpha=0.7, ax=ax)
ax.set_title("Salary by Department", fontsize=14, fontweight="bold")
ax.set_xlabel("Department")
ax.set_ylabel("Salary ($)")
plt.tight_layout()
plt.show()

print("HOW TO READ A BOXPLOT:")
print("- Box spans Q1 (bottom) to Q3 (top) = middle 50% of values")
print("- Line inside box = MEDIAN")
print("- Whiskers = farthest points within 1.5 x IQR")
print("- Dots beyond whiskers = potential OUTLIERS")
print()
for dept in df["department"].cat.categories:
    s = df[df["department"] == dept]["salary"]
    print(f"  {dept}: median=${s.median():,.0f}, range=${s.min():,.0f}-${s.max():,.0f}")

---
## Theory: Correlation and Relationships

### Pearson Correlation Coefficient (r)

$$r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \cdot \sum (y_i - \bar{y})^2}}$$

| r value | Interpretation |
|---|---|
| +1.0 | Perfect positive linear relationship |
| +0.7 to +0.9 | Strong positive correlation |
| +0.4 to +0.7 | Moderate positive correlation |
| 0 | No linear relationship |
| -0.4 to -0.7 | Moderate negative correlation |
| -1.0 | Perfect negative linear relationship |

### Critical Caveats

1. **Correlation does not imply causation.** Ice cream sales and drowning rates
   correlate (both peak in summer) — ice cream does not cause drowning.

2. **Linear relationships only.** A perfect parabola (y = x^2) can have r = 0.

3. **Sensitive to outliers.** One extreme point can drastically change r.

In [None]:
# .corr() computes Pearson correlation between every pair of numeric columns.
# Result: square matrix where cell (i,j) = correlation between column i and j.
corr = df.select_dtypes(include="number").corr()

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(
    corr,
    annot=True,     # Show values in each cell
    fmt=".2f",      # 2 decimal places
    cmap="RdBu_r",  # Red=positive, Blue=negative
    center=0,       # Neutral at 0
    vmin=-1, vmax=1,
    square=True,
    linewidths=0.5,
    ax=ax
)
ax.set_title("Correlation Matrix (Pearson)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

print("INTERPRETATION:")
print("- Diagonal is always 1.0 (each variable correlates perfectly with itself)")
print("- Off-diagonal: |r| > 0.5 = noteworthy linear relationship")
print()

# Flag notable correlations automatically
cols = corr.columns
found = False
for i in range(len(cols)):
    for j in range(i+1, len(cols)):
        r = corr.iloc[i, j]
        if abs(r) > 0.4:
            found = True
            s = "strong" if abs(r) > 0.7 else "moderate"
            d = "positive" if r > 0 else "negative"
            print(f"  {cols[i]} vs {cols[j]}: r={r:.2f} ({s} {d})")
if not found:
    print("  No correlations with |r| > 0.4 found.")

---
## Theory: Feature Engineering — Ratios and Bucketing

Feature engineering creates **new features** from existing ones. It is often
the single biggest lever for improving model performance.

### Why Create New Features?

Raw features sometimes do not capture what a model needs:
- A burnout risk model cares about **salary relative to tenure**, not each alone.
- A career prediction model may care about **age group** (junior/senior) more than exact age.

### Common Techniques

| Technique | Example | When to Use |
|---|---|---|
| **Ratio** | salary / tenure | When the ratio carries more signal than each value separately |
| **Bucketing** | age -> "<25", "25-35" | When the effect of a variable is non-linear across its range |
| **Log transform** | log(salary) | When a feature is heavily right-skewed |
| **Interaction** | age x tenure | When combined effect of two features matters |

### Warning: Avoid Target Leakage

Never create features using information unavailable at prediction time.
If predicting next month salary, you cannot use next month performance review — it does not exist yet.

In [None]:
# ---- Feature 1: Ratio ----
# salary_per_year: pay per year of experience
# +0.1 avoids division by zero for brand-new hires with tenure near 0
df["salary_per_year"] = df["salary"] / (df["tenure"] + 0.1)

# ---- Feature 2: Bucketing ----
# age_bucket: group employees into career stage categories
# pd.cut() converts continuous values into labeled bins
df["age_bucket"] = pd.cut(
    df["age"],
    bins=[0, 25, 35, 45, 100],
    labels=["<25", "25-35", "35-45", "45+"]
)

print("=" * 65)
print("NEW FEATURES")
print("=" * 65)
print(df[["age", "salary", "tenure", "salary_per_year", "age_bucket"]].to_string())
print()
print("salary_per_year summary:")
print(f"  Min:  ${df['salary_per_year'].min():,.0f}")
print(f"  Max:  ${df['salary_per_year'].max():,.0f}")
print(f"  Mean: ${df['salary_per_year'].mean():,.0f}")
print()
print("age_bucket distribution:")
print(df["age_bucket"].value_counts().sort_index())

---
## Summary of Findings

### Theory Recap

| Topic | Key Takeaway |
|---|---|
| **Data types** | Numerical (continuous/discrete) vs categorical (nominal/ordinal). Each needs different handling. |
| **Summary statistics** | Mean is sensitive to outliers; median is robust. Compare them to detect skew. |
| **Missing values** | Understand MCAR/MAR/MNAR. Add indicator columns before imputing. |
| **Distributions** | Histograms reveal shape; boxplots reveal quartiles and outliers. |
| **Correlation** | Pearson r measures linear relationships only. Correlation != causation. |
| **Feature engineering** | Ratios and buckets can create more informative signals than raw columns. |

### Practice Completed

1. **Inspected** structure: , , 
2. **Quantified** missing values and applied median imputation with indicator column
3. **Computed** summary statistics:  and 
4. **Cast** department to  dtype
5. **Visualized** distributions: histograms with KDE + mean/median lines, boxplots by group
6. **Built** Pearson correlation heatmap
7. **Engineered**  (ratio) and  (binning)

### Key Takeaway

> EDA is not optional. It is the foundation of every reliable ML pipeline.
> A model trained on poorly understood data is a model waiting to fail in production.

---
## Next Steps: Connection to Day 02

In **Day 02**, we build our first model — a **logistic regression baseline** on
the Breast Cancer Wisconsin dataset. Today's work connects directly:

- The **imputation** we did is required before passing data to any sklearn model
  (it does not handle NaN by default).
- The **dtype casting** influences how we encode features (one-hot for categoricals,
  scaling for numerics).
- The **correlation analysis** helps decide which features are likely useful predictors.
- The **feature engineering** ideas can be applied before modeling for better inputs.

**See you in Day 02!**