# UNIT I — Notebook 5: End-to-End Data Preprocessing Pipeline

## Objective

This notebook concludes **UNIT I**.

Till now, we have learned:
- How to understand a dataset
- How to handle missing values
- How to analyze distributions
- How to reduce skewness
- How to scale features correctly

Now we will:
1. Combine all preprocessing steps
2. Apply them in the correct order
3. Clearly separate *what belongs to UNIT I* vs *what comes next*


## 1️⃣ The Golden Rule of Preprocessing (Very Important)

Preprocessing is NOT a list of steps.
It is a **pipeline with a strict order**.

Wrong order = wrong data = wrong model.

Correct order (UNIT I):
1. Load data
2. Identify feature types
3. Handle missing values
4. Diagnose distributions
5. Apply transformations
6. Scale numeric features

Only after this:
➡️ Model building


## 2️⃣ Reload Dataset (Clean State)

Each notebook must be executable independently.


In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

path = "healthcare-dataset-stroke-data.csv"
df = pd.read_csv(path)

TARGET_COL = "stroke"
df.shape


(5110, 12)

## 3️⃣ Step 1: Identify Feature Types

Before touching values, classify features.

Why?

Different features require different preprocessing.


In [2]:
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df.select_dtypes(include=["object"]).columns.tolist()

numeric_features, categorical_features


(['id',
  'age',
  'hypertension',
  'heart_disease',
  'avg_glucose_level',
  'bmi',
  'stroke'],
 ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'])

## 4️⃣ Step 2: Handle Missing Values

Decisions already justified earlier:
- BMI → median imputation
- BMI missingness → informative → keep flag
- Categorical NaN → "Unknown"

We now apply them systematically.


In [3]:
# BMI missing flag
df["bmi_missing_flag"] = df["bmi"].isna().astype(int)

# Median imputation
df["bmi_imputed"] = df["bmi"].fillna(df["bmi"].median())

# Categorical missing handling
for col in categorical_features:
    df[col] = df[col].fillna("Unknown")


## 5️⃣ Step 3: Distribution-Aware Transformations

From Notebook 3:
- avg_glucose_level → strongly right-skewed
- bmi_imputed → right-skewed
- age → acceptable as-is

We apply log transformation where justified.


In [4]:
df["log_avg_glucose_level"] = np.log1p(df["avg_glucose_level"])
df["log_bmi_imputed"] = np.log1p(df["bmi_imputed"])


## 6️⃣ Step 4: Select Final Numeric Features for Modeling

We now decide which numeric features move forward.

Rule:
- Keep original features for audit
- Use transformed features for modeling


In [5]:
final_numeric_features = [
    "age",
    "log_avg_glucose_level",
    "log_bmi_imputed",
    "bmi_missing_flag"
]

final_numeric_features


['age', 'log_avg_glucose_level', 'log_bmi_imputed', 'bmi_missing_flag']

## 7️⃣ Step 5: Feature Scaling (Standardization)

Why StandardScaler?
- Continuous features
- Distance / gradient-based models later
- Mean-centering improves optimization

⚠️ Scaling is applied ONLY to selected numeric features.


In [6]:
scaler = StandardScaler()

df_scaled = df.copy()
df_scaled[final_numeric_features] = scaler.fit_transform(
    df[final_numeric_features]
)

df_scaled[final_numeric_features].head()


Unnamed: 0,age,log_avg_glucose_level,log_bmi_imputed,bmi_missing_flag
0,1.051434,2.324024,1.045711,-0.202349
1,0.78607,1.982522,0.022638,4.941952
2,1.62639,0.192196,0.584773,-0.202349
3,0.255342,1.521365,0.80501,-0.202349
4,1.582163,1.567759,-0.583632,-0.202349


## 8️⃣ Verify Scaling (Sanity Check)

After Standardization:
- Mean ≈ 0
- Std ≈ 1


In [7]:
df_scaled[final_numeric_features].describe().loc[["mean", "std"]]

Unnamed: 0,age,log_avg_glucose_level,log_bmi_imputed,bmi_missing_flag
mean,5.0057810000000006e-17,-2.280411e-16,-1.037309e-15,-5.561978e-18
std,1.000098,1.000098,1.000098,1.000098


## 9️⃣ Final Dataset Snapshot (UNIT I Output)

At the end of UNIT I:
- Data is clean
- Missing values handled
- Skewness reduced
- Features scaled
- Ready for modeling

Let’s separate features and target.


X = df_scaled[final_numeric_features]
y = df_scaled[TARGET_COL]

X.shape, y.shape


## 10️⃣ What We Achieved in UNIT I (Explicit Summary)

UNIT I taught you HOW TO THINK about data.

You learned:
- Why data understanding comes first
- Why missing values are dangerous if ignored
- Why median > mean in skewed data
- Why transformations come before scaling
- Why scaling does NOT fix skewness
- Why preprocessing order matters

You did NOT blindly apply functions.
You justified every decision.


## 11️⃣ What We Explicitly Did NOT Do (By Design)

❌ Feature selection  
❌ Feature importance  
❌ Interaction features  
❌ Aggregation features  
❌ Class imbalance handling  
❌ Model training  

These belong to:
➡️ UNIT II  
➡️ Or a more complex dataset


## 12️⃣ UNIT I Mental Model (This Is the Takeaway)

Before building any ML model, always ask:

1. Do I understand my data?
2. Are missing values handled correctly?
3. Are distributions reasonable?
4. Did I transform before scaling?
5. Is my data model-ready?

If the answer is NO to any:

➡️ Stop. Fix preprocessing first.


## 13️⃣ Transition to UNIT II

UNIT I answer:
> “Is my data ready for modeling?”

UNIT II question:
> “Which model should I choose, and why?”

