Day 4: Pipelines & Preprocessing
🎯 Objective of the day

Learn how to build scikit-learn pipelines.

Automatically handle preprocessing + model training in one step.

Work with both numeric and categorical features.

📝 Notes

Pipelines help you avoid data leakage (preprocessing applied consistently on train & test).

Cleaner code → preprocessing + model in one object.

You can directly use pipelines with GridSearchCV later for hyperparameter tuning.

. One-Hot Encoding (OHE)

Machine learning models need numbers, but categories like "male" / "female" or "C" / "Q" / "S" are not numeric.

If we just turned them into numbers (e.g., male=0, female=1), the model would wrongly assume an order (as if female > male).

Instead, OneHotEncoder creates new binary columns, one for each category:
Example:

sex
-----
male    →  [1, 0]
female  →  [0, 1]


For 3 embarkation ports (C, Q, S):

embarked = "Q" → [0, 1, 0]


So OHE makes categorical data usable without implying false numerical order.

Numeric pipeline: fill missing with median, scale.

Categorical pipeline: fill missing with most frequent, one-hot encode (convert categories to binary columns).

ColumnTransformer: applies the right pipeline to the right columns and merges results.

Final Pipeline: combines preprocessing + model into one, so train/test always get identical transformations automatically.

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

import seaborn as sns
df = sns.load_dataset("titanic")

# Select features
X = df[["pclass", "sex", "age", "fare", "embarked"]]
y = df["survived"]

# Numeric features: fill missing with median + scale
numeric_features = ["age", "fare"]
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Categorical features: fill missing with most_frequent + one-hot encode
categorical_features = ["pclass", "sex", "embarked"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Combine
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))



Accuracy: 0.7877094972067039


📊 Exercise of the Day

What accuracy did you get using the pipeline?

Compare this with your earlier Titanic Logistic Regression (without a pipeline). Did handling preprocessing properly improve results?

Why are pipelines useful in real-world projects?

1) Accuracy with pipeline: 0.79

2) Logistic no pipeline: 0.76

The use of a pipeline slightly improved the accuracy.

3) Pipelines combine preprocessing of data and the training of the model into a single object. Therefore, every time the modle is tested or trained it will be done witht he same preprocessed data avoiding data leakage.

🌟 Mini-Challenge

Swap the Logistic Regression model for a Random Forest Classifier inside the pipeline.

Compare performance with Logistic Regression.
👉 Which one works better on Titanic? Why might that be?

In [14]:
clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.7821229050279329


-Logistic Regression Accuracy: 0.7877094972067039

-Random Forest Accuracy: 0.7821229050279329

Almost identical accuracies logistic regression is slighty better. This might be because many of the features have linear signals. For exampe:

-Being female increases survival odds.
-Being in first class increases survival odds
-Higher fare increases survival odds.

✨ Key Insight

Pipelines don’t just make your code cleaner → they’re essential for real-world ML because:

They guarantee reproducibility.

They integrate seamlessly with GridSearchCV (you’ll use that soon).

They allow you to deploy models without worrying about inconsistent preprocessing.