# 04 — Pipelines and ColumnTransformer

In real-world machine learning, we must ensure that the same preprocessing applied to the training data is consistently applied to new and unseen data.

We achieve this using:
- `Pipeline` to chain preprocessing + model steps
- `ColumnTransformer` to apply different transformations to numeric and categorical columns


Step 1 — Reload Data

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
df = df.drop(columns=["Cabin", "Ticket", "Name", "PassengerId"])

X = df.drop(columns=["Survived"])
y = df["Survived"]


Step 2 — Train Test Split

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


Step 3 — Identify Data Types

In [3]:
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns


Step 4 — Build Preprocessing Transformers

In [4]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])


In [5]:
preprocessor

Step 5 — Create Final ML Pipeline

Try with three different classifiers — you will observe clear differences.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

models = {
    "LogisticRegression": LogisticRegression(max_iter=2000),
    "RandomForest": RandomForestClassifier(n_estimators=200, random_state=42),
    "SVC": SVC(kernel="rbf", probability=True)
}

for name, clf in models.items():
    pipe = Pipeline([
        ("preprocess", preprocessor),
        ("model", clf)
    ])
    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_test)
    print(f"{name} Accuracy:", accuracy_score(y_test, pred))


LogisticRegression Accuracy: 0.8044692737430168
RandomForest Accuracy: 0.8156424581005587
SVC Accuracy: 0.8156424581005587
