# 03 — Feature Engineering and Scaling

Feature engineering improves model performance more than changing the model itself.

In this notebook we:
- Identify and remove low-variance (uninformative) features
- Compare different feature scaling techniques
- Introduce polynomial and interaction features for non-linear relationships


Step 1 — Reload and Split Data

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

df = df.drop(columns=["Cabin", "Ticket", "Name", "PassengerId"])
X = df.drop(columns=["Survived"])
y = df["Survived"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


Step 2 — Separate numerical and categorical columns

In [3]:
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns


Step 3 — Standard Scaling vs Min-Max Scaling

In [4]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler_std = StandardScaler()
scaler_minmax = MinMaxScaler()

X_train_std = scaler_std.fit_transform(X_train[numeric_features])
X_train_minmax = scaler_minmax.fit_transform(X_train[numeric_features])

pd.DataFrame(X_train_std).head(), pd.DataFrame(X_train_minmax).head()


(          0         1         2         3         4
 0  0.829568       NaN -0.465084 -0.466183  0.513812
 1 -0.370945       NaN -0.465084 -0.466183 -0.662563
 2 -1.571457       NaN -0.465084 -0.466183  3.955399
 3  0.829568 -0.815864 -0.465084  0.727782 -0.467874
 4 -0.370945  0.082384  0.478335  0.727782 -0.115977,
      0         1      2         3         4
 0  1.0       NaN  0.000  0.000000  0.110272
 1  0.5       NaN  0.000  0.000000  0.000000
 2  0.0       NaN  0.000  0.000000  0.432884
 3  1.0  0.220910  0.000  0.166667  0.018250
 4  0.5  0.384267  0.125  0.166667  0.051237)

You must visually see that:

StandardScaler → centers around 0

MinMaxScaler → rescales to 0 to 1

This is not memorization.
This is pattern recognition.

Step 4 — Low-Variance Feature Removal (optional but professional)

In [5]:
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.0)
selector.fit(X_train[numeric_features])

numeric_features[selector.get_support()]


Index(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')

If some columns get removed → good.
If none → also fine.
You are learning analysis, not forcing results.

Step 5 — Polynomial Features (the real performance booster)

In [7]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.impute import SimpleImputer

# First impute missing values
imputer = SimpleImputer(strategy="median")
X_train_num_imputed = imputer.fit_transform(X_train[numeric_features])

# Then apply polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train_num_imputed)

X_train_poly.shape


(712, 20)

Point here is:

Linear models become nonlinear with polynomial expansion

But complexity increases fast → must use carefully

Step 6 — Compare Model Performance (Standard vs Poly)

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# 1) Without polynomial features
num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer([
    ("num", num_pipe, numeric_features),
    ("cat", cat_pipe, categorical_features)
])

model = Pipeline([
    ("preprocess", preprocess),
    ("clf", LogisticRegression(max_iter=2000))
])

model.fit(X_train, y_train)
pred1 = model.predict(X_test)
print("Baseline Accuracy:", accuracy_score(y_test, pred1))


# 2) With polynomial interaction terms
poly_num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", StandardScaler())
])

preprocess_poly = ColumnTransformer([
    ("num", poly_num_pipe, numeric_features),
    ("cat", cat_pipe, categorical_features)
])

model_poly = Pipeline([
    ("preprocess", preprocess_poly),
    ("clf", LogisticRegression(max_iter=4000))
])

model_poly.fit(X_train, y_train)
pred2 = model_poly.predict(X_test)
print("Polynomial Accuracy:", accuracy_score(y_test, pred2))


Baseline Accuracy: 0.8044692737430168
Polynomial Accuracy: 0.8100558659217877


If polynomial accuracy decreases → overfitting sign → we will fix later with regularization and CV.