# üéØ Feature Engineering ‚Äî Part A: Individual Concepts (Colab-Ready)

**Updated:** 2025-08-22

This notebook is designed for **first-time learners**. You will practice each feature engineering step **individually** (no pipelines yet), so you can clearly see *what each step does* and *why it matters*.

**What you'll practice:**
- Dataset loading & quick audit
- Handling missing values (drop, impute)
- Scaling & normalization (standardization, min-max, per-row normalization)
- Encoding categorical variables (ordinal vs one-hot)
- Feature transformations (log, power, polynomial)
- Simple dimensionality reduction (PCA) for visualization
- Short exercises after each section

> Use this Part A first. After you are comfortable, move to **Part B (Pipelines)** to automate and combine steps.

## 0) Setup

In [None]:
# If running in Google Colab, you can install optional packages here:
# !pip install -q statsmodels==0.14.2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, OrdinalEncoder
from sklearn.preprocessing import PolynomialFeatures, PowerTransformer
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error

pd.set_option('display.max_columns', 100)

## 1) Dataset Setup & Quick Audit

In [None]:
# Option A: Load Titanic from a stable GitHub mirror (recommended for first run)
URL = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(URL)
print("Shape:", df.shape)
df.head()

In [None]:
# Option B: Upload your own CSV (uncomment to use in Colab)
# from google.colab import files
# up = files.upload()  # pick file
# import io
# df = pd.read_csv(io.BytesIO(up[list(up.keys())[0]]))
# print("Shape:", df.shape)
# df.head()

In [None]:
# Quick audit
print("\nInfo:")
df.info()
print("\nMissing values per column:")
print(df.isna().sum().sort_values(ascending=False))
print("\nNumeric describe:")
df.describe().T

## 2) Handling Missing Values (Individually)

**Goal:** Learn when to **drop** vs **impute**.

**Common choices**
- Numeric: mean/median
- Categorical: most frequent

We'll practice on Titanic columns like `Age`, `Embarked`, and `Cabin`.

In [None]:
# View null counts
df.isna().sum().sort_values(ascending=False).head(10)

In [None]:
# 2.1 DROP example (use cautiously)
df_drop_rows = df.dropna(subset=['Age', 'Embarked'])  # drop rows where these are null
print("Original:", df.shape, "After drop:", df_drop_rows.shape)

In [None]:
# 2.2 SIMPLE IMPUTE example
df_imp = df.copy()
# Numeric (Age): median
df_imp['Age'] = df_imp['Age'].fillna(df_imp['Age'].median())
# Categorical (Embarked): most frequent
df_imp['Embarked'] = df_imp['Embarked'].fillna(df_imp['Embarked'].mode()[0])

# 'Cabin' is very sparse; we can fill with "Unknown"
df_imp['Cabin'] = df_imp['Cabin'].fillna('Unknown')

df_imp.isna().sum().head(10)

In [None]:
# 2.3 KNN Imputation (numeric only demonstration)
num_cols = df.select_dtypes(include=['number']).columns.tolist()
knn_df = df[num_cols].copy()
imputer = KNNImputer(n_neighbors=3)
knn_imputed = imputer.fit_transform(knn_df)
knn_imputed_df = pd.DataFrame(knn_imputed, columns=num_cols)
knn_imputed_df.head()

**üìù Exercise 2**
1) Compare **mean vs median** imputation for `Age`. Which preserves the original distribution better?  
2) For `Embarked`, try filling with a new category (`'Unknown'`) vs mode. What changes in `value_counts()`?

Filling with Mode
df['Embarked_mode'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
Most frequent category (likely "S") fills the gaps.
value_counts() ‚Üí "S" increases, distribution slightly biased toward majority class.

Filling with 'Unknown'
df['Embarked_unknown'] = df['Embarked'].fillna('Unknown')
Keeps missingness visible as a separate category.
value_counts() ‚Üí new "Unknown" category appears.
Preserves original frequency counts for "C", "Q", "S".
Useful for models that can handle categorical variables, as "Unknown" may carry information.


Mode filling inflates the count of the most frequent port (bias).
'Unknown' filling adds a new category, preserving the original class proportions while making missingness explicit

Mean Imputation
df['Age_mean'] = df['Age'].fillna(df['Age'].mean())
Replaces missing ages with the average of all available ages.
Problem: The mean is sensitive to outliers (e.g., very high ages like 80‚Äì90).
This tends to shift the distribution towards the center, reducing variability.
Histogram after mean imputation ‚Üí a spike near the mean.

Median Imputation
df['Age_median'] = df['Age'].fillna(df['Age'].median())
Replaces missing ages with the middle value.
More robust to skewness and outliers.
The distribution shape is preserved better compared to mean.
Histogram after median imputation ‚Üí spike at the median, but less distortion.

Median imputation preserves the original distribution better because it‚Äôs less affected by skewness and extreme values.

## 3) Scaling & Normalization (Individually)

- **Standardization**: z = (x - mean)/std (good for many ML models)
- **MinMax scaling**: maps to [0,1] (useful when features have different units)
- **Per-row Normalization**: scales each *row vector* to unit norm (useful for text-like frequency vectors)

We'll demonstrate on `Fare` and `Age`.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10,4))
axes[0].hist(df_imp['Age'].dropna(), bins=30)
axes[0].set_title('Age - Raw')
axes[1].hist(df_imp['Fare'].dropna(), bins=30)
axes[1].set_title('Fare - Raw')
plt.show()

In [None]:
sc_std = StandardScaler()
sc_mm  = MinMaxScaler()

age_std = sc_std.fit_transform(df_imp[['Age']])
fare_mm = sc_mm.fit_transform(df_imp[['Fare']])

fig, axes = plt.subplots(1, 2, figsize=(10,4))
axes[0].hist(age_std.flatten(), bins=30)
axes[0].set_title('Age - Standardized')
axes[1].hist(fare_mm.flatten(), bins=30)
axes[1].set_title('Fare - MinMax [0,1]')
plt.show()

**üìù Exercise 3**
1) Standardize `Fare` and plot the histogram.  
2) Apply **Normalizer** on `[Age, Fare]` rows and check the first 5 normalized vectors.

In [None]:
from sklearn.preprocessing import Normalizer


df_small = df[['Age','Fare']].dropna().copy()

normalizer = Normalizer()
norm_data = normalizer.fit_transform(df_small)


norm_df = pd.DataFrame(norm_data, columns=['Age_norm','Fare_norm'])


print(norm_df.head())


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler


df = pd.read_csv("titanic.csv")   # adjust path


scaler = StandardScaler()
df['Fare_std'] = scaler.fit_transform(df[['Fare']])


plt.hist(df['Fare_std'], bins=30, edgecolor='black')
plt.title("Standardized Fare Distribution")
plt.xlabel("Standardized Fare")
plt.ylabel("Frequency")
plt.show()


## 4) Encoding Categorical Variables (Individually)

- **Ordinal/Label encoding**: map categories to integers (assumes order or used with tree models).  
- **One-Hot encoding**: binary column per category (no order assumption).

We'll use `Sex` and `Embarked` as examples.

In [None]:
# 4.1 Ordinal encoding demo (note: no real order in Sex/Embarked; this is just to illustrate)
enc = OrdinalEncoder()
ord_demo = df_imp[['Sex','Embarked']].copy()
ord_vals = enc.fit_transform(ord_demo)
pd.DataFrame(ord_vals, columns=['Sex_ord','Embarked_ord']).head()

In [None]:
# 4.2 One-Hot encoding demo with pandas
ohe_embarked = pd.get_dummies(df_imp['Embarked'], prefix='Embarked')
ohe_sex = pd.get_dummies(df_imp['Sex'], prefix='Sex')
encoded_df = pd.concat([df_imp[['Survived','Age','Fare']], ohe_sex, ohe_embarked], axis=1)
encoded_df.head()

**üìù Exercise 4**
1) Compare the **number of features** produced by ordinal vs one-hot for `Embarked`.  
2) Why might one-hot be safer for linear models?

Ordinal encoding introduces a false sense of order.
With ordinal encoding, "S" (2) is ‚Äúgreater than‚Äù "Q" (1) which is ‚Äúgreater than‚Äù "C" (0).
A linear regression or logistic regression will treat this as numeric distance, e.g., assume "S" is twice "Q", which is meaningless.
This can distort coefficients and model interpretation.
One-hot avoids this by treating categories as independent, non-ordered indicators.
"C" = [1,0,0], "Q" = [0,1,0], "S" = [0,0,1].
The model assigns separate weights, no fake ordering.
That‚Äôs why one-hot is safer for linear models (and often tree models too, though trees can sometimes handle ordinal codes fine).

Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder()
df['Embarked_ord'] = ord_enc.fit_transform(df[['Embarked']])
Categories mapped to numbers, e.g. C=0, Q=1, S=2.

One-Hot Encoding
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop=None, sparse=False)
embarked_ohe = ohe.fit_transform(df[['Embarked']])
Creates separate binary columns for each category: Embarked_C, Embarked_Q, Embarked_S.
Produces 3 features.

## 5) Feature Transformation (Individually)

- **Log transform**: t = log1p(x) for right-skewed positive data (e.g., Fare).
- **Power transform**: Yeo-Johnson can handle zero/negative values; stabilizes variance.
- **Polynomial features**: create interactions/quadratics for simple non-linear modeling.

We'll use `Fare` and `Age`.

In [None]:
# 5.1 Log transform on Fare (positive values)
fare_raw = df_imp['Fare'].dropna().values.reshape(-1,1)
fare_log = np.log1p(fare_raw)

fig, axes = plt.subplots(1, 2, figsize=(10,4))
axes[0].hist(fare_raw.flatten(), bins=30)
axes[0].set_title('Fare - Raw')
axes[1].hist(fare_log.flatten(), bins=30)
axes[1].set_title('Fare - log1p')
plt.show()

In [None]:
# 5.2 Power transform (Yeo-Johnson) on [Age, Fare]
pt = PowerTransformer(method='yeo-johnson')
af = df_imp[['Age','Fare']].dropna()
af_pt = pt.fit_transform(af)

fig, axes = plt.subplots(1, 2, figsize=(10,4))
axes[0].hist(af['Age'].values, bins=30)
axes[0].set_title('Age - Raw')
axes[1].hist(af_pt[:,0], bins=30)
axes[1].set_title('Age - Yeo-Johnson')
plt.show()

In [None]:
# 5.3 Polynomial features on [Age, Fare] (degree=2)
poly = PolynomialFeatures(degree=2, include_bias=False)
af_poly = poly.fit_transform(af[['Age','Fare']])
print("Original shape:", af[['Age','Fare']].shape, " -> With poly:", af_poly.shape)
poly.get_feature_names_out(['Age','Fare'])[:6]

**üìù Exercise 5**
1) Identify one numeric column that is **skewed**. Try both **log** and **power** transforms and compare histograms.  
2) With `PolynomialFeatures(2)`, which new terms are created from `Age` and `Fare`?

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['Age','Fare']].dropna())

print(poly.get_feature_names_out(['Age','Fare']))


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PowerTransformer


fare = df['Fare'].dropna()


fare_log = np.log1p(fare)


pt = PowerTransformer(method='yeo-johnson')
fare_power = pt.fit_transform(fare.values.reshape(-1,1))


plt.figure(figsize=(12,4))

plt.subplot(1,3,1)
plt.hist(fare, bins=30, edgecolor='black')
plt.title("Original Fare")

plt.subplot(1,3,2)
plt.hist(fare_log, bins=30, edgecolor='black')
plt.title("Log(Fare+1)")

plt.subplot(1,3,3)
plt.hist(fare_power, bins=30, edgecolor='black')
plt.title("Power Transform (Yeo-Johnson)")

plt.show()


## 6) Simple Dimensionality Reduction (PCA) ‚Äî Visualization Only

We will apply PCA to **numeric** features to reduce to 2D and make a scatter plot colored by `Survived` (if present).

> Note: This is for **intuition/visualization** only in Part A.

In [None]:

num_only = df_imp.select_dtypes(include=['number']).dropna()
y = df_imp.loc[num_only.index, 'Survived'] if 'Survived' in df_imp.columns else None

pca = PCA(n_components=2, random_state=42)
Z = pca.fit_transform(num_only.values)

print("Explained variance ratios:", pca.explained_variance_ratio_)


plt.figure(figsize=(6,5))
if y is not None:
   
    idx0 = (y.values == 0)
    idx1 = (y.values == 1)
    plt.scatter(Z[idx0,0], Z[idx0,1], s=10, label='Survived=0')
    plt.scatter(Z[idx1,0], Z[idx1,1], s=10, label='Survived=1')
    plt.legend()
else:
    plt.scatter(Z[:,0], Z[:,1], s=10)
plt.xlabel('PC1'); plt.ylabel('PC2'); plt.title('PCA (numeric only)')
plt.show()

**üìù Exercise 6**
1) Which **two numeric columns** contribute the most variance before PCA (use `df.var()`)?  
2) Try PCA with `n_components=3` and print the cumulative explained variance.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Scale first (important for PCA)
X_scaled = StandardScaler().fit_transform(num_cols.dropna())

# PCA with 3 components
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

# Cumulative explained variance
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Cumulative explained variance:", pca.explained_variance_ratio_.cumsum())


In [None]:

num_cols = df.select_dtypes(include=['int64','float64'])

variances = num_cols.var().sort_values(ascending=False)
print(variances)


print("Top 2 columns by variance:", variances.index[:2].tolist())

#Fare and Age contribute the most variance.

## 7) Consolidated Practice (No Pipelines Yet)

Using the operations you've learned, perform a **clean preprocessing** (manually):
1) Impute: `Age` (median), `Embarked` (mode), `Cabin` ('Unknown').  
2) Scale: standardize `Age` and min-max scale `Fare`.  
3) Encode: one-hot `Sex` and `Embarked`.  
4) Transform: log1p `Fare`.  
5) (Optional) PCA on numeric subset for 2D visualization.

Then, answer:
- Which step **changed the data distribution** the most?
- Which encoding produced **more features**, ordinal or one-hot? Why?
- If you trained a simple logistic regression on your manually processed features, what **accuracy** do you get on a 75/25 split? (Optional challenge)

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


df = pd.read_csv("titanic.csv")


df['Age'] = df['Age'].fillna(df['Age'].median())       
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0]) 
df['Cabin'] = df['Cabin'].fillna('Unknown')            


scaler_std = StandardScaler()
df['Age_std'] = scaler_std.fit_transform(df[['Age']])  

scaler_mm = MinMaxScaler()
df['Fare_mm'] = scaler_mm.fit_transform(df[['Fare']])  


ohe = OneHotEncoder(drop=None, sparse=False)
sex_embarked = ohe.fit_transform(df[['Sex','Embarked']])

ohe_cols = ohe.get_feature_names_out(['Sex','Embarked'])
df_ohe = pd.DataFrame(sex_embarked, columns=ohe_cols, index=df.index)

df = pd.concat([df, df_ohe], axis=1)


df['Fare_log'] = np.log1p(df['Fare']) 


num_subset = df[['Age_std','Fare_mm','Fare_log']].dropna()
pca = PCA(n_components=2)
X_pca = pca.fit_transform(num_subset)

print("PCA explained variance ratio:", pca.explained_variance_ratio_)



X = pd.concat([df[['Age_std','Fare_mm','Fare_log']], df_ohe], axis=1)
y = df['Survived']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print("Logistic Regression Accuracy:", acc)



1. The log transform on Fare changed the distribution the most, because Fare is highly skewed and log1p compresses large values, making the histogram more ‚Äúnormal‚Äù.

2. One-hot encoding produced more features. Ordinal encoding uses only 1 column per category, but one-hot creates separate binary columns for each category to avoid introducing false order. For Sex (2 classes) and Embarked (3 classes), one-hot gave 5 features.

3. Around 0.78‚Äì0.82 depending on random state (typical Titanic baseline).

## ‚úÖ What You Should Take Away from Part A

- Each step (imputation, scaling, encoding, transforms) has a **clear purpose** and **visible effect**.  
- You can now apply them **manually** and reason about their impact.  
- Next: move to **Part B (Pipelines)** to **combine & automate** these steps safely (avoid leakage, enable cross-validation, and reproducibility).