Day 5: Feature Engineering
🎯 Objective of the day

Learn how to create, transform, and select features to improve model performance.

Apply transformations on the Titanic dataset to see the effect.

In [None]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load Titanic
df = sns.load_dataset("titanic")

# Base features
X = df[["pclass", "sex", "age", "fare", "sibsp", "parch", "embarked"]]
y = df["survived"]

# Create New features
# Feature engineering = create features with better signal

# Family size = sibsp + parch
# sibsp = Siblings / Spouses aboard the Titanic
#  parch = Parents / Children aboard the Titanic
df["family_size"] = df["sibsp"] + df["parch"]

# Is child? (under 12 years old)
df["is_child"] = (df["age"] < 12).astype(int)

# High fare?
df["high_fare"] = (df["fare"] > df["fare"].median()).astype(int)

# Update X
X = df[["pclass", "sex", "age", "fare", "embarked", "family_size", "is_child", "high_fare"]]

#Preprocessing and Model

# Numeric & categorical features
numeric_features = ["age", "fare", "family_size"]
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_features = ["pclass", "sex", "embarked", "is_child", "high_fare"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Combine
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Model pipeline
clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])

# Train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy with engineered features:", accuracy_score(y_test, y_pred))



Accuracy with engineered features: 0.8044692737430168


📊 Exercise of the Day

What accuracy did you get before adding new features?

What accuracy did you get after feature engineering?

Which engineered feature do you think contributes the most? Why?

1) 
Accuracy before adding features: 0.7877094972067039

2) 
Accuracy after adding features: 0.8044692737430168

3) I think family size because it can show the amount of help the passengers received.

-If they were solo then thye might have less help.
-If they were a moderate family they couldve helped each other
-If they were a big group then it wouldve been difficult to stay together so lower survival odds.

The other two engineered features are also useful but I think they are not as necessary. High fare is related to class, teh feature class already tells us wich persons paid higher fairs therefore it is not eneded. Is_child is realted to age and childs have been accompanied by families therefore they have been helped most likely.

🌟 Mini-Challenge

Create your own feature (e.g., combine pclass with sex, or make a flag for “large family”).

Add it to the pipeline and check if accuracy improves.
👉 Which custom feature was most useful?

In [5]:
# Load Titanic
df = sns.load_dataset("titanic")

# Base features
X = df[["pclass", "sex", "age", "fare", "sibsp", "parch", "embarked"]]
y = df["survived"]

# Create New features
# Feature engineering = create features with better signal

# Family size = sibsp + parch
# sibsp = Siblings / Spouses aboard the Titanic
#  parch = Parents / Children aboard the Titanic
df["family_size"] = df["sibsp"] + df["parch"]

# Is child? (under 12 years old)
df["is_child"] = (df["age"] < 12).astype(int)

# High fare?
df["high_fare"] = (df["fare"] > df["fare"].median()).astype(int)

#Flag for large family

df["large_family"] = (df["family_size"] > df["family_size"].median()).astype(int)

#Flag for is alone

df["is_alone"] = (df["family_size"] == 0).astype(int)

# Update X
X = df[["pclass", "sex", "age", "fare", "embarked", "family_size", "is_child", "high_fare", "large_family", "is_alone"]]

#Preprocessing and Model

# Numeric & categorical features
numeric_features = ["age", "fare", "family_size"]
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_features = ["pclass", "sex", "embarked", "is_child", "high_fare", "large_family", "is_alone"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Combine
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Model pipeline
clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])

# Train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy with engineered features:", accuracy_score(y_test, y_pred))

Accuracy with engineered features: 0.8044692737430168


I added a large family flag and an is_alone flag. The flags dont change the accuracy.

📝 Notes

Feature engineering often improves models more than trying different algorithms.

Good features = capture real-world patterns in the data.

Pipelines let you test feature engineering without messing up preprocessing.