# Feature Engineering — Deep Dive
This notebook explores techniques to create, transform, and select features that improve model performance.
Sections:
- Creating features
- Encoding strategies
- Interaction features
- Aggregations & group features
- Temporal features
- Feature selection & importance



## Setup
Import libraries and load a dataset (Titanic). We'll use it for practical examples.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

sns.set_theme()
df = sns.load_dataset('titanic').copy()
df.shape


## 1. Creating simple features
Create `family_size`, `is_alone`, and bucketized versions of numeric features.


In [None]:
df['family_size'] = df['sibsp'].fillna(0) + df['parch'].fillna(0)
df['is_alone'] = (df['family_size'] == 0).astype(int)
# bucketize age
bins = [0,12,18,35,60,200]
labels = ['child','teen','young_adult','adult','senior']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)
df[['sibsp','parch','family_size','is_alone','age','age_group']].head()


## 2. Encoding strategies
- One-hot encoding for nominal categories
- Target encoding / mean encoding for high-cardinality categories (careful with leakage)


In [None]:
# One-hot encoding example
X = pd.get_dummies(df[['sex','pclass','age_group']], drop_first=True)
X.head()


### 2.1 Target (mean) encoding — with K-fold to reduce leakage
We'll implement a simple K-fold target encoder for the 'embark_town' column.


In [None]:
from sklearn.model_selection import KFold

def mean_target_encoding(series, target, n_splits=5, seed=42):
    '''Return a series of encoded values using out-of-fold mean target encoding.'''
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    encoded = pd.Series(index=series.index, dtype=float)
    for train_idx, val_idx in kf.split(series):
        means = target.iloc[train_idx].groupby(series.iloc[train_idx]).mean()
        encoded.iloc[val_idx] = series.iloc[val_idx].map(means)
    # global mean for any unseen
    return encoded.fillna(target.mean())

# prepare data
tmp = df[['embark_town','survived']].dropna()
tmp_index_map = tmp.index
encoded = mean_target_encoding(tmp['embark_town'].reset_index(drop=True), tmp['survived'].reset_index(drop=True))
# show mapping by category
mapping = tmp[['embark_town']].reset_index(drop=True).join(encoded.rename('emb_enc'))
mapping.groupby('embark_town').agg({'emb_enc':'mean'})


## 3. Interaction features
Create product, ratio, or difference features, e.g., `fare_per_person = fare / (family_size+1)`.


In [None]:
df['fare_per_person'] = df['fare'] / (df['family_size'] + 1)
df[['fare','family_size','fare_per_person']].head()


## 4. Aggregations & group features
Aggregate statistics per group (mean fare by embark town, survival rate per class) and join back to rows.


In [None]:
grp = df.groupby('embark_town')['fare'].agg(['mean','median','count']).reset_index().rename(columns={'mean':'emb_fare_mean'})
merged = df.merge(grp[['embark_town','emb_fare_mean']], on='embark_town', how='left')
merged[['embark_town','fare','emb_fare_mean']].head()


## 5. Temporal features (if applicable)
Extract year/month/day/hour from datetime columns, calculate elapsed time between events.


In [None]:
# Titanic doesn't have datetime fields to demo; below is an example snippet
# df['date'] = pd.to_datetime(df['date_column'])
# df['year'] = df['date'].dt.year
# df['month'] = df['date'].dt.month
# df['age_days'] = (pd.to_datetime('today') - df['date']).dt.days
pass


## 6. Feature selection & importance
Use model-based importance or recursive feature elimination. Here we'll use RandomForest feature importances.


In [None]:
# Prepare a simple feature set
fe_df = df.copy()
fe_df['age'] = fe_df['age'].fillna(fe_df['age'].median())
fe_df['fare'] = fe_df['fare'].fillna(fe_df['fare'].median())
fe_df['embark_town'] = fe_df['embark_town'].fillna('Unknown')
X = pd.get_dummies(fe_df[['age','fare','family_size','is_alone','fare_per_person','embark_town']], drop_first=True)
y = fe_df['survived'].fillna(0)
X_train, X_test, y_train, y_test = train_test_split(X.fillna(0), y, test_size=0.2, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
importances.head(20)


## Exercises
1. Create two new interaction features and test their impact on RandomForest accuracy.
2. Implement K-fold mean encoding for 'class' or 'sex' and compare.
3. Create group-level features by 'pclass' and evaluate correlation with target.
