# Feature-engine: 7 Essential Transformations for ML Students

This notebook demonstrates 5–7 important feature-engine transformations that every machine learning student should know. We use the Titanic dataset (via seaborn) to illustrate transformations on both numerical and categorical variables. Each transformation includes a short explanation and a runnable example.

Transformations covered:
1. Mean/Median imputation for numerical variables (MeanMedianImputer)
2. Categorical imputation for missing categorical values (CategoricalImputer)
3. Rare label encoding to group infrequent categories (RareLabelEncoder)
4. One-hot encoding for categorical expansion (OneHotEncoder)
5. Equal-frequency discretisation / binning (EqualFrequencyDiscretiser)
6. Log transformation to reduce skew (LogTransformer)
7. Scaling numerical features (StandardScalerWrapper)

All transformers are from the feature-engine library and are sklearn-compatible, so they can be included inside sklearn Pipelines.

In [1]:
# Install dependencies (run once)
!uv pip install -q feature-engine seaborn

In [13]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from feature_engine.encoding import RareLabelEncoder, OneHotEncoder
from feature_engine.discretisation import EqualFrequencyDiscretiser
from feature_engine.transformation import LogTransformer
from feature_engine.scaling import MeanNormalizationScaler


In [6]:
# Load the Titanic dataset
df = sns.load_dataset('titanic')
df = df[['survived','pclass','sex','age','fare','embarked','deck']].copy()
df.head()

Unnamed: 0,survived,pclass,sex,age,fare,embarked,deck
0,0,3,male,22.0,7.25,S,
1,1,1,female,38.0,71.2833,C,C
2,1,3,female,26.0,7.925,S,
3,1,1,female,35.0,53.1,S,C
4,0,3,male,35.0,8.05,S,


Quick look: the dataset contains numerical columns (age, fare), categorical columns (sex, embarked, deck), and missing values. We'll demonstrate transformations on a subset of these features.

## 1) Mean/Median imputation for numerical variables (MeanMedianImputer)

When numerical features have missing values, replacing them with the mean or median is a simple baseline strategy. Median is robust to outliers.

In [7]:
num_imputer = MeanMedianImputer(imputation_method='median', variables=['age'])
df_imputed = num_imputer.fit_transform(df)
df[['age']].describe().loc[['count','mean','std','min','50%','max']]


Unnamed: 0,age
count,714.0
mean,29.699118
std,14.526497
min,0.42
50%,28.0
max,80.0


Check that missing values in 'age' were replaced by the median:

In [8]:
df_imputed['age'].isna().sum(), df['age'].isna().sum()

(np.int64(0), np.int64(177))

## 2) Categorical imputation (CategoricalImputer)

Categorical features sometimes have missing values. One option is to replace them with a string such as 'Missing' so that the model can learn from the presence of missingness.

In [9]:
cat_imputer = CategoricalImputer(fill_value='Missing', variables=['embarked','deck'])
df_cat_imputed = cat_imputer.fit_transform(df)
df_cat_imputed[['embarked','deck']].isna().sum()


embarked    0
deck        0
dtype: int64

## 3) Rare label encoding (RareLabelEncoder)

Rare categories (very infrequent) may create noisy features. RareLabelEncoder groups categories that appear in a small fraction of observations into a single label (e.g., 'Rare').

In [10]:
# Demonstrate with 'embarked' (though it has few categories) and 'deck' which has a lot of NAs and many categories
rare_enc = RareLabelEncoder(tol=0.05, n_categories=1, variables=['deck','embarked'])
df_rare = rare_enc.fit_transform(df_cat_imputed)
df_rare['deck'].value_counts(normalize=True).head(10)


deck
Missing    0.772166
Rare       0.108866
C          0.066218
B          0.052750
A          0.000000
D          0.000000
E          0.000000
F          0.000000
G          0.000000
Name: proportion, dtype: float64

## 4) One-hot encoding (OneHotEncoder)

Convert categorical variables into numerical columns via one-hot encoding. feature-engine's OneHotEncoder returns a DataFrame and allows dropping the last level to avoid collinearity.

In [11]:
ohe = OneHotEncoder(drop_last=True, variables=['sex','embarked'])
df_ohe = ohe.fit_transform(df_rare)
df_ohe.head()


Unnamed: 0,survived,pclass,age,fare,deck,sex_male,embarked_S,embarked_C,embarked_Q
0,0,3,22.0,7.25,Missing,1,1,0,0
1,1,1,38.0,71.2833,C,0,0,1,0
2,1,3,26.0,7.925,Missing,0,1,0,0
3,1,1,35.0,53.1,C,0,1,0,0
4,0,3,35.0,8.05,Missing,1,1,0,0


## 5) Equal-frequency discretisation (EqualFrequencyDiscretiser)

Discretisation (binning) transforms continuous variables into ordinal bins. Equal-frequency bins try to make bins with (roughly) the same number of observations.
Binning can help models that benefit from ordinal categories or to capture non-linear relationships.

In [16]:
disc = EqualFrequencyDiscretiser(q=4, variables=['fare'], return_object=True)
df_disc = disc.fit_transform(df_ohe)
df_disc[['fare','fare_bin']].head(10)


KeyError: "['fare_bin'] not in index"

Note: return_object=True returns bins as strings (like '(0.0, 7.91]') — useful when you want to one-hot encode the bins later.

## 6) Log transformation (LogTransformer)

Apply a log transform to reduce skew in positive-valued variables (e.g., fare). feature-engine's LogTransformer handles zero/negative values by adding an offset if necessary.

In [15]:
log_t = LogTransformer(variables=['fare'])
df_log = log_t.fit_transform(df)
df[['fare']].describe().loc[['mean','std','min','max']]


ValueError: Some variables contain zero or negative values, can't apply log

Show original vs log transformed distribution (basic comparison using summary stats):

In [None]:
orig = df['fare'].dropna()
trans = log_t.transform(df)['fare']
pd.DataFrame({'original_mean':orig.mean(), 'original_std':orig.std(),
              'log_mean':trans.mean(), 'log_std':trans.std()}, index=[0])


## 7) Scaling numerical features (StandardScalerWrapper)

Standardisation (zero mean, unit variance) is a common preprocessing step, especially for algorithms that are distance-based or use regularisation.

In [14]:
scaler = MeanNormalizationScaler(variables=['age','fare'])
df_scaled = scaler.fit_transform(df_log)
df_scaled[['age','fare']].describe().loc[['mean','std']]


NameError: name 'df_log' is not defined

## Putting it together: a simple end-to-end Pipeline

Below is a single sklearn Pipeline that chains several feature-engine transformers. Feature-engine transformers are compatible with sklearn Pipelines and operate on pandas DataFrames (returning DataFrames), which makes it convenient to build sequential transformations that act on different variables.

Pipeline steps (example):
- Impute numeric missing values (age)
- Log transform fare
- Scale numerical features
- Impute categorical missing values
- Rare label encode
- One-hot encode selected categoricals
- Discretise fare into equal-frequency bins (optional)


In [None]:
pipeline = Pipeline([
    ('num_imputer', MeanMedianImputer(imputation_method='median', variables=['age'])),
    ('log_fare', LogTransformer(variables=['fare'])),
    ('scaler', MeanNormalizationScaler(variables=['age','fare'])),
    ('cat_imputer', CategoricalImputer(fill_value='Missing', variables=['embarked','deck'])),
    ('rare_encoder', RareLabelEncoder(tol=0.05, n_categories=1, variables=['deck','embarked'])),
    ('ohe', OneHotEncoder(drop_last=True, variables=['sex','embarked'])),
    # Optional: discretise fare to bins (uncomment if desired)
    # ('discretiser', EqualFrequencyDiscretiser(q=4, variables=['fare'], return_object=True)),
])

# Fit pipeline (transformers that need only X will be fit with X)
pipeline.fit(df)
df_transformed = pipeline.transform(df)
df_transformed.head()


ValueError: Some variables contain zero or negative values, can't apply log

### Notes and teaching tips
- Show students how each transformer stores state (e.g., median used for imputation, categories identified as rare). Examine transformer attributes (e.g., num_imputer.imputer_dict_).
- Emphasize the difference between transformers that need the target (e.g., some OrdinalEncoders that order by target) and those that don't.
- Teach the importance of fitting transformers only on training data and applying to test data to avoid data leakage.
- Demonstrate how to save pipelines (joblib.dump) and reuse them in production.

This notebook introduced seven practical transformations from feature-engine. Students can extend the pipeline with additional feature-engine tools such as variable extractors, polynomial features (via sklearn), or decision-tree-based discretisers for supervised binning.


In [None]:
# Example: inspect what median was used for 'age'
pipeline.named_steps['num_imputer'].imputer_dict_


In [None]:
# Example: show the categories mapped to 'Rare' by the RareLabelEncoder
pipeline.named_steps['rare_encoder'].encoder_dict_
