# Feature-engine: Essential Transformations for ML Students

This notebook demonstrates 8 important feature-engine transformations that every machine learning student should know. We use the Titanic dataset (via seaborn) to illustrate transformations on both numerical and categorical variables. Each transformation includes a short explanation and a runnable example.

Transformations covered:
1. Mean/Median imputation for numerical variables (MeanMedianImputer)
2. Categorical imputation for missing categorical values (CategoricalImputer)
3. Rare label encoding to group infrequent categories (RareLabelEncoder)
4. One-hot encoding for categorical expansion (OneHotEncoder)
5. Equal-frequency discretisation / binning (EqualFrequencyDiscretiser)
6. Log transformation to reduce skew (LogTransformer)
7. Scaling numerical features (StandardScaler)
8. Outlier capping/winsorization (Winsorizer)

All transformers are from the feature-engine library and are sklearn-compatible, so they can be included inside sklearn Pipelines.

In [17]:
# Install dependencies (run once)
!uv pip install -q feature-engine seaborn

In [18]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from feature_engine.encoding import RareLabelEncoder, OneHotEncoder
from feature_engine.discretisation import EqualFrequencyDiscretiser
from feature_engine.transformation import LogTransformer
from feature_engine.outliers import Winsorizer
from sklearn.preprocessing import StandardScaler

In [19]:
# Load the Titanic dataset
df = sns.load_dataset('titanic')
df = df[['survived','pclass','sex','age','fare','embarked','deck']].copy()
df.head()

Unnamed: 0,survived,pclass,sex,age,fare,embarked,deck
0,0,3,male,22.0,7.25,S,
1,1,1,female,38.0,71.2833,C,C
2,1,3,female,26.0,7.925,S,
3,1,1,female,35.0,53.1,S,C
4,0,3,male,35.0,8.05,S,


Quick look: the dataset contains numerical columns (age, fare), categorical columns (sex, embarked, deck), and missing values. We'll demonstrate transformations on a subset of these features.

## 1) Mean/Median imputation for numerical variables (MeanMedianImputer)

When numerical features have missing values, replacing them with the mean or median is a simple baseline strategy. Median is robust to outliers.

**Key differences from sklearn.impute.SimpleImputer:**
- **Input**: Accepts pandas DataFrame (not numpy arrays)
- **Output**: Returns pandas DataFrame (preserves column names and index)
- **Variables parameter**: Specify which columns to impute; other columns pass through unchanged
- **Flexibility**: Can impute different columns with different strategies in same pipeline

**How it works:**
1. `.fit()` - Calculates and stores the median/mean from training data
2. `.transform()` - Replaces missing values using stored statistics
3. Stored in `.imputer_dict_` attribute (e.g., `{'age': 28.0}`)

In [36]:
# Create imputer - specify which columns to impute
num_imputer = MeanMedianImputer(
    imputation_method='median',  # Can be 'mean' or 'median'
    variables=['age']            # Only impute 'age', leave other columns as-is
)

# Fit on data (learns median=28.0) and transform in one step
df_imputed = num_imputer.fit_transform(df)

# Show statistics - note that 'count' increases after imputation
df[['age']].describe().loc[['count','mean','std','min','50%','max']]

Unnamed: 0,age
count,714.0
mean,29.699118
std,14.526497
min,0.42
50%,28.0
max,80.0


Check that missing values in 'age' were replaced by the median:

In [21]:
df_imputed['age'].isna().sum(), df['age'].isna().sum()

(np.int64(0), np.int64(177))

## 2) Categorical imputation (CategoricalImputer)

Categorical features sometimes have missing values. One option is to replace them with a string such as 'Missing' so that the model can learn from the presence of missingness.

**Key points:**
- **Input/Output**: DataFrame → DataFrame (like all feature-engine transformers)
- **fill_value**: The string/value to replace NaN with (default='Missing')
- **variables**: List of categorical columns to impute
- **Why this matters**: Treating missing as a separate category can capture patterns (e.g., "missing deck info correlates with lower fare")

**Compared to sklearn:** sklearn's SimpleImputer can do this with `strategy='constant'`, but feature-engine's version is more explicit and DataFrame-friendly.

In [None]:
# Create categorical imputer
cat_imputer = CategoricalImputer(
    fill_value='Missing',              # Replace NaN with this string
    variables=['embarked','deck']      # Apply to these columns only
)

# Transform data - replaces NaN with 'Missing'
df_cat_imputed = cat_imputer.fit_transform(df)

# Verify: should show 0 missing values
df_cat_imputed[['embarked','deck']].isna().sum()

embarked    0
deck        0
dtype: int64

## 3) Rare label encoding (RareLabelEncoder)

Rare categories (very infrequent) may create noisy features. RareLabelEncoder groups categories that appear in a small fraction of observations into a single label (e.g., 'Rare').

**Important parameters:**
- **tol**: Minimum frequency threshold (0.05 = 5%). Categories below this → 'Rare'
- **n_categories**: Minimum number of categories to keep. If tol would create fewer categories, keep top n_categories
- **replace_with**: String to replace rare categories with (default='Rare')

**Why use this?**
- Reduces high-cardinality categorical features
- Prevents overfitting to rare categories with few samples
- Reduces dimensionality after one-hot encoding

**How it learns:**
- `.fit()` - Identifies which categories are rare based on training data frequency
- `.transform()` - Replaces rare categories with 'Rare'
- Stored in `.encoder_dict_` (e.g., `{'deck': ['D', 'E', 'F', 'G'], 'embarked': []}`)

In [None]:
# Create rare label encoder
rare_enc = RareLabelEncoder(
    tol=0.05,                          # Categories appearing in <5% of rows → 'Rare'
    n_categories=1,                    # Keep at least 1 category (plus 'Rare')
    variables=['deck','embarked']      # Apply to these categorical columns
)

# Fit and transform - groups infrequent deck values into 'Rare'
df_rare = rare_enc.fit_transform(df_cat_imputed)

# Show frequency distribution - notice 'Rare' category consolidates infrequent values
df_rare['deck'].value_counts(normalize=True).head(10)

deck
Missing    0.772166
Rare       0.108866
C          0.066218
B          0.052750
A          0.000000
D          0.000000
E          0.000000
F          0.000000
G          0.000000
Name: proportion, dtype: float64

## 4) One-hot encoding (OneHotEncoder)

Convert categorical variables into numerical columns via one-hot encoding. feature-engine's OneHotEncoder returns a DataFrame and allows dropping the last level to avoid collinearity.

**Key differences from sklearn.preprocessing.OneHotEncoder:**
- **Input/Output**: DataFrame → DataFrame (sklearn uses arrays)
- **Column names**: Automatically creates readable names like 'sex_male', 'embarked_S'
- **drop_last**: If True, drops one category per variable to avoid multicollinearity (important for linear models)
- **variables**: Specify which columns to encode; unspecified columns pass through unchanged

**How it works:**
1. `.fit()` - Learns all unique categories in specified columns
2. `.transform()` - Creates binary columns for each category, drops original categorical columns
3. New columns: 'variable_category' format (e.g., 'sex_male', 'sex_female')

**Important**: Always fit on training data only! If test data has unseen categories, they'll be ignored (all zeros).

In [None]:
# Create one-hot encoder
ohe = OneHotEncoder(
    drop_last=True,                    # Drop last category to avoid dummy variable trap
    variables=['sex','embarked']       # Encode these categorical columns
)

# Transform - creates new binary columns, removes original categorical columns
df_ohe = ohe.fit_transform(df_rare)

# Notice: 'sex' becomes 'sex_male' (female dropped), 'embarked' becomes 3 columns (one dropped)
df_ohe.head()

Unnamed: 0,survived,pclass,age,fare,deck,sex_male,embarked_S,embarked_C,embarked_Q
0,0,3,22.0,7.25,Missing,1,1,0,0
1,1,1,38.0,71.2833,C,0,0,1,0
2,1,3,26.0,7.925,Missing,0,1,0,0
3,1,1,35.0,53.1,C,0,1,0,0
4,0,3,35.0,8.05,Missing,1,1,0,0


## 5) Equal-frequency discretisation (EqualFrequencyDiscretiser)

Discretisation (binning) transforms continuous variables into ordinal bins. Equal-frequency bins try to make bins with (roughly) the same number of observations.
Binning can help models that benefit from ordinal categories or to capture non-linear relationships.

**Key parameters:**
- **q**: Number of quantiles/bins to create (e.g., q=4 creates quartiles)
- **return_object**: 
  - `False` → Returns integers (0, 1, 2, 3) representing bin numbers
  - `True` → Returns interval strings like '(0.0, 7.91]'
- **variables**: Columns to discretize

**Compared to sklearn.preprocessing.KBinsDiscretizer:**
- feature-engine is DataFrame-friendly and allows specifying columns
- sklearn works with arrays and discretizes all features

**Why use discretization?**
- Captures non-linear patterns (e.g., fare 0-10, 10-30, 30-100, 100+)
- Makes models more robust to outliers
- Can improve tree-based models and linear models

**Fit behavior:** Learns bin edges from training data distribution

In [None]:
# Create equal-frequency discretiser
disc = EqualFrequencyDiscretiser(
    q=4,                               # Create 4 bins (quartiles)
    variables=['fare'],                # Discretize 'fare' column
    return_object=False                # Return bin numbers (0,1,2,3) not interval strings
)

# Transform - replaces continuous values with bin numbers
df_disc = disc.fit_transform(df_ohe)

# Show original and binned values side by side
df_temp = df_ohe[['fare']].copy()
df_temp['fare_binned'] = df_disc['fare']
df_temp.head(10)

Unnamed: 0,fare,fare_binned
0,7.25,0
1,71.2833,3
2,7.925,1
3,53.1,3
4,8.05,1
5,8.4583,1
6,51.8625,3
7,21.075,2
8,11.1333,1
9,30.0708,2


Note: return_object=False returns bins as integers (0, 1, 2, 3) representing quartiles. Set to True if you want string representations like '(0.0, 7.91]'.

## 6) Log transformation (LogTransformer)

Apply a log transform to reduce skew in positive-valued variables (e.g., fare). LogTransformer requires strictly positive values. For data with zeros, use `YeoJohnsonTransformer` instead, or add a small constant before transforming.

**Important constraints:**
- **Requires**: All values must be > 0 (raises ValueError if zeros/negatives found)
- **Input/Output**: DataFrame → DataFrame
- **Effect**: Compresses large values, expands small values → reduces right skew

**Compared to sklearn:**
- sklearn doesn't have a direct LogTransformer
- Use sklearn's `FunctionTransformer(np.log1p)` or feature-engine's version
- feature-engine is more explicit and checks for invalid values

**Mathematical transformation:**
```
new_value = log(original_value)
```

**When to use:**
- Right-skewed distributions (long tail to the right)
- Variables with exponential growth (prices, populations, web traffic)
- To stabilize variance in regression models

In [33]:
# Check for zeros/negatives and handle them
print(f"Fare has {(df['fare'] <= 0).sum()} zero or negative values")

# Create a copy and add small constant to avoid log(0)
df_for_log = df.copy()
df_for_log['fare'] = df_for_log['fare'] + 0.01  # Add small constant

log_t = LogTransformer(variables=['fare'])
df_log = log_t.fit_transform(df_for_log)

# Show the transformed data statistics
print("\nOriginal fare stats:")
print(df[['fare']].describe().loc[['mean','std','min','max']])
print("\nLog-transformed fare stats:")
print(df_log[['fare']].describe().loc[['mean','std','min','max']])

Fare has 15 zero or negative values

Original fare stats:
            fare
mean   32.204208
std    49.693429
min     0.000000
max   512.329200

Log-transformed fare stats:
          fare
mean  2.817036
std   1.343804
min  -4.605170
max   6.238987


Show original vs log transformed distribution (basic comparison using summary stats):

In [35]:
# Compare original vs transformed using the same data the transformer was fitted on
orig = df_for_log['fare'].dropna()
trans = df_log['fare'].dropna()

pd.DataFrame({
    'original_mean': df['fare'].mean(), 
    'original_std': df['fare'].std(),
    'log_mean': trans.mean(), 
    'log_std': trans.std()
}, index=[0])

Unnamed: 0,original_mean,original_std,log_mean,log_std
0,32.204208,49.693429,2.817036,1.343804


## 7) Scaling numerical features (StandardScaler)

Standardisation (zero mean, unit variance) is a common preprocessing step, especially for algorithms that are distance-based or use regularisation. We use sklearn's StandardScaler which works well with feature-engine pipelines.

**sklearn.preprocessing.StandardScaler:**
- **Input**: Numpy array or DataFrame
- **Output**: Numpy array (loses column names!)
- **Formula**: `z = (x - mean) / std`
- **Result**: Mean ≈ 0, Standard Deviation ≈ 1

**Important for pipelines:**
- StandardScaler returns numpy arrays, breaking DataFrame structure
- In production pipelines, place StandardScaler as the last step
- Or use feature-engine's `StandardScaler` wrapper to maintain DataFrames

**When to use scaling:**
- Linear regression, Logistic regression (with regularization)
- KNN, SVM, Neural Networks (distance-based algorithms)
- Gradient descent optimization
- **Not needed** for tree-based models (Random Forest, XGBoost)

In [None]:
# First impute age (scaler can't handle NaN values)
df_for_scaling = df.copy()
num_imputer = MeanMedianImputer(imputation_method='median', variables=['age'])
df_for_scaling = num_imputer.fit_transform(df_for_scaling)

# Create and apply StandardScaler
scaler = StandardScaler()

# Note: StandardScaler requires numeric data without NaN
# It learns mean and std from the data, then transforms: (x - mean) / std
df_for_scaling[['age','fare']] = scaler.fit_transform(df_for_scaling[['age','fare']])

# Verify: scaled features should have mean≈0, std≈1
df_for_scaling[['age','fare']].describe().loc[['mean','std']]

Unnamed: 0,age,fare
mean,2.27278e-16,3.9873330000000004e-18
std,1.000562,1.000562


## 8) Outlier capping with Winsorizer

Outliers can negatively impact model performance. The Winsorizer caps extreme values at specified quantiles (e.g., 5th and 95th percentiles), which is less aggressive than removing outliers entirely.

**Key parameters:**
- **capping_method**: 
  - `'quantiles'` - Cap at percentiles (e.g., 5th and 95th)
  - `'iqr'` - Cap using IQR method (Q1 - 1.5×IQR, Q3 + 1.5×IQR)
  - `'gaussian'` - Cap at mean ± n×std
- **tail**: `'both'`, `'left'`, or `'right'` - which tail(s) to cap
- **fold**: For quantiles, the percentile (0.05 = 5th/95th percentiles)

**How it works:**
1. `.fit()` - Calculates capping values from training data (e.g., 5th and 95th percentiles)
2. `.transform()` - Replaces values below 5th with 5th percentile, above 95th with 95th percentile
3. Stored in `.left_tail_caps_` and `.right_tail_caps_` attributes

**Why winsorize instead of removing outliers?**
- Preserves all data points (no sample loss)
- Less sensitive to extreme values than deletion
- Better for production: handles outliers in new data gracefully

**Bonus:** Also removes zeros/negatives from lower tail, making data safe for log transformation!

In [None]:
# Create winsorizer
winsorizer = Winsorizer(
    capping_method='quantiles',        # Cap at percentiles
    tail='both',                       # Cap both high and low values
    fold=0.05,                         # 5th and 95th percentiles (keeps middle 90%)
    variables=['fare']                 # Apply to 'fare' column
)

# Transform - caps extreme values at 5th and 95th percentiles
df_winsorized = winsorizer.fit_transform(df)

# Compare original vs winsorized - notice min/max are now bounded
print("Original fare range:", df['fare'].min(), "to", df['fare'].max())
print("Winsorized fare range:", df_winsorized['fare'].min(), "to", df_winsorized['fare'].max())
print("\nOriginal 95th percentile:", df['fare'].quantile(0.95))
print("Winsorized max (capped at 95th):", df_winsorized['fare'].max())

Original fare range: 0.0 to 512.3292
Winsorized fare range: 7.225 to 112.07915

Original 95th percentile: 112.07915
Winsorized max (capped at 95th): 112.07915


## Putting it together: a simple end-to-end Pipeline

Below is a single sklearn Pipeline that chains several feature-engine transformers. Feature-engine transformers are compatible with sklearn Pipelines and operate on pandas DataFrames (returning DataFrames), which makes it convenient to build sequential transformations that act on different variables.

Pipeline steps (example):
- Impute numeric missing values (age)
- Cap outliers in fare (removes zeros/negatives too)
- Log transform fare (now safe after winsorization)
- Impute categorical missing values
- Rare label encode
- One-hot encode selected categoricals
- Discretise fare into equal-frequency bins (optional)

Note: The Winsorizer before LogTransformer is important because it ensures fare > 0, which is required for log transformation.

In [None]:
# Create an end-to-end preprocessing pipeline
# Pipeline executes transformers in sequence, passing output of each to the next

pipeline = Pipeline([
    # Step 1: Impute missing numerical values (age has ~20% missing)
    ('num_imputer', MeanMedianImputer(imputation_method='median', variables=['age'])),
    
    # Step 2: Cap outliers in fare (removes extreme values AND zeros/negatives)
    ('winsorizer', Winsorizer(capping_method='quantiles', tail='both', fold=0.05, variables=['fare'])),
    
    # Step 3: Log transform fare (safe now - winsorizer ensured fare > 0)
    ('log_fare', LogTransformer(variables=['fare'])),
    
    # Step 4: Impute missing categorical values (embarked, deck)
    ('cat_imputer', CategoricalImputer(fill_value='Missing', variables=['embarked','deck'])),
    
    # Step 5: Group rare categories (reduces cardinality of 'deck')
    ('rare_encoder', RareLabelEncoder(tol=0.05, n_categories=1, variables=['deck','embarked'])),
    
    # Step 6: One-hot encode categorical variables (creates binary columns)
    ('ohe', OneHotEncoder(drop_last=True, variables=['sex','embarked'])),
    
    # Optional: Discretise fare to bins (uncomment if desired)
    # ('discretiser', EqualFrequencyDiscretiser(q=4, variables=['fare'], return_object=False)),
])

# Fit the entire pipeline on training data
# Each transformer learns its parameters sequentially
pipeline.fit(df)

# Transform data - applies all transformations in order
df_transformed = pipeline.transform(df)

# Result: Clean, encoded, ready-for-modeling DataFrame
df_transformed.head()

Unnamed: 0,survived,pclass,age,fare,deck,sex_male,embarked_S,embarked_C,embarked_Q
0,0,3,22.0,1.981001,Missing,1,1,0,0
1,1,1,38.0,4.266662,C,0,0,1,0
2,1,3,26.0,2.070022,Missing,0,1,0,0
3,1,1,35.0,3.972177,C,0,1,0,0
4,0,3,35.0,2.085672,Missing,1,1,0,0


### Notes and teaching tips

**Understanding transformers (sklearn API):**
- All transformers follow the same pattern: `.fit()`, `.transform()`, `.fit_transform()`
- `.fit()` - Learn parameters from training data (e.g., mean, median, categories)
- `.transform()` - Apply learned parameters to new data
- `.fit_transform()` - Shortcut to fit and transform in one step

**Critical concept: Fit on training data only!**
```python
# ✅ CORRECT
transformer.fit(X_train)           # Learn from training data
X_train_transformed = transformer.transform(X_train)
X_test_transformed = transformer.transform(X_test)   # Apply to test

# ❌ WRONG - causes data leakage!
transformer.fit(X)                  # Don't fit on all data
```

**feature-engine advantages over sklearn:**
1. **DataFrame-in, DataFrame-out** - Preserves column names and structure
2. **Selective transformation** - `variables` parameter lets you transform specific columns
3. **Explicit and readable** - Clear what each transformer does
4. **Pipeline-friendly** - All transformers work seamlessly in sklearn Pipelines

**Inspecting learned parameters:**
- Show students how each transformer stores state (e.g., median used for imputation, categories identified as rare)
- Examine transformer attributes: `.imputer_dict_`, `.encoder_dict_`, `.right_tail_caps_`, etc.
- This helps debug and understand what the model learned

**Additional feature-engine transformers to explore:**
- `ArbitraryOutlierCapper` - Custom outlier thresholds
- `DecisionTreeDiscretiser` - Supervised binning using decision trees
- `YeoJohnsonTransformer` - Handles both positive and negative values (unlike log)
- `DropFeatures` - Systematic feature removal
- `MathFeatures` - Create interaction features (sum, product, ratio)

In [31]:
# Example: inspect what median was used for 'age'
pipeline.named_steps['num_imputer'].imputer_dict_


{'age': 28.0}

In [32]:
# Example: show the categories mapped to 'Rare' by the RareLabelEncoder
pipeline.named_steps['rare_encoder'].encoder_dict_


{'deck': ['Missing', 'C', 'B'], 'embarked': ['S', 'C', 'Q']}