# 02 - Feature Engineering

Objective: Transform raw data from the EDA phase into model-ready features for training Logistic Regression, Random Forest, and ANN models.

Tasks covered:
- Handle outliers identified in EDA
- Encode categorical variables (binary, ordinal, nominal)
- Engineer derived features (log transforms, ratios)
- Train-test split with stratification
- Feature scaling (StandardScaler)
- Handle class imbalance with SMOTE
- Save preprocessing artifacts for Phase 3


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Use sklearn v1.7.2 because of imlearn has compatibility issue with latest sklearn version (v1.8.0)
from imblearn.over_sampling import SMOTE
import joblib

sns.set_theme(style="whitegrid")

# Resolve project root whether the notebook is launched from repo root or notebooks/
PROJECT_ROOT = Path.cwd().resolve()
if PROJECT_ROOT.name == "notebooks":
    PROJECT_ROOT = PROJECT_ROOT.parent

data_path = PROJECT_ROOT / "data" / "Dataset - 2526.csv"
models_dir = PROJECT_ROOT / "models"
models_dir.mkdir(parents=True, exist_ok=True)

RANDOM_STATE = 38  # Universal random state for reproducibility

print(f"Project root: {PROJECT_ROOT}")
print(f"Data path: {data_path}")


Project root: /Users/ziadalwazzan/Documents/MSc/COMP0198-ML/ml-personal-loan-classification-models
Data path: /Users/ziadalwazzan/Documents/MSc/COMP0198-ML/ml-personal-loan-classification-models/data/Dataset - 2526.csv


In [2]:
# Load data
df = pd.read_csv(data_path)

print(f"Dataset shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
print(f"\nColumn dtypes:")
print(df.dtypes)
df.head()


Dataset shape: 45,000 rows x 16 columns

Column dtypes:
person_age                          int64
person_gender                      object
person_education                   object
person_income                       int64
person_emp_exp                      int64
person_home_ownership              object
loan_amnt                           int64
loan_intent                        object
loan_int_rate                     float64
loan_percent_income               float64
cb_person_cred_hist_length          int64
credit_score                        int64
previous_loan_defaults_on_file     object
empl_len                            int64
ppl_household                       int64
loan_status                         int64
dtype: object


Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,empl_len,ppl_household,loan_status
0,22,female,Master,71948,0,RENT,35000,PERSONAL,16.02,0.49,3,561,No,0,3,1
1,21,female,High School,12282,0,OWN,1000,EDUCATION,11.14,0.08,2,504,Yes,0,0,0
2,25,female,High School,12438,3,MORTGAGE,5500,MEDICAL,12.87,0.44,3,635,No,3,1,1
3,23,female,Bachelor,79753,0,RENT,35000,MEDICAL,15.23,0.44,2,675,No,0,1,1
4,24,male,Master,66135,1,RENT,35000,MEDICAL,14.27,0.53,4,586,No,1,0,1


## Step 2: Outlier Handling

Address outliers identified in EDA:
- `person_age`: Cap at 100 (max was 144, clearly erroneous)
- `person_income`: Cap at 99th percentile (max 7.2M, highly skewed)
- `person_emp_exp`: Cap at 99th percentile
- `loan_amnt`: Cap at 99th percentile


In [3]:
# Outlier handling

# Cap person_age at 100 (erroneous values like 144)
print(f"person_age before capping: min={df['person_age'].min()}, max={df['person_age'].max()}")
df['person_age'] = df['person_age'].clip(upper=100)
print(f"person_age after capping: min={df['person_age'].min()}, max={df['person_age'].max()}")

# Cap person_income at 99th percentile
income_99 = df['person_income'].quantile(0.99)
print(f"\nperson_income before capping: min={df['person_income'].min()}, max={df['person_income'].max()}")
print(f"99th percentile: {income_99:,.0f}")
df['person_income'] = df['person_income'].clip(upper=income_99)
print(f"person_income after capping: min={df['person_income'].min()}, max={df['person_income'].max()}")

# Cap person_emp_exp at 99th percentile
emp_exp_99 = df['person_emp_exp'].quantile(0.99)
print(f"\nperson_emp_exp before capping: min={df['person_emp_exp'].min()}, max={df['person_emp_exp'].max()}")
print(f"99th percentile: {emp_exp_99}")
df['person_emp_exp'] = df['person_emp_exp'].clip(upper=emp_exp_99)
print(f"person_emp_exp after capping: min={df['person_emp_exp'].min()}, max={df['person_emp_exp'].max()}")

# Cap loan_amnt at 99th percentile
loan_amnt_99 = df['loan_amnt'].quantile(0.99)
print(f"\nloan_amnt before capping: min={df['loan_amnt'].min()}, max={df['loan_amnt'].max()}")
print(f"99th percentile: {loan_amnt_99:,.0f}")
df['loan_amnt'] = df['loan_amnt'].clip(upper=loan_amnt_99)
print(f"loan_amnt after capping: min={df['loan_amnt'].min()}, max={df['loan_amnt'].max()}")


person_age before capping: min=20, max=144
person_age after capping: min=20, max=100

person_income before capping: min=8000, max=7200766
99th percentile: 271,450
person_income after capping: min=8000.0, max=271450.0600000004

person_emp_exp before capping: min=0, max=125
99th percentile: 26.0
person_emp_exp after capping: min=0, max=26

loan_amnt before capping: min=500, max=35000
99th percentile: 28,390
loan_amnt after capping: min=500.0, max=28390.34000000007


## Step 3: Categorical Encoding

Encode categorical variables using LabelEncoder:
- **Binary:** `person_gender`, `previous_loan_defaults_on_file`
- **Ordinal:** `person_education` (High School < Associate < Bachelor < Master < Doctorate)
- **Nominal:** `person_home_ownership`, `loan_intent`


In [4]:
# Categorical encoding

# Store encoders for later use
label_encoders = {}

# Binary encoding: person_gender
le_gender = LabelEncoder()
df['person_gender'] = le_gender.fit_transform(df['person_gender'])
label_encoders['person_gender'] = le_gender
print(f"person_gender mapping: {dict(zip(le_gender.classes_, le_gender.transform(le_gender.classes_)))}")

# Binary encoding: previous_loan_defaults_on_file
le_defaults = LabelEncoder()
df['previous_loan_defaults_on_file'] = le_defaults.fit_transform(df['previous_loan_defaults_on_file'])
label_encoders['previous_loan_defaults_on_file'] = le_defaults
print(f"previous_loan_defaults_on_file mapping: {dict(zip(le_defaults.classes_, le_defaults.transform(le_defaults.classes_)))}")

# Ordinal encoding: person_education (with explicit ordering)
education_order = ['High School', 'Associate', 'Bachelor', 'Master', 'Doctorate']
education_mapping = {edu: i for i, edu in enumerate(education_order)}
df['person_education'] = df['person_education'].map(education_mapping)
print(f"person_education mapping: {education_mapping}")

# Nominal encoding: person_home_ownership
le_home = LabelEncoder()
df['person_home_ownership'] = le_home.fit_transform(df['person_home_ownership'])
label_encoders['person_home_ownership'] = le_home
print(f"person_home_ownership mapping: {dict(zip(le_home.classes_, le_home.transform(le_home.classes_)))}")

# Nominal encoding: loan_intent
le_intent = LabelEncoder()
df['loan_intent'] = le_intent.fit_transform(df['loan_intent'])
label_encoders['loan_intent'] = le_intent
print(f"loan_intent mapping: {dict(zip(le_intent.classes_, le_intent.transform(le_intent.classes_)))}")

print(f"\nDataset shape after encoding: {df.shape}")
df.head()


person_gender mapping: {'female': np.int64(0), 'male': np.int64(1)}
previous_loan_defaults_on_file mapping: {'No': np.int64(0), 'Yes': np.int64(1)}
person_education mapping: {'High School': 0, 'Associate': 1, 'Bachelor': 2, 'Master': 3, 'Doctorate': 4}
person_home_ownership mapping: {'MORTGAGE': np.int64(0), 'OTHER': np.int64(1), 'OWN': np.int64(2), 'RENT': np.int64(3)}
loan_intent mapping: {'DEBTCONSOLIDATION': np.int64(0), 'EDUCATION': np.int64(1), 'HOMEIMPROVEMENT': np.int64(2), 'MEDICAL': np.int64(3), 'PERSONAL': np.int64(4), 'VENTURE': np.int64(5)}

Dataset shape after encoding: (45000, 16)


Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,empl_len,ppl_household,loan_status
0,22,0,3,71948.0,0,3,28390.34,4,16.02,0.49,3,561,0,0,3,1
1,21,0,0,12282.0,0,2,1000.0,1,11.14,0.08,2,504,1,0,0,0
2,25,0,0,12438.0,3,0,5500.0,3,12.87,0.44,3,635,0,3,1,1
3,23,0,2,79753.0,0,3,28390.34,3,15.23,0.44,2,675,0,0,1,1
4,24,1,3,66135.0,1,3,28390.34,3,14.27,0.53,4,586,0,1,0,1


## Step 4: Feature Engineering

Create targeted derived features:
- `log_income` = log1p(person_income) ‚Äî handle extreme skewness
- `log_loan_amnt` = log1p(loan_amnt) ‚Äî handle skewness
- `income_per_household` = person_income / ppl_household ‚Äî per-capita affordability
- `credit_hist_age_ratio` = cb_person_cred_hist_length / person_age ‚Äî credit maturity relative to age


In [5]:
# Feature engineering: create derived features

# Log transforms for skewed features
df['log_income'] = np.log1p(df['person_income'])
df['log_loan_amnt'] = np.log1p(df['loan_amnt'])

# Per-capita income (handle division by zero with clip)
df['income_per_household'] = df['person_income'] / df['ppl_household'].clip(lower=1)

# Credit history relative to age (how established is credit relative to life)
df['credit_hist_age_ratio'] = df['cb_person_cred_hist_length'] / df['person_age']

print("New features created:")
print(df[['log_income', 'log_loan_amnt', 'income_per_household', 'credit_hist_age_ratio']].describe())

print(f"\nDataset shape after feature engineering: {df.shape}")


New features created:
         log_income  log_loan_amnt  income_per_household  \
count  45000.000000   45000.000000          45000.000000   
mean      11.118137       8.939693          61222.633930   
std        0.542794       0.707922          43968.957359   
min        8.987322       6.216606           1333.333333   
25%       10.762255       8.517393          30488.458333   
50%       11.113179       8.987322          50352.000000   
75%       11.469916       9.412322          78981.250000   
max       12.511537      10.253839         271450.060000   

       credit_hist_age_ratio  
count           45000.000000  
mean                0.197969  
std                 0.089935  
min                 0.020000  
25%                 0.130435  
50%                 0.173913  
75%                 0.259259  
max                 0.588235  

Dataset shape after feature engineering: (45000, 20)


## Step 5: Train-Test Split

Split data 80/20 with stratification to preserve class proportions (77.8% rejected, 22.2% approved).


In [6]:
# Separate features and target
target_col = 'loan_status'
X = df.drop(columns=[target_col])
y = df[target_col]

# Define feature columns (for reference)
feature_names = X.columns.tolist()
print(f"Feature columns ({len(feature_names)}):")
print(feature_names)

print(f"\nTarget distribution (before split):")
print(y.value_counts())
print(f"\nTarget proportions:")
print(y.value_counts(normalize=True).round(4))


Feature columns (19):
['person_age', 'person_gender', 'person_education', 'person_income', 'person_emp_exp', 'person_home_ownership', 'loan_amnt', 'loan_intent', 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length', 'credit_score', 'previous_loan_defaults_on_file', 'empl_len', 'ppl_household', 'log_income', 'log_loan_amnt', 'income_per_household', 'credit_hist_age_ratio']

Target distribution (before split):
loan_status
0    35000
1    10000
Name: count, dtype: int64

Target proportions:
loan_status
0    0.7778
1    0.2222
Name: proportion, dtype: float64


In [7]:
# Train-test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE
)

print(f"Training set: {X_train.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")

print(f"\nTraining set class distribution:")
print(y_train.value_counts())
print(f"\nTest set class distribution:")
print(y_test.value_counts())


Training set: 36,000 samples
Test set: 9,000 samples

Training set class distribution:
loan_status
0    28000
1     8000
Name: count, dtype: int64

Test set class distribution:
loan_status
0    7000
1    2000
Name: count, dtype: int64


## Step 6: Feature Scaling

Apply StandardScaler for Logistic Regression and ANN (fit on train, transform both).
Keep unscaled versions available for Random Forest (scale-invariant).


In [8]:
# Feature scaling with StandardScaler

scaler = StandardScaler()

# Fit on training data only, then transform both
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_names, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_names, index=X_test.index)

# Keep unscaled versions for Random Forest
X_train_unscaled = X_train.copy()
X_test_unscaled = X_test.copy()

print("Scaled training data (first 5 rows):")
print(X_train_scaled.head())
print(f"\nScaled data stats (should have mean‚âà0, std‚âà1):")
print(X_train_scaled.describe().loc[['mean', 'std']].round(4))


Scaled training data (first 5 rows):
       person_age  person_gender  person_education  person_income  \
25657    0.373712      -1.109645         -1.284542       0.663573   
37383   -0.796692      -1.109645         -1.284542       0.024500   
18047    0.540913       0.901189          0.568831       3.340855   
39088    0.540913      -1.109645         -0.357855       0.056550   
6789    -0.963893       0.901189          0.568831      -1.066504   

       person_emp_exp  person_home_ownership  loan_amnt  loan_intent  \
25657        0.650322              -1.180423   0.204494     0.854580   
37383       -0.237745              -1.180423  -0.688087    -1.460223   
18047        1.360775              -1.180423  -0.087411    -1.460223   
39088        1.005548               0.903017  -1.134054    -0.881522   
6789        -0.592972               0.903017  -0.411750     1.433281   

       loan_int_rate  loan_percent_income  cb_person_cred_hist_length  \
25657       0.330700            -0.457040 

## Step 7: Class Imbalance Handling

Apply SMOTE to training data only to balance classes (~50/50).
Test data remains untouched to preserve real-world distribution for evaluation.


In [9]:
# Apply SMOTE to handle class imbalance (training data only)

smote = SMOTE(random_state=RANDOM_STATE)

# Apply to scaled data (for Logistic Regression and ANN)
X_train_scaled_res, y_train_res = smote.fit_resample(X_train_scaled, y_train)

# Apply to unscaled data (for Random Forest)
X_train_unscaled_res, _ = smote.fit_resample(X_train_unscaled, y_train)

print("Class distribution BEFORE SMOTE:")
print(y_train.value_counts())
print(f"Total: {len(y_train):,}")

print("\nClass distribution AFTER SMOTE:")
print(pd.Series(y_train_res).value_counts())
print(f"Total: {len(y_train_res):,}")

print(f"\nSynthetic samples added: {len(y_train_res) - len(y_train):,}")


Class distribution BEFORE SMOTE:
loan_status
0    28000
1     8000
Name: count, dtype: int64
Total: 36,000

Class distribution AFTER SMOTE:
loan_status
0    28000
1    28000
Name: count, dtype: int64
Total: 56,000

Synthetic samples added: 20,000


## Step 8: Save Artifacts

Save preprocessors and processed data for use in Phase 3 (Model Training).


In [10]:
# Save preprocessing artifacts

# Bundle all preprocessors together
preprocessors = {
    'scaler': scaler,
    'label_encoders': label_encoders,
    'education_mapping': education_mapping,
    'feature_names': feature_names,
    'random_state': RANDOM_STATE,
    'outlier_caps': {
        'person_age': 100,
        'person_income': income_99,
        'person_emp_exp': emp_exp_99,
        'loan_amnt': loan_amnt_99
    }
}

# Save to models directory
preprocess_path = models_dir / 'preprocess.pkl'
joblib.dump(preprocessors, preprocess_path)
print(f"Preprocessors saved to: {preprocess_path}")

# Verify saved file
loaded = joblib.load(preprocess_path)
print(f"\nSaved artifacts keys: {list(loaded.keys())}")


Preprocessors saved to: /Users/ziadalwazzan/Documents/MSc/COMP0198-ML/ml-personal-loan-classification-models/models/preprocess.pkl

Saved artifacts keys: ['scaler', 'label_encoders', 'education_mapping', 'feature_names', 'random_state', 'outlier_caps']


## Step 9: Documentation

### Summary of Feature Engineering Pipeline


In [11]:
# Final summary

print("=" * 60)
print("FEATURE ENGINEERING SUMMARY")
print("=" * 60)

print("\nüìä FINAL FEATURE SET:")
print("-" * 40)
for i, name in enumerate(feature_names, 1):
    print(f"  {i:2}. {name}")

print(f"\nüìê TRANSFORMATIONS APPLIED:")
print("-" * 40)
print("  ‚Ä¢ Outlier capping: person_age (‚â§100), person_income, person_emp_exp, loan_amnt (‚â§99th percentile)")
print("  ‚Ä¢ Categorical encoding: LabelEncoder for all categorical variables")
print("  ‚Ä¢ Ordinal encoding: person_education (High School=0 ‚Üí Doctorate=4)")
print("  ‚Ä¢ Feature engineering: log_income, log_loan_amnt, income_per_household, credit_hist_age_ratio")
print("  ‚Ä¢ Feature scaling: StandardScaler (mean=0, std=1)")

print(f"\n‚öñÔ∏è CLASS BALANCE:")
print("-" * 40)
print(f"  Before SMOTE: {y_train.value_counts()[0]:,} rejected, {y_train.value_counts()[1]:,} approved")
print(f"  After SMOTE:  {pd.Series(y_train_res).value_counts()[0]:,} rejected, {pd.Series(y_train_res).value_counts()[1]:,} approved")

print(f"\nüì¶ DATA SHAPES FOR PHASE 3:")
print("-" * 40)
print(f"  X_train_scaled_res: {X_train_scaled_res.shape} (balanced, scaled - for LR, ANN)")
print(f"  X_train_unscaled_res: {X_train_unscaled_res.shape} (balanced, unscaled - for RF)")
print(f"  X_test_scaled: {X_test_scaled.shape} (scaled test set)")
print(f"  X_test_unscaled: {X_test_unscaled.shape} (unscaled test set)")
print(f"  y_train_res: {y_train_res.shape} (balanced training labels)")
print(f"  y_test: {y_test.shape} (test labels)")

print(f"\nüíæ SAVED ARTIFACTS:")
print("-" * 40)
print(f"  ‚Ä¢ {preprocess_path}")

print("\n" + "=" * 60)
print("Ready for Phase 3: Model Training")
print("=" * 60)


FEATURE ENGINEERING SUMMARY

üìä FINAL FEATURE SET:
----------------------------------------
   1. person_age
   2. person_gender
   3. person_education
   4. person_income
   5. person_emp_exp
   6. person_home_ownership
   7. loan_amnt
   8. loan_intent
   9. loan_int_rate
  10. loan_percent_income
  11. cb_person_cred_hist_length
  12. credit_score
  13. previous_loan_defaults_on_file
  14. empl_len
  15. ppl_household
  16. log_income
  17. log_loan_amnt
  18. income_per_household
  19. credit_hist_age_ratio

üìê TRANSFORMATIONS APPLIED:
----------------------------------------
  ‚Ä¢ Outlier capping: person_age (‚â§100), person_income, person_emp_exp, loan_amnt (‚â§99th percentile)
  ‚Ä¢ Categorical encoding: LabelEncoder for all categorical variables
  ‚Ä¢ Ordinal encoding: person_education (High School=0 ‚Üí Doctorate=4)
  ‚Ä¢ Feature engineering: log_income, log_loan_amnt, income_per_household, credit_hist_age_ratio
  ‚Ä¢ Feature scaling: StandardScaler (mean=0, std=1)

‚öñÔ∏è

### Variables Ready for Phase 3 (Model Training)

| Variable | Description | Use For |
|----------|-------------|---------|
| `X_train_scaled_res` | Balanced, scaled training features | Logistic Regression, ANN |
| `X_train_unscaled_res` | Balanced, unscaled training features | Random Forest |
| `X_test_scaled` | Scaled test features | Logistic Regression, ANN |
| `X_test_unscaled` | Unscaled test features | Random Forest |
| `y_train_res` | Balanced training labels | All models |
| `y_test` | Test labels (original distribution) | All models |
| `feature_names` | List of feature column names | Feature importance, interpretation |

### Preprocessing Artifacts

Saved to `models/preprocess.pkl`:
- `scaler`: StandardScaler fitted on training data
- `label_encoders`: Dictionary of LabelEncoders for categorical columns
- `education_mapping`: Ordinal mapping for education levels
- `feature_names`: List of final feature names
- `outlier_caps`: Dictionary of capping thresholds
