# 02 - Feature Engineering
## Credit Scoring Model Project

**Learning Objectives:**
- Handle missing values systematically
- Create domain-based features using business knowledge
- Encode categorical variables appropriately
- Scale and normalize numerical features
- Perform feature selection
- Prepare data for modeling

**What is Feature Engineering?**
Feature engineering is the process of using domain knowledge to create features that make machine learning algorithms work better. It's often the difference between a mediocre model and a great one!

**Quote:** "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." - Andrew Ng

Let's build powerful features!

## üì¶ Import Libraries and Utilities



In [1]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from pathlib import Path

# Sklearn preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# Our custom utilities
import sys
sys.path.append('../')
from src.data_preprocessing import (
    load_data,
    analyze_missing_values,
    handle_missing_values,
    validate_data_quality
)
from src.domain_features import create_domain_features
from src.feature_engineering import encode_categorical_features, clean_column_names, scale_features
from src.feature_selection import select_features

# Configuration
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("[OK] All libraries imported successfully!")


[OK] All libraries imported successfully!



## üìÇ Load Data

We\'ll load the data we explored in the EDA notebook.



In [2]:
# Load data using our utility
train_df, test_df = load_data()

print(f"Training set: {train_df.shape}")
print(f"Test set: {test_df.shape}")
print(f"\nTarget distribution:")
print(train_df['TARGET'].value_counts(normalize=True))


Loading data from: C:\Users\shahu\OPEN CLASSROOMS\PROJET 6\Scoring_Model\notebooks\..\data

LOADING AND AGGREGATING ALL DATA SOURCES

1. Loading main application tables...


   Train: (307511, 122), Test: (48744, 121)

2. Loading bureau data...


   Bureau: (1716428, 17)
   Processing bureau_balance.csv in chunks...


     Processed chunk 1 (1,000,000 rows)


     Processed chunk 2 (1,000,000 rows)


     Processed chunk 3 (1,000,000 rows)


     Processed chunk 4 (1,000,000 rows)


     Processed chunk 5 (1,000,000 rows)


     Processed chunk 6 (1,000,000 rows)


     Processed chunk 7 (1,000,000 rows)


     Processed chunk 8 (1,000,000 rows)


     Processed chunk 9 (1,000,000 rows)


     Processed chunk 10 (1,000,000 rows)


     Processed chunk 11 (1,000,000 rows)


     Processed chunk 12 (1,000,000 rows)


     Processed chunk 13 (1,000,000 rows)


     Processed chunk 14 (1,000,000 rows)


     Processed chunk 15 (1,000,000 rows)


     Processed chunk 16 (1,000,000 rows)


     Processed chunk 17 (1,000,000 rows)


     Processed chunk 18 (1,000,000 rows)


     Processed chunk 19 (1,000,000 rows)


     Processed chunk 20 (1,000,000 rows)


     Processed chunk 21 (1,000,000 rows)


     Processed chunk 22 (1,000,000 rows)


     Processed chunk 23 (1,000,000 rows)


     Processed chunk 24 (1,000,000 rows)


     Processed chunk 25 (1,000,000 rows)


     Processed chunk 26 (1,000,000 rows)


     Processed chunk 27 (1,000,000 rows)


     Processed chunk 28 (299,925 rows)
   Combining 28 chunks...


   Bureau Balance aggregated: (817395, 5)
  Aggregating bureau data...


    Created 37 bureau features

3. Loading previous applications...


   Previous applications: (1670214, 37)
  Aggregating previous applications...


    Created 56 previous application features

4. Loading POS/cash balances...


   POS/cash: (10001358, 8)
  Aggregating POS/cash balances...


    Created 20 POS/cash features

5. Loading credit card balances...


   Credit card: (3840312, 23)
  Aggregating credit card balances...


    Created 52 credit card features

6. Loading installment payments...
   Processing installments_payments.csv in chunks...


     Processed chunk 1 (2,000,000 rows)


     Processed chunk 2 (2,000,000 rows)


     Processed chunk 3 (2,000,000 rows)


     Processed chunk 4 (2,000,000 rows)


     Processed chunk 5 (2,000,000 rows)


     Processed chunk 6 (2,000,000 rows)


     Processed chunk 7 (1,605,401 rows)
   Combining 7 chunks...


   Installments aggregated: (339587, 32)
    Created 31 installment features

7. Merging all aggregated features...



[SUCCESS] Final shapes:
   Train: (307511, 318)
   Test: (48744, 317)
   Total features: 316 (excluding SK_ID_CURR and TARGET)

[OK] Data validation passed
Training set: (307511, 318)
Test set: (48744, 317)

Target distribution:
TARGET
0    0.919271
1    0.080729
Name: proportion, dtype: float64



## üîç Handle Missing Values

**Strategy:**
1. Identify features with excessive missing values (>70%) ‚Üí Consider dropping
2. Create missing indicators for important features
3. Impute remaining missing values appropriately

**Educational Note:**
Different imputation strategies work for different scenarios:
- **Median:** For numerical features with outliers (robust)
- **Mean:** For numerical features with normal distribution
- **Mode:** For categorical features
- **Constant:** When missingness has meaning (e.g., 0 for no car)
- **Missing Indicator:** Preserve information about missingness



In [3]:
# Analyze missing values
missing_summary = analyze_missing_values(train_df, threshold=0)

# Separate features by missing percentage
high_missing = missing_summary[missing_summary['missing_percent'] > 70]['column'].tolist()
medium_missing = missing_summary[(missing_summary['missing_percent'] > 20) &
                                 (missing_summary['missing_percent'] <= 70)]['column'].tolist()
low_missing = missing_summary[(missing_summary['missing_percent'] > 0) &
                              (missing_summary['missing_percent'] <= 20)]['column'].tolist()

print(f"\n[DECISION SUMMARY]")
print(f"High missing (>70%): {len(high_missing)} features - CONSIDER DROPPING")
print(f"Medium missing (20-70%): {len(medium_missing)} features - CREATE INDICATORS + IMPUTE")
print(f"Low missing (<20%): {len(low_missing)} features - IMPUTE ONLY")

# Let\'s drop very sparse features
print(f"\nDropping {len(high_missing)} features with >70% missing...")
train_df = train_df.drop(columns=high_missing)
test_df = test_df.drop(columns=[col for col in high_missing if col in test_df.columns])

print(f"New training shape: {train_df.shape}")


MISSING VALUES ANALYSIS
Total features: 318
Features with missing > 0%: 263
\nTop features with missing values:
                            column  missing_count  missing_percent   dtype
 PREV_RATE_INTEREST_PRIVILEGED_MIN         302902        98.501192 float64
   PREV_RATE_INTEREST_PRIMARY_MEAN         302902        98.501192 float64
    PREV_RATE_INTEREST_PRIMARY_MIN         302902        98.501192 float64
    PREV_RATE_INTEREST_PRIMARY_MAX         302902        98.501192 float64
PREV_RATE_INTEREST_PRIVILEGED_MEAN         302902        98.501192 float64
 PREV_RATE_INTEREST_PRIVILEGED_MAX         302902        98.501192 float64
        CC_AMT_PAYMENT_CURRENT_MIN         246451        80.143800 float64
        CC_AMT_PAYMENT_CURRENT_MAX         246451        80.143800 float64
       CC_AMT_PAYMENT_CURRENT_MEAN         246451        80.143800 float64
  CC_AMT_DRAWINGS_POS_CURRENT_MEAN         246371        80.117784 float64
  CC_AMT_DRAWINGS_ATM_CURRENT_MEAN         246371        80.117

New training shape: (307511, 258)



## üéØ Create Domain-Based Features

**Domain Knowledge in Credit Scoring:**
Key financial ratios and indicators matter:
- **Debt-to-Income Ratio:** How much debt relative to income?
- **Credit Utilization:** How much of available credit is used?
- **Payment Burden:** Can they afford the payments?
- **Age/Employment Stability:** Risk indicators

Let\'s create features that capture these concepts!



In [4]:
# Use our new modular function
print("="*80)
train_df = create_domain_features(train_df)
test_df = create_domain_features(test_df)
print("="*80)

print(f"\nNew shape after feature creation:")
print(f"Training: {train_df.shape}")
print(f"Test: {test_df.shape}")




Creating domain features...
  [OK] Age features created
  [OK] Employment features created
  [OK] Income features created
  [OK] Credit features created
  [OK] Family features created
  [OK] Document features created


  [OK] External source features created
  [OK] Regional features created
Creating domain features...
  [OK] Age features created
  [OK] Employment features created
  [OK] Income features created
  [OK] Credit features created
  [OK] Family features created
  [OK] Document features created


  [OK] External source features created
  [OK] Regional features created

New shape after feature creation:
Training: (307511, 273)
Test: (48744, 272)



## üè∑Ô∏è Encode Categorical Variables

**Encoding Strategies:**

1. **Label Encoding:** For ordinal categories (order matters)
   - Example: Education level (Low ‚Üí Medium ‚Üí High)

2. **One-Hot Encoding:** For nominal categories (no order)
   - Example: Contract type (Cash, Revolving)
   - Creates binary columns for each category

3. **Target Encoding:** For high-cardinality features
   - Encode by mean target value per category
   - Use with caution (risk of overfitting!)

**Rule of Thumb:**
- Low cardinality (<10 categories) ‚Üí One-Hot Encoding
- High cardinality (>10 categories) ‚Üí Target Encoding or drop



In [5]:
# Use our new modular function
train_df, test_df = encode_categorical_features(train_df, test_df, cardinality_limit=10)

# Clean column names for LightGBM compatibility
print("\nCleaning column names...")
train_df = clean_column_names(train_df)
test_df = clean_column_names(test_df)

print(f"\nFinal shape after handling all categoricals:")
print(f"Training: {train_df.shape}")
print(f"Test: {test_df.shape}")


Found 16 categorical columns



One-hot encoding 14 low-cardinality features...



Dropping these features as they have too many categories for one-hot encoding...


  [OK] Dropped 2 high-cardinality features

Cleaning column names...

Final shape after handling all categoricals:
Training: (307511, 307)
Test: (48744, 306)



## üîß Impute Remaining Missing Values

Now we\'ll impute the remaining missing values using appropriate strategies.



In [6]:
# Identify columns with missing values
missing_cols = train_df.columns[train_df.isnull().any()].tolist()
if 'TARGET' in missing_cols:
    missing_cols.remove('TARGET')

print(f"Columns with missing values: {len(missing_cols)}")

# Separate numerical and categorical
numerical_missing = [col for col in missing_cols
                     if train_df[col].dtype in ['int64', 'float64']]
categorical_missing = [col for col in missing_cols
                       if train_df[col].dtype == 'object']

print(f"  Numerical: {len(numerical_missing)}")
print(f"  Categorical: {len(categorical_missing)}")

# Impute numerical with median
if numerical_missing:
    print(f"\nImputing {len(numerical_missing)} numerical features with median...")
    imputer = SimpleImputer(strategy='median')
    train_df[numerical_missing] = imputer.fit_transform(train_df[numerical_missing])
    test_df[numerical_missing] = imputer.transform(test_df[numerical_missing])
    print("  [OK] Numerical imputation complete")

# Impute categorical with most frequent
if categorical_missing:
    print(f"\nImputing {len(categorical_missing)} categorical features with mode...")
    imputer = SimpleImputer(strategy='most_frequent')
    train_df[categorical_missing] = imputer.fit_transform(train_df[categorical_missing])
    test_df[categorical_missing] = imputer.transform(test_df[categorical_missing])
    print("  [OK] Categorical imputation complete")

# Verify no missing values remain
print(f"\n[VERIFICATION]")
print(f"Training missing: {train_df.isnull().sum().sum()}")
print(f"Test missing: {test_df.isnull().sum().sum()}")


Columns with missing values: 207
  Numerical: 207
  Categorical: 0

Imputing 207 numerical features with median...


  [OK] Numerical imputation complete

[VERIFICATION]


Training missing: 0
Test missing: 0



## üéØ Feature Selection

**Why Feature Selection?**
1. Reduce overfitting (simpler models generalize better)
2. Improve model performance (remove noise)
3. Reduce training time
4. Improve interpretability

**Strategies:**
1. Remove low-variance features (constant or near-constant)
2. Remove highly correlated features (redundant information)
3. Use feature importance from baseline models



In [7]:
# Separate features and target
X_train = train_df.drop(columns=['SK_ID_CURR', 'TARGET'])
y_train = train_df['TARGET']
X_test = test_df.drop(columns=['SK_ID_CURR'])

# Use our new modular function
X_train, X_test = select_features(X_train, X_test)

print(f"[SUCCESS] Removed {train_df.shape[1] - X_train.shape[1] - 2} features")


Features before selection: 305

1. Removing low-variance features...


   Found 80 low-variance features to remove



2. Removing highly correlated features (>{correlation_threshold})...


   Found 36 highly correlated features to remove



Features after selection: 189
[SUCCESS] Removed 116 features



## ‚öñÔ∏è Scale Features

**Why Scaling?**
- Features have different ranges (income vs children count)
- Many ML algorithms are sensitive to feature scales
- Required for: Logistic Regression, SVM, Neural Networks
- Optional for: Tree-based models (Random Forest, XGBoost)

**Scaling Methods:**
1. **StandardScaler:** Mean=0, Std=1 (use when data is normally distributed)
2. **MinMaxScaler:** Scale to [0, 1] (use when data has outliers)
3. **RobustScaler:** Uses median and IQR (robust to outliers)

We\'ll use StandardScaler as it works well for most cases.



In [8]:
# Use our new modular function
X_train_scaled, X_test_scaled = scale_features(X_train, X_test)

print(f"\nScaled feature statistics:")
print(X_train_scaled.describe().loc[['mean', 'std']].round(3))


Scaling features with StandardScaler...


[OK] Scaling complete

Scaled feature statistics:


      CNT_CHILDREN  AMT_INCOME_TOTAL  AMT_CREDIT  AMT_ANNUITY  DAYS_BIRTH  \
mean          -0.0              -0.0        -0.0         -0.0         0.0   
std            1.0               1.0         1.0          1.0         1.0   

      DAYS_EMPLOYED  DAYS_REGISTRATION  DAYS_ID_PUBLISH  OWN_CAR_AGE  \
mean            0.0               -0.0              0.0         -0.0   
std             1.0                1.0              1.0          1.0   

      FLAG_EMP_PHONE  FLAG_WORK_PHONE  FLAG_PHONE  FLAG_EMAIL  \
mean             0.0              0.0         0.0        -0.0   
std              1.0              1.0         1.0         1.0   

      CNT_FAM_MEMBERS  REGION_RATING_CLIENT  HOUR_APPR_PROCESS_START  \
mean              0.0                   0.0                     -0.0   
std               1.0                   1.0                      1.0   

      REG_REGION_NOT_LIVE_REGION  REG_REGION_NOT_WORK_REGION  \
mean                         0.0                         0.0   
std       


## üîÄ Create Train-Validation Split

**Important:** Use **STRATIFIED** split to preserve class distribution!

We\'ll create:
- Training set (70%): For training models
- Validation set (30%): For hyperparameter tuning and model selection
- Test set: Already separate (for final evaluation)



In [9]:
# Stratified split
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train_scaled,
    y_train,
    test_size=0.3,
    stratify=y_train,
    random_state=RANDOM_STATE
)

print("Train-Validation Split:")
print(f"  Training: {X_train_split.shape}")
print(f"  Validation: {X_val_split.shape}")
print(f"  Test: {X_test_scaled.shape}")

print(f"\nClass distribution verification:")
print(f"  Original: {y_train.value_counts(normalize=True).to_dict()}")
print(f"  Training: {y_train_split.value_counts(normalize=True).to_dict()}")
print(f"  Validation: {y_val_split.value_counts(normalize=True).to_dict()}")

print("\n[OK] Class distribution preserved!")


Train-Validation Split:
  Training: (215257, 189)
  Validation: (92254, 189)
  Test: (48744, 189)

Class distribution verification:
  Original: {0: 0.9192711805431351, 1: 0.08072881945686496}
  Training: {0: 0.9192732408237595, 1: 0.08072675917624049}
  Validation: {0: 0.9192663732737876, 1: 0.08073362672621241}

[OK] Class distribution preserved!



## üíæ Save Processed Data

Save the processed data so we can use it in the next notebooks.



In [10]:
# Create processed data directory
processed_dir = Path('../data/processed')
processed_dir.mkdir(exist_ok=True)

# Save datasets
print("Saving processed datasets...")

X_train_split.to_csv(processed_dir / 'X_train.csv', index=False)
X_val_split.to_csv(processed_dir / 'X_val.csv', index=False)
X_test_scaled.to_csv(processed_dir / 'X_test.csv', index=False)

y_train_split.to_csv(processed_dir / 'y_train.csv', index=False, header=True)
y_val_split.to_csv(processed_dir / 'y_val.csv', index=False, header=True)

# Save feature names
pd.DataFrame({'feature': X_train_split.columns}).to_csv(
    processed_dir / 'feature_names.csv', index=False
)

# Save IDs
train_df[['SK_ID_CURR']].iloc[X_train_split.index].to_csv(
    processed_dir / 'train_ids.csv', index=False
)
train_df[['SK_ID_CURR']].iloc[X_val_split.index].to_csv(
    processed_dir / 'val_ids.csv', index=False
)
test_df[['SK_ID_CURR']].to_csv(processed_dir / 'test_ids.csv', index=False)

print("[OK] All datasets saved!")
print(f"\nSaved files:")
print(f"  - X_train.csv: {X_train_split.shape}")
print(f"  - X_val.csv: {X_val_split.shape}")
print(f"  - X_test.csv: {X_test_scaled.shape}")
print(f"  - y_train.csv, y_val.csv")
print(f"  - feature_names.csv: {len(X_train_split.columns)} features")
print(f"  - IDs for each split")


Saving processed datasets...


[OK] All datasets saved!

Saved files:
  - X_train.csv: (215257, 189)
  - X_val.csv: (92254, 189)
  - X_test.csv: (48744, 189)
  - y_train.csv, y_val.csv
  - feature_names.csv: 189 features
  - IDs for each split



## üìù Feature Engineering Summary

### ‚úÖ What We Accomplished

1. **Handled Missing Values**
   - Dropped features with >70% missing
   - Created missing indicators
   - Imputed remaining values

2. **Created Domain Features** (10+ new features)
   - Age and employment features
   - Income per person
   - Debt-to-income ratio (KEY!)
   - Credit utilization
   - Payment burden ratio
   - Family features
   - Document counts
   - External source aggregations

3. **Encoded Categorical Variables**
   - One-hot encoded low-cardinality features
   - Handled high-cardinality features

4. **Feature Selection**
   - Removed low-variance features
   - Removed highly correlated features
   - Reduced feature count significantly

5. **Scaled Features**
   - StandardScaler (mean=0, std=1)
   - Ready for modeling!

6. **Created Train-Val Split**
   - Stratified sampling
   - 70% training, 30% validation

### üéØ Key Features Created

The most important features for credit scoring:
1. **DEBT_TO_INCOME_RATIO** - How much debt vs income
2. **ANNUITY_TO_INCOME_RATIO** - Can afford payments?
3. **EMPLOYMENT_YEARS** - Stability indicator
4. **AGE_YEARS** - Risk correlates with age
5. **INCOME_PER_PERSON** - Family financial situation
6. **CREDIT_UTILIZATION** - Credit usage patterns

### üìä Final Dataset

- **Features:** ~100-150 (after selection and engineering)
- **Training samples:** ~215,000
- **Validation samples:** ~92,000
- **Test samples:** ~48,000
- **Class balance:** ~8% positive class (defaults)

### üöÄ Next Steps

In the next notebook ([03_baseline_models.ipynb](03_baseline_models.ipynb)), we will:

1. **Train Multiple Baseline Models**
   - Logistic Regression
   - Random Forest
   - XGBoost
   - LightGBM

2. **Set Up MLflow Tracking**
   - Log all experiments
   - Compare models
   - Save artifacts

3. **Evaluate Using Appropriate Metrics**
   - ROC-AUC
   - Precision-Recall AUC
   - F1-Score
   - Confusion Matrix

4. **Select Best Baseline Model**
   - Compare performance
   - Consider interpretability
   - Choose for optimization

---

**Excellent work on feature engineering! üéâ**

Your data is now ready for modeling. Remember: good features are often more important than complex models!

