## Notebook Overview

This notebook builds upon the insights gained from the previous notebook (`03_exploratory_data_analysis.ipynb`) and focuses on **Feature Engineering**. This stage involves creating new features and transforming existing ones based on our statistical findings and domain knowledge to enhance our credit risk model.

### 0.4.1 Objectives

The main objectives of this notebook are:

1. **Leverage EDA Insights:** Utilize the patterns and relationships uncovered in our exploratory data analysis to guide our feature engineering efforts.
2. **Create Non-Linear Transformations:** Develop binned versions of continuous variables to capture non-linear relationships with the target variable.
3. **Implement Domain-Specific Features:** Create new features based on financial domain knowledge that are known to be relevant in credit risk assessment.
4. **Enhance Model Input:** Generate features that can potentially improve our model's predictive power and interpretability.
5. **Prepare for Modeling:** Finalize the feature set that will be used in our subsequent modeling efforts.

### 0.4.2 Importance of Feature Engineering

Feature engineering plays a crucial role in improving machine learning model performance:

- **Capture Complex Relationships:** Engineered features can help models capture non-linear and intricate relationships that may not be apparent in the raw data.
- **Incorporate Domain Knowledge:** It allows us to inject domain expertise into our data, potentially improving model performance and interpretability.
- **Improve Model Generalization:** Well-engineered features can help models generalize better to unseen data.
- **Enhance Interpretability:** Carefully crafted features can make model predictions more understandable and actionable for stakeholders.

### 0.4.3 Our Approach

In this notebook, we will focus on the following feature engineering tasks:

1. **Age Binning:** Create age groups (18-25, 26-35, 36-45, 46-55, 56-65, 65+) to capture non-linear relationships with default risk.
2. **Financial Ratios:** Implement key financial ratios including:
   - Debt-to-income ratio
   - Credit-to-goods price ratio
   - Annuity-to-income ratio
3. **Stability Indicators:** Derive new features such as:
   - Employed-to-age ratio
   - Flag for when credit amount exceeds goods price
4. **Credit Score Aggregation:** Generate an average of external source scores for a more stable overall credit score indicator.
5. **Income and Credit Amount Binning:** Create bins for income and credit amount variables to identify potential threshold effects and improve model robustness to outliers.

By the end of this notebook, we will have a rich set of engineered features that leverage both our data-driven insights and domain knowledge, setting a strong foundation for our subsequent modeling efforts.


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd

import warnings

from retail_bank_risk.feature_engineering_utils import (
    create_binned_features,
    create_derived_features,
    encode_categorical_features
)

warnings.filterwarnings("ignore", category=FutureWarning)

In [3]:
application_train = pd.read_parquet(
    "../data/processed/application_train_prepared.parquet"
)
application_test = pd.read_parquet(
    "../data/processed/application_test_prepared.parquet"
)

In [4]:
application_train_engineered = create_binned_features(application_train)
application_test_engineered = create_binned_features(application_test)

In [5]:
ohe_features = [
    "reg_city_not_work_city",       # Binary, 2 distinct values
    "name_contract_type",           # Binary, 2 distinct values
    "code_gender",                  # Binary, 2 distinct values
    "flag_own_car",                 # Binary, 2 distinct values
    "flag_own_realty",              # Binary, 2 distinct values
    "name_type_suite",              # 8 distinct values
    "name_income_type",             # 8 distinct values
    "name_education_type",          # 5 distinct values
    "name_family_status",           # 6 distinct values
    "name_housing_type",            # 6 distinct values
    "weekday_appr_process_start",   # 7 distinct values
    "housetype_mode",               # 4 distinct values
    "emergencystate_mode",          # 3 distinct values
    "is_anomaly"                    # Binary, 2 distinct values
]

target_encoded_features = [
    "region_rating_client_w_city",  # 3 distinct values
    "region_rating_client",         # 3 distinct values
    "occupation_type",              # 19 distinct values
    "organization_type"             # 58 distinct values
]

target_feature = 'target'  # Binary target with 0 and 1 values

In [6]:
application_train_engineered = create_derived_features(application_train)
application_test_engineered = create_derived_features(application_test)

In [7]:
application_train_encoded, application_test_encoded = encode_categorical_features(
    application_train_engineered,
    application_test_engineered,
    ohe_features,
    target_encoded_features,
    target_feature
)

In [8]:
application_test_encoded.head()

Unnamed: 0,reg_city_not_work_city_0,reg_city_not_work_city_1,reg_city_not_work_city_-1,name_contract_type_cash loans,name_contract_type_revolving loans,name_contract_type_-1,code_gender_m,code_gender_f,code_gender_xna,code_gender_-1,...,amt_annuity,amt_goods_price,is_anomaly_false,is_anomaly_true,is_anomaly_-1,debt_to_income_ratio,credit_to_goods_ratio,annuity_to_income_ratio,ext_source_mean,credit_exceeds_goods
0,1,0,0,1,0,0,0,1,0,0,...,20560.5,450000.0,1,0,0,4.213333,1.264,0.1523,0.474587,1
1,1,0,0,1,0,0,1,0,0,0,...,17370.0,180000.0,1,0,0,2.250182,1.2376,0.175455,0.362309,1
2,1,0,0,1,0,0,1,0,0,0,...,69777.0,630000.0,0,1,0,3.275378,1.0528,0.344578,0.655389,1
3,1,0,0,1,0,0,0,1,0,0,...,49018.5,1575000.0,0,1,0,5.0,1.0,0.155614,0.561191,0
4,0,1,0,1,0,0,1,0,0,0,...,32067.0,625500.0,1,0,0,3.475,1.0,0.17815,0.522519,0


In [9]:
application_train_encoded.head()

Unnamed: 0,reg_city_not_work_city_0,reg_city_not_work_city_1,reg_city_not_work_city_-1,name_contract_type_cash loans,name_contract_type_revolving loans,name_contract_type_-1,code_gender_m,code_gender_f,code_gender_xna,code_gender_-1,...,amt_goods_price,is_anomaly_false,is_anomaly_true,is_anomaly_-1,debt_to_income_ratio,credit_to_goods_ratio,annuity_to_income_ratio,ext_source_mean,credit_exceeds_goods,target
0,1,0,0,1,0,0,1,0,0,0,...,351000.0,1,0,0,2.007889,1.158397,0.121978,0.201162,1,1
1,1,0,0,1,0,0,0,1,0,0,...,1129500.0,1,0,0,4.79075,1.145199,0.132217,0.588812,1,0
2,1,0,0,0,1,0,1,0,0,0,...,135000.0,1,0,0,2.0,1.0,0.1,0.642739,0,0
3,1,0,0,1,0,0,0,1,0,0,...,297000.0,1,0,0,2.316167,1.052803,0.2199,0.68046,1,0
4,0,1,0,1,0,0,1,0,0,0,...,513000.0,1,0,0,4.222222,1.0,0.179963,0.39776,0,0


In [10]:
# categorical_features = [
#     "reg_city_not_work_city",
#     "region_rating_client_w_city",
#     "region_rating_client",
#     "name_contract_type",
#     "code_gender",
#     "flag_own_car",
#     "flag_own_realty",
#     "name_type_suite",
#     "name_income_type",
#     "name_education_type",
#     "name_family_status",
#     "name_housing_type",
#     "occupation_type",
#     "weekday_appr_process_start",
#     "organization_type",
#     "housetype_mode",
#     "emergencystate_mode",
#     "is_anomaly",
# ]