## Notebook Overview

This notebook builds upon the insights gained from the previous notebook (`03_exploratory_data_analysis.ipynb`) and focuses on **Feature Engineering**. This stage involves creating new features and transforming existing ones based on our statistical findings and domain knowledge to enhance our credit risk model.

### 0.4.1 Objectives

The main objectives of this notebook are:

1. **Leverage EDA Insights:** Utilize the patterns and relationships uncovered in our exploratory data analysis to guide our feature engineering efforts.
2. **Create Non-Linear Transformations:** Develop binned versions of continuous variables to capture non-linear relationships with the target variable.
3. **Implement Domain-Specific Features:** Create new features based on financial domain knowledge that are known to be relevant in credit risk assessment.
4. **Enhance Model Input:** Generate features that can potentially improve our model's predictive power and interpretability.
5. **Prepare for Modeling:** Finalize the feature set that will be used in our subsequent modeling efforts.

### 0.4.2 Importance of Feature Engineering

Feature engineering plays a crucial role in improving machine learning model performance:

- **Capture Complex Relationships:** Engineered features can help models capture non-linear and intricate relationships that may not be apparent in the raw data.
- **Incorporate Domain Knowledge:** It allows us to inject domain expertise into our data, potentially improving model performance and interpretability.
- **Improve Model Generalization:** Well-engineered features can help models generalize better to unseen data.
- **Enhance Interpretability:** Carefully crafted features can make model predictions more understandable and actionable for stakeholders.

### 0.4.3 Our Approach

In this notebook, we will focus on the following feature engineering tasks:

1. **Age Binning:** Create age groups (18-25, 26-35, 36-45, 46-55, 56-65, 65+) to capture non-linear relationships with default risk.
2. **Financial Ratios:** Implement key financial ratios including:
   - Debt-to-income ratio
   - Credit-to-goods price ratio
   - Annuity-to-income ratio
3. **Stability Indicators:** Derive new features such as:
   - Employed-to-age ratio
   - Flag for when credit amount exceeds goods price
4. **Credit Score Aggregation:** Generate an average of external source scores for a more stable overall credit score indicator.
5. **Income and Credit Amount Binning:** Create bins for income and credit amount variables to identify potential threshold effects and improve model robustness to outliers.

By the end of this notebook, we will have a rich set of engineered features that leverage both our data-driven insights and domain knowledge, setting a strong foundation for our subsequent modeling efforts.


In [12]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
import pandas as pd

import warnings

from retail_bank_risk.feature_engineering_utils import (
    create_binned_features,
    create_derived_features,
    encode_categorical_features,
)

warnings.filterwarnings("ignore", category=FutureWarning)

In [14]:
application_train = pd.read_parquet(
    "../data/processed/application_train_prepared.parquet"
)
application_test = pd.read_parquet(
    "../data/processed/application_test_prepared.parquet"
)

We will use the `create_binned_features` function to transform continuous numerical variables into categorical bins, simplifying the dataset.4

This function will bin the `days_birth` column into predefined age groups, converting age (expressed in days) into meaningful categories like "18-25" and "65+".

We will also use it to bin the `amt_income_total` and `amt_credit` columns into quantile-based groups, ensuring approximately equal representation in each bin.

This binning will help mitigate the impact of outliers and capture non-linear relationships between these variables and the target variable.

Ultimately, this process will enhance model interpretability and stability by grouping continuous data into manageable segments.


In [15]:
application_train_engineered = create_binned_features(application_train)
application_test_engineered = create_binned_features(application_test)

After binning the data, we will apply one-hot encoding to the categorical features. Features with more than two categories will be encoded separately. We will also separate the target variable from the encoded features.


In [16]:
ohe_features = [
    "reg_city_not_work_city",  # Binary, 2 distinct values
    "name_contract_type",  # Binary, 2 distinct values
    "code_gender",  # Binary, 2 distinct values
    "flag_own_car",  # Binary, 2 distinct values
    "flag_own_realty",  # Binary, 2 distinct values
    "name_type_suite",  # 8 distinct values
    "name_income_type",  # 8 distinct values
    "name_education_type",  # 5 distinct values
    "name_family_status",  # 6 distinct values
    "name_housing_type",  # 6 distinct values
    "weekday_appr_process_start",  # 7 distinct values
    "housetype_mode",  # 4 distinct values
    "emergencystate_mode",  # 3 distinct values
    "is_anomaly",  # Binary, 2 distinct values
]

target_encoded_features = [
    "region_rating_client_w_city",  # 3 distinct values
    "region_rating_client",  # 3 distinct values
    "occupation_type",  # 19 distinct values
    "organization_type",  # 58 distinct values
]

target_feature = "target"  # Binary target with 0 and 1 values

We will now generate new features using the `create_derived_features` function.

This function will calculate key financial ratios, including `debt_to_income_ratio`, `credit_to_goods_ratio`, and `annuity_to_income_ratio`, to provide insights into applicants' financial health.

We will also create `ext_source_mean` by averaging external source scores for a consolidated measure of external assessments.

Furthermore, we will generate a binary flag, `credit_exceeds_goods`, to indicate instances where the credit amount surpasses the value of the goods purchased.

These derived features will improve the dataset's predictive power by capturing important financial relationships.


In [17]:
application_train_engineered = create_derived_features(application_train)
application_test_engineered = create_derived_features(application_test)

Finally, we will use the `encode_categorical_features` function to transform categorical variables into a numerical format suitable for machine learning.

This function will apply one-hot encoding to low-cardinality features, creating binary columns for each category and effectively handling unknown categories. For high-cardinality features, leave-one-out encoding will be used, replacing each category with the mean of the target variable.

This approach captures the relationship between the categories and the target.

The function ensures consistency between the training and testing datasets by aligning the encoded columns and removing any unwanted or constant columns.

This final encoding step prepares the data for model training by preserving valuable information while managing dimensionality.


In [18]:
application_train_encoded, application_test_encoded = (
    encode_categorical_features(
        application_train_engineered,
        application_test_engineered,
        ohe_features,
        target_encoded_features,
        target_feature,
    )
)

1. **Binning Continuous Variables:** Transformed age, income, and credit amount into categorical bins to simplify models and handle outliers.
2. **Creating Derived Features:** Generated financial ratios and indicators to capture complex relationships and domain-specific insights.
3. **Encoding Categorical Variables:** Applied One-Hot and Leave-One-Out Encoding to convert categorical features into numerical formats suitable for modeling, ensuring consistency and effectively handling high-cardinality features.

These steps improve the dataset's quality and relevance for more accurate and robust machine learning models.

We will save the processed data to parquet files for efficient storage and retrieval.

Next, we'll move on to modeling in the `05_model_training_and_evaluation.ipynb` notebook.

In [20]:
application_train_encoded.to_parquet("../data/processed/application_train_engineered.parquet")
application_test_encoded.to_parquet("../data/processed/application_test_engineered.parquet")