## Home Credit Default Risk Prediction

### Project Overview

This project aims to develop multiple machine learning models to predict the likelihood of loan default for Home Credit Group clients. As a proof-of-concept (POC) for our startup's risk evaluation service, we'll create a suite of models that can help banks make more informed decisions about loan approvals. Our goal is to maximize accuracy in identifying both creditworthy clients and potential defaults, providing a robust and flexible solution for our potential banking clients.

### Why is this Important?

Home Credit specializes in lending to individuals with limited or no credit history (often referred to as "unbanked" or "underbanked"). They utilize diverse data sources, including phone and transaction data, to assess loan repayment probability. By developing accurate default prediction models, we can:

- **Expand financial inclusion:** Enable deserving clients who might be rejected by traditional methods to access credit.
- **Optimize risk management:** Help financial institutions make more informed lending decisions, reducing their exposure to potential losses.
- **Demonstrate our startup's capabilities:** Showcase our ability to translate complex business requirements into effective machine learning solutions.

### Dataset Description

We're using the Home Credit Default Risk dataset, which includes comprehensive information about loan applications and borrowers' credit histories. The data is structured across several interconnected tables:

1. **Application Data (`application_{train|test}.csv`):** The primary table containing loan application details such as age, income, employment history, etc.
2. **Bureau Data (`bureau.csv` and `bureau_balance.csv`):** Information about the applicants' previous loans from other financial institutions.
3. **Previous Applications (`previous_application.csv`):** Historical data on the applicants' past loan applications with Home Credit.
4. **Home Credit Loan Details:**
   - **POS and Cash Loans Balance (`pos_cash_balance.csv`):** Monthly balance snapshots of previous point-of-sale and cash loans.
   - **Credit Card Balance (`credit_card_balance.csv`):** Monthly balance snapshots of previous credit cards.
5. **Installments Payments (`installments_payments.csv`):** Payment history for previous loans at Home Credit.

### Our Approach

We'll employ a comprehensive, multi-stage approach to develop our predictive models:

1. **Initial Data Exploration:**

   - Conduct thorough Exploratory Data Analysis (EDA) on each table.
   - Identify key variables and their distributions.
   - Investigate relationships between features and the target variable (loan default).
   - Check for data quality issues, missing values, and anomalies.

2. **Feature Engineering and Data Preprocessing:**

   - Create new features based on domain knowledge and initial insights.
   - Handle missing data and outliers.
   - Perform appropriate encoding for categorical variables and scaling for numerical features.

3. **Statistical Inference:**

   - Define the target population and formulate multiple statistical hypotheses.
   - Construct confidence intervals and conduct appropriate statistical tests.
   - Analyze correlations and other relationships between variables.

4. **Model Development:**

   - Implement multiple machine learning algorithms (e.g., Logistic Regression, Random Forests, Gradient Boosting).
   - Utilize cross-validation techniques to ensure model robustness.
   - Perform hyperparameter tuning to optimize model performance.

5. **Model Evaluation and Selection:**

   - Assess models using appropriate performance metrics (e.g., AUC-ROC, precision, recall).
   - Analyze feature importance and model interpretability.
   - Select the best-performing models for deployment.

6. **Model Deployment:**

   - Deploy the top-performing models to Google Cloud Platform.
   - Ensure models are accessible via HTTP requests for easy integration.

7. **Documentation and Presentation:**
   - Clearly document all steps, assumptions, and results.
   - Prepare visualizations and explanations of our findings.
   - Develop recommendations for potential clients based on our insights.

### Additional Notes

- **Data Source:** The dataset is from the [Home Credit Default Risk Kaggle Competition](https://www.kaggle.com/competitions/home-credit-default-risk/data).
- **Geographic Scope:** Home Credit operates primarily in CIS and Southeast Asian countries, including Kazakhstan, Russia, Vietnam, China, Indonesia, and the Philippines.
- **Product Categories:** Home Credit offers various loan products:
  - Revolving loans (credit cards)
  - Consumer loans (point-of-sale or POS loans)
  - Cash loans
- **Ethical Considerations:** We'll pay close attention to potential biases in our models and strive for fair lending practices.
- **Iterative Process:** Our approach will be flexible, allowing for iterations and refinements based on insights gained throughout the project.

**Home Credit Dataset Schema:**

![Home Credit Dataset Schema](../images/data-scheme.png)

This project will demonstrate our ability to handle complex, real-world data and deliver valuable insights and predictive capabilities to the financial sector.


By the end of this project, we aim to provide Home Credit with a reliable and easy-to-use tool to predict loan default risk, allowing them to make better lending decisions and potentially extend credit to more people who can manage it responsibly.


In [11]:
import pandas as pd
import polars as pl
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
import lightgbm as lgb
from ydata_profiling import ProfileReport
import sweetviz as sv
from matplotlib.colors import LinearSegmentedColormap
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap, ListedColormap
from ydata_profiling.report.structure.variables import render_real


from retail_bank_risk.data_preprocessing_utils import (
    handle_missing_values,
    simple_imputation,
    flag_anomalies,
)
from retail_bank_risk.basic_visualizations_utils import (
    plot_correlation_matrix,
    plot_combined_histograms,
    plot_categorical_features_by_target,
    plot_combined_bar_charts,
    plot_combined_boxplots,
    plot_single_bar_chart,
    plot_feature_importances,
)
from retail_bank_risk.model_utils import (
    evaluate_model,
    extract_feature_importances,
)
from retail_bank_risk.advanced_visualizations_utils import (
    plot_model_performance,
    plot_combined_confusion_matrices,
    plot_roc_curve,
    plot_precision_recall_curve,
    plot_confusion_matrix,
    plot_learning_curve,
)

import pandas as pd
from ydata_profiling import ProfileReport
from IPython.display import display

For reproducibility, we set a random seed.


In [3]:
np.random.seed(42)

We then load the CSV files using Polars.


In [4]:
application_train = pl.read_csv("../data/application_train.csv")
application_test = pl.read_csv("../data/application_test.csv")

In [7]:
application_train.head()

SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,…,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
i64,i64,str,str,str,str,i64,f64,f64,f64,f64,str,str,str,str,str,f64,i64,i64,f64,i64,f64,i64,i64,i64,i64,i64,i64,str,f64,i64,i64,str,i64,i64,i64,i64,…,f64,str,str,f64,str,str,f64,f64,f64,f64,f64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,f64,f64,f64,f64,f64,f64
100002,1,"""Cash loans""","""M""","""N""","""Y""",0,202500.0,406597.5,24700.5,351000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Single / not married""","""House / apartment""",0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,"""Laborers""",1.0,2,2,"""WEDNESDAY""",10,0,0,0,…,0.0,"""reg oper account""","""block of flats""",0.0149,"""Stone, brick""","""No""",2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
100003,0,"""Cash loans""","""F""","""N""","""N""",0,270000.0,1293502.5,35698.5,1129500.0,"""Family""","""State servant""","""Higher education""","""Married""","""House / apartment""",0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,"""Core staff""",2.0,1,1,"""MONDAY""",11,0,0,0,…,0.01,"""reg oper account""","""block of flats""",0.0714,"""Block""","""No""",1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100004,0,"""Revolving loans""","""M""","""Y""","""Y""",0,67500.0,135000.0,6750.0,135000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Single / not married""","""House / apartment""",0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,"""Laborers""",1.0,2,2,"""MONDAY""",9,0,0,0,…,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100006,0,"""Cash loans""","""F""","""N""","""Y""",0,135000.0,312682.5,29686.5,297000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Civil marriage""","""House / apartment""",0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,"""Laborers""",2.0,2,2,"""WEDNESDAY""",17,0,0,0,…,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
100007,0,"""Cash loans""","""M""","""N""","""Y""",0,121500.0,513000.0,21865.5,513000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Single / not married""","""House / apartment""",0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,"""Core staff""",1.0,2,2,"""THURSDAY""",11,0,0,0,…,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
application_train.dtypes

[Int64,
 Int64,
 String,
 String,
 String,
 String,
 Int64,
 Float64,
 Float64,
 Float64,
 Float64,
 String,
 String,
 String,
 String,
 String,
 Float64,
 Int64,
 Int64,
 Float64,
 Int64,
 Float64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 String,
 Float64,
 Int64,
 Int64,
 String,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 String,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 String,
 String,
 Float64,
 String,
 String,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 I

Next up we can convert them to Pandas dataframe objects for AutoEDA tools.
