## Home Credit Default Risk Prediction

### Project Overview

This project aims to develop a machine learning model to predict the likelihood of loan default for Home Credit Group clients. Think of it like a tool that helps banks make smarter decisions about who to give loans to. We want to be as accurate as possible, ensuring that deserving clients are approved while also minimizing the risk of people not paying back their loans. 

### Why is this Important?

Home Credit focuses on lending to people who might not have traditional credit histories (sometimes called the "unbanked"). They use a variety of information, like phone and transaction data, to help assess whether someone is likely to repay a loan. By accurately predicting loan defaults, we can:

* **Help more people access credit:** Deserving clients who might otherwise be rejected can get loans.
* **Reduce risk for Home Credit:** The company can make better decisions about who to lend to.

### Dataset Description

We're using data from Home Credit, which includes details about loan applications, past loans from other institutions, and Home Credit specific loan history (think credit cards, point-of-sale loans, etc.). The data is organized into several interconnected tables, each providing a different piece of the puzzle:

1. **Application Data (`application_{train|test}.csv`):** This is the main table with information about each loan application, including things like age, income, employment history, etc. 
2. **Bureau Data (`bureau.csv` and `bureau_balance.csv`):** This contains information about loans the person has taken from other banks or financial institutions.
3. **Previous Applications (Home Credit) (`previous_application.csv`):** This tells us about the person's history of applying for loans at Home Credit specifically.
4. **Home Credit Loan Details:** 
    * **POS and Cash Loans Balance (`pos_cash_balance.csv`):**  Provides details on point-of-sale and cash loans.
    * **Credit Card Balance (`credit_card_balance.csv`):** Contains credit card balance information. 
5. **Installments Payments (`installments_payments.csv`):** This shows how the person has paid back previous loans at Home Credit (were they on time, late, etc.). 

### Our Approach 

We'll take a multi-step approach to building our predictive model:

1. **Exploratory Data Analysis (EDA):** We'll dig deep into the data to understand it better, identify important features (like which ones are strongly related to default risk), and find any patterns or issues that need to be addressed. Think of it like detective work!
2. **Data Preparation:** We'll clean the data, handle missing information, and create new features (also called "feature engineering") that might be helpful for the model. 
3. **Model Selection:** We'll try out different machine learning models to find the one that works best for this task.
4. **Fine-tuning:** We'll tweak the settings of our chosen model to make it as accurate as possible.
5. **Presenting the Solution:** We'll clearly explain our findings and the logic behind our model.
6. **Model Deployment:** The final model will be made available so that Home Credit can easily use it.

### Additional Notes

* **Data Source:** The dataset used in this project is from the [Home Credit Default Risk Kaggle Competition](https://www.kaggle.com/competitions/home-credit-default-risk/leaderboard). You can find more information and potentially useful insights from previous Home Credit analyses [[here](https://www.kaggle.com/competitions/home-credit-default-risk/discussion/63032)].
* **Geographic Scope:** Home Credit primarily operates in the Commonwealth of Independent States (CIS) and Southeast Asia (SEA) regions.  The data might include information from countries like Kazakhstan, Russia, Vietnam, China, Indonesia, and the Philippines.
* **Product Categories:** Home Credit offers several types of loan products, including:
    * **Revolving loans (credit cards)**
    * **Consumer loans (point-of-sale loans or POS loans)**
    * **Cash loans**
* **Exploratory Data Analysis Focus:**  During the exploratory data analysis phase, each table will be analyzed individually to understand its relationship with the target variable (loan default).

**The following is the schema for Home Credit dataset.**

By the end of this project, we aim to provide Home Credit with a reliable and easy-to-use tool to predict loan default risk, allowing them to make better lending decisions and potentially extend credit to more people who can manage it responsibly. 

In [1]:
import pandas as pd
import polars as pl
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
import lightgbm as lgb

from retail_bank_risk.data_preprocessing_utils import handle_missing_values, simple_imputation, flag_anomalies
from retail_bank_risk.basic_visualizations_utils import (
    plot_correlation_matrix, plot_combined_histograms, plot_categorical_features_by_target,
    plot_combined_bar_charts, plot_combined_boxplots, plot_single_bar_chart, plot_feature_importances,
)
from retail_bank_risk.model_utils import evaluate_model, extract_feature_importances
from retail_bank_risk.advanced_visualizations_utils import (
    plot_model_performance, plot_combined_confusion_matrices,
    plot_roc_curve, plot_precision_recall_curve, plot_confusion_matrix, plot_learning_curve
)

In [3]:
np.random.seed(42)

In [2]:
application_train = pl.read_csv('../data/application_train.csv')
application_test = pl.read_csv('../data/application_test.csv')

print(f"Training data shape: {application_train.shape}")
print(f"Testing data shape: {application_test.shape}")

Training data shape: (307511, 122)
Testing data shape: (48744, 121)


In [4]:
bureau = pl.read_csv('../data/bureau.csv')
bureau_balance = pl.read_csv('../data/bureau_balance.csv')
prev_application = pl.read_csv('../data/previous_application.csv')
pos_cash = pl.read_csv('../data/pos_cash_balance.csv')
credit_card = pl.read_csv('../data/credit_card_balance.csv')
installments = pl.read_csv('../data/installments_payments.csv')

In [8]:
for name, df in {
    'Bureau': bureau,
    'Bureau Balance': bureau_balance,
    'Previous Applications': prev_application,
    'POS Cash Balance': pos_cash,
    'Credit Card Balance': credit_card,
    'Installments Payments': installments
}.items():
    print(f"\n{name} data shape: {df.shape}")

NameError: name 'bureau' is not defined