# 02_random_forest

This notebook uses random forests to predict the likelihood that a borrower will experience serious delinquency within the next two years, using the [GiveMeSomeCredit](https://www.kaggle.com/c/GiveMeSomeCredit/rules) dataset from the Kaggle competition.


# Imports

In [1]:
import sys

import pandas as pd
import sklearn.ensemble
import sklearn.impute
import sklearn.pipeline
import sklearn.preprocessing

from IPython.display import display

sys.path.append("../../src")
import GiveMeSomeCredit

# Data Loading

## Loading the Training Dataset

This section loads the training data into a DataFrame and displays its basic information.


In [2]:
credit_data_df = GiveMeSomeCredit.load_training_data()

/Users/rina/llm-classification/data/GiveMeSomeCredit/raw/cs-training.csv Memory Usage: 13.73 MB


Unnamed: 0,dtype,count,non_null,null_count,mean,std,min,25%,50%,75%,max
SeriousDlqin2yrs,int64,150000,150000,0,0.06684,0.249746,0.0,0.0,0.0,0.0,1.0
RevolvingUtilizationOfUnsecuredLines,float64,150000,150000,0,6.048438,249.755371,0.0,0.029867,0.154181,0.559046,50708.0
age,int64,150000,150000,0,52.295207,14.771866,0.0,41.0,52.0,63.0,109.0
NumberOfTime30-59DaysPastDueNotWorse,int64,150000,150000,0,0.421033,4.192781,0.0,0.0,0.0,0.0,98.0
DebtRatio,float64,150000,150000,0,353.005076,2037.818523,0.0,0.175074,0.366508,0.868254,329664.0
MonthlyIncome,float64,120269,120269,29731,6670.221237,14384.674215,0.0,3400.0,5400.0,8249.0,3008750.0
NumberOfOpenCreditLinesAndLoans,int64,150000,150000,0,8.45276,5.145951,0.0,5.0,8.0,11.0,58.0
NumberOfTimes90DaysLate,int64,150000,150000,0,0.265973,4.169304,0.0,0.0,0.0,0.0,98.0
NumberRealEstateLoansOrLines,int64,150000,150000,0,1.01824,1.129771,0.0,0.0,1.0,2.0,54.0
NumberOfTime60-89DaysPastDueNotWorse,int64,150000,150000,0,0.240387,4.155179,0.0,0.0,0.0,0.0,98.0


Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
3,0,0.658180,38,1,0.085113,3042.0,2,1,0,0,0.0
4,0,0.233810,30,0,0.036050,3300.0,5,0,0,0,0.0
5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
149996,0,0.040674,74,0,0.225131,2100.0,4,0,1,0,0.0
149997,0,0.299745,44,0,0.716562,5584.0,4,0,1,0,2.0
149998,0,0.246044,58,0,3870.000000,,18,0,1,0,0.0
149999,0,0.000000,30,0,0.000000,5716.0,4,0,0,0,0.0


## Loading the Training IDs

Retrieve the list of training IDs. All rows in `credit_data_df` with IDs in this list belong to the training set, while all other rows are used as validation data.


In [3]:
training_ids = GiveMeSomeCredit.get_training_row_ids()

# Model Building

In this section, we will construct a random forest pipeline to generate predictions on the dataset and save the results for later evaluation.


## Prepare Training Data and Build Model Pipeline

- Select the training subset from the full dataset using `training_ids`.
- Separate features (`X`) and target variable (`y`) where `"SeriousDlqin2yrs"` is the target.
- ML [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html):
  - Imputes missing values using the median: [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
  - Scales features to standard normal distribution: [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
  - Trains a random forest classifier with balanced class weights to handle class imbalance: [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- Fit the model pipeline on the training data.


In [4]:
training_df = credit_data_df.loc[training_ids]

X = training_df.drop("SeriousDlqin2yrs", axis=1)
y = training_df["SeriousDlqin2yrs"]

# Define pipeline steps
model = sklearn.pipeline.Pipeline([
    ("imputer", sklearn.impute.SimpleImputer(strategy="median")),
    ("scaler", sklearn.preprocessing.StandardScaler()),
    ("classifier", sklearn.ensemble.RandomForestClassifier(
        class_weight="balanced",
        n_estimators=100,
        max_depth=None,
        random_state=0
    ))
])

model.fit(X, y)
print(model)

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler()),
                ('classifier',
                 RandomForestClassifier(class_weight='balanced',
                                        random_state=0))])


## Generate Predictions on Entire Dataset

- Prepare the feature set `X_all` by dropping the target column.
- Use the trained model to predict probabilities (`pred_probs`) of the positive class.
- Generate binary class predictions (`pred`) based on the model.
- Combine predicted probabilities and class labels into a DataFrame `predictions_df` indexed by the original dataset.
- Display the predictions for review.

In [5]:
X_all = credit_data_df.drop("SeriousDlqin2yrs", axis=1)

pred_probs = model.predict_proba(X_all)[:, 1]
pred = model.predict(X_all)

predictions_df = pd.DataFrame({
    "pred_probs": pred_probs,
    "pred": pred
}, index=credit_data_df.index)[["pred", "pred_probs"]]

display(predictions_df)


Unnamed: 0,pred,pred_probs
1,1,0.71
2,0,0.03
3,0,0.10
4,0,0.00
5,0,0.08
...,...,...
149996,0,0.00
149997,0,0.00
149998,0,0.00
149999,0,0.00


## Save Results

In [6]:
GiveMeSomeCredit.save_train_validation_results("Random Forest", predictions_df)

display(
    *GiveMeSomeCredit.load_training_validation_results()
)

column:  ('Random Forest', 'pred')
column:  ('Random Forest', 'pred_probs')
column:  ('Random Forest', 'pred')
column:  ('Random Forest', 'pred_probs')


2025-09-12 17:12:08,714 - INFO - Saved DataFrame to processed directory: /Users/rina/llm-classification/data/GiveMeSomeCredit/processed/training_results.csv
2025-09-12 17:12:08,787 - INFO - Saved DataFrame to processed directory: /Users/rina/llm-classification/data/GiveMeSomeCredit/processed/validation_results.csv


Unnamed: 0_level_0,Logistic Regression,Logistic Regression,Random Forest,Random Forest
Unnamed: 0_level_1,pred,pred_probs,pred,pred_probs
Row ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,1,0.756797,1,0.71
2,0,0.464318,0,0.03
3,1,0.707312,0,0.10
5,0,0.209631,0,0.08
6,0,0.249206,0,0.01
...,...,...,...,...
149996,0,0.243774,0,0.00
149997,0,0.451480,0,0.00
149998,0,0.298296,0,0.00
149999,1,0.504252,0,0.00


Unnamed: 0_level_0,Logistic Regression,Logistic Regression,Random Forest,Random Forest
Unnamed: 0_level_1,pred,pred_probs,pred,pred_probs
Row ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
4,1,0.518869,0,0.00
12,0,0.413919,0,0.00
14,1,0.947342,1,0.63
24,0,0.476525,0,0.00
25,0,0.340831,0,0.03
...,...,...,...,...
149984,0,0.180238,0,0.00
149985,0,0.170122,0,0.00
149987,0,0.386821,0,0.03
149992,0,0.309497,0,0.00
