# E402 Final Project

This code is a machine learning model built to predict loan default risk using historical data from a financial institution. The model is built using a random forest classifier, and includes a number of data preprocessing and feature engineering steps.

The code begins by loading in the training dataset, which includes information on applicants' personal and financial characteristics. The data is then preprocessed and cleaned, including imputing missing values, encoding categorical variables, and scaling the data. The resulting dataset is then split into training and validation sets, and a random forest classifier is fit on the training set. Once the model is trained, it is used to make predictions on the validation set, and various performance metrics are calculated, including accuracy, precision, recall, and F1 score. Finally, the model is used to predict the likelihood of loan defaults on a separate test dataset, and the results are returned as a pandas dataframe. Overall, this code aims to provide a robust and effective solution for predicting loan default risk using machine learning techniques.

In [3]:
import pandas as pd
import numpy as np
import math
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

In [24]:
class CreditDefaultRF():
    """
    Upon initialization, this class trains a Random Forest Classifier on a dataset of 200,000+ loan applicants from 
    Home Credit Group. The model will train on 6 features: Gender, Car Ownership, Realty Status, Number of Children,
    Income and Amount of Credit. 
    
     NOTE: Before using this class, you must download the dataset from the Home Credit Default Risk Kaggle competition
    (https://www.kaggle.com/c/home-credit-default-risk/data) and place the 'application_train.csv' file in a local
    directory. The path to this directory must be specified in the 'pd.read_csv()' call in the code below.
    """
    
    def __init__(self):
        
        # -- LOAD IN TRAINING DATASET FROM HOME CREDIT GROUP -- #
        self.df_train = pd.read_csv('/Users/sampence/Documents/IU Bloom/E402 - Computational Methods in Macro/Final Project/home-credit-default-risk/application_train.csv')
        
        # Convert all binary values to 1 and 0 
        mapping = {'Y': 1, 'N': 0, 'M': 1, 'F': 0}
        cols = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
        self.df_train[cols] = self.df_train[cols].applymap(mapping.get)

        # Label for targeting
        self.label = np.where(self.df_train['TARGET'] == 1, 1, 0)
        
        # Feature Selection
        self.train_features = self.df_train[
            ['CNT_CHILDREN','AMT_INCOME_TOTAL','AMT_CREDIT', 'DAYS_BIRTH', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
            'CODE_GENDER', 'AMT_GOODS_PRICE']
        ]

        # Fill null values in each column with median of the column.
        for col in self.train_features.columns:
            if self.train_features[col].isna().sum() > 0:
                self.train_features.loc[:, col] = self.train_features[col].fillna(self.train_features[col].median())
            else:
                pass
        
        # Double check for null values after their replacement
        for col in self.train_features.columns:
            if self.train_features[col].isnull().any():
                print(f"Column {col} contains NaN values.")
            else:
                pass
            
        # Scaling features to Mean = 0 and Std. Dev = 1
        scaler = StandardScaler()
        self.train_features = scaler.fit_transform(self.train_features)
        
         # Create a RandomForestClassifier to fit on the training data.
        self.model = RandomForestClassifier(n_estimators=100, random_state=40)
        
        # Split the data into training and validation sets
        X_train, X_val, y_train, self.y_val = train_test_split(
            self.train_features, self.label, test_size=0.2, random_state=42)

        # Train your model on the training set
        self.model.fit(X_train, y_train)

        # Make predictions on the validation set
        self.y_pred = self.model.predict(X_val)

        # Evaluate the performance of your model on the validation set
        self.accuracy = accuracy_score(self.y_val, self.y_pred)
        self.precision = precision_score(self.y_val, self.y_pred)
        self.recall = recall_score(self.y_val, self.y_pred)
        self.f1 = f1_score(self.y_val, self.y_pred)
        self.conf_matrix = confusion_matrix(self.y_val, self.y_pred)

        
#----------------------------#
#---TEST PERFORMANCE---------#
#----------------------------#       


    def get_validation_results(model, y_val, y_pred):
        y_pred = self.y_pred
        y_val = self.y_val
        results = pd.DataFrame({'actual': y_val, 'predicted': y_pred}, index=X_val.index)
        results['is_correct'] = results['actual'] == results['predicted']
        return results


    def metrics(self):
        """
        Returns a dataframe with the model evaluation metrics.
        """
        # Unpack attributes of self
        accuracy, precision, recall, f1 = self.accuracy, self.precision, self.recall, self.f1
        
        # Create a dictionary with the metric names and their values
        metrics_dict = {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1': f1
        }
    
        # Create a dataframe from the metrics dictionary
        metrics_df = pd.DataFrame.from_dict(metrics_dict, orient='index', columns=['Value'])
    
        # Set the index name
        metrics_df.index.name = 'Metric'
    
        # Return the dataframe
        return metrics_df  
    
    
    
#------------------------------------------------------------------------#
#--- EVERYTHING BELOW THIS LINE SHOULD BE USED TO TEST NEW APPLICANTS ---#
#------------------------------------------------------------------------#

    def predictions(self, df_test):
        """
        Predicts the loan repayment status of a set of loan applicants.

        Args:
            X_test (pandas.DataFrame): A DataFrame containing the loan applicant information to be predicted.
        
        Returns:
            pandas.DataFrame: A DataFrame of binary predictions for each loan applicant in X_test.
        """
        mapping = {'Y': 1, 'N': 0, 'M': 1, 'F': 0}
        cols = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
        df_test[cols] = df_test[cols].applymap(mapping.get)
        df_test = df_test[['CNT_CHILDREN','AMT_INCOME_TOTAL','AMT_CREDIT', 'DAYS_BIRTH', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
            'CODE_GENDER', 'AMT_GOODS_PRICE']]

        # Feature Selection
        test_features = df_test[
            ['CNT_CHILDREN','AMT_INCOME_TOTAL','AMT_CREDIT', 'DAYS_BIRTH', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
            'CODE_GENDER', 'AMT_GOODS_PRICE']
        ]
        
        # Fill null values in each column with median of the column.
        for col in test_features.columns:
            if test_features[col].isna().sum() > 0:
                test_features.loc[:, col] = test_features[col].fillna(test_features[col].median())
            else:
                pass
        
        # Double check for null values after their replacement
        for col in test_features.columns:
            if test_features[col].isnull().any():
                print(f"Column {col} contains NaN values.")
            else:
                pass

        # Scaling features to Mean = 0 and Std. Dev = 1
        scaler = StandardScaler()
        test_features = scaler.fit_transform(test_features)
        
        # Make predictions on test set
        predictions = self.model.predict(test_features)
        pred_df = pd.DataFrame(predictions, columns=['Prediction'])
        # Return predictions
        return pred_df
    
    def defaults(self, pred_df):
        """
        Filter out loans that are predicted to default.

        Args:
        -----------
        pred_df : pandas.DataFrame
            A DataFrame containing loan predictions(obtained by calling 'predict' method), where each row corresponds to a loan and has a
            'Prediction' column with values of 0 or 1, indicating whether the loan is predicted to default or not.

        Returns:
        --------
        pandas.DataFrame
            A new DataFrame containing only the rows from `pred_df` where 'Prediction' == 1, indicating that the
            loan is predicted to default.
        """

        # Check if the input dataframe contains the Prediction column
        if 'Prediction' not in pred_df.columns:
            raise ValueError("Input dataframe does not contain the Prediction column")
            
        # Filter out loans that are likely to default
        pred_df = pred_df[pred_df['Prediction'] == 1]
        
        # Return the dataframe of positive predictions
        return pred_df
    

In [25]:
%%time
# Instantiate
obj = CreditDefaultRF()

CPU times: user 48.1 s, sys: 2.04 s, total: 50.1 s
Wall time: 50.6 s


In [8]:
# Evaluate performance metrics
metrics = obj.metrics()
metrics

Unnamed: 0_level_0,Value
Metric,Unnamed: 1_level_1
accuracy,0.912866
precision,0.133929
recall,0.015155
f1,0.027228


In [29]:
print(obj.y_val)
print(obj.y_pred)

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]


In [9]:
# Show Confusion Matrix
x = obj.conf_matrix
print(x)

[[56069   485]
 [ 4874    75]]


This method of the random forest classifier will be used to find viable features for the model.

In [10]:
obj.model.feature_importances_

array([0.02743781, 0.18620601, 0.18933601, 0.45450364, 0.00735648,
       0.01281266, 0.00671918, 0.11562822])

In [11]:
len(obj.y_pred)

61503

In [15]:
# Predict
df_test = pd.read_csv('/Users/sampence/Documents/IU Bloom/E402 - Computational Methods in Macro/Final Project/home-credit-default-risk/application_test.csv')
predictions = obj.predictions(df_test)
predictions

Unnamed: 0,Prediction
0,0
1,0
2,0
3,0
4,0
...,...
48739,0
48740,0
48741,0
48742,0


In [14]:
# Filter out bad loans
positives = obj.defaults(predictions)
len(positives)

157