# Script Assignment 4

## Group 4

## Assignment

You are working as a data scientist for a financial technology company specializing in credit risk assessment. Your task is to build and evaluate an ensemble model using historical loan data to predict the likelihood of default for new loan applicants.

- Load the historical loan dataset (`loan_data.csv`) containing features such
 as credit score, income, loan amount, and default status. Preprocess the data by handling missing values, encoding categorical variables, and splitting the dataset into training and testing sets.
- Implement three different ensemble models: Random Forest, Gradient 
Boosting, and Voting Classifier. Train each ensemble model on the training dataset and evaluate its performance using appropriate evaluation metrics for classification tasks (e.g., accuracy, precision, recall, F1-score, ROC-AUC).
- Compare the performance of the three ensemble models and select the 
best-performing model based on evaluation metrics. Provide insights into why the selected ensemble model might be well-suited for credit risk assessment in fintech.

## Links

- [Kaggle Loan Default Dataset](https://www.kaggle.com/datasets/yasserh/loan-default-dataset/data)
- [Data Cleaning and Preprocessing Tactics](https://www.kaggle.com/code/nkitgupta/advance-data-preprocessing)

## Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.impute import SimpleImputer

## Load the dataset 

In [None]:
# Load the dataset
file_path = '../data/loan_data.csv'
df_loan_data = pd.read_csv(file_path)
df_loan_data.head()

## Review the dataframe and visualize missing data

In this section, we will review the dataframe's structure and visualize the missing data.  We will also create a function to determine which columns have missing data and the percentage of missing data in each column.

### Review the dataframe

In [None]:
df_loan_data.info()

### Calculate missing data percentages

In [None]:
def calculate_missing_percentages(df):
    """Calculate the percentage of missing data in each column of a dataframe."""
    total = df.shape[0]
    missing_columns = [col for col in df.columns if df[col].isnull().sum() > 0]
    miss_pct = {}
    for col in missing_columns:
        null_count = df[col].isnull().sum()
        per = (null_count/total) * 100
        miss_pct[col] = per
        print(f"{col}: {null_count} ({per:.3f}%)")
    return miss_pct

In [None]:
_ = calculate_missing_percentages(df_loan_data)

### Visualize missing data

The `missingno` library provides a matrix visualization of the missing data, and the `matplotlib` library provides a bar chart of the missing data percentages.

In [None]:
msno.matrix(df_loan_data)
plt.figure(figsize = (15,9))
plt.show()

From both the chart and statistics above, there are a few columns with substantial missing data. We will need to address this missing data before we can proceed with building the ensemble models.

## Data Preprocessing

In this section, we will review features, handle the missing data, encode the categorical variables, and split the dataset into training and testing sets.

### Review features

We'll start by reviewing the features in the dataset and removing any that we know have little to no impact on the model.  

According to an article on forbes.com, the following features are important for credit risk assessment:

- Credit Score and History
- Income
- Debt-to-income Ratio
- Collateral
- Origination Fee

In order to simplify the dataset, we'll remove the following columns: 


In [None]:
df_loan_data.drop(['loan_limit','Gender', 'approv_in_adv','loan_type', 'loan_purpose', 'Credit_Worthiness','open_credit',
        'business_or_commercial', 'rate_of_interest', 'Interest_rate_spread', 'Neg_ammortization', 'interest_only',
        'lump_sum_payment', 'construction_type', 'occupancy_type', 'Secured_by', 'total_units', 'credit_type',
        'co-applicant_credit_type', 'submission_of_application', 'Region', 'Security_Type', 'ID', 'year'], axis = 1, inplace = True)

df_loan_data.head()

### Handle missing data


In [None]:
# Review the missing data percentages
_ = calculate_missing_percentages(df_loan_data)

#### Handle missing numerical data

We will use the SimpleImputer class to fill in the missing data.  We will use the mean value for numerical columns and the most frequent value for categorical columns.

In [None]:
# Only numerical features
num_cols = [col for col in df_loan_data.columns if df_loan_data[col].dtype != 'object']
print(num_cols)

In [None]:
imputer = SimpleImputer(strategy='mean')
# Run SimpleImputer on a subset of columns on df_loan_data
for col in num_cols:
    df_loan_data[col] = imputer.fit_transform(df_loan_data[[col]])

df_loan_data.head()

In [None]:
df_loan_data.isnull().sum()

#### Handle missing categorical variables

Now we'll set our sights on cleaning up the categorical variables.  We'll start by identifying the categorical variables and then filling in the missing data with the most frequent value.

In [None]:
df_loan_data.dropna(inplace = True)
df_loan_data.isnull().sum()


### Encode categorical variables

We will use the LabelEncoder class to encode the age variable.

In [None]:
label_encoder = LabelEncoder()
df_loan_data['age'] = label_encoder.fit_transform(df_loan_data['age'])
df_loan_data['age']

### Split the dataset into training and testing sets

In [None]:
# Split the dataset into training and testing sets
X = df_loan_data.drop('Status', axis=1)
y = df_loan_data['Status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)