# Use Case : Predictiong Bank Loan Defaults
### A data science approach to predict and understand the applicant’s profile to minimize the risk of future loan defaults.

The dataset contains information about credit applicants. Banks, globally, use this kind of dataset and type of informative data to create models to help in deciding on who to accept/refuse for a loan.
After all the exploratory data analysis, cleansing and dealing with all the anomalies we might (will) find along the way, the patterns of a good/bad applicant will be exposed to be learned by machine learning models. 
* Machine Learning issue and objectives
     We’re dealing with a supervised binary classification problem. The goal is to train the best machine learning model to maximize the predictive capability of deeply understanding the past customer’s profile minimizing the risk of future loan defaults.

* Performance Metric
    The metric used for the models’ evaluation is the ROC AUC given that we’re dealing with a highly unbalanced data.

* Project structure
    The project divides into three categories:
* EDA: Exploratory Data Analysis
* Data Wrangling: Cleansing and Feature Selection
* Machine Learning: Predictive Modelling


In [1]:
import os
os.system("pip install katonic[ml]")



You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.


0

In [2]:
# Importing the Necessary packages

import warnings
warnings.filterwarnings("ignore")
import os
import pickle

import numpy as np
import pandas as pd
pd.set_option("display.max_columns",100)
from scipy import stats

# from imblearn.over_sampling import RandomOverSampler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

from katonic.ml.client import set_exp
from katonic.ml.classification import Classifier

import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.gridspec as gridspec

In [4]:
# Loading the data.

clean_data = pd.read_csv("https://raw.githubusercontent.com/vinaynaman/miscellaneous/main/test_data_for_model_creation.csv")

In [5]:
clean_data.head()

Unnamed: 0.1,Unnamed: 0,id,event_timestamp,annual_inc,short_emp,emp_length_num,dti,last_delinq_none,revol_util,total_rec_late_fee,od_ratio,bad_loan,grade_A,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,home_ownership_MORTGAGE,home_ownership_OWN,home_ownership_RENT,purpose_car,purpose_credit_card,purpose_debt_consolidation,purpose_home_improvement,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,0,11454641,2016-02-08 00:37:08+00:00,100000,1,1,26.27,1,43.2,0.0,0.160624,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1,9604874,2016-02-08 05:56:20+00:00,83000,0,4,5.39,0,21.5,0.0,0.810777,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,2,9684700,2016-02-08 06:15:39+00:00,78000,0,11,18.45,1,46.3,0.0,0.035147,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,3,9695736,2016-02-08 06:15:39+00:00,37536,0,6,12.28,0,10.7,0.0,0.534887,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,4,9795013,2016-02-08 06:51:45+00:00,65000,0,11,11.26,0,15.2,0.0,0.1665,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [6]:
clean_data.drop("Unnamed: 0", axis=1, inplace=True)

In [7]:
# Before doing the training. We need to split Truedata for training and testing.
# Separating the data into Dependent and independent features.

X = clean_data.drop(["id","event_timestamp","bad_loan"],axis = 1)
y = clean_data["bad_loan"]

#### Splitting the Data in 80:20 ratio for train and test repectively.

In [8]:
# Splitting the Dataset.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=101, shuffle=True)
# ros = RandomOverSampler()

# X_train, y_train = ros.fit_resample(X_train, y_train)

In [9]:
# Creating a new experiment using set_exp function from ml client.
exp_name = "test-exp"

set_exp(exp_name)

<Experiment: artifact_location='s3://models/21', experiment_id='21', lifecycle_stage='active', name='test-exp', tags={}>

In [10]:
# Let's Initialize an object for Our Auto ML Classifier along with the training, testing data and an experiment name.

classifier = Classifier(X_train, X_test, y_train, y_test, experiment_name=exp_name)

In [11]:
exp_id = classifier.id

In [12]:
print("experiment name : ",classifier.name)
print("experiment location : ",classifier.location)
print("experiment id : ",classifier.id)
print("experiment status : ",classifier.stage)

experiment name :  test-exp
experiment location :  s3://models/21
experiment id :  21
experiment status :  active


In [13]:
# Now we have successfully setup out Evironment to experiement all the models.

### Logistic Regression

In [14]:
# Let's Initialize an object for Our Auto ML Classifier along with the training, testing data and an experiment name.

classifier = Classifier(X_train, X_test, y_train, y_test, experiment_name=exp_name)
classifier.LogisticRegression()

lrparams={
'solver':{
    'values': ['liblinear', 'lbfgs', 'sag'],
    'type': 'categorical'
},
    'C':{
        'low': 0.6,
        'high': 1.0,
        'type': 'float'
    }
}

classifier.LogisticRegression(is_tune = True, params = lrparams)

[32m[I 2023-03-10 08:06:20,669][0m A new study created in memory with name: no-name-6121aa66-58ec-4b37-bc64-3ddcc58b1523[0m
[32m[I 2023-03-10 08:06:23,161][0m Trial 0 finished with value: 0.5 and parameters: {'solver': 'sag', 'C': 0.6}. Best is trial 0 with value: 0.5.[0m
[32m[I 2023-03-10 08:06:25,057][0m Trial 1 finished with value: 0.5 and parameters: {'solver': 'lbfgs', 'C': 0.8}. Best is trial 0 with value: 0.5.[0m
[32m[I 2023-03-10 08:06:27,037][0m Trial 2 finished with value: 0.5 and parameters: {'solver': 'lbfgs', 'C': 0.8}. Best is trial 0 with value: 0.5.[0m
[32m[I 2023-03-10 08:06:28,941][0m Trial 3 finished with value: 0.5 and parameters: {'solver': 'liblinear', 'C': 0.7}. Best is trial 0 with value: 0.5.[0m
[32m[I 2023-03-10 08:06:31,370][0m Trial 4 finished with value: 0.5 and parameters: {'solver': 'sag', 'C': 0.9}. Best is trial 0 with value: 0.5.[0m


Number of finished trials:  5
Best trial:
  auc_roc_score:  0.5
  Params: 
    solver: sag
    C: 0.6
