### Models for prediction on credit card applications


In this notebook, we are going to create a supervised machine learning model to predict the outcome of a credit card application given a set of features. 

In [164]:
# Import modules 
import numpy as np 
import pandas as pd

#### 1. Method of dropping rows with missing values
We reload the original credit card dataset and clean the missing data by simply dropping the concerning rows.  

In [165]:
cc_original = pd.read_csv('../datasets/crx.data_named.csv', index_col=[0])
cc_original.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefaulter,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [166]:
cc_original.describe(include='all')

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefaulter,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
count,690,690,690.0,690,690,690,690,690.0,690,690,690.0,690,690,690.0,690.0,690
unique,3,350,,4,4,15,10,,2,2,,2,3,171.0,,2
top,b,?,,u,g,c,v,,t,f,,f,g,0.0,,-
freq,468,12,,519,519,137,399,,361,395,,374,625,132.0,,383
mean,,,4.758725,,,,,2.223406,,,2.4,,,,1017.385507,
std,,,4.978163,,,,,3.346513,,,4.86294,,,,5210.102598,
min,,,0.0,,,,,0.0,,,0.0,,,,0.0,
25%,,,1.0,,,,,0.165,,,0.0,,,,0.0,
50%,,,2.75,,,,,1.0,,,0.0,,,,5.0,
75%,,,7.2075,,,,,2.625,,,3.0,,,,395.5,


In [167]:
# Load helper functions from src/utils-*-*.py to run on jupyter-notebook
%run ../src/utils-gather-assess.py
%run ../src/utils-explore-clean.py
%run ../src/utils-models.py

In [168]:
cc_drop = cc_original.copy()

# Replace all ? with Nan values
for icol in cc_drop.columns:
    replace_feature_missingvalues(cc_drop, icol, '?', np.NaN)

In [169]:
# Drop all rows that are Nan
cc_drop.dropna(inplace=True)

# change datatype of Age feature
change_feature_datatype(cc_drop, 'Age', float)

 Details of column: Age
        - dtype(o): object
        - dtype(n): float64


In [170]:
cc_drop.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 689
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          653 non-null    object 
 1   Age             653 non-null    float64
 2   Debt            653 non-null    float64
 3   Married         653 non-null    object 
 4   BankCustomer    653 non-null    object 
 5   EducationLevel  653 non-null    object 
 6   Ethnicity       653 non-null    object 
 7   YearsEmployed   653 non-null    float64
 8   PriorDefaulter  653 non-null    object 
 9   Employed        653 non-null    object 
 10  CreditScore     653 non-null    int64  
 11  DriversLicense  653 non-null    object 
 12  Citizen         653 non-null    object 
 13  ZipCode         653 non-null    object 
 14  Income          653 non-null    int64  
 15  Approved        653 non-null    object 
dtypes: float64(3), int64(2), object(11)
memory usage: 86.7+ KB


After dropping rows with missing values, we end up with 653 rows/instances for 15 features and 1 target variable.  

In [171]:
cc_drop.to_csv('../datasets/crx.data_drop.csv')

In [172]:
show_files_datasets('../datasets')

 datasets/
       - crx.data
       - crx.names
       - crx.data_named.csv
       - crx.data_drop.csv
       - crx.data_clean.csv


Credit card data is now loaded into a `cc_drop` dataframe. We now split the combined dataframe into `features` and `target` dataframe by performing **one_hot_encoding** on categorical variables in `features` dataframe.   

In [183]:
# Split dataframe into num_df, cat_df and target_df
cc_drop_num, cc_drop_cat, cc_drop_target = split_dataframe_datatypes(cc_drop, 'Approved')

In [174]:
# Perform OneHotEncoding on Categorical df
cc_drop_cat_ohe = one_hot_encode(cc_drop_cat)

# Combine dataframes
cc_drop_features_combined = combine_two_dfs(cc_drop_cat_ohe, cc_drop_num)
acc_drop_features_combined = cc_drop_features_combined.to_numpy()

Let's sneak into the shapes of dataframes that are now ready to go into supervised ML model.

In [175]:
# Check size of features df
acc_drop_features_combined.shape

(653, 209)

In [176]:
# Check size of target df
acc_drop_target = cc_drop_target.to_numpy()
acc_drop_target.shape

(653, 1)

Before starting to build a ML model, we first initialize model list and dicts.

In [177]:
# initialize models list and dicts
models = []
mean_mse = {}
cv_std = {}
res = {}

In [178]:
# Set the no. of processor 
nproc = 2

In [184]:
# Creation of Linear Regression model 
lr = LinearRegression()


# Update models list
models.extend([lr])

# Split dataframe into training and testing datasets
#ftrain, ftest, ttrain, ttest = get_train_test_dfs(acc_drop_features_combined, acc_drop_target, 0.3)

In [186]:
# Cross validation of the model 
for model in models:
    train_model(model, cc_drop_features_combined, cc_drop_target, nproc, mean_mse, cv_std)
    print_model_summary(model, mean_mse, cv_std)


 Model:
  LinearRegression()
 Average MSE:
 nan
 Standard deviation during CV:
 nan

 Model:
  LinearRegression()
 Average MSE:
 nan
 Standard deviation during CV:
 nan
