# Predicting Credit Card Application Approvals

Banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low credit scores or low income levels and etc for example.
The task I wish to achieve from this notebook is to build an automatic credit card approval predictor using Data Analysis and Machine Learning. For this, I have: 
1) Load and read the data

2) Perform data cleaning- deal with missing values, duplicate values

3) Data Preprocessing- converting non-numeric values to numeric, scaling the dataset values to best fit a Machine Learning algorithm and finally split the dataset into train and test data

4) Exploratory data analysis to build an intuition about model needed

5) Build a Machine Learning model that is able to predict if an individual credit card application is approved or reject

The dataset that I have picked is the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

In [2]:
# load dataset
df = pd.read_csv("datasets/cc_approvals.data", header=None)
df.head(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
5,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,+
6,b,33.17,1.04,u,g,r,h,6.5,t,f,0,t,g,164,31285,+
7,a,22.92,11.585,u,g,cc,v,0.04,t,f,0,f,g,80,1349,+
8,b,54.42,0.5,y,p,k,h,3.96,t,f,0,f,g,180,314,+
9,b,42.5,4.915,y,p,w,v,3.165,t,f,0,t,g,52,1442,+


As seen, the column names are anonymized by the contributor since this data is confidential.

<a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html"> This blog</a> gives us a pretty good overview of the probable features. The probable features in a typical credit card application are <i>Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus.</i> This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.

In [3]:
# dataframe information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


In [4]:
df.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [5]:
# summary statistics
df.describe(include = 'O') 
#notice the '?' there, these need to be removed/replaced

Unnamed: 0,0,1,3,4,5,6,8,9,11,12,13,15
count,690,690,690,690,690,690,690,690,690,690,690,690
unique,3,350,4,4,15,10,2,2,2,3,171,2
top,b,?,u,g,c,v,t,f,f,g,0,-
freq,468,12,519,519,137,399,361,395,374,625,132,383


In [6]:
df.tail(20) 
# notice the '?' there, these need to be removed/replaced

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


In [7]:
df.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

### Data Imputation

In [8]:
# replace the '?'s with NaN
df.replace('?', np.nan, inplace=True)

# inspect the missing values again
df.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


In [9]:
# impute the missing values with mean imputation
df.fillna(df.mean(), inplace=True)

  df.fillna(df.mean(), inplace=True)


In [10]:
# count the number of NaNs in the dataset and print the counts to verify
df.isnull().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

In [11]:
# use backfill method to fill nan values in object columns
for cname in df:
    if df[cname].dtypes == "object":
        df[cname].fillna(method = 'backfill', inplace = True)

In [12]:
# finally check for any duplicate rows
df.duplicated().sum()

0

### Data Preprocessing

In [13]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# extract columns having data type as object (i.e non numeric)
for col in df:
    if df[col].dtypes =='object':
        df[col]=le.fit_transform(df[col])  # use LabelEncoder to transform values into numeric

In [14]:
df.head(20)
# all values converted to numeric

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,1,156,0.0,1,0,12,7,1.25,1,1,1,0,0,68,0,0
1,0,328,4.46,1,0,10,3,3.04,1,1,6,0,0,11,560,0
2,0,89,0.5,1,0,10,3,1.5,1,0,0,0,0,96,824,0
3,1,125,1.54,1,0,12,7,3.75,1,1,5,1,0,31,3,0
4,1,43,5.625,1,0,12,7,1.71,1,0,0,0,2,37,0,0
5,1,168,4.0,1,0,9,7,2.5,1,0,0,1,0,115,0,0
6,1,179,1.04,1,0,11,3,6.5,1,0,0,1,0,54,31285,0
7,0,74,11.585,1,0,2,7,0.04,1,0,0,0,0,23,1349,0
8,1,310,0.5,2,2,8,3,3.96,1,0,0,0,0,62,314,0
9,1,255,4.915,2,2,12,7,3.165,1,0,0,1,0,15,1442,0


In [15]:
df.nunique()

0       2
1     349
2     215
3       3
4       3
5      14
6       9
7     132
8       2
9       2
10     23
11      2
12      3
13    170
14    240
15      2
dtype: int64

In [16]:
# drop the features 11 and 13 because feature 11 corresponds to DriversLicencse and 13 to ZipCode which are both unimportant for us
df = df.drop([11, 13], axis=1)

In [17]:
# view the df to verify
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
0,1,156,0.0,1,0,12,7,1.25,1,1,1,0,0,0
1,0,328,4.46,1,0,10,3,3.04,1,1,6,0,560,0
2,0,89,0.5,1,0,10,3,1.5,1,0,0,0,824,0
3,1,125,1.54,1,0,12,7,3.75,1,1,5,0,3,0
4,1,43,5.625,1,0,12,7,1.71,1,0,0,2,0,0


In [18]:
# segregate features and labels into separate variables
X = df.drop([15], axis =1)
y = df[15]

In [19]:
# perform one hot encoding on columns that have less than 5 unique values so that all values are considered of equal weight
X = pd.get_dummies(X, columns=[3,4,8,9,12]) 

In [20]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [22]:
# data scaling

scaler = MinMaxScaler(feature_range=(0,1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)



### Model Building

This is a classifcation problem and the possible models I would consider for this are- Logistic Regression, Random Forrest Classifer and KNeighbors Classifer. Lets build these models and check accuraries

In [23]:
# Logistic Regression Model
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(random_state = 15)
lr_model.fit(rescaledX_train, y_train)

LogisticRegression(random_state=15)

In [24]:
y_pred = lr_model.predict(rescaledX_test)

In [25]:
# model evaluation
print("Accuracy of logistic regression classifier: ", lr_model.score(rescaledX_test, y_test))

Accuracy of logistic regression classifier:  0.8596491228070176


In [26]:
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

rfc_model = RandomForestClassifier(random_state = 36)
rfc_model.fit(rescaledX_train, y_train)

RandomForestClassifier(random_state=36)

In [27]:
y_pred = rfc_model.predict(rescaledX_test)

In [28]:
# model evaluation
print("Accuracy of random forest classifier: ", rfc_model.score(rescaledX_test, y_test))

Accuracy of random forest classifier:  0.8947368421052632


In [29]:
#KNeightboursClassifier Model
from sklearn.neighbors import KNeighborsClassifier

kn_model = KNeighborsClassifier()
kn_model.fit(rescaledX_train, y_train)

KNeighborsClassifier()

In [30]:
y_pred = kn_model.predict(rescaledX_test)

In [31]:
# model evaluation
print("Accuracy of KNeighbors classifier: ", kn_model.score(rescaledX_test, y_test))

Accuracy of KNeighbors classifier:  0.9122807017543859


The Random Forest and KNeighbors Classifiers have the best accuracy. Lets hypertune parameters for these and check our best model for Credit Card Approval predictions!

### Hypertuning the model

In [32]:
# hyperparameter tuning for RandomForestClassifier
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
parameters = {'n_estimators': [20, 50, 60, 80, 90, 100, 120, 150, 200], 'max_features': ["auto", "sqrt", "log2"]}
cls = GridSearchCV(estimator = rfc_model, param_grid = parameters)
cls.fit(rescaledX_train, y_train)

# displaying the best params 
cls.best_params_

{'max_features': 'auto', 'n_estimators': 150}

In [33]:
rfc_model2 = RandomForestClassifier(random_state = 36, n_estimators = 150, max_features = 'auto')
rfc_model2.fit(rescaledX_train, y_train)
y_pred = rfc_model2.predict(rescaledX_test)
rfc_model2.score(rescaledX_test, y_test)

0.9035087719298246

In [34]:
# hyperparameter tuning for RandomForestClassifier

n_neighbors = range(1, 21, 2)
parameters = {'n_neighbors': n_neighbors, 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan', 'minkowski']}
cls = GridSearchCV(estimator = kn_model, param_grid = parameters)
cls.fit(rescaledX_train, y_train)

# displaying the best params 
cls.best_params_

{'metric': 'euclidean', 'n_neighbors': 15, 'weights': 'distance'}

In [35]:
kn_model2 = KNeighborsClassifier(n_neighbors=15, weights = 'distance', metric='euclidean')
kn_model2.fit(rescaledX_train, y_train)
y_pred = kn_model2.predict(rescaledX_test)
kn_model2.score(rescaledX_test, y_test)

0.8947368421052632

The best score is using the Random Forest Classifier model with parameters obtained from GridSearchCV(). Hence, our final model for Credit Card Aprovals is now ready!

In [36]:
model = RandomForestClassifier(random_state = 36, n_estimators = 150, max_features = 'auto')
model.fit(rescaledX_train, y_train)
y_pred = model.predict(rescaledX_test)
model.score(rescaledX_test, y_test)

0.9035087719298246

### <b>The Machine Learning Model is able to predict Credit Card Approvals with an enhanced accuracy of 90.36%</b>