# Best Classifier for Loan Status Prediction

In this notebook, we will use different classification algorithms to predict the loan status of customers and find the best one for this particular dataset by using the accuracy evaluation methods.

## Preprocessing

First, we import the packages that we will be using later. Then, we store the dataset into a pandas dataframe.

In [None]:
import itertools
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.ticker import NullFormatter
from sklearn import preprocessing
%matplotlib inline

In [None]:
data = pd.read_csv('/kaggle/input/loandata/Loan payments data.csv')

We look at the first 5 columns of the dataframe to see how it looks like. Then, we display the data types of each column.

In [None]:
data.head()

In [None]:
data.dtypes

We convert **effective_date** and **due_date** to date time objects. Then, we drop **paid_off_time**, **past_due_days**, and **Loan_ID** since we will not need them for the prediction.

In [None]:
data['effective_date'] = pd.to_datetime(data['effective_date'])
data['due_date'] = pd.to_datetime(data['due_date'])
data = data.drop('paid_off_time', 1).drop('past_due_days', 1).drop('Loan_ID', 1)
data.head()

We check whether there are missing values in our data.

In [None]:
data.isnull().sum()

We check how many of each class is in our dataset.

In [None]:
data['loan_status'].value_counts()

We plot some columns to better understand the data.

In [None]:
bins = np.linspace(data.Principal.min(), data.Principal.max(), 10)
g = sns.FacetGrid(data, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

In [None]:
bins = np.linspace(data.age.min(), data.age.max(), 10)
g = sns.FacetGrid(data, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

In [None]:
data['day'] = data['effective_date'].dt.dayofweek
bins = np.linspace(data.day.min(), data.day.max(), 10)
g = sns.FacetGrid(data, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'day', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()

From the graph above, we can see that customers who get the loan at the end of the week do not pay back their loan. With this, we make a new **weekend** column containing binary values—**1** if it is a weekend (greater than day 3), and **0** if not.

In [None]:
data['weekend'] = data['day'].apply(lambda x: 1 if (x>3)  else 0)
data.head()

Since **Gender** contains categorical features, we convert them to numerical values—from **male** and **female** to **0** and **1**.

In [None]:
data['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)

Now, we inspect **education**.

In [None]:
data.groupby(['education'])['loan_status'].value_counts()

We create a new dataframe containing the features that we will need for the prediction.

In [None]:
features = data[['Principal','terms','age','Gender','weekend','education']]
features.head()

We use One Hot Encoding technique to convert the categorical features of **education** to binary values and append them to the **features** dataframe. Then, we drop **education** and **Master or Above** columns.

In [None]:
features = pd.concat([features,pd.get_dummies(data['education'])], axis=1)
features = features.drop(['education'], axis=1).drop(['Master or Above'], axis = 1)
features.head()

Now, we define the feature set as the predictors (X) and **loan_status** as our criterion variable (y).

In [None]:
X = features
X.head()

In [None]:
y = data['loan_status'].values
y[0:5]

Then, we normalize the data.

In [None]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

## Classification

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

### K-Nearest Neighbors (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

#Get best K
Ks = 11
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfusionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

In [None]:
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()
print( "The best accuracy was with", mean_acc.max(), "with k =", mean_acc.argmax()+1)

In [None]:
k = 8

knn = KNeighborsClassifier(n_neighbors=k).fit(X_train,y_train)
yhat = knn.predict(X_test)

In [None]:
print("Train Set Accuracy: ", metrics.accuracy_score(y_train, knn.predict(X_train)))
print("Test Set Accuracy: ", metrics.accuracy_score(y_test, yhat))

### Desicion Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
tree.fit(X_train,y_train)
tree

In [None]:
yhat = tree.predict(X_test)
yhat

In [None]:
print("Train Set Accuracy: ", metrics.accuracy_score(y_train, tree.predict(X_train)))
print("Test Set Accuracy: ", metrics.accuracy_score(y_test, yhat))

### Support Vector Machine (SVM)

In [None]:
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)

In [None]:
yhat = clf.predict(X_test)
yhat

In [None]:
print("Train Set Accuracy: ", metrics.accuracy_score(y_train, clf.predict(X_train)))
print("Test Set Accuracy: ", metrics.accuracy_score(y_test, yhat))

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

In [None]:
yhat = LR.predict(X_test)
yhat

In [None]:
print("Train Set Accuracy: ", metrics.accuracy_score(y_train, LR.predict(X_train)))
print("Test Set Accuracy: ", metrics.accuracy_score(y_test, yhat))

## Model Evaluation

In [None]:
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

In [None]:
test_df = pd.read_csv('/kaggle/input/loandata/Loan payments data.csv')
test_df.head()

In [None]:
# Preprocessing

test_df['due_date'] = pd.to_datetime(test_df['due_date'])
test_df['effective_date'] = pd.to_datetime(test_df['effective_date'])
test_df['day'] = test_df['effective_date'].dt.dayofweek
test_df['weekend'] = test_df['day'].apply(lambda x: 1 if (x>3)  else 0)
test_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
test_feature = test_df[['Principal','terms','age','Gender','weekend']]
test_feature = pd.concat([test_feature,pd.get_dummies(test_df['education'])], axis=1)
test_feature.drop(['Master or Above'], axis = 1,inplace=True)
test_X = preprocessing.StandardScaler().fit(test_feature).transform(test_feature)
test_y = test_df['loan_status'].values

### Jaccard Index

In [None]:
# KNN
knn_yhat = knn.predict(test_X)
ji1 = round(jaccard_score(test_y, knn_yhat, average='weighted'),2)
# Decision Tree
dt_yhat = tree.predict(test_X)
ji2 = round(jaccard_score(test_y, dt_yhat, average='weighted'),2)
# SVM
svm_yhat = clf.predict(test_X)
ji3 = round(jaccard_score(test_y, svm_yhat, average='weighted'),2)
# Logistic Regression
lr_yhat = LR.predict(test_X)
ji4 = round(jaccard_score(test_y, lr_yhat, average='weighted'),2)

list_ji = [ji1, ji2, ji3, ji4]
list_ji

### F1-score

In [None]:
# KNN
fs1 = round(f1_score(test_y, knn_yhat, average='weighted'),2)
# Decision Tree
fs2 = round(f1_score(test_y, dt_yhat, average='weighted'),2)
# SVM
fs3 = round(f1_score(test_y, svm_yhat, average='weighted'),2)
# Logistic Regression
fs4 = round(f1_score(test_y, lr_yhat, average='weighted'),2)

list_fs = [fs1, fs2, fs3, fs4]
list_fs

## Accuracy Report

In [None]:
accuracy = pd.DataFrame(list_ji, index=['KNN','Decision Tree','SVM','Logistic Regression'])
accuracy.columns = ['Jaccard']
accuracy.insert(loc=1, column='F1-score', value=list_fs)
accuracy.columns.name = 'Algorithm'
accuracy

Based from their Jaccard Index and F1-score, the best classification algorithm for this dataset is **K-Nearest Neighbors (KNN)**.