# Loan Classifier

In this project, you will complete a notebook where you will build a classifier to predict whether a loan case will be paid off or not.

You load a historical dataset from previous loan applications, clean the data, and apply different classification algorithm on the data. You are expected to use the following algorithms to build your models:

- k-Nearest Neighbour
- Decision Tree
- Support Vector Machine
- Logistic Regression

The results is reported as the accuracy of each classifier, using the following metrics when these are applicable:

- Jaccard index
- F1-score
- LogLoass

## Import Necessary Libraries

In [21]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

## About Dataset

This dataset is about past loans. The __Loan_train.csv__ data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| Loan_status    | Whether a loan is paid off on in collection                                           |
| Principal      | Basic principal loan amount at the                                                    |
| Terms          | Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule |
| Effective_date | When the loan got originated and took effects                                         |
| Due_date       | Since it’s one-time payoff schedule, each loan has one single due date                |
| Age            | Age of applicant                                                                      |
| Education      | Education of applicant                                                                |
| Gender         | The gender of applicant     

### Lets download the dataset

In [22]:
!wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv

--2020-03-10 23:50:04--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]
Saving to: ‘loan_train.csv’


2020-03-10 23:50:06 (90.9 KB/s) - ‘loan_train.csv’ saved [23101/23101]



In [23]:
df = pd.read_csv('loan_train.csv')
print(df.shape)
df.head()


(346, 10)


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,loan_status,Principal,terms,effective_date,due_date,age,education,Gender
0,0,0,PAIDOFF,1000,30,9/8/2016,10/7/2016,45,High School or Below,male
1,2,2,PAIDOFF,1000,30,9/8/2016,10/7/2016,33,Bechalor,female
2,3,3,PAIDOFF,1000,15,9/8/2016,9/22/2016,27,college,male
3,4,4,PAIDOFF,1000,30,9/9/2016,10/8/2016,28,college,female
4,6,6,PAIDOFF,1000,30,9/9/2016,10/8/2016,29,college,male


# Data visualization and pre-processing

## Convert Categorical features to numerical values

### Look at gender:


In [24]:
df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)

Gender  loan_status
female  PAIDOFF        0.865385
        COLLECTION     0.134615
male    PAIDOFF        0.731293
        COLLECTION     0.268707
Name: loan_status, dtype: float64

__86 % of female pay there loans while only 73 % of males pay there loan__

Lets convert male to 0 and female to 1:

In [25]:
df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,loan_status,Principal,terms,effective_date,due_date,age,education,Gender
0,0,0,PAIDOFF,1000,30,9/8/2016,10/7/2016,45,High School or Below,0
1,2,2,PAIDOFF,1000,30,9/8/2016,10/7/2016,33,Bechalor,1
2,3,3,PAIDOFF,1000,15,9/8/2016,9/22/2016,27,college,0
3,4,4,PAIDOFF,1000,30,9/9/2016,10/8/2016,28,college,1
4,6,6,PAIDOFF,1000,30,9/9/2016,10/8/2016,29,college,0


### Look at Education:

In [26]:
df.groupby(['education'])['loan_status'].value_counts(normalize=True)

education             loan_status
Bechalor              PAIDOFF        0.750000
                      COLLECTION     0.250000
High School or Below  PAIDOFF        0.741722
                      COLLECTION     0.258278
Master or Above       COLLECTION     0.500000
                      PAIDOFF        0.500000
college               PAIDOFF        0.765101
                      COLLECTION     0.234899
Name: loan_status, dtype: float64

as __Master or Above is 50/50__ so it doesn't make any impact on our model but all the other impact our model, so we ommit Master or Above from our training dataset.

we also ommit __Unnamed: 0	Unnamed: 0.1__ as the doesn't have any meaning.

### So Our Featured dataset will be 

In [27]:
df = df[['Principal','terms','age','Gender','education', 'loan_status']]

In [28]:
df.head()

Unnamed: 0,Principal,terms,age,Gender,education,loan_status
0,1000,30,45,0,High School or Below,PAIDOFF
1,1000,30,33,1,Bechalor,PAIDOFF
2,1000,15,27,0,college,PAIDOFF
3,1000,30,28,1,college,PAIDOFF
4,1000,30,29,0,college,PAIDOFF


### Now we conver categorical varables to binary variables and append them to the feature Data Frame 

In [29]:
Feature = df[['Principal','terms','age','Gender']]
Feature = pd.concat([Feature,pd.get_dummies(df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
Feature.head()

Unnamed: 0,Principal,terms,age,Gender,Bechalor,High School or Below,college
0,1000,30,45,0,0,1,0
1,1000,30,33,1,1,0,0
2,1000,15,27,0,0,0,1
3,1000,30,28,1,0,0,1
4,1000,30,29,0,0,0,1


# Feature selection


In [30]:
# Lets defind feature sets, X:
X = Feature
X[0:5]

Unnamed: 0,Principal,terms,age,Gender,Bechalor,High School or Below,college
0,1000,30,45,0,0,1,0
1,1000,30,33,1,1,0,0
2,1000,15,27,0,0,0,1
3,1000,30,28,1,0,0,1
4,1000,30,29,0,0,0,1


In [31]:
# What are our lables?
y = df['loan_status'].values
y[:5]

array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
      dtype=object)

__Now Our Dataset are ready so we now go for training our model__

# Train Test Split

In [96]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.15, random_state=40)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (294, 7) (294,)
Test set: (52, 7) (52,)


# Classifications

Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model
We use the following algorithm:

- __K Nearest Neighbor(KNN)__
- __Decision Tree__
- __Support Vector Machine__
- __Logistic Regression__


# KNN

### Find the best K

In [97]:
# Classifier implementing the k-nearest neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import jaccard_similarity_score, log_loss, classification_report, confusion_matrix,jaccard_score


In [98]:
Ks =100
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])
    #print("for {} accuracy {}".format(n,mean_acc[n-1]))


In [99]:
print( "The best accuracy on training data:", mean_acc.max(), "with k=", mean_acc.argmax()+1)

The best accuracy on training data: 0.8653846153846154 with k= 19


In [101]:
# Jaccard Similarity
neigh = KNeighborsClassifier(n_neighbors = 19).fit(X_train,y_train)

yhat_knn=neigh.predict(X_test)
jac = jaccard_similarity_score(y_test, yhat_knn)
jac



0.8653846153846154

In [102]:
print(classification_report(y_test, yhat_knn))

              precision    recall  f1-score   support

  COLLECTION       1.00      0.22      0.36         9
     PAIDOFF       0.86      1.00      0.92        43

    accuracy                           0.87        52
   macro avg       0.93      0.61      0.64        52
weighted avg       0.88      0.87      0.83        52



# Logistic Regression

In [113]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=0.02, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [114]:
yhat_lr = LR.predict(X_test)
yhat_lr

array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'], dtype=object)

In [115]:
# Jaccard Similarity
jac = jaccard_similarity_score(y_test, yhat_lr)
jac



0.8269230769230769

In [116]:
# F1 Score
print(classification_report(y_test, yhat_lr))

              precision    recall  f1-score   support

  COLLECTION       0.00      0.00      0.00         9
     PAIDOFF       0.83      1.00      0.91        43

    accuracy                           0.83        52
   macro avg       0.41      0.50      0.45        52
weighted avg       0.68      0.83      0.75        52



  _warn_prf(average, modifier, msg_start, len(result))


In [117]:
# Log Loss
log_loss(y_test, yhat_prob)

0.4708351326237112