# Objective:

* Solve Classification Problem


* Use different algorithms


* Use different evaluation metrics

### Questions to be asked:






* What is the goal in solving the classification problem? 


* What are we attempting to predict? 


* What algorithms are to be used to build our model?


* What evaluation metrics are we using?

### Answers:
* The goal: Predict whether a loan case will be paid off or not, as well as making sure the data is cleaned i.e. fill/clear NULL, clear duplicates, format datatypes, remove unneccessary features.


* We are attempting to predict whether a customer will pay off there loan and vise-versa.


* We will use ML algorithms: **KNN, Decision Tree, SVM, Logistic Regression.**


* We will. use evaluation metrics: **Jaccard Index, F1-Score, Log-Loss, Confusion Matrix**

In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/loan_train.csv'
df = pd.read_csv(path)

In [4]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,loan_status,Principal,terms,effective_date,due_date,age,education,Gender
0,0,0,PAIDOFF,1000,30,9/8/2016,10/7/2016,45,High School or Below,male
1,2,2,PAIDOFF,1000,30,9/8/2016,10/7/2016,33,Bechalor,female
2,3,3,PAIDOFF,1000,15,9/8/2016,9/22/2016,27,college,male
3,4,4,PAIDOFF,1000,30,9/9/2016,10/8/2016,28,college,female
4,6,6,PAIDOFF,1000,30,9/9/2016,10/8/2016,29,college,male


In [5]:
df['loan_status'].unique()

array(['PAIDOFF', 'COLLECTION'], dtype=object)

In [6]:
df = df.drop(labels = ['Unnamed: 0.1', 'Unnamed: 0'], axis = 1)

# dfTree will a variable used later, while testing the accuracy of your Decision Tree 
dfTree = df

df.dtypes

loan_status       object
Principal          int64
terms              int64
effective_date    object
due_date          object
age                int64
education         object
Gender            object
dtype: object

In [7]:
dfTree.head()
dfTree = dfTree.drop(['effective_date', 'due_date'], axis = 1)

In [29]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score, jaccard_score
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

x = dfTree.iloc[:, 1:5]
y = dfTree.iloc[:, 0]
print(x.nunique())
print("\n", x['education'].unique())

Principal     5
terms         3
age          32
education     4
dtype: int64

 ['High School or Below' 'Bechalor' 'college' 'Master or Above']


In [30]:
y = y.values
x = x.values

yy = preprocessing.LabelEncoder().fit(["PAIDOFF", "COLLECTION"])
xx = preprocessing.LabelEncoder().fit(["High School or Below", "Bechalor", "college", "Master or Above"])
x[:, 3] = xx.transform(x[:, 3])
y = yy.transform(y)

x = preprocessing.StandardScaler().fit_transform(x)

### Model Creation and Evaluation:

In [57]:
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = .2, random_state = 5)

treeNumModel = DecisionTreeClassifier(criterion = 'entropy').fit(train_x, train_y)
trainNumYhat = treeNumModel.predict(train_x)

print("Decision Tree Evaluation:\nTrain Score:% .3f" % accuracy_score(train_y, trainNumYhat))

treeNumYhat = treeNumModel.predict(test_x)
print("\nJaccard-Simialirty:\nTest Score:% .3f" % jaccard_score(test_y, treeNumYhat))
print("\nF1-Score:\nTest Score:% .3f" % f1_score(test_y, treeNumYhat))

Decision Tree Evaluation:
Train Score: 0.841

Jaccard-Simialirty:
Test Score: 0.603

F1-Score:
Test Score: 0.752


In [58]:
from sklearn.svm import SVC

svcModel = SVC(kernel = 'linear').fit(train_x, train_y)
trainYhat = svcModel.predict(train_x)

print("SVM Evaluation:\nTrain Score:% .3f" %(accuracy_score(train_y, trainYhat)))

svcYhat = svcModel.predict(test_x)
print("\nJaccard-Similarity:\nTest Score:% .3f" % jaccard_score(test_y, svcYhat))
print("\nF-Score:\nTest Score:% .3f" % f1_score(test_y, svcYhat))

SVM Evaluation:
Train Score: 0.764

Jaccard-Similarity:
Test Score: 0.700

F-Score:
Test Score: 0.824


In [60]:
from sklearn.neighbors import KNeighborsClassifier as kNN
    
knnModel = kNN(9).fit(train_x, train_y)

trainYhat = knnModel.predict(train_x)
print("KNN Evaluation:\nTrain Score:% .3f" % jaccard_score(train_y, trainYhat))
    
knnYhat = knnModel.predict(test_x)
print("\nJaccard-Similarity:\nTest Score:% .3f" % jaccard_score(test_y, knnYhat))
print("\nF1-Score:\nTest Score:% .3f" % f1_score(test_y, knnYhat))

KNN Evaluation:
Train Score: 0.768

Jaccard-Similarity:
Test Score: 0.701

F1-Score:
Test Score: 0.825


In [65]:
import sklearn.linear_model
from sklearn.linear_model import LinearRegression as LR
from sklearn.metrics import log_loss

lrModel = LR().fit(train_x, train_y)
trainLrYhat = lrModel.predict(train_x)
print("Logistic Regression Evaluation:\nTrain Score:% .3f" % log_loss(train_y, trainLrYhat))
    
lrYhat = lrModel.predict(test_x)
print("\nLog-Loss:\nTest Score:% .3f" % log_loss(test_y, lrYhat))

Logistic Regression Evaluation:
Train Score: 0.537

Log-Loss:
Test Score: 0.627


In [45]:
from sklearn.neural_network import MLPClassifier as MLP

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = .20, random_state = 30)

neuralModel = MLP(solver = 'adam').fit(train_x, train_y)
trainYhatNeural = neuralModel.predict(train_x)
print("MLP:\nTrain Score: % .3f" % accuracy_score(train_y, trainNumYhat))

neuralYhat = neuralModel.predict(test_x)
print("\nTest Score: % .3f" % f1_score(test_y, neuralYhat))

MLP:
Train Score:  0.634

Test Score:  0.923


