<a href="https://colab.research.google.com/github/vikniksor/DataScience/blob/main/comparing_some_clf_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

### About dataset

This dataset is about past loans. The __Loan_train.csv__ data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| Loan_status    | Whether a loan is paid off on in collection                                           |
| Principal      | Basic principal loan amount at the                                                    |
| Terms          | Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule |
| Effective_date | When the loan got originated and took effects                                         |
| Due_date       | Since it’s one-time payoff schedule, each loan has one single due date                |
| Age            | Age of applicant                                                                      |
| Education      | Education of applicant                                                                |
| Gender         | The gender of applicant                                                               |

Lets download the dataset

In [None]:
!wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv

### Load Data From CSV File  

In [None]:
df = pd.read_csv('loan_train.csv')
df.head()

In [None]:
df.shape

### Convert to date time object 

In [None]:
df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df.head()

# Data visualization and pre-processing



Let’s see how many of each class is in our data set 

In [None]:
df['loan_status'].value_counts()

260 people have paid off the loan on time while 86 have gone into collection 


Lets plot some columns to underestand data better:

In [None]:
import seaborn as sns

bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

In [None]:
bins = np.linspace(df.age.min(), df.age.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

# Pre-processing:  Feature selection/extraction

In [None]:
df['dayofweek'] = df['effective_date'].dt.dayofweek
bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'dayofweek', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()


We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4 

In [None]:
df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
df.head()

**Convert Categorical features to numerical values**

Lets look at gender:

In [None]:
df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)

86 % of female pay there loans while only 73 % of males pay there loan


Lets convert male to 0 and female to 1:


In [None]:
df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
df.head()

**One Hot Encoding** 

How about education?

In [None]:
df.groupby(['education'])['loan_status'].value_counts(normalize=True)

Feature befor One Hot Encoding

In [None]:
df[['Principal','terms','age','Gender','education']].head()

Use one hot encoding technique to conver categorical varables to binary variables and append them to the feature Data Frame 

In [None]:
Feature = df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
Feature.head()


Feature selection

Lets defind feature sets, X:

In [None]:
X = Feature
X[0:5]

What are our lables?

In [None]:
y = df['loan_status'].values
y[0:5]

**Normalize Data**

Data Standardization give data zero mean and unit variance (technically should be done after train test split )

In [None]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

# Classification 

Will use:
- K Nearest Neighbor(KNN)
- Decision Tree
- Support Vector Machine
- Logistic Regression




# K Nearest Neighbor(KNN)


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [None]:
X_knn, y_knn = X, y
X_knn_train, X_knn_valid, y_knn_train, y_knn_valid = train_test_split(X_knn, y_knn, test_size = 0.4, random_state = 6)

In [None]:
scores = {}
for k in range(1, 10):
    knn_clf = KNeighborsClassifier(k)
    knn_clf.fit(X_knn_train, y_knn_train)
    knn_clf.predict(X_knn_valid)
    scores[k] = knn_clf.score(X_knn_valid, y_knn_valid)
print(scores)

In [None]:
knn_clf = KNeighborsClassifier(n_neighbors = 5)
knn_clf.fit(X, y)

KNeighborsClassifier()

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt_clf = DecisionTreeClassifier()

In [None]:
dt_clf.fit(X, y)

DecisionTreeClassifier()

# Support Vector Machine

In [None]:
from sklearn.svm import SVC

In [None]:
svc_clf = SVC()

In [None]:
svc_clf.fit(X, y)

SVC()

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg = LogisticRegression(C = 0.1, class_weight = "balanced")

In [None]:
logreg.fit(X, y)

LogisticRegression(C=0.1, class_weight='balanced')

# Model Evaluation using Test set

In [None]:
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

First, download and load the test set:

In [None]:
!wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv

### Load Test set for evaluation 

In [None]:
test_df = pd.read_csv('loan_test.csv')
test_df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,loan_status,Principal,terms,effective_date,due_date,age,education,Gender
0,1,1,PAIDOFF,1000,30,9/8/2016,10/7/2016,50,Bechalor,female
1,5,5,PAIDOFF,300,7,9/9/2016,9/15/2016,35,Master or Above,male
2,21,21,PAIDOFF,1000,30,9/10/2016,10/9/2016,43,High School or Below,female
3,24,24,PAIDOFF,1000,30,9/10/2016,10/9/2016,26,college,male
4,35,35,PAIDOFF,800,15,9/11/2016,9/25/2016,29,Bechalor,male


In [None]:
test_df['due_date'] = pd.to_datetime(test_df['due_date'])
test_df['effective_date'] = pd.to_datetime(test_df['effective_date'])
test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek
# evaulate weekend field
test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
# work out education level
test_feature = test_df[['Principal','terms','age','Gender','weekend']]
test_feature = pd.concat([test_feature,pd.get_dummies(test_df['education'])], axis=1)
test_feature.drop(['Master or Above'], axis = 1,inplace=True)
test_feature.tail()
# normalize the test data
test_X = preprocessing.StandardScaler().fit(test_feature).transform(test_feature)

In [None]:
y_test = test_df["loan_status"].values

In [None]:
knn_pred = knn_clf.predict(test_X)
#jk1 = jaccard_score(y_test, knn_pred, pos_label = "PAIDOFF")
fs1 = f1_score(y_test, knn_pred, average = "weighted")
dt_pred = dt_clf.predict(test_X)
#jk2 = jaccard_score(y_test, dt_pred, pos_label = "PAIDOFF")
fs2 = f1_score(y_test, dt_pred, average = "weighted")
svm_pred = svc_clf.predict(test_X)
#jk3 = jaccard_score(y_test, svm_pred, pos_label = "PAIDOFF")
fs3 = f1_score(y_test, svm_pred, average = "weighted")
logreg_pred = logreg.predict(test_X)
#jk4 = jaccard_score(y_test, logreg_pred, pos_label = "PAIDOFF")
fs4 = f1_score(y_test, logreg_pred, average = "weighted")

In [None]:
logreg_proba = logreg.predict_proba(test_X)
ll4 = log_loss(y_test, logreg_proba)

list_jk = [JK1, JK2, JK3, JK4]
list_fs = [fs1, fs2, fs3, fs4]
list_ll = ['NA', 'NA', 'NA', ll4]


report_df = pd.DataFrame(list_jk, index=['KNN','Decision Tree','SVM','Logistic Regression'])
report_df.columns = ['Jaccard']
report_df.insert(loc=1, column='F1-score', value=list_fs)
report_df.insert(loc=2, column='LogLoss', value=list_ll)
report_df.columns.name = 'Algorithm'
report_df