## About Notebook
In this notebook we try to practice all the classification algorithms.

We load a dataset using Pandas library, and apply the following algorithms, and find the best one for this specific dataset by accuracy evaluation methods.

### Required Libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

This dataset is about past loans. The **loan.csv** data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| Loan_status    | Whether a loan is paid off on in collection                                           |
| Principal      | Basic principal loan amount at the                                                    |
| Terms          | Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule |
| Effective_date | When the loan got originated and took effects                                         |
| Due_date       | Since it’s one-time payoff schedule, each loan has one single due date                |
| Age            | Age of applicant                                                                      |
| Education      | Education of applicant                                                                |
| Gender         | The gender of applicant                                                               |

## Load DataFrame form CSV

In [None]:
df = pd.read_csv('../input/loan-dataset/loan.csv')
df.head()

In [None]:
df.drop('Unnamed: 0', axis=1, inplace=True)
df.drop('Unnamed: 0.1', axis=1, inplace=True)

In [None]:
df.head()

### Shape of Data

In [None]:
df.shape

In [None]:
df.info()

## Convert to date time object
As the date and time colum are not in proper format, so convert to data time object.


In [None]:
df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df.head()

## Data visualization and pre-processing

Let’s see how many of each class is in our data set

In [None]:
df['loan_status'].value_counts()

260 people have paid off the loan on time while 86 have gone into collection

Let's plot some columns to underestand data better:

In [None]:
df.columns

In [None]:
import seaborn as sns

bins = np.linspace(df.Principal.min(), df.Principal.max(), 7)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

In [None]:
bins = np.linspace(df.age.min(), df.age.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

## Pre-processing: Feature selection/extraction

### Let's look at the day of the week people get the loan

In [None]:
df['dayofweek'] = df['effective_date'].dt.dayofweek
bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'dayofweek', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()

We see that people who get the loan at the end of the week don't pay it off, so let's use Feature binarization to set a threshold value less than day 4

In [None]:
df.columns

In [None]:
df['week'] = df['dayofweek'].apply(lambda  x:1 if(x>3) else 0)
df.head()

## Convert Categorical features to numerical values

In [None]:
df.info()

lets look at gender column.


In [None]:
collection = df.groupby(['Gender'])['loan_status'].value_counts()
collection

86 % of female pay there loans while only 73 % of males pay there loan

### Conversion categorical into numerical
Let's convert male to 0 and female to 1:

In [None]:
df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
# df.drop('Unnamed: 0', axis=1, inplace=True)
# df.drop('Unnamed: 0.1', axis=1, inplace=True)
df.head()

## One Hot Encoding

let's see Education column

In [None]:
df.groupby(['education'])['loan_status'].value_counts(normalize=True)*100

## Features before One Hot Encoding

In [None]:
df[['Principal','terms','age','Gender','education']].head()

In [None]:
features = df[['Principal','terms','age','Gender','week']]
features = pd.concat([features, pd.get_dummies(df['education'])],axis=1)
features.drop(['Master or Above'], axis=1,inplace=True)
features.head()

## Features selection

In [None]:
X = features
X[0:5]

In [None]:
y = df['loan_status'].values
y[0:5]

## Normalize Data
Data Standardization give data zero mean and unit variance (technically should be done after train test split)

In [None]:
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

## Classification
The following Classification Algorithm will be use here.
1. K Nearest Neighbor(KNN)
2. Decision Tree
3. Support Vector Machine
4. Logistic Regression

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
print("The Score of the Prediction ",round(knn.score(X_train, y_train)*100 ,2), "%")

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier(random_state = 25)
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
print("The Prediction Score are : ", round(tree.score(X_train, y_train)*100, 2), "%")

In [None]:
from sklearn  import metrics
print("Train set Data Accuracy :", metrics.accuracy_score(y_train, tree.predict(X_train)))
# print("Test set data Accuracy :", metrics.accuracy_score(y_test, y_pred))
print("Test set Accuracy", metrics.accuracy_score(y_test, tree.predict(X_test)))

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

## Support Vector Machine SVM

In [None]:
from sklearn import svm

In [None]:
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("The prediction score are : ", round(clf.score(X_train, y_train)*100, 2), "%")

In [None]:
from sklearn import metrics
print("Train set Accuracy", metrics.accuracy_score(y_train, clf.predict(X_train)))
print("Test set Accuracy", metrics.accuracy_score(y_test, y_pred))

In [None]:
print(classification_report(y_test, y_pred))

## Performance Comparision

In [None]:
from sklearn.ensemble import RandomForestClassifier
rm = RandomForestClassifier()
rm.fit(X_train, y_train)
y_pred = rm.predict(X_test)
print("The prediction score are : ", round(rm.score(X_train, y_train)*100, 2), "%")

In [None]:
from yellowbrick.classifier import ClassificationReport

# K-NearestNeighbour

In [None]:
visualizer = ClassificationReport(knn, classes=['PAIDOFF','COLLECTION'])
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.poof() # Draw/show/poof the data

## DECISION TREE

In [None]:
visualizer = ClassificationReport(tree, classes=['PAIDOFF','COLLECTION'])
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.poof() # Draw/show/poof the data

# Support Vector Machine

In [None]:
visualizer = ClassificationReport(clf, classes=['PAIDOFF','COLLECTION'])
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.poof() # Draw/show/poof the data

# Random Forest 

In [None]:
visualizer = ClassificationReport(rm, classes=['PAIDOFF','COLLECTION'])
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.poof() # Draw/show/poof the data