# Bank Note Authentication

The purpose of this project is to build a classification based model to classify weather a bank note given its feature s is either a genuine note or not


The data is from an UCI repo but can also be found in [Kaggle](https://www.kaggle.com/ritesaluja/bank-note-authentication-uci-data)


**In this notebook we will be performing the below steps**
* Importing the libaries and reading the data

## Importing the libaries and reading the data

In [7]:
# Importing the libaries
import pandas as pd
import numpy as np
import seaborn as sns

In [3]:
# Get the file names in data folder
!ls data/*

data/bank_note_authentication.csv


In [5]:
# Reading the data into Pandas DataFrame
df = pd.read_csv("data/bank_note_authentication.csv")

In [6]:
df.head()

Unnamed: 0,variance,skewness,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [8]:
# See the label counts
df["class"].value_counts()

0    762
1    610
Name: class, dtype: int64

Converting the data into features and lables

In [9]:
# Creating the lables and features
X = df.drop(["class"], axis=1)
y = df[["class"]]

In [11]:
X.head()

Unnamed: 0,variance,skewness,curtosis,entropy
0,3.6216,8.6661,-2.8073,-0.44699
1,4.5459,8.1674,-2.4586,-1.4621
2,3.866,-2.6383,1.9242,0.10645
3,3.4566,9.5228,-4.0112,-3.5944
4,0.32924,-4.4552,4.5718,-0.9888


In [12]:
y.head()

Unnamed: 0,class
0,0
1,0
2,0
3,0
4,0


## Splitting the data into training and testing datasets


We will be splittig the data into 2 parts, namely
1. Training data 
2. Testing data

The testsize is set to 30% of the complete data size

In [19]:
# Getting the shape of raw data
X.shape, y.shape

((1372, 4), (1372, 1))

In [16]:
# Performing a train test split on the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [20]:
# Checking the shape of the split data
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((960, 4), (412, 4), (960, 1), (412, 1))

## Creating a training model

Since this is a classification problem (we will be choosing one of the two labels 0 or 1), we will using the RandomForestClassifier as this is an ensemble technique and will produce a high accuracy out of the box

In [23]:
from sklearn.ensemble import RandomForestClassifier

# 1. Create a classifier
classifier = RandomForestClassifier()

# 2. Train the model
classifier.fit(X_train, y_train)

  classifier.fit(X_train, y_train)


RandomForestClassifier()

Using the trained model to predict the outcomes to the testing dataset

In [24]:
# 3. Predict the resutls
y_pred = classifier.predict(X_test)

In [27]:
# 4. Evaluating the score
from sklearn.metrics import accuracy_score

accuracy_score(y_pred, y_test)

0.9975728155339806

In [38]:
# Score of the classifier
classifier.score(X_test, y_test)

0.9975728155339806

Since the default RandomForestClassifier has such a high accuracy score of ~99%, we will not need to optimize this any further.

## Saving the model to file

Once the model has been built, the model is ready to be used in any other eco-system by loading the model. There are 2 ways we can save the model

1. Pickle
2. Joblib

Firstly, we will try to dump the model into a pickle file, later load it and test if the test results are the same

### Pickle file approach

In [33]:
# Saving the model to the file
import pickle

pickle_file = open("pickle_model.pkl", "wb") # The file where the pickle file will be saved
pickle.dump(classifier, pickle_file) # Dumping the pickle file
pickle_file.close()

In [36]:
# Reading the saved model
pickle_loaded_model = pickle.load(open("pickle_model.pkl", "rb"))
pickle_loaded_model.score(X_test, y_test)

0.9975728155339806

### Joblib approach

In [39]:
# Saving the model using joblib
import joblib
joblib_file ="joblib_model.sav"
joblib.dump(classifier, joblib_file)

['joblib_model.sav']

In [40]:
# Reading the saved joblib model
joblib_saved_model = joblib.load(joblib_file)
joblib_saved_model.score(X_test, y_test)

0.9975728155339806

### Viewing all the model score

In [43]:
print("Base classifier model score: {}".format(str(classifier.score(X_test, y_test))))
print("Pickle approach model score: {}".format(str(pickle_loaded_model.score(X_test, y_test))))
print("Joblib approach model score: {}".format(str(joblib_saved_model.score(X_test, y_test))))

Base classifier model score: 0.9975728155339806
Pickle approach model score: 0.9975728155339806
Joblib approach model score: 0.9975728155339806


In [44]:
classifier.score(X_test, y_test) == pickle_loaded_model.score(X_test, y_test)

True

In [46]:
classifier.score(X_test, y_test) == joblib_saved_model.score(X_test, y_test)

True

Since all the model scores are the same, we will be using the model from the `pickle_loaded_model.pkl`

In [47]:
df[df['class'] == 1]

Unnamed: 0,variance,skewness,curtosis,entropy,class
762,-1.39710,3.31910,-1.392700,-1.99480,1
763,0.39012,-0.14279,-0.031994,0.35084,1
764,-1.66770,-7.15350,7.892900,0.96765,1
765,-3.84830,-12.80470,15.682400,-1.28100,1
766,-3.56810,-8.21300,10.083000,0.96765,1
...,...,...,...,...,...
1367,0.40614,1.34920,-1.450100,-0.55949,1
1368,-1.38870,-4.87730,6.477400,0.34179,1
1369,-3.75030,-13.45860,17.593200,-2.77710,1
1370,-3.56370,-8.38270,12.393000,-1.28230,1
