<a href="https://www.kaggle.com/code/tusharaggarwal27/banknote-authentication-project?scriptVersionId=113095571" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Have you ever been in a supermarket where you handed cash to cashiers only to discover that the money was fake while there was a long line of people waiting to pay?Or, even more embarrassing, why did you not carry any additional banknotes?I had been in this predicament once, and the shame of being construed as an immoral cheapskate stayed with me for a very long time.I was inspired by this to carry out this project, **K-Means Clustering model to detect if a banknote is real or fake?**

github.com/tushar2704,kaggle.com/tusharaggarwal27, linkedin.com/in/tusharaggarwalinseec

Using Python (Pandas, NumPy) to gather and assess the data and scikit-learn to train a K-Means model to detect if a banknote is genuine or forged

**Steps used for the model**

Step 1: Gather and EDA of the data

Step 2: K-Means model fitting

Step 3: Re-run K-means several times to to see if we get similar results, which can tell if the K-Means model is stable or not

Step 4: Analyze the K-Means computing results

Step 5: Calculate the accuracy of the result!

In [None]:
# Data manipulation imports
import numpy as np
import pandas as pd
from scipy.io import arff

# Visualization imports
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
%matplotlib inline

# Modeling imports
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

In [None]:
# Reading the data as a DataFrame

data = arff.loadarff('/kaggle/input/banknoteauthentication/php50jXam.arff')

bank_note = pd.DataFrame(data[0])
bank_note.head()

This dataset is about distinguishing genuine and forged banknotes. Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. A Wavelet Transform tool was used to extract features from these images. (Source: https://www.openml.org/d/1462)

# EDA

In [None]:
#getting the metadata
bank_note.info()


# Attributes Information
1. V1. variance of Wavelet Transformed image (continuous)
2. V2. skewness of Wavelet Transformed image (continuous)
3. V3. curtosis of Wavelet Transformed image (continuous)
4. V4. entropy of image (continuous)


In [None]:
bank_note['Class'].unique()
#according to data set it has 2 classes,lets see further with wcss how many k we can make

In [None]:
#exploring the basic statistical info
bank_note.describe(include='all')

**K-Means itself, I only picked out two variables to build the models, which are V1 (variance of Wavelet Transformed image) and V2 (skewness of Wavelet Transformed image).**

In [None]:
#displaying scatter 
plt.figure(figsize = [8, 8])
plt.scatter(bank_note.V1, bank_note.V2)

The first step in building K-Means is to assess if this dataset is suitable for K-Means; if not, then we should choose other clustering models. After seeing this plot, I found the data distribution in the graph is neither too wide, nor too centered at one place, therefore it is worth trying to computing K-Means on this dataset. But, there is no obvious cluster in spherical shapes so we should expect the K-Means model won’t work perfectly here.

# Step 2: K-Means model fitting

In [None]:

from sklearn.datasets import make_blobs,make_circles,make_moons


bank_note_1 = np.column_stack(( bank_note.V1, bank_note.V2))  # we use only V1 and V2

#compute KMeans

km_pic = KMeans(n_clusters=2).fit(bank_note_1)
clusters = km_pic.cluster_centers_
#clusters #the the coordinates of clusters

In [None]:
# put the assigned labels to the original dataset
bank_note['KMeans'] = km_pic.labels_

#plot out the result
g = sns.FacetGrid(data = bank_note, hue = 'KMeans', size = 5)
g.map(plt.scatter, 'V1', 'V2')
g.add_legend();
plt.scatter(clusters[:,0], clusters[:,1], s=500, marker='*', c='r')

# Step 3: Re-run K-means several times to to see if we get similar results, which can make sure the K-Means model is stable in the dataset.

In [None]:
n_iter = 9
fig, ax = plt.subplots(3, 3, figsize=(16, 16))
ax = np.ravel(ax)
centers = []
for i in range(n_iter):
    # Run local implementation of kmeans
    km = KMeans(n_clusters=2,
                max_iter=3)
    km.fit(bank_note)
    centroids = km.cluster_centers_
    centers.append(centroids)
    ax[i].scatter(bank_note[km.labels_ == 0, 0].value, bank_note[km.labels_ == 0, 1],
                   label='cluster 1')
    ax[i].scatter(bank_note[km.labels_ == 1, 0], bank_note[km.labels_ == 1, 1],
                   label='cluster 2')
    ax[i].scatter(centroids[:, 0], centroids[:, 1],
                  c='r', marker='*', s=300, label='centroid')
    ax[i].legend(loc='lower right')
    ax[i].set_aspect('equal')
plt.tight_layout();

After running K-Means for 9 times, the results we got are very similar, which means the K-Means here is stable.

# Step 4: Analyze the K-Means computing results

In [None]:
bank_note.groupby('KMeans').describe()

There are ~574 data being clustered in group 0, and ~798 data being clustering in group 1. For group 1, V1’s mean = ~-0.20 and V1’s mean = ~-3.67. while for group 2, V1’s mean = ~0.88 and V2’s mean = ~5.95.

**Step 5: Calculate the accuracy of the result.**

In [None]:
# plot the data with Correct labels
g = sns.FacetGrid(data = bank_note, hue = 'Class', size = 5)
g.map(plt.scatter, 'V1', 'V2')
g.add_legend()
plt.title("Data With Correct Lables")


# plot the data computed by K-Means
g = sns.FacetGrid(data = bank_note, hue = 'KMeans', size = 5)
g.map(plt.scatter, 'V1', 'V2')
g.add_legend()
plt.title("K-Means Result");

We can see the K-Means one tends to be divided by a horizontal line at V2 = 1, whereas the original one tends to be divided by a slightly slant vertical line at V1 =0. Which showed one drawback of K-Means which is that K-Means gives more weight to the bigger clusters.(The group 1 in K-Means tends to include the bigger cluster in the below position.) Let’s calculate the accuracy of this K-Means clustering model:



In [None]:
# correct the labels
bank_note["KMeans"] = bank_note["KMeans"].map({0: 1, 1: 2})

# calculate the accuracy of the model.
correct = 0

for i in range(0,1372):
    if bank_note.Class[i] == bank_note["KMeans"][i]:
        correct+=1
print(correct/1371)

K-Means Result: The accuracy of this K-Means Model is 65.3%.

Nice! Since I didn’t conduct any data cleaning process and pre-analysis here, so getting an accuracy of 65.3% is quite reasonable. However, if you are interested in optimizing the accuracy rate, you can consider conducting factor analysis to find the most influential variables to put into the models. Then you will get a better result! (For readers who are interesting in factor analysis, I found DataCamp has a good introductory article on factor analysis in python.)