# Customer Clustering(KNN-Algorithm)

Hi! In this article you will learn brief description about KNN (K- Nearest Neighbours) and implementation of it on evaluation of customer base without using any machine -learning libraries. 

Outcomes by end of article
*  What is KNN
*  Diving into Customer data 
* * Importing Libraries 
* * Load Dataset
* *  Data Filtering(Calculating RMF Value)
* *  Data Preprocessing
* *  Data Analysis(Applying K-Mean Algo)
* *  Data Visualization
* Observation

# What is KNN
K Nearest Neighbour is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure.It mostly used to classifies a data point based on how its neighbours are classified.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Let's dive into work

# 1. Import required python libraries

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

# 2. Load the Dataset

In [None]:
import pandas as pd
data = pd.read_csv("../input/customer-segmentation/Online Retail.csv")
data.head()

In [None]:
data.shape

In [None]:
data.keys()

# We are going to calculate RMF
R stands for recency (most currently customer purchase)
M stands for monetary (amount spend)
F stands for freuency (how frequently customer purchase in a span of time)

In [None]:
data.describe()

Removing Unnecessary data

In [None]:
data = data.loc[data["Quantity"]>0]
data.shape

Checking the data type of each field for analysis

In [None]:
data.info()

Invoice Data is of Object data typpe for analysis purpose we need to convert it into date-time format

In [None]:
data["InvoiceDate"] = pd.to_datetime(data["InvoiceDate"])
data.info()  #verifying the result

We are calculating sale for the purpose of monetary value
* Sale is the money spent for the product of quantity purchased and unit price of each product 
i.e., Monetary Value = Quantity * UnitPrice 

In [None]:
data["Sale"] = data.Quantity * data.UnitPrice
data.head()

As the CustomerId the Key value for every customer so we will use it as a key for all three types of Classification

In [None]:
monetary = data.groupby("CustomerID").Sale.sum()
monetary = monetary.reset_index()
monetary.head()

Frequency of Customer
For every puchase customer gets a unique invoice no. for calculating frequency we can count the number of invoices per CustomerId

In [None]:
frequency = data.groupby("CustomerID").InvoiceNo.count()
frequency = frequency.reset_index()
frequency.head()

Recency of Customer
We are using the difference between last date of purchase in the dataset and the invoice date

In [None]:
LastDate = max(data.InvoiceDate)
LasDate = LastDate + pd.DateOffset(days = 1)
data["Diff"] = LastDate - data.InvoiceDate
data["Diff"] = data["Diff"].dt.days
recency = data.groupby("CustomerID").Diff.min()
recency = recency.reset_index()
recency.head()

# RMF Combined DataFrame
Now we are creating DataFrame of recency,monetary and frequency to perform analytics on it

In [None]:
RMF = monetary.merge(frequency, on = "CustomerID")
RMF = RMF.merge(recency, on = "CustomerID")
RMF.columns = ["CustomerID", "Monetary", "Frequency", "Recency"]
RMF

In [None]:
RMF1 = pd.DataFrame(RMF,columns= ["Monetary","Frequency","Recency"])
RMF1 


# **Data Filtering ends here**

# Preprocess Data 

In [None]:
'''from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
RMF = ss.fit_transform(RMF)
RMF1 = pd.DataFrame(RMF, columns=["Monetary", "Frequency", "Recency"])
RMF1'''

# Now let's get started with analysis
With KMeans Clustering Algo We are going to calculate inertia for checking the value of k

In [None]:
from sklearn.cluster import KMeans
ssd = []
for k in range(1,20):
    km = KMeans(n_clusters = k)
    km.fit(RMF1)
    ssd.append(km.inertia_)


# Now we are going to plot inertia for determining the value of k

In [None]:
import matplotlib.pyplot as plt
import numpy as np
plt.plot(np.arange(1,20), ssd, color = "blue")
plt.scatter(np.arange(1,20), ssd, color = "red")
plt.show()

# Making 5 Clusters and Alotting ClusterID

In [None]:

model = KMeans(n_clusters = 5)
ClusterID = model.fit_predict(RMF1)
RMF1["ClusterID"] = ClusterID
RMF1

# Declare variables for each of the three features

In [None]:
km_cluster_Sale = RMF1.groupby("ClusterID").Monetary.mean()
km_cluster_Recency = RMF1.groupby("ClusterID").Recency.mean()
km_cluster_Frequency = RMF1.groupby("ClusterID").Frequency.mean()

# Plotting Graph(Visualization)

In [None]:

import seaborn as sns
fig, axs = plt.subplots(1,3, figsize = (15,5))
sns.barplot(x = [0,1,2,3,4], y= km_cluster_Sale,ax = axs[0])
sns.barplot(x = [0,1,2,3,4], y= km_cluster_Frequency,ax = axs[1])
sns.barplot(x = [0,1,2,3,4], y= km_cluster_Recency,ax = axs[2])


In [None]:
fig, axs = plt.subplots(1,3, figsize = (15, 5))
ax1 = fig.add_subplot(1,3,1)
ax1.pie(km_cluster_Sale,labels = [0,1,2,3,4])
plt.title("Monetary Mean")

ax2 = fig.add_subplot(1,3,2)
ax2.pie(km_cluster_Frequency,labels = [0,1,2,3,4])
plt.title("Frequency Mean")

ax3 = fig.add_subplot(1,3,3)
ax3.pie(km_cluster_Recency,labels = [0,1,2,3,4])
plt.title("Recency Mean")
plt.axis("off")
plt.show()



# Observation
We can observe easily that the plots of the clusters with high frequency and monetary and low recency are potential customers i.e,
* Customer in cluster 1,3 are spending most with having higher frequency of purchase
* Customer in cluster 2 have moderate sale with more frequency but spending low monetary
* Customer of cluster 0,4 have least contribution among all .