## Clustering
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

KMeans Clustering
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster. Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares. In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
https://www.kaggle.com/code/heeraldedhia/kmeans-clustering-for-customer-data/notebook
https://ganpat-patel-012.github.io/Customer-Segmentation-Exopsys-Data-Labs-Internship/

dataLink = 'https://raw.githubusercontent.com/DUanalytics/datasets/master/csv/clusteringMallCustomers.csv'
This input file contains the basic information (ID, age, gender, income, spending score) about the customers of a mall. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.
You own the mall and want to understand the customers like who can be easily converge [Target Customers] so that the sense can be given to marketing team and plan the strategy accordingly.

In [1]:
# libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

import plotly as py
import plotly.graph_objs as go

from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'plotly'

In [None]:
!pip install plotly

Collecting plotly
  Downloading plotly-5.13.1-py2.py3-none-any.whl (15.2 MB)
[K     |████████████████████████████████| 15.2 MB 379 kB/s eta 0:00:01    |██████████████                  | 6.7 MB 3.1 MB/s eta 0:00:03     |████████████████████████▊       | 11.8 MB 3.3 MB/s eta 0:00:02     |███████████████████████████▊    | 13.2 MB 3.3 MB/s eta 0:00:01
[?25hCollecting tenacity>=6.2.0
  Downloading tenacity-8.2.2-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly


In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/DUanalytics/datasets/master/csv/clusteringMallCustomers.csv')
df.head()

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.columns = ['customerID','gender','age','annualIncome','spendingScore']

In [None]:
df.describe(include='all')

In [None]:
# Null Values
df.isnull().sum()

In [None]:
sns.distplot(df['age'])

In [None]:
plt.figure(1, figsize=(10,6))
n = 0
for x in ['age','annualIncome', 'spendingScore']:
    n = n + 1
    plt.subplot(1, 3, n)
    plt.subplots_adjust(hspace = 0.5, wspace = .5)
    sns.distplot(df[x], bins=5)
    plt.title('Distribution Plot of {}'.format(x))
plt.show();

In [None]:
sns.pairplot(df, vars = ['age','annualIncome', 'spendingScore'], hue='gender')

### 2D Clustering based on Age and Spending Score

In [None]:
plt.figure(1 , figsize = (15 , 7))
plt.title('Scatter plot of Age v/s Spending Score', fontsize = 20)
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.scatter( x = 'age', y = 'spendingScore', data = df, s = 100)
plt.show();

### k value
Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster. A good model is one with low inertia AND a low number of clusters ( K ).

In [None]:
X1 = df[['age', 'spendingScore']].values
X1

In [None]:
n=2
Aks = KMeans(n_clusters=n, init='k-means++', n_init=10, max_iter=300, tol = .001, random_state=111, algorithm ='elkan')
Aks.fit(X1)
round(Aks.inertia_)

In [None]:
n=4
Aks = KMeans(n_clusters=n, init='k-means++', n_init=10, max_iter=300, tol = .001, random_state=111, algorithm ='elkan')
Aks.fit(X1)
round(Aks.inertia_)

In [None]:
# Inertia changes as we change n, no of clusters
#Optimal Number of cluster ???

In [None]:
inertia = []
for n in range(1, 15):
    Aks = KMeans(n_clusters=n, init='k-means++', n_init=10, max_iter=300, tol = .001, random_state=111, algorithm ='elkan')
    Aks.fit(X1)
    inertia.append(round(Aks.inertia_))
print(inertia)

In [None]:
#plot and check inertia and n values
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 15) , inertia , 'o')
plt.plot(np.arange(1 , 15) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show();

In [None]:
# elbow points - k = 4
n=4
Aks4 = KMeans(n_clusters=n, init='k-means++', n_init=10, max_iter=300, tol = .001, random_state=111, algorithm ='elkan')
Aks4.fit(X1)
round(Aks4.inertia_)

In [None]:
labels4 = Aks4.labels_
centeriods4 = Aks4.cluster_centers_

In [None]:
print(labels4)
# to which cluster each row of data belongs

In [None]:
print(centeriods4)
# center point of each cluster

In [None]:
df['cluster1'] = pd.DataFrame(labels4)
df.sort_values(by='cluster1')

In [None]:
df.cluster1.value_counts()
#clusters =['0','1','2','3']

In [None]:
plt.figure(1 , figsize = (15 , 7))
plt.title('Scatter plot of Age v/s Spending Score', fontsize = 20)
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.scatter( x = 'age', y = 'spendingScore', data = df, s = 100, c='cluster1')
plt.show();

plt.figure(figsize=(15,8))
sns.scatterplot(df['age'], df['spendingScore'],hue=['cluster1'] ,alpha=0.6)
plt.title('Cluster Wise Colors : Age vs Spending Score', fontsize = 15)
plt.xlabel('Age', fontsize = 12)
plt.ylabel('Spending Score', fontsize = 12)
plt.show()

In [None]:
### Use all numeric Columns
X2 = df[['age', 'annualIncome','spendingScore']].values
X2[1:5]

In [None]:
inertia = []
for n in range(1, 15):
    Aks = KMeans(n_clusters=n, init='k-means++', n_init=10, max_iter=300, tol = .001, random_state=111, algorithm ='elkan')
    Aks.fit(X2)
    inertia.append(round(Aks.inertia_))
print(inertia)

In [None]:
#plot and check inertia and n values
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 15) , inertia , 'o')
plt.plot(np.arange(1 , 15) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show();
#n=6

In [None]:
n=6
Aks6 = KMeans(n_clusters=n, init='k-means++', n_init=10, max_iter=300, tol = .001, random_state=111, algorithm ='elkan')
Aks6.fit(X2)
round(Aks6.inertia_)

In [None]:
labels6 = Aks6.labels_
centeriods6 = Aks6.cluster_centers_

In [None]:
df['cluster2'] = pd.DataFrame(labels6)
df.sort_values(by='cluster2')

In [None]:
df.cluster2.value_counts()

## Plotly 3D

In [None]:
import plotly as py
import plotly.graph_objs as go

trace1 = go.Scatter3d (
    x= df['age'],   y= df['spendingScore'], z= df['annualIncome'],
    mode='markers',marker=dict( color = df['cluster2'],  size= 10, 
    line=dict( color= df['cluster2'], width= 12),  opacity=0.8  )
)
data = [trace1]
layout = go.Layout(
    title= 'Clusters wrt Age, Income and Spending Scores',
    scene = dict(
            xaxis = dict(title  = 'Age'),
            yaxis = dict(title  = 'Spending Score'),
            zaxis = dict(title  = 'Annual Income')
        )
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)

We analysed Customer data and performed 2D and 3D clustering using K Means Algorithm. This kind of cluster analysis helps design better customer acquisition strategies and helps in business growth

more
meshgrid, scaling of data

## Hierarchical Clustering
We merge the most similar points or clusters in hierarchical clustering – we know this. Now the question is – how do we decide which points are similar and which are not?
way to calculate similarity – Take the distance between the centroids of these clusters. The points having the least distance are referred to as similar points and we can merge them. We can refer to this as a distance-based algorithm as well (since we are calculating the distances between the clusters).

In hierarchical clustering, we have a concept called a proximity matrix. This stores the distances between each point. Let’s take an example to understand this matrix as well as the steps to perform hierarchical clustering.
https://www.analyticsvidhya.com/blog/2019/05/beginners-guide-hierarchical-clustering/

In [None]:
XH = df[['age','annualIncome', 'spendingScore']].values
XH[1:5]

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

### Dendogram
A dendrogram is a tree-like diagram that records the sequences of merges or splits.

In [None]:
linkage_data = linkage(XH[1:5], method='ward', metric='euclidean')

In [None]:
dendrogram(linkage_data)
plt.show()

The vertical line represents the distance between these samples. Similarly, we plot all the steps where we merged the clusters and finally, we get a dendrogram like this:
More the distance of the vertical lines in the dendrogram, more the distance between those clusters.
we can set a threshold distance and draw a horizontal line (Generally, we try to set the threshold in such a way that it cuts the tallest vertical line)
The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the threshold. 

### Agglomerative Clustering (n to 1) : join
We assign each point to an individual cluster in this technique. Suppose there are 4 data points. We will assign each of these points to a cluster and hence will have 4 clusters in the beginning:
Then, at each iteration, we merge the closest pair of clusters and repeat this step until only a single cluster is left:
We are merging (or adding) the clusters at each step, right? Hence, this type of clustering is also known as additive hierarchical clustering.

In [None]:
HC = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
HC_labels = HC.fit_predict(XH[1:5])

In [None]:
HC_labels

In [None]:
#1st and 3rd in 1 cluster
#2nd and 4th in 2nd cluster
XH[1:5]

### Divisive Clustering ( 1 to n ) : separate
Divisive hierarchical clustering works in the opposite way. Instead of starting with n clusters (in case of n observations), we start with a single cluster and assign all the points to that cluster.

So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong to the same cluster at the beginning:
Now, at each iteration, we split the farthest point in the cluster and repeat this process until each cluster only contains a single point:
We are splitting (or dividing) the clusters at each step, hence the name divisive hierarchical clustering.
Agglomerative Clustering is widely used in the industry

### Scale Data

In [None]:
dfH = pd.read_csv('https://raw.githubusercontent.com/DUanalytics/datasets/master/csv/clusteringMallCustomers.csv')

In [None]:
dfH.columns = ['customerID','gender','age','annualIncome','spendingScore']
dfH.head()

In [None]:
dfH.drop(['gender'], axis=1, inplace=True)
dfH.head()

In [None]:
from sklearn.preprocessing import normalize
data_scaled = normalize(dfH)
data_scaled = pd.DataFrame(data_scaled, columns=dfH.columns)
data_scaled.head()
# scale of all the variables is almost similar

In [None]:
data_scaled.describe()

In [None]:
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))  
plt.title("Dendrograms")  
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))

The x-axis contains the samples and y-axis represents the distance between these samples. The vertical line with maximum distance is the blue line and hence we can decide a threshold of x and cut the dendrogram:
https://youtu.be/ijUMKMC4f9I

In [None]:
plt.figure(figsize=(10, 7))  
plt.title("Dendrograms")  
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.axhline(y=2, color='r', linestyle='--')

In [None]:
# height is max when joining two cluster.. 
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
#print(dend)
print(set(dend['color_list']))

In [None]:
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')  
cluster.fit_predict(data_scaled)

In [None]:
plt.figure(figsize=(10, 7))  
plt.scatter(data_scaled['age'], data_scaled['annualIncome'], c=cluster.labels_) 

In [None]:
plt.figure(figsize=(10, 7))  
plt.scatter(dfH['age'], dfH['annualIncome'], c=cluster.labels_) 
plt.show();

https://plotly.com/python/dendrogram/
https://www.youtube.com/watch?v=4DInt3H2UNE    