## Table of Content

1. **[Header Files](#lib)**
2. **[About Data Set](#about)**
3. **[Data Preparation](#prep)**
    - 3.1 - **[Read Data](#read)**
    - 3.2 - **[Analysing Missing Values](#miss)**
    - 3.3 - **[Analysing Outliers](#outliers)**
    - 3.4 - **[Analysing the data set](#dt)**
    - 3.5 - **[Scaling](#scale)**   
    - 3.6 - **[Encoding](#encode)** 
    
4. **[Determining Optimal Linkage Method](#ol)**
5. **[Visualizing the clusters](#vis)**
6. **[Agglomerative Clustering](#ag)**
7. **[KMeans Clustering](#kmeans)**
8. **[Principal Component Analysis](#PCA)**
9. **[Kernel PCA](#kpca)**
10. **[Density Based Clustering](#dbscan)**

<a id='lib'></a>
## 1. Header Files

In [None]:
import pandas as pd 
import numpy as np 

import seaborn as sns
import matplotlib.pyplot as plt

# Header Files for Data preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

# Header Files for finding optimal linkage for clustering
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

# Header files for visualizing the clusters
from scipy.cluster.hierarchy import linkage,dendrogram,cut_tree
from yellowbrick.cluster import SilhouetteVisualizer

# Header Files for Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering

# Header Files for KMeans Custering
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Header files for dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA

# Header files for DBSCan
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN


<a id='about'></a>
## 2.About Data Set


Customer ID - Unique identification of customer

Gender - Sex of the customer

Age - Age of customer

Annual Income - Income of salary in 1000's unit Dollars

Spending Score - Readiness of customer to spend money

<a id='prep'></a>
## 3.Data Preperation

<a id='read'></a>
### 3.1 Read the data

In [None]:
df=pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
df.head()

<a id='miss'></a>
### 3.2 Analysing Missing Values

In [None]:
df.isnull().sum()

Note- No missing Values in data

<a id='outliers'></a>
### 3.3 Analysing Outliers

In [None]:
for x in df.select_dtypes(np.number).columns:
    sns.boxplot(x=df[x])
    plt.show()

Note: Very Few Outliers

<a id='dt'></a>
### 3.4 Analysing the data set

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
# Analysing range of numerical columns
df.describe().T[['min','max']]

In [None]:
# Analysing categorical Variables
for x in df.select_dtypes(exclude=np.number):
    print(df[x].value_counts())

In [None]:
df=df.drop('CustomerID',axis=1)

In [None]:
# Maintaining a copy of the data
data=df.copy()
df_num=df.select_dtypes(np.number)
df_cat=df.select_dtypes(exclude=np.number)

 <a id='scale'></a>
### 3.5 Scaling

In [None]:
ss=StandardScaler()
df_nums=pd.DataFrame(ss.fit_transform(df_num),columns=df_num.columns)
df_nums

<a id='encode'></a>
### 3.6 Encoding

In [None]:
df1=pd.concat([df_nums,df_cat],axis=1)
df1.head(2)

In [None]:
df_processed=pd.get_dummies(df1,columns=['Gender'])
df_processed.head()

<a id='ol'></a>
## 4.Determining Optimal Linkage Method

In [None]:
# Method with highest cophenetic score is the optimal linkage method

coph=dict()
for method in ['ward','average','complete','single']:
    mergings=linkage(df_processed,method=method)
    c,d=cophenet(mergings,pdist(df_processed))
    coph[method]=c
print(coph)

print('\nOptimal Linkage Method:',max(coph))

<a id='vis'></a>
## 5.Visualizing the clusters

In [None]:
df_cluster=df_processed.copy()

In [None]:
# Done to find an approx value of k
mergings=linkage(df_processed,method='ward',metric='euclidean')
dendrogram(mergings,truncate_mode='lastp')
plt.show()

Note : 

1. Determining optimal number of clusters using dendrogram is confusing
    
2. High time complexity

In [None]:
df_cluster['cluster']=cut_tree(mergings,n_clusters=4)
df_cluster.head()

In [None]:
#Analysing The Cluster
df_cluster.cluster.value_counts()

<a id='ag'></a>
## 6.Agglomerative Clustering

Logic - Each Observation is a unique cluster at the initial step then iteratively moves to add more similar points to the cluster.This process is continued till all observations are fused to a single cluster

Note - Doesnt work well with very large data(Computational Cost is very high)

In [None]:
model=AgglomerativeClustering(n_clusters=4)
model.fit(df_processed)
df_cluster['ag_cluster']=model.labels_
model.labels_

In [None]:
df_cluster.head()

In [None]:
df_cluster.ag_cluster.value_counts()

In [None]:
df_processed.columns

In [None]:
# Visualizing the cluster
sns.scatterplot(x=df['Annual Income (k$)'],y=df['Spending Score (1-100)'],hue=df_cluster.ag_cluster)
plt.show()

<a id='kmeans'></a>
## 7. KMeans Clustering (Lloyds Algorithm)

Logic: Clusters data by seperating data into groups of equal variance.

Note: A cluster is said to be a good cluster when

1.Clusters are well packed

2.Clusters are well seperated

### 7.1 Optimal Value of K for Kmeans clustering

There are two methods to calculate the optimal value of K 

1. Elbow Plot

2. Silhoutte Method

### 7.1.1 Elbow Plot

The aim of ploting an elbow plot is to find an optimal value of k such that varience within clusters is lowest and the number of clusters is not too large to interpret.

In [None]:
#wcss is within cluster sum of squared errors.
wcsse=[]

for k in np.arange(2,8):
    model=KMeans(n_clusters=k,random_state=5)
    model.fit(df_processed)
    wcsse.append(model.inertia_)
    
plt.plot(np.arange(2,8),wcsse)
plt.axvline(4,c='red')
plt.xlabel('No. Of Clusters')
plt.ylabel('wcsse')
plt.show()
    

The value of k is selelected at the point where an elbow is formed, hence the name elbow plot.(Here 4)

### 7.1.2 Silhoutte Method

In [None]:
score=[]
for k in np.arange(2,6):
    model=KMeans(n_clusters=k,random_state=5)
    cluster=model.fit_predict(df_processed)
    score.append(silhouette_score(df_processed,cluster))

plt.plot(np.arange(2,6),score)
plt.axvline(4,c='red')
plt.xlabel('No. Of Clusters')
plt.ylabel('Silhoutte Score')
plt.show()

The point at which silhouette score is highest is considered as the optimal value for number of clusters.

### 7.1.3 Silhoutte Visualizer

In [None]:
clust_mod=KMeans(n_clusters=3,random_state=5)
viz=SilhouetteVisualizer(clust_mod)
viz.fit(df_processed)
plt.show()

In [None]:
clust_mod=KMeans(n_clusters=4,random_state=5)
viz=SilhouetteVisualizer(clust_mod)
viz.fit(df_processed)
plt.show()

In [None]:
clust_mod=KMeans(n_clusters=5,random_state=5)
viz=SilhouetteVisualizer(clust_mod)
viz.fit(df_processed)
plt.show()

### 7.1.4 KMeans Model

In [None]:
model=KMeans(n_clusters=4,random_state=5)
cluster=model.fit_predict(df_processed)

In [None]:
# Visualizing the cluster
sns.scatterplot(x=df['Annual Income (k$)'],y=df['Spending Score (1-100)'],hue=cluster)
plt.show()


In the middle there are is no clear seperation in the clusters(Could be because the two features selected at random doesnot explain maximum variance).

Dimensionality reduction techniques like PCA or KPCA can be used to find the best vectors to represent clusters in lower dimensions.

<a id='PCA'></a>
## 8. Principal Component Analysis

PCA is a method used to represent the data in lower dimensions by creating new features that capture maximum variance.

In [None]:
pca=PCA()
pca.fit(df_processed)
np.cumsum(pca.explained_variance_ratio_)*100

In [None]:
df_pca=pd.DataFrame(pca.transform(df_processed))
col=['PC'+str(x) for x in np.arange(1,6)]
df_pca.columns=col
df_pca.head()

In [None]:
# Finding optimal number of clusters
wcsse=[]

for k in np.arange(2,8):
    model=KMeans(n_clusters=k,random_state=5)
    model.fit(df_pca)
    wcsse.append(model.inertia_)
    
plt.plot(np.arange(2,8),wcsse)
plt.axvline(4,c='red')
plt.xlabel('No. Of Clusters')
plt.ylabel('wcsse')
plt.show()

In [None]:
model=KMeans(n_clusters=4,random_state=5)
cluster=model.fit_predict(df_pca)

# Visualizing the cluster

sns.scatterplot(x=df_pca['PC1'],y=df_pca['PC2'],hue=cluster)
plt.show()

<a id='kpca'></a>
## 9. Kernel PCA

Kernel PCA uses a function to project non linear data onto a higher dimension inorder to make it linearly seperable and then uses PCA.

In [None]:
kpca= KernelPCA(n_components=2)
kpca.fit(df_processed)
df_kpca=pd.DataFrame(kpca.fit_transform(df_processed),columns=['PC1','PC2'])
df_kpca.head()

In [None]:
# Finding optimal number of clusters
wcsse=[]

for k in np.arange(2,8):
    model=KMeans(n_clusters=k,random_state=5)
    model.fit(df_kpca)
    wcsse.append(model.inertia_)
    
plt.plot(np.arange(2,8),wcsse)
plt.axvline(4,c='red')
plt.xlabel('No. Of Clusters')
plt.ylabel('wcsse')
plt.show()

In [None]:
model=KMeans(n_clusters=4,random_state=5)
cluster=model.fit_predict(df_kpca)
sns.scatterplot(x=df_kpca['PC1'],y=df_kpca['PC2'],hue=cluster)
plt.show()

When compared to the results of pca at the edges the clusters are not overlapping . But the main application of KPCA is when the points are not linearly seperable.

<a id='dbscan'></a>
## 9. Density Based Clustering (DBScan)

DBScan forms clusters of non linear shapes. The main application of DBScan is in outlier detection. The regions are not densely populated are considered to be outliers.

DBScan has 2 main parameters to be considered : 

1. eps - Radius of neighbourhood of a data point

2. min_samples - Number of points inside epsilon neighborhood to be considered as a core point

### 9.1 Finding optimal value of epsilon

In [None]:
nn=NearestNeighbors(n_neighbors=4)
nn.fit(df_processed)

In [None]:
distance,index=nn.kneighbors(df_processed)
plt.plot(np.sort(distance[:,3]))
plt.show()

### 9.2 DBScan 

In [None]:
df_db=df_kpca.copy()

In [None]:
model=DBSCAN(eps=0.9,min_samples=5)
df_db['cluster']=model.fit_predict(df_processed)
df_db.cluster.value_counts()

-1 represents outliers

In [None]:
# Visualizing the cluster to identify outliers
sns.scatterplot(x=df_db['PC1'],y=df_kpca['PC2'],hue=df_db['cluster'],palette=['Red','Yellow','Pink'])
plt.show()

Points in Light Red are the outliers

In [None]:
# Removing the identified outliers
df_db=df_db[df_db.cluster != -1]
sns.scatterplot(x=df_db['PC1'],y=df_kpca['PC2'],hue=df_db['cluster'],palette=['Yellow','Pink'])
plt.show()