### <b>Clustering Guide</b>
<br>



> <center><img src="https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif"></center>



##### <b> What is Clustering? </b>
Clustering is the task of dividing the population or data points into several groups, so that the data points in the same groups are more similar to other data points in the same group than those of other groups. In simple words, the goal is to segregate groups with similar traits and assign them to clusters.



Cluster analysis can be done based on the resources in which we try to find subgroups of samples based on resources or based on samples in which we try to find subgroups of resources based on samples. We will address resource-based clustering here. Clustering is used in market segmentation; where we try to fine customers who are similar to each other, whether in terms of behaviors or attributes, segmentation / compression of images; where we try to group similar regions, group documents based on topics, etc.

<br>



<b> Context </b>
This data set is created only for the purpose of learning the concepts of customer segmentation, also known as market basket analysis. I will demonstrate this using the unsupervised ML technique (KMeans clustering algorithm) in the simplest way.


<br>

<b> Problem statement </b>
  The mall owner wants to understand customers as someone who can easily converge [target customers] so that it is possible to make sense of the marketing team and plan the strategy accordingly.



<b> Inspiration </b>
At the end of this case study, you can answer the questions below.
* 1- How to achieve customer segmentation using the machine learning algorithm (KMeans Clustering) in Python in the simplest way.
* 2- Who are your target customers with whom you can start the marketing strategy [easy to talk to]
* 3- How the marketing strategy works in the real world

<br>
<hr>
<br>



<p align=center>
<img src="https://miro.medium.com/max/1280/1*5UHmgCaTD8EegsPuKcxC1Q.png" width="50%"></p>





<br>
<hr>

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

%matplotlib inline 
import warnings
warnings.filterwarnings('ignore')

from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from scipy.cluster.hierarchy import fcluster, linkage, dendrogram
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN

In [None]:
path = '../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv'
data = pd.read_csv(path)
data.head()

In [None]:
data.describe()

In [None]:
data.dtypes

### Data visualization 


<br>
<hr>

In [None]:
plt.figure(figsize = (15 , 6))
count = 0 
for feature in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
    count += 1
    plt.subplot(1,3,count)
    plt.subplots_adjust(hspace=0.5 , wspace=0.5)
    sns.distplot(data[feature], bins=20)
    plt.title('Distribuição {}'.format(feature))
plt.show()

In [None]:
plt.figure(figsize=(15,7))
count = 0 
for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
    for y in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
        count += 1
        plt.subplot(3,3, count)
        plt.subplots_adjust(hspace=0.5, wspace=0.5)
        sns.regplot(x=x, y=y, data=data)
        plt.ylabel(y.split()[0]+' '+y.split()[1] if len(y.split()) > 1 else y )
plt.show()

<br>
<hr>
<br>
<br>


### <b>K-Means</b> 


<br>

The Kmeans algorithm is an iterative algorithm that attempts to partition the data set into K predefined distinct and non-overlapping subgroups (clusters), where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible, in addition to keeping the clusters as different as possible. It assigns data points to a cluster so that the sum of the squared distance between the data points and the cluster centroid (arithmetic mean of all data points that belong to that cluster) is minimal. The smaller the variation we have in the clusters, the more homogeneous (similar) the data points are in the same cluster.


<br>


The objetive function is: 
<br>

<p align=center>
<img src="https://miro.medium.com/max/548/1*myXqNCTZH80uvO2QyU6F5Q.png" width="50%"></p>






<hr>
<br>

In [None]:
# features 
X = data[['CustomerID', 'Gender', 'Age', 'Annual Income (k$)',
       'Spending Score (1-100)']]

# One hot encoding 
encoding = OneHotEncoder(sparse=False)
X['Gender'] = encoding.fit_transform(X[['Gender']])


# StandardScaler 
scaler = StandardScaler()
X = scaler.fit_transform(X)


# K-Means 
model = KMeans(n_clusters=4, init='k-means++', max_iter=100, random_state=42)
model.fit(X)


# labels 
labels = model.labels_


# centroids 
centroids = model.cluster_centers_


# números de clusters
print('Numbers of cluster: ', model.n_clusters)


# metrics 
print('Silhoutte:', silhouette_score(X, labels))
print('Davies-Bouldin:', davies_bouldin_score(X, labels))

<br>
<br>


### Inertia X Number of clusters

The trade-off graph of inertia in relation to the number of clusters is a technique of identifying the best possible K value, which makes the cluster compact and with a smaller point dispersion.

<br>

In [None]:
# set of features 
X1 = data[['Age' , 'Spending Score (1-100)']]
inercia = []
for k in range(1 , 11):
    kmeans = KMeans(n_clusters=k,
                     init='k-means++',
                     n_init=10,
                     max_iter=100, 
                     random_state=42,
                     algorithm='elkan')
    kmeans.fit(X1)
    inercia.append(kmeans.inertia_)

In [None]:
# Curve Inertia X Number of clusters

plt.figure(figsize = (15 ,6))
plt.plot(np.arange(1 , 11), inercia, 'o')
plt.plot(np.arange(1 , 11) , inercia, '-', color='green',alpha = 0.5, lw=2)
plt.title('Inercia Vs Clusters')
plt.xlabel('Numbers of cluster')
plt.ylabel('Inercia')
plt.show()

In [None]:
# Defined cluster centroids
kmeans = KMeans(n_clusters=4, init='k-means++', n_init=10, max_iter=300, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_
print('Silhoutte:', silhouette_score(X, labels))
print('Davies-Bouldin:', davies_bouldin_score(X, labels))

In [None]:
# Silhouete points 

fig, ax = plt.subplots(figsize=(10,5))
plt.title('Silhouette coefficient', fontsize=15)
visualizer = SilhouetteVisualizer(kmeans, colors='yellowbrick', ax=ax)
visualizer.fit(X)
plt.tight_layout()

<br>


##### <b> Elbow method </b> 

the “elbow” method to help data scientists select the optimal number of clusters by fitting the model with a range of values for K. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point. In the visualizer “elbow” will be annotated with a dashed line.

<br>

In [None]:
# Elbow method

fig, ax = plt.subplots(figsize=(10,5))
plt.title('KElbow', fontsize=15)
visualizer = KElbowVisualizer(KMeans(), k=(1,11))
visualizer.fit(X)
plt.tight_layout()

<br>
In the Elbow method, we always choose to choose a low K value with a very reduced inertia, as the best K value will be K = number of instances, so finding a low K value with a low inertia is the goal central.

<br>
<hr>

### Hierarchical Cluster
<br>

Another method of clustering unsupervised learning.
In general, merges and divisions are determined in a greedy way. The results of the hierarchical grouping are usually presented in a dendrogram.


<br>


<p align=center>
<img src="https://miro.medium.com/max/1254/1*KpYv1mhEaJbbafiC5udHPA.png" width="50%"></p>


<br>

In [None]:
# with scikit-learn  
hc = AgglomerativeClustering(n_clusters=4, affinity='euclidean')
hc.fit(X)
labels_hc = hc.labels_


# metrics
print('Silhoutte:', silhouette_score(X, labels_hc))
print('Davies-Bouldin:', davies_bouldin_score(X, labels_hc))

In [None]:
# with scipy 
distance_matrix = linkage(X, metric='euclidean')

# dendogram 
fig, ax = plt.subplots(figsize=(14,7))
dendograma = dendrogram(distance_matrix, ax=ax)
plt.show()

In [None]:
hierarquico = fcluster(distance_matrix, 4, criterion='maxclust')
print('Silhouette: {} '.format(silhouette_score(X,hierarquico)))

<br>
<hr>
<br>


### DBSCAN 
<br>


DBSCAN - Spatial clustering of applications based on density. Finds high density main samples and expands clusters from them. Good for data containing clusters of similar density.

##### <b> How DBSCAN works </b>

Consider a set of points in some space to be grouped. Let ε be a parameter that specifies the radius of a neighborhood in relation to some point. For DBSCAN agglomeration purposes, points are classified as central points, (density -) attainable points and extreme values, as follows:


<br>

<p align=center>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/af/DBSCAN-Illustration.svg/400px-DBSCAN-Illustration.svg.png" width="50%"></p>

<br>


A point p is a central point if at least points minPts are at a distance ε from it (including p).
A point q is directly accessible from p if point q is at a distance ε from the central point p. The points are said to be reachable only directly from the main points.
A point q is accessible from p, if there is a path p 1, ..., p n with p 1 = p and p n = Q, where each p i 1 is directly accessible from p i. Note that this implies that the starting point and all points on the path must be main points, with the possible exception of q.
All points inaccessible from any other point are outliers or noise.
Now, if p is a central point, it forms a cluster along with all the points (main or non-central) that are accessible from it. Each cluster contains at least one main point; non-essential points may be part of a cluster, but they form their "edge" as they cannot be used to reach more points.

<br>
<br>

##### <b> Steps of the algorithm </b>

The DBSCAN algorithm can be abstracted in the following steps:

  1. Find the points in the neighborhood ε (eps) of each point and identify the main points with more than neighboring minPts.
  2. Find the connected components of the main points on the neighboring graph, ignoring all non-essential points.
  3.Assign each non-central point to a nearby cluster if the cluster is a neighbor ε (eps), otherwise, assign it to noise.
A naive implementation of this requires storing neighborhoods in step 1, thus requiring substantial memory.







<br>

In [None]:
# DBSCAN 
dbscan = DBSCAN(eps=1, min_samples=5, metric='euclidean', algorithm='auto')
dbscan.fit(X)
labels_db = dbscan.labels_


print('Silhoutte:', silhouette_score(X, labels_db))
print('Davies-Bouldin:', davies_bouldin_score(X, labels_db))

<br>



#### <b> I hope this notebook helps in studies! Thank you </b> 

<br>
<hr>