### Kmeans Clustering 
* Kmeans clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using 
k-medians and k-medoids.[Wikipedia] 
<br><br>
* Clustering is a type of unsupervised machine learning algorithms used to separate data points into groups or clusters. Clustering is a process in which we inccrease similarity between the members inside a cluster, and in the meanwhile, we decrease it between members of diffirent groups or clusters. 
<br><br>
* The kind of problems using clustering techniques such as Kmeans deals with an input space of data (features) without any predefined target, are known as unsupervised learning problems. In these problems, we only have the independent variables and no target variable.
<br><br>
* The quality of the cluster assignments is determined by computing the sum of the squared error (SSE) after the centroids converge, or match the previous iteration’s assignment. The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid. Since this is a measure of error, the objective of k-means is to try to minimize this value. 
<br><br>
* Clustering Process with Kmeans : <br>  <br>
 1. Specify the number of K clusters.
 2. Randomly choose K centroids 
 3. Repeat this:<br>
  3.1 : Assign each data point to its nearset cenroids <br>
  3.2 : Recompute each centroid (mean of data points ) of each cluster.
 4. Untill : the clusters don't change. 


### Imoprts

In [None]:
import os
import numpy as np 
import pandas as pd 
from kneed import KneeLocator
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt 
import seaborn as sb
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings("ignore")
print(os.listdir("../input"))
print(os.listdir("../input/customer-segmentation-tutorial-in-python"))

### Loading data

* We are working on a mall customers data known as market basket analysis offered by Kaggle based on  data features such as age, gender, annual income and spending score. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.
* Our first task is to explore the dataset and anlyze it (EDA) to get some meaningful insights out of it. This will lead us to know relations between variales , and select only the important features before feeding it to the kmeans algorithm. 
* Our second task is to apply the Kmeans technique to segment customers into different clusters. 

In [None]:
#get data 
data=pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
data.head()

In [None]:
#get some informations on our data 
data.info()

In [None]:
#show the shape of the datafrme 
data.shape

In [None]:
#describe data
data[data.columns.difference(['CustomerID'])].describe().transpose()

In [None]:
#check if there any null values 
data.isnull().any()

* No Null Values. Let's proceed.. 

### Data Visualization

In [None]:
male=data[data.Gender=='Male']
female=data[data.Gender=='Female']
print(f'Percentage of Male Customers : {len(male)/data.shape[0]*100} %')
print(f'Percentage of Female Customers : {len(female)/data.shape[0]*100} %')

In [None]:
#count customers by gender 
plt.figure(2 , figsize = (8 , 5))
sb.countplot(x = 'Gender' , data = data)
plt.show()

In [None]:
# Let's plot the pairplot for the dataset scattered by Gender
sb.set_style('whitegrid')
plt.figure(1 , figsize = (15 , 6))
sb.pairplot(data.drop('CustomerID', axis=1), hue='Gender', aspect=1.5)
plt.show()

* As we can see from the pairplot above, gender has no effect on clustering since our data does't show any patterns based on customer's sex. 

In [None]:
#so let's select only the important features 
features=['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']
sb.set_style('whitegrid')
plt.figure(1 , figsize = (15 , 6))
n = 0 
for i in features:
    n += 1
    plt.subplot(1 , 3 , n)
    plt.subplots_adjust(hspace =0.5 , wspace = 0.5)
    sb.distplot(data[i] , bins = 20)
    plt.title(f' {i} Distplot')
plt.show()

In [None]:
# plotting regression plot to show relation within variables of the dataset.
plt.figure(1 , figsize = (15 , 7))
n = 0 
for i in features:
    for j in features:
        n += 1
        plt.subplot(3 , 3 , n)
        plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
        sb.regplot(x = i , y = j , data = data)
        plt.ylabel(j.split()[0]+' '+j.split()[1] if len(j.split()) > 1 else j )
plt.show()

In [None]:
plt.figure(1 , figsize = (15 , 6))
for gender in ['Male' , 'Female']:
    plt.scatter(x = 'Age' , y = 'Annual Income (k$)' , data = data[data['Gender'] == gender] ,
                s = 200 , alpha = 0.5 , label = gender)
plt.xlabel('Age'), 
plt.ylabel('Annual Income (k$)') 
plt.title('Age vs Annual Income grouped by Gender')
plt.legend()
plt.show()

In [None]:
plt.figure(1 , figsize = (15 , 6))
for gender in ['Male' , 'Female']:
    plt.scatter(x = 'Annual Income (k$)',y = 'Spending Score (1-100)' ,
                data = data[data['Gender'] == gender] ,s = 200 , alpha = 0.5 , label = gender)
plt.xlabel('Annual Income (k$)'), plt.ylabel('Spending Score (1-100)') 
plt.title('Annual Income vs Spending Score grouped by Gender')
plt.legend()
plt.show()

In [None]:
# Distribution of Age , Annual Income and Spending Score by Gender
plt.figure(1 , figsize = (15 , 7))
n = 0 
for i in features:
    n += 1 
    plt.subplot(1 , 3 , n)
    plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
    sb.violinplot(x = i , y = 'Gender' , data = data , palette = 'vlag')
    sb.swarmplot(x = i , y = 'Gender' , data = data)
    plt.ylabel('Gender' if n == 1 else '')
    plt.title('Boxplots & Swarmplots' if n == 2 else '')
plt.show()

### Clustering using Kmeans

#### Clustering with Annual Income and Spending Score

#### The Elbow Method :
* The Elbow Method is one of the most popular methods to determine this optimal value of k, k as number of clusters we want. To perform the elbow method, run several k-means, increment k with each iteration, and record the SSE(Sum of Squared Errors). 
* When you plot SSE as a function of the number of clusters, notice that SSE continues to decrease as you increase k. As more centroids are added, the distance from each point to its closest centroid will decrease. There’s a sweet spot where the SSE curve starts to bend known as the elbow point

In [None]:
# Clustering with Annual Income and spending Score
X = data[['Annual Income (k$)' , 'Spending Score (1-100)']].iloc[: , :].values
#scaled the features 
scaler=StandardScaler()
X_scaled= scaler.fit_transform(X)
#begin saerching for the best k 
sse = []
for n in range(1 , 11):
    model = KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='elkan') 
    model.fit(X_scaled)
    sse.append(model.inertia_)


In [None]:
# Selecting k Clusters based on the SSE (Sum of Squared Errors using the elbow method
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , sse , 'o')
plt.plot(np.arange(1 , 11) , sse , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
plt.title('The Elbow Method')
plt.show()

In [None]:
#to detect the elbow point 
kl = KneeLocator(range(1, 11), sse, curve="convex", direction="decreasing")
#get the elbow point  
print(f'The Elbow Point matches : {kl.elbow}')

In [None]:
#we can see that number of clusters equals 5
model = (KMeans(n_clusters = 5 ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='elkan') )
model.fit(X)
labels = model.labels_
centroids = model.cluster_centers_
print(f'Init Centroids:\n {np.unique(centroids)}')
print(f'Clustering Labels: \n{np.unique(labels)}')

In [None]:
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max,0.02 ), np.arange(y_min, y_max, 0.02))

In [None]:
#plot the clusters
plt.figure(1 , figsize = (15 , 7) )
plt.clf()
X_preds = model.predict(np.c_[xx.ravel(), yy.ravel()])
X_preds = X_preds.reshape(xx.shape)
plt.imshow( X_preds, interpolation='nearest', 
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')

plt.scatter( x = 'Annual Income (k$)' ,y = 'Spending Score (1-100)' , data = data , c = labels , 
            s = 200 )
plt.scatter(x = centroids[: , 0] , y =  centroids[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.ylabel('Spending Score')  
plt.xlabel('Annual Income')
plt.title('Plotting the Clusters')
plt.show()

#### Clustering with Annual Income and Age

In [None]:
# Clustering with age and Annual Income
Z = data[['Annual Income (k$)' , 'Age']].iloc[: , :].values
#scale data 
Z_scaled=scaler.fit_transform(Z)
sse=[]
for i in range(1,11):
    kmeans = KMeans(n_clusters= i, init='k-means++', random_state=0)
    kmeans.fit(Z_scaled)
    sse.append(kmeans.inertia_)

In [None]:
#Visualizing the ELBOW method to show the elbow point 
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , sse , 'o')
plt.plot(np.arange(1 , 11) , sse , '-' , alpha = 0.5)
plt.title('The Elbow Method')
plt.xlabel('# of clusters')
plt.ylabel('SSE')
plt.show()

In [None]:
#to detect the elbow point 
kl = KneeLocator(range(1, 11), sse, curve="convex", direction="decreasing")
#get the elbow point  
print(f'The Elbow Point matches : {kl.elbow}')

In [None]:
#as we can see, the elbow poit matches k=3 
kmeans_model = KMeans(n_clusters= 3, init='k-means++', random_state=0)
Z_preds= kmeans_model.fit_predict(Z_scaled)
#plot the clusters 
plt.figure(1 , figsize = (15 ,6))
plt.scatter(Z[Z_preds == 0, 0], Z[Z_preds == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(Z[Z_preds == 1, 0], Z[Z_preds == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(Z[Z_preds == 2, 0], Z[Z_preds == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(kmeans_model.cluster_centers_[:, 0], kmeans_model.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Age')
plt.legend(loc='upper right')
plt.show()

### If you Like my Notebook Please Upvote !! 