# Unsupervised learning with K-Means for mall customer data segmentation.

Unsupervised learning refers to the machine learning algorithm that infers from data that is not labelled. It learns or studies the patterns in the data on it's own. K-means is a popular unsupervised learning algorithm that separates n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Firstly, import all necessary modules.

In [None]:
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import silhouette_score

In [None]:
df=pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
df.head()

Now lets look at some basic details the dataset can tell us like the dimensions and count/mean etc of the various features in the dataset. Also lets rename the 2 columns with long names into ones we can easily access.

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.rename(columns={'Spending Score (1-100)':'SpendingScore','Annual Income (k$)':'AnnualIncome'},inplace=True)
df.head()

# Visualization

Plot some basic graphs with the help of matplotlib library and seaborn. 
The countplot gives us an idea of how many examples are present in a given group in categorical data and here it is gender.

In [None]:
sns.countplot(data=df,x='Gender',palette='Set2')

Now lets look at the distplot(histogram) which gives the range of the different features;age, income and the spending score

In [None]:
plt.figure(figsize=(15,10))
plt.subplot(3,3,1)
sns.distplot(df['Age'])
plt.subplot(3,3,2)
sns.distplot(df['AnnualIncome'],color='red')
plt.subplot(3,3,3)
sns.distplot(df['SpendingScore'],color='green')

Now, a heatmap shows the two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors and the relation between each of them is shown in the matrix. It helps us analyze the relationship between different features in our dataset.

In [None]:
sns.heatmap(df.iloc[:,1:5].corr(),annot=True,linewidths=0.2)

The simple lineplot is created between 2 variables, with respect to gender.

In [None]:
plt.figure(figsize=(10,7))
sns.lineplot(x='AnnualIncome',y='SpendingScore',hue='Gender',data=df,ci=False,style='Gender',markers=True)

In [None]:
plt.figure(figsize=(10,7))
sns.lineplot(x='Age',y='SpendingScore',hue='Gender',data=df,ci=False,style='Gender',markers=True)

A boxplot can also be used to visualize distributions.

In [None]:
plt.figure(figsize=(20,10))
x=0
for i in ['AnnualIncome','SpendingScore']:
    x=x+1
    plt.subplot(2,2,x)
    sns.boxplot(data=df,x=i,y='Gender',palette='Set'+str(x))
plt.show()

The next few steps are to check if all the values in the dataset are non-null and contain proper numerical values

In [None]:
lenc=LabelEncoder()
df['Gender']=lenc.fit_transform(df['Gender'])

In [None]:
df.isna().sum()

In [None]:
df.drop('CustomerID',axis=1,inplace=True)

In [None]:
df.head()

# K-Means

First, lets check the elbow curve for different number of clusters using a forloop to append the values of inertia of the K-means algorithm into a list and plotting them for a range of 1-10 clusters.
K-Means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. Inertia tells us how far away the points within a cluster are.
The value of inertia decreases as the number of clusters increase. 
The elbow point is the point in the graph when we notice a bend in the curve.

In [None]:
cluster=list()
for i in range(1,11):
    kmns=KMeans(n_clusters=i)
    kmns.fit(df)
    cluster.append(kmns.inertia_)
plt.figure(figsize=(10,7))
sns.lineplot(x=list(range(1,11)),y=cluster)

We notice 2 potential elbow points or "bends" i.e. one at approximately 3 and another at around 5. Thus we run K-means at both those points to form the requires clusters which we'll visualize eventually.

# 3 clusters

In [None]:
n=3
kmeans3=KMeans(n_clusters=n,n_init=10,max_iter=500)
kmeans3.fit(df)

In [None]:
df['clusters']=kmeans3.labels_
kmeans3.cluster_centers_

In [None]:
df.head()

The silhouette score is calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each sample.The values of this score range from -1 to 1. Values almost equal to 0 indicate overlapping clusters. Values closer to 1 indicate the best possible clustering while negative values generally indicate that a sample has been assigned to the wrong cluster.

In [None]:
print(silhouette_score(df.iloc[:,0:4],kmeans3.labels_))

In the plot below, we can clearly notice 3 clusters distinguished by color. Thus K-means has performed its job and assigned proper clusters to the data points.

In [None]:
plt.figure(figsize=(12, 8))
sns.scatterplot(df['AnnualIncome'], df['SpendingScore'], hue=df['clusters'], palette='Set1',style=df['Gender'])

# 5 clusters

In [None]:
n=5
kmeans5=KMeans(n_clusters=n,n_init=10,max_iter=500)
kmeans5.fit(df)

In [None]:
df['clusters']=kmeans5.labels_
kmeans5.cluster_centers_

In [None]:
df.head()

In [None]:
print(silhouette_score(df.iloc[:,0:4],kmeans5.labels_))

We see and improvement in the silhouette score. Thus, plotting the clusters, we can distinguish the 5 groups.

In [None]:
plt.figure(figsize=(12, 8))
sns.scatterplot(df['AnnualIncome'], df['SpendingScore'], hue=df['clusters'], palette='Set1',style=df['Gender'])