## Introduction

This kernel will show you K-Means Clustering for customer Segmentation

## Import Modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


sns.set(rc={'figure.figsize':(8,8)})
sns.set_style("whitegrid")
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv', index_col=False)
df.head()

In [None]:
df.info()

## EDA

In [None]:
## Gender Distribution
df.Gender.value_counts().plot.barh()

In [None]:
## Age Distribution
sns.distplot(df.Age, hist=True, rug=True,color='c')

In [None]:
## Income Based on Gender
ax = sns.boxplot(x="Gender", y="Annual Income (k$)",
                 data=df, palette="Set2")

In [None]:
## Spending Score based on Gender
ax = sns.boxplot(x="Gender", y="Spending Score (1-100)",
                 data=df, palette="rainbow")

In [None]:
## Correlation of Annual Income and Age by Gender
ax = sns.scatterplot(x="Annual Income (k$)", y="Age", hue='Gender',
                 data=df, palette="jet_r")

In [None]:
## Correleation of Spending Score and Age by Gender
ax = sns.scatterplot(x="Spending Score (1-100)", y="Age", hue='Gender',
                 data=df, palette="YlOrBr_r")

In [None]:
## Correlation of Annual Spending and Spending Score by Gender
ax = sns.scatterplot(x="Spending Score (1-100)", y="Annual Income (k$)", hue='Gender',
                 data=df, palette="prism")

## Feature Engineering

In [None]:
df = pd.get_dummies(df)
df.head()

In [None]:
## heatmap
sns.heatmap(df.corr(), annot=True,cmap='seismic')

## Modelling

In [None]:
from sklearn.cluster import KMeans 
from sklearn import metrics 
from scipy.spatial.distance import cdist

In [None]:
## Clustering based on Annual Income and Spending Score
X = df.iloc[:,[2,3]].to_numpy()

In [None]:
distortions = []
mapping1 = {}
K = range(1,10) 

for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(X) 
    kmeanModel.fit(X)

    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 
                    'euclidean'),axis=1)) / X.shape[0]) 

    mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_, 
                'euclidean'),axis=1)) / X.shape[0] 

In [None]:
for key,val in mapping1.items(): 
    print(str(key)+' : '+str(val)) 

In [None]:
plt.plot(K, distortions, 'bx-') 
plt.xlabel('Values of K') 
plt.ylabel('Distortion') 
plt.title('The Elbow Method using Distortion') 
plt.show() 

In elbow method we determine k value of k means by using distortion. If the graph would likely be stable onward, that k value is the best value in kmeans. In this case we are using 5 as number of cluster.

In [None]:
kmeans = KMeans(n_clusters = 5)
kmeans.fit(X)
y_pred = kmeans.predict(X)
print(kmeans.cluster_centers_)

In [None]:
df["label"] = kmeans.labels_
df.head()

In [None]:
plt.scatter(X[:,0], X[:,1],c=y_pred,cmap='prism')
plt.title('Kmeans')
plt.xlabel("Annual Income(k$)")
plt.ylabel("Spending Score")

## Summary

Conclusion from this analysis is there is 5 kind of customer in this company

A. Green Customer is the one who have low annual income but spend a lot of money on this company, most likely they are satisfied with the service of this company.

B. Yellow Customer is the customer who have high income and high spending on this company, they love to work hard and spend a lot of money.

C. Red Customer is a costumer who have a normal income and normal spending, most likely they just heard about this company product or services.

D. Purple Customer is a costumer who have low annual income and low spending, it is uncertain whether the had a don't like this company product or they just love to save their money.

E. Blue Costumer is the main problem, even though they have a lot of annual income they spend a low amount of money on this company, It is most likely they don't like the services or product, customer in this segment tend to leave.