# Mall Customers Segmentation using K-Means Clustering 


![](https://zonretail.com/wp-content/uploads/2018/04/shopping-malls.jpg)

Malls and shopping complexes often compete with each other to increase their customer base and hence make huge profits. To achieve this task machine learning is being applied in many stores already.It is amazing to realize the fact that how machine learning can aid in such ambitions. AI and ML already have been intimately involved in online shopping since, well, the beginning of online shopping. You can’t use Amazon or any other shopping service without getting recommendations, which are often personalized based on the vendor’s understanding of your traits: your purchase history, your browsing history, and possibly much more. The shopping complexes make use of their customer's data and develop ML models to target the right customers.This not only increases sales but also makes their business efficient.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

In [None]:
df.head()

In [None]:
df.describe()

**Observations:**
* Age of the customers ranges from 18-70. This shows that the mall attracts has shops and things which suite all age group people.
* Average age of customers is 39.
* Average income of customers is 60 K$.
* Average spending score of customers is 50.


First step in any data science problem is to check for missing/null values. Let's just check that first.

In [None]:
df.isnull().sum()

As we can see thankfully there are no missing values in this particular dataset. 

In [None]:
df.shape #To check the number of rows and columns in the dataset.

In [None]:
df.info() # To check for the data types in the dataset.

# Data Visualisation

## Univariate Analysis

### Gender

In [None]:
sns.countplot(df['Gender'])

Female customers are more compared to male customers.

### Age

In [None]:
sns.distplot(df['Age'])

![](http://)Age of the customers follows right skewed normal distrbution.

### Annual Income

In [None]:
sns.distplot(df['Annual Income (k$)'])

Annual income of the customers also follows right skewed normal distrbution.

### Spending score

In [None]:
sns.distplot(df['Spending Score (1-100)'])

Spending score of the customers roughly follws normal distribution.

## Bivariate Analysis

let's now check for the relationship between different features by using a pairplot.

In [None]:
sns.pairplot(df[[ 'Gender', 'Age', 'Annual Income (k$)','Spending Score (1-100)']])

**Observations:**
* Most of the customers are in the 20-40 age group.
* Spending score is high for the customers in the age group of 20-40.
* Spending score is high for customers with very low and very high income.


![](http://)![](http://)

### Heatmap

In [None]:
plt.rcParams['figure.figsize'] = (14, 8)
sns.heatmap(df[[ 'Gender', 'Age', 'Annual Income (k$)','Spending Score (1-100)']].corr(), cmap = 'magma_r', annot = True, linewidths=.5)
plt.title('Heatmap', fontsize = 20)
plt.show()

As we can see there is not much correlation between the features.

### Gender vs Spending Score

In [None]:
plt.rcParams['figure.figsize'] = (16, 7)
g = sns.catplot(x="Gender", y="Spending Score (1-100)", kind="violin", inner=None, data=df)
sns.swarmplot(x="Gender", y="Spending Score (1-100)", color="k", size=3, data=df, ax=g.ax);
plt.title('Gender vs Spending Score', fontsize = 16)
plt.xlabel('Gender')
plt.ylabel('Spending Score (1-100)')


### Gender vs Annual Income

In [None]:
plt.rcParams['figure.figsize'] = (16, 7)
g = sns.catplot(x="Gender", y="Annual Income (k$)", kind="violin", inner=None, data=df)
sns.swarmplot(x="Gender", y="Annual Income (k$)", color="k", size=3, data=df, ax=g.ax);
plt.title('Gender vs Annual Income', fontsize = 16)
plt.xlabel('Gender')
plt.ylabel('Annual Income (k$)')

### Gender vs Age

In [None]:
sns.catplot(x="Gender", y="Age", kind="box", data=df);

# K-Means Clustering

## k-means clustering based on annual income

#### Elbow method to find the optimal number of Clusters

In [None]:
data=df.iloc[:,[3,4]].values
from sklearn.cluster import KMeans
wcss=[] # within cluster sum of square
for i in range(1,11):
    kmeans=KMeans(n_clusters=i, init='k-means++',random_state=0)
    kmeans.fit(data)
    wcss.append(kmeans.inertia_) #inertia_ = to find the wcss value

plt.plot(range(1,11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

* From the above figure, we can see that last most significant slope occurs at k = 5 , hence we will have 5 clusters in this case.

In [None]:

kmeans=KMeans(n_clusters=5,init='k-means++',random_state=0)
y_kmeans=kmeans.fit_predict(data)

#plotting the the clusters
fig,ax = plt.subplots(figsize=(14,6))
ax.scatter(data[y_kmeans==0,0],data[y_kmeans==0,1],s=100,c='red',label='Cluster 1')
ax.scatter(data[y_kmeans==1,0],data[y_kmeans==1,1],s=100,c='blue',label='Cluster 2')
ax.scatter(data[y_kmeans==2,0],data[y_kmeans==2,1],s=100,c='green',label='Cluster 3')
ax.scatter(data[y_kmeans==3,0],data[y_kmeans==3,1],s=100,c='cyan',label='Cluster 4')
ax.scatter(data[y_kmeans==4,0],data[y_kmeans==4,1],s=100,c='magenta',label='Cluster 5')

ax.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],s=400,c='yellow',label='Centroid')
plt.title('Cluster Segmentation of Customers')
plt.xlabel('Annual Income(K$)')
plt.ylabel('Spending Score(1-100)')
plt.legend()
plt.show()

## k-means clustering based on Age

In [None]:
data = df.iloc[:,[2,4]].values
from sklearn.cluster import KMeans
wcss=[]  # within cluster sum of square
for i in range(1,11):
    kmeans=KMeans(n_clusters=i, init='k-means++',random_state=0)
    kmeans.fit(data)
    wcss.append(kmeans.inertia_)  # inertia_ = to find the wcss value

plt.plot(range(1,11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

From the above figure, we can see that last most significant slope occurs at k = 4 , hence we will have 4 clusters in this case.

In [None]:
kmeans=KMeans(n_clusters=4,init='k-means++',random_state=0)
y_kmeans=kmeans.fit_predict(data)

#Plotting the clusters
fig,ax = plt.subplots(figsize=(14,6))
ax.scatter(data[y_kmeans==0,0],data[y_kmeans==0,1],s=100,c='red',label='Cluster 1')
ax.scatter(data[y_kmeans==1,0],data[y_kmeans==1,1],s=100,c='blue',label='Cluster 2')
ax.scatter(data[y_kmeans==2,0],data[y_kmeans==2,1],s=100,c='green',label='Cluster 3')
ax.scatter(data[y_kmeans==3,0],data[y_kmeans==3,1],s=100,c='cyan',label='Cluster 4')

ax.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],s=400,c='yellow',label='Centroid')
plt.title('Cluster Segmentation of Customers')
plt.xlabel('Age')
plt.ylabel('Spending Score(1-100)')
plt.legend()
plt.show()

### Conclusion:

Using on the k-means clustering we have managed to form different clusters based on different features. Mall management can target the clusters with average spending score to increase their profit and should also maintain good relationship with premium customers with high spending score.They should also work on coming up with new innovative ideas to upgrade the customers with low spending score.

### Please do upvote if you like this Notebook!