# Customer Segmentation by Kmeans

## Overview

Customers are important to the survival and success of any business because they are the source of the revenue. The success of a business is the ability to satisfy customers and make them happy, and therefore turn a profit from them. To achieve a business success, the business has to find the right potential customers. 
Using customers’ data can help to build models for predicting the future behaviors of the customers and targeting customers who  help the business to make more money. 

In this project, I will analyze a mall customers’ dataset to answer the following question:
* Who is the customers we should target?

To answer this question, I will build a K-means clustering model to group the customers by their characteristics. This grouping is to understand each segment and target the right customers to let them stay with the business.



### Dataset

* **Mall_Customers.csv** - The file contains information about the customers, there are five columns:
    * CustomerID: Unique ID assigned to the customer.
    * Genre: Gender of the customer.
    * Age: Age of the customer.
    * Annual_Income_(k $\$): Annual Income of the customee.
    * Spending_Score_(1-100): Score assigned by the mall based on customer behavior and spending nature.
    
****

The project is organized as follows. Section 1 explores and visualizes the data. Section 2 includes data preperation and Section 3 implements the solution. Finally, Section 4 presents the summary.

## 1. Data Understanding

This step explores the dataset, features, data types using different functions: `head()`, `shape`, `describe()` and checks the null values using `isnull()`. It also visualizes the data to understand it.

In [None]:
# Import needed libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import seaborn as sns


In [None]:

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
# Read data in the csv file

df = pd.read_csv('../input/mall-customers/Mall_Customers.csv')
df.head()


In [None]:
df.shape

In [None]:
df.describe()

In [None]:
# Check null values
df.isnull().sum()

In [None]:
# Rename Columns 

df.rename(columns = {'Annual Income (k$)':'Annual_Income_(k$)' , 'Spending Score (1-100)': 'Spending_Score'}, inplace = True)
df.head()

In [None]:
# Dispaly the distribution of Spending_Score in a chart
plt.hist(df['Spending_Score'])
plt.show()

In [None]:
# Dispaly the distribution of Annual_Income_(k$) in a chart
plt.hist(df['Annual_Income_(k$)'])
plt.show()

In [None]:
# Dispaly the distribution of Age in a chart
plt.hist(df['Age'])
plt.show()

In [None]:
# Display the number of male and female in a chart

gender_df = df.groupby(['Genre']).count()['CustomerID']
gender_df.plot(kind='bar', title ='Genders Distribution');

In [None]:
# Spending_Score mean with respect to Genre
df[['Spending_Score',  'Genre']].groupby(['Genre']).mean()

In [None]:
# Dispaly the distribution of Spending_Score based on the gender
figure, ax = plt.subplots(figsize = (10, 5))
ax.hist(df[df['Genre'] =='Male']['Spending_Score'], color = 'blue', alpha = 0.7, label = 'Male')
ax.hist(df[df['Genre'] =='Female']['Spending_Score'], color ='red', alpha = 0.7, label = 'Female')
plt.xlabel('Spending Score')
plt.ylabel('Count')
plt.title('Spending Score Distribution of the Customers based on their Gender')
plt.legend()
plt.show()

In [None]:
# Spending_Score max with respect to Genre
df[['Spending_Score', 'Genre']].groupby(['Genre']).max()

In [None]:
# Spending_Score min with respect to Genre
df[['Spending_Score', 'Genre']].groupby(['Genre']).min()

In [None]:
# Display the relationship between the features of the dataset
sns.pairplot(df[[ 'Age', 'Annual_Income_(k$)', 'Spending_Score']])

**Observations:**

From the data, we can see that:
* All the features are nemerical data unless `Genre` is categorical data. 
* The dataset has zero null values. 
* The number of the female is higher than the number of the male. 
* There are no relationship between the annual income and the spending score. 
* The average of the spending scores of females are higher than males. 
* The customers can be grouped based on the spending scores and annual incomes.


## 2. Data Preparation

This step includes selecting the appropriate data and cleaning data.

As we can see from the previous section, there is no null data. For feature selection, I will select the annual income and the spending score. 

In [None]:
X = df[['Annual_Income_(k$)', 'Spending_Score']]

## 3. Modeling

This step builds the data mining model from the dataset to answer the question, and discusses the result is done in this step.

The clustering model that will be used is the K-means model. To build the Kmeans model, the number of the optimal clusters should firstly be determined.  

### 3.1. Selecting the Right Number of Clusters using Elbow Method

In [None]:
# Average distance to the center
WSS = []
index = range(1,8)
for i in index:
    kmeans = KMeans(n_clusters = i, random_state = 42)
    kmeans.fit(X)
    WSS.append(kmeans.inertia_)
    print(kmeans.inertia_)

In [None]:
# Use the Elbow method to determine the optimum number of clusters
plt.plot(index, WSS)
plt.xlabel('K')
plt.ylabel('WSS')
plt.show()

The chart is showed that the value of k at the "elbow" is 5. So, The number of the cluster will be 5. 

### 3.2. Building the K-means Model

In [None]:
# Create Kmeans model with 5 clusters
kmeans = KMeans(n_clusters = 5, random_state = 42)
kmeans.fit(X)

In [None]:
# The values of cluster centers
center = kmeans.cluster_centers_
center

In [None]:
# The clusters of the customers
kmeans.labels_

In [None]:
# Add the cluster column to the dataframe 
X_cls = X
X_cls['Cluster'] = kmeans.labels_
X_cls.head()

In [None]:
# Display the five clusters and the centroid
plt.scatter(df['Annual_Income_(k$)'], df['Spending_Score'], c = kmeans.labels_)
plt.scatter(center[:, 0], center[:, 1], c = 'red', marker = '*')
plt.xlabel( 'Annual Income (K$)')
plt.ylabel ('Spending Score')
plt.title ('Clusters and Centroid')
plt.show()

**Observations:**

There are five type of cluster which are:
* Cluster 0: medium sending score and medium annual income.
* Cluster 1: low spending score and high annual income.
* Cluster 2: low sending score and low annual income.
* Cluster 3: high sending score and low annual income.
* Cluster 4: high spending score and high annual income.

The heigher spending scores indecate better behaviors and spending natures. So, the business should focus on the customers who have heigh spending score. These customers are the customers of Cluster 3 and Cluster 4. 


### 3.3. Results Discussion

Now, I will explore the characteristics of each clusters.

In [None]:
# Mearge the two dataframes based on the indexes numbers
new_df = df.merge(X_cls['Cluster'], left_index = True, right_index = True, how = 'inner')
new_df.head()

In [None]:
# Creating a datframe for Cluster 0. 
# Cluster 0: medium sending score and medium annual income.
df_c0 = new_df[new_df['Cluster'] == 0]
df_c0.head()

In [None]:
# Creating a datframe for Cluster 1. 
# Cluster 1: low spending score and high annual income.
df_c1 = new_df[new_df['Cluster'] == 1]
df_c1.head()

In [None]:
# Creating a datframe for Cluster 2. 
# Cluster 2: low sending score and low annual income.
df_c2 = new_df[new_df['Cluster'] == 2]
df_c2.head()

In [None]:
# Creating a datframe for Cluster 3. 
# Cluster 3: high sending score and low annual income.
df_c3 = new_df[new_df['Cluster'] == 3]
df_c3.head()

In [None]:
# Creating a datframe for Cluster 4. 
# Cluster 4: high spending score and high annual income.
df_c4 = new_df[new_df['Cluster'] == 4]
df_c4.head()

In [None]:
# Creating a datframe for very heigh spending score, Spending Score > 60
df_high_sp = new_df[new_df['Spending_Score'] > 60]
df_high_sp.head()

In [None]:
# Dispaly the distribution of Age in the Cluster 0
plt.hist(df_c0['Age'])
plt.show()

In [None]:
# Display the number of male and female in the Cluster 0
gender_df_c0 = df_c0.groupby(['Genre']).count()['CustomerID']
gender_df_c0.plot(kind='bar', title ='Genders Distribution');

In [None]:
# Dispaly the distribution of Age in the Cluster 1
plt.hist(df_c1['Age'])
plt.show()

In [None]:
# Display the number of male and female in the Cluster 1
gender_df_c1 = df_c1.groupby(['Genre']).count()['CustomerID']
gender_df_c1.plot(kind='bar', title ='Genders Distribution');

In [None]:
# Dispaly the distribution of Age in the Cluster 2
plt.hist(df_c2['Age'])
plt.show()

In [None]:
# Display the number of male and female in the Cluster 2
gender_df_c2 = df_c2.groupby(['Genre']).count()['CustomerID']
gender_df_c2.plot(kind='bar', title ='Genders Distribution');

In [None]:
# Dispaly the distribution of Age in the Cluster 3
plt.hist(df_c3['Age'])
plt.show()

In [None]:
# Display the number of male and female in the Cluster 3
gender_df_c3 = df_c3.groupby(['Genre']).count()['CustomerID']
gender_df_c3.plot(kind='bar', title ='Genders Distribution');

In [None]:
# Dispaly the distribution of Age in the Cluster 4
plt.hist(df_c4['Age'])
plt.show()

In [None]:
# Display the number of male and female in the Cluster 4
gender_df_c4 = df_c4.groupby(['Genre']).count()['CustomerID']
gender_df_c4.plot(kind='bar', title ='Genders Distribution');

In [None]:
# Dispaly the distribution of Age of the customers with very heigh spending score
plt.hist(df_high_sp['Age'])
plt.show()

In [None]:
# Display the number of male and female of the customers with very heigh spending score
gender_df_high_sp = df_high_sp.groupby(['Genre']).count()['CustomerID']
gender_df_high_sp.plot(kind='bar', title ='Genders Distribution');

In [None]:
# Dispaly the distribution of Age based on the gender in Cluster 0
figure, ax = plt.subplots(figsize = (10, 5))
ax.hist(df_c0[df_c0['Genre'] =='Male']['Age'], color = 'blue', alpha = 0.7, label = 'Male')
ax.hist(df_c0[df_c0['Genre'] =='Female']['Age'], color ='red', alpha = 0.7, label = 'Female')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution of the Customers based on their Gender')
plt.legend()
plt.show()

In [None]:
# Dispaly the distribution of Age based on the gender in Cluster 1
figure, ax = plt.subplots(figsize = (10, 5))
ax.hist(df_c1[df_c1['Genre'] =='Male']['Age'], color = 'blue', alpha = 0.7, label = 'Male')
ax.hist(df_c1[df_c1['Genre'] =='Female']['Age'], color ='red', alpha = 0.7, label = 'Female')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution of the Customers based on their Gender')
plt.legend()
plt.show()

In [None]:
# Dispaly the distribution of Age based on the gender in Cluster 2
figure, ax = plt.subplots(figsize = (10, 5))
ax.hist(df_c2[df_c2['Genre'] =='Male']['Age'], color = 'blue', alpha = 0.7, label = 'Male')
ax.hist(df_c2[df_c2['Genre'] =='Female']['Age'], color ='red', alpha = 0.7, label = 'Female')
plt.xlabel('Age')
plt.ylabel('Age Count')
plt.title(' Distribution of the Customers based on their Gender')
plt.legend()
plt.show()

In [None]:
# Dispaly the distribution of Age based on the gender in Cluster 3
figure, ax = plt.subplots(figsize = (10, 5))
ax.hist(df_c3[df_c3['Genre'] =='Male']['Age'], color = 'blue', alpha = 0.7, label = 'Male')
ax.hist(df_c3[df_c3['Genre'] =='Female']['Age'], color ='red', alpha = 0.7, label = 'Female')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution of the Customers based on their Gender')
plt.legend()
plt.show()

In [None]:
# Dispaly the distribution of Age based on the gender in Cluster 4
figure, ax = plt.subplots(figsize = (10, 5))
ax.hist(df_c4[df_c4['Genre'] =='Male']['Age'], color = 'blue', alpha = 0.7, label = 'Male')
ax.hist(df_c4[df_c4['Genre'] =='Female']['Age'], color ='red', alpha = 0.7, label = 'Female')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution of the Customers based on their Gender')
plt.legend()
plt.show()

In [None]:
# Dispaly the distribution of Age based on the gender of the customers with very heigh spending score
figure, ax = plt.subplots(figsize = (10, 5))
ax.hist(df_high_sp[df_high_sp['Genre'] =='Male']['Age'], color = 'blue', alpha = 0.7, label = 'Male')
ax.hist(df_high_sp[df_high_sp['Genre'] =='Female']['Age'], color ='red', alpha = 0.7, label = 'Female')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution of the Customers based on their Gender')
plt.legend()
plt.show()

**Observations:**

From the above charts, we can see that:
* Most of the customers in cluster 1 and cluster 2 are older than 35 years old.
* Most of the customers in cluster 3 and cluster 4 are female who are between 20 and 30 years old.


## 4. Summary

Customers data clustering help us to understand the customers’ characteristics. Here are some suggestion for each cluster:

* The customers in clusters 0, who have medium annual incomes and medium spending scores, can be attractive by marketing campaigns.

* The customers in clusters 1, who have low annual incomes and low spending scores, are usually older than 35 years. So, we need to the find a way to attract old customers for example special offers for the products that they usually buy it, and investigate the causes of the low spending scores.

* The customers in clusters 2, who have high annual incomes and low spending scores, are usually males or old females. So, we need to the find a way to attract them customers for example special offers for the products that they usually buy it, and investigate the causes of the low spending scores.

* The customers in clusters 3, who have low annual incomes and high spending scores, are special customer because their incomes are low and spending scores are high. We should to keep these customers by designing a loyalty programs which offers discounts to the members.

* The customers in clusters 4, who have high annual incomes and high spending scores, are the most valuable customers. We should keep these customers by appropriate pricing or special offers. 

From the clustering we found that, the customers in cluster 3 and cluster 4 are usually female who are between the ages of 20 and 30 years old. So, to obtain new customers, we can fouces on female who are between 20 and 30 in the advertising campaign. 