# MALL SEGMENTATION

In this kernel, I used k-Means clustering to find structure in data on mall customers based on their age, annual income, and spending score.

Instead of using sklearn's built in methods. I have implemented the k-Means algorithm to understand  how this simple algorithm works. 

I have also defined functions for the Average Silhouette method. 

Using Plotly, I visualised the data and clusters in 3D interactive plots.

This is a great dataset to begin understanding unsupervised learning.

<div id='back'></div>
**Contents**

1. [Data Exploration and Preparation](#prep) - visualise data in interactive 3D plots, examine descriptive statistics, and scale data
2. [Running K-Means](#train) - define functions for [k-Means algorithm](#kmeans), define functions for calculating the [average silhouette score](#avg), find optimal number of clusters using average silhouette method, and run kMeans algorithm 
3. [Evaluation](#eval) - visualise clusters and evaluate result


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
#load dataset
customers = pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

<div id='prep'></div>**Data Exploration and Preparation**

In [None]:
#first few entries of dataset
customers.head()

In [None]:
#dataset info
customers.info()

There are 200 training examples, and 5 features. There are no missing values.

Features:
1. Customer ID - Nominal data used to identify customers. This feature is unnecessary for our model as it has no information beyond identifying a customer.
2. Gender - Dichotomous categorical variable to label gender of customer. Will map Male/Female labels to 0/1.
3. Age - Ratio scale denoting age of customer.
4. Annual Income (in 000's dollars) - Ratio scale denoting average income of customer.
5. Spending Score (1-100) - Score assigned by the mall based on customer behavior and spending nature. The highe the score, the more the customer is likely to purchase something, or spend highly.

We will first explore the data to see if we can spot patterns/relationships.

What is the distribution of the Gender feature?

In [None]:
plt.figure()
plt.hist(customers['Gender'])
plt.title('Distribution of Gender')
plt.show()

There are more female than male customers in the data set.

Are there any patterns in male/ female spending such that Gender will be an important factor in clustering? For example, are males or females of a particular age group and income group likely to spend/purchase more? Will we see a cluster of males in the 40-60 age group and with income between 80,000  - 100,000 dollars show  the same high spending behavioiur? Or perhaps women under 21 or over 60 and with income over 80,000 dollars share the same high spending behaviour?

We will plot a 3D scatter plot with Age, Annual Income, and Spending Score measured on the x, y, and z axis, respectively? Gender will be the fourth feature denoted by colour.

In [None]:
import plotly.io as pio
pio.renderers.default='notebook'

In [None]:
import plotly.express as px

fig = px.scatter_3d(customers, x='Age', y='Annual Income (k$)', z='Spending Score (1-100)',
              color='Gender')
fig.show()

Keeping in mind that females outnumber males in this dataset, there does not seem to be a striking pattern of behaviour in which gender is a feature. 

We will map Male/Female in Gender to 0/1 (0 indicating Male, 1 indicating Female).

In [None]:
#check unique values in Gender
print(customers['Gender'].unique())

In [None]:
#replace  Male/Female with 0/1
customers['Gender'].replace({'Male':0, 'Female':1}, inplace=True)
#check
print(customers['Gender'].unique())


Lets's look at the distributions of Age, Annual Income, and  Spending Score (1-100).

In [None]:
#descriptive statistics
customers[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']].describe()

Things to note:
* All three of the features have very similar values for each of their mean and median. This suggests that their respective distributions are fairly symmetrical.
* According to the [US Bureau of Labour Statistics](https://www.thestreet.com/personal-finance/average-income-in-us-14852178), the median annual income for full-time workers is 48,672 dollars. The median annual income in our dataset  61,500 dollars. Since the mean and median annual income of our dataset is very similar, the sample distribution of this dataset does not represent very well the population distribution (population here literally meaning the country's population). If this dataset is a sample of frequent mall customers, then it may very well be a representative sample.
* The [median age](https://www.statista.com/statistics/241494/median-age-of-the-us-population/) of the US population as of 2018 was 38.2. The median age of this dataset is 36, so the dataset population, in terms of age, is a good representation of the general population.
* The features have different ranges, so they will need to be scaled before training the model.


Are any of the continuous variables linearly correlated with each other?

In [None]:
#linear correlation coefficients between Age, Annual Income, and Spending Score
corr_customers = customers[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']].corr()
mask = np.triu(np.ones_like(corr_customers, dtype=bool))
sns.heatmap(corr_customers, mask=mask,annot=True, cmap='BuPu')
plt.show()

The features Age, Annual Income, and Spending Score are not strongly correlated with each other. The strongest correlation between these features is between Age and Spendng Score which has a negative correlation coefficient of -0.33.

We will be clustering the data using a k-means algorithm that calculates clusters using Euclidean distance. This type of algorithm only works with continuous variables,, we will use Age, Annual Income, and Spending Score to cluster the data. The binary feature, Gender, can safely be left out, as we saw above that there are no distinct behaviour in which Gender is a prominent factor.

**3D scatter plot of Customers**

In [None]:
import plotly.io as pio
pio.renderers.default='notebook'

import plotly.graph_objs as go

data = go.Scatter3d(x = customers['Age'], y = customers['Annual Income (k$)'], z = customers['Spending Score (1-100)'], 
                    mode ='markers', marker = dict(size = 4, color = 'crimson',  line=dict(width=2, color='DarkSlateGrey')))

layout =  dict(title = 'Customers',
              scene = dict(xaxis= dict(title= 'Age',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'Annual Income',ticklen= 5,zeroline= False),
              zaxis= dict(title= 'Spending Score',ticklen= 5,zeroline= False))
             )

fig = go.Figure(dict(data = data, layout = layout))

fig.show()

Points to note from 3D plot:

* People who earn roughly the median income are much more likely to have a spending score around 50, and much less llikely yo have a high or low spending score, when compared to low income and high income earners.

* Median earners across the age group spend in moderation.

* Of the low income earners, people under 30 are more likely to have a high spending score.

* High income earners tend to either have a very high spending score or a low spending score. Moderation in spending is not a feature in their behaviour.


**Scale data to be between 0 and 1.**

In [None]:
#scaling the features
from sklearn.preprocessing import minmax_scale

for col in ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']:
    customers[col + '_scaled'] = minmax_scale(customers[col])
    
#check
customers.columns

[back to Contents](#back)

<div id='train'></div>**K-Means Clustering**

<div id = 'kmeans'></div>
The k-means clustering algorithm:

**1.** Initialise cluster centroids (randomly pick training examples to be initial cluster centers)  
**2.** Iterate over a) and b) for specified number of iterations or until some precision is reached:   
         a)Assign each point to nearest centroid as measured by Euclidean distance  
         b) Assign new cluster centroids which are the average points of each of the clusters

We will first define some functions.

In [None]:
def findClosestCentroids(X, centroids):
    
    '''
    Calculates the closest centroid in centroids for each training example in X.
    Returns vector of centroid assignments for each training example.
    '''
    
    #set K
    K = centroids.shape[0]
    
    #vector of cluster assignments
    idx = np.zeros(X.shape[0])
    
    dist = np.zeros(K)
    for i in range(X.shape[0]):
        for k in range(K):
            dist[k] = np.sum((X[i,:] - centroids[k,:])**2)**0.5
        idx[i] = np.argmin(dist)
        
    return idx

In [None]:
def computeCentroids(X, idx, K):
    
    '''
    Calculates new centroids by computing the mean of the 
    data points assigned to each centroid. Returns matrix
    where each row is a new centroid's point.
    '''
    
    #no. of data points
    m = X.shape[0]
    #dimension of points
    n = X.shape[1]
    
    centroids = np.zeros((K,n))
    
    for k in range(K):
        count = 0
        s = np.zeros((1,n))
        for i in range(m):
            if idx[i] == k:
                s = s + X[i,:]
                count += 1
        centroids[k,:] = s/count
        
    return centroids

In [None]:
def RandInitialCentroids(X, K):
    
    '''
    Initializes K centroids by randomly selecting K points in X.
    '''
    
    centroids = np.zeros((K, X.shape[1]))
    
    #Randomly reorder the indicies of examples
    randidx = np.random.permutation(range(X.shape[0]))
    #Take the first K examples
    centroids = X[randidx[0:K],:]
    
    return centroids

Distortion is the average distance between all training examples and the centroid of the cluster to which it has been assigned. 

In [None]:
def kMeansDistortion(X, idx, centroids):
    
    '''
    Calculates the average distance between the examples and the 
    centroid of the cluster to which each example has been assigned.
    '''
    
    #no. of data points
    m = X.shape[0]
    
    distortion = 0
    
    for i in range(X.shape[0]):
        closest = int(idx[i])
        distance = np.sum((X[i,:] - centroids[closest])**2)
        distortion = distortion + distance
        
    distortion = distortion/m
    
    return distortion

**The k-Means algorithm function:**

In [None]:
def kMeans(X, K, max_iters):
           
    '''
    Run the kmeans algorithm for specified number of iterations 
    and returns final centroids, index of closest centroids for 
    each example (idx), final distortion, and distortion history.
    '''
    distortion_history = []
    distortion = 0
    centroids = RandInitialCentroids(X, K)       

    for i in range(max_iters):
        idx = findClosestCentroids(X, centroids)
        distortion = kMeansDistortion(X, idx, centroids)
        distortion_history.append(distortion)
        centroids = computeCentroids(X, idx, K)
        
    return centroids, idx, distortion, distortion_history           
           

Let's test the algorithm and plot the graph of distortion_history to make sure everything is working properly. Distortion should decrease with the number of iterations and eventuslly converge to some value. We will run a test with 5 clusters and 30 iterations.

In [None]:
centroids, idx, distortion, distortion_history = kMeans(np.array(customers[['Age_scaled', 'Annual Income (k$)_scaled', 'Spending Score (1-100)_scaled']]), 5, 30)

In [None]:
plt.figure()
plt.plot(distortion_history)
plt.title('Distortion History')
plt.xlabel('Iteration')
plt.ylabel('Distortion')
plt.show()

In [None]:
print(distortion_history)

In this model, k-Means converges before 10 iterations have completed. 

**K-Means with multiple random initialisations**:

The centroids at which the algorithm converges depends on the starting centroids. In order to avoid ending up at a local optima, the following function runs the k-Means algorithm a specified number of times, and picks the result with the lowest distortion.

In [None]:
#run kmeans with specified different random initialisations and pick one with lowest distortion
def kMeansRuns(X, K, max_iters, init_runs):
    '''
    Run the kMeans algorithm for specified number of random initialisations, init_runs, 
    and return result with lowets distortion.
    '''
    for r in range(init_runs):
        if r == 0:
            centroids, index, distortion, distortion_hist = kMeans(X, K, max_iters)
            distortion_lowest = distortion
        else:
            current_centroids, current_index, distortion, current_distortion_hist = kMeans(X, K, max_iters)
            if distortion_lowest > distortion:
                centroids= current_centroids
                index = current_index
                distortion_lowest = distortion
                
    return centroids, index, distortion_lowest


**Number of Clusters**:

Since there are not any absolute answers to choosing the number of clusters, we can manually pick the number of clusters or we can utilise methods like the Elbow method to help guide our selection. I have chosen to use the average silhouette method, and have defined the functions for it below.

<div id='avg'></div>
**Average silhouette** measures the quality of clustering. We calculate the average silhouette for a range of number of clusters, and pick the the number of clusters which has the highest average silhouette score.



First, we calculate the silhouette score for each training example, i. 

The silhouette score is     <font size = '4'> $s(i) = \frac{b(i) - a(i)}{max\{a(i), b(i)\}}$ </font>   where a(i) is the average euclidean distance bwetween training example, i, and all other points in its cluster, and b(i) is the average distance between the training example, i, and its closest cluster by average distance.

A silhouette score of close to 1 indicates that the training example is much closer to its own cluster than to its nearest cluster, a score of 0 indicates that the training example is halfway between the cluster it is assigned to and the its nearest other cluster, and a score of -1 indicates that the training example is nowhere near its own cluster in comparison with the nearest other cluster. 


After we have calculated the silhouette score for each training example, we average it. 



In [None]:
from scipy.spatial import distance

def SilhouetteScore(x_i, X, idx, K):
    '''
    For given training example (with index x_i), calculates  and returns its silhouette score.
    
    First, the function calculates the average distance, a_i, between the given training example
    and all other points in the cluster it belongs to. Then, the function calculates the average distance
    between the training example and all other points not in its own cluster, and picks the cluster with 
    the smallest average distance. Using a_i and b_i, it calculates the silhouette score of the given 
    training example.
    '''
    #calculate average distance between x_i and all points in its cluster
    
    #training example 
    point = X[x_i]
    #cluster index of training example
    idx_point = idx[x_i]
    
    #list of distances between point and other points in own cluster
    own_cluster_distances = np.empty(0)
    #loop over training examples' assigned cluster index, find points in 
    #own cluster and calculate euclidean distance
    for i in range(idx.shape[0]):
        if idx[i] == idx_point:
            own_cluster_distances = np.append(own_cluster_distances, distance.euclidean(point, X[i]))
    
    #average distance between point and all other points in own cluster
    a_i = np.sum(own_cluster_distances)/(own_cluster_distances.shape[0])
    
    #for each k in range K, calculate average distance between point and all other points in cluster k
    avg_cluster_distances = np.empty(0)
    #range of K without given trainig example's own cluster
    other_clusters = [r for r in range(K) if r != idx_point]
    
    for k in other_clusters:
        #distances between point and all points in cluster k
        k_distances = np.empty(0)
        #all points in cluster k
        k_cluster = X[idx==k]
        #number of points in cluster k
        k_len = k_cluster.shape[0]
        for n in range(k_len):
            k_distances = np.append(k_distances, distance.euclidean(point, k_cluster[n]))
        #average distance between point and all points in cluster k appended to
        #avg_cluster_distances array
        if k_len != 0:
            avg_cluster_distances = np.append(avg_cluster_distances, np.sum(k_distances)/k_len)
        else:
            avg_cluster_distances = np.append(avg_cluster_distances, 0)
        
        
    #find closest cluster in avg_cluster_distances
    b_i = np.min(avg_cluster_distances)
    
    silhouette_score = (b_i - a_i)/np.max([a_i, b_i])
    
    
    return silhouette_score 

In [None]:
def AverageSilhouette(X, idx,  K):
    '''
    Calculates and returns the average silhoutte for given number of clusters, K.
    
    Average silhouette is the average of the silhouette scores of all the training examples.
    '''
    silhouette_scores = np.empty(0)
    #loop over all training examples and calculate their silhouette score
    for i in range(X.shape[0]):
        silhouette_i = SilhouetteScore(i, X, idx, K)
        silhouette_scores = np.append(silhouette_scores, silhouette_i)
    
    #calculate average of all scores
    avg_silhouette = np.sum(silhouette_scores)/len(silhouette_scores)
    
    return avg_silhouette
        


The following function runs k-Means for each K in the range 2 - K_range, calculates the average silhouette, and plots a graph of the average silhouette score for each K.

In [None]:
def PlotAvgSilhouettes(X, K_range, max_iters, init_runs):
    
    '''
    Runs kMeans and plots the average silhouette scores for 
    each number of clusters in K_range.
    
    '''
    clusters_avg_sil = np.empty(0)
    #minimum of K_range must be 2
    for K in range(2, K_range+1):
        centroids, idx, distortion_lowest = kMeansRuns(X, K, max_iters, init_runs)
        k_avg_sil = AverageSilhouette(X, idx,  K)
        clusters_avg_sil = np.append(clusters_avg_sil, k_avg_sil)
        
    
    plt.figure(figsize = (12.8, 9.6))
    plt.plot(np.arange(2,K_range+1,1), clusters_avg_sil)
    plt.title('Average Silhouette for number of clusters K')
    plt.xlabel('K')
    plt.ylabel('Average Silhouette')
    plt.show()
    
    
    return None

In [None]:
#features as an array, X
X = np.array(customers[['Age_scaled', 'Annual Income (k$)_scaled', 'Spending Score (1-100)_scaled']])

We will calculate and plot the average silhouette score for each K number of clusters in the range 2-15 (inclusive). We will run K-Means with 50 random initialisations, and for 20 iterations per random initialisation (we saw above that, for this model, the function converges before 10 iterations).

In [None]:
import warnings
warnings.simplefilter('error')

#calculate and plot graph of average silhouettes for 2-15 clusters
PlotAvgSilhouettes(X, 15, 20, 50)

From the plot above, the maximum average silhouette score is for the model with nine clusters. 

We will now run k-Means with nine clusters with 100 random initialisations, and 20 iterations for each random initiliation.

In [None]:
centroids, ind, distortion = kMeansRuns(X, 9, 20, 50)

In [None]:
#extract unscaled features into variable C so we can plot and understand the results
C = np.array(customers[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']])

#get clusters
cluster0 = C[ind==0]
cluster1 = C[ind==1]
cluster2 = C[ind==2]
cluster3 = C[ind==3]
cluster4 = C[ind==4]
cluster5 = C[ind==5]
cluster6 = C[ind==6]
cluster7 = C[ind==7]
cluster8 = C[ind==8]

In [None]:
print(cluster0.shape)
print(cluster1.shape)
print(cluster2.shape)
print(cluster3.shape)
print(cluster4.shape)
print(cluster5.shape)
print(cluster6.shape)
print(cluster7.shape)
print(cluster8.shape)

<div id='eval'></div>**Evaluation**

We will now plot the clusters in a 3D scatter plot.

In [None]:
#cluster 3d scatter plot

import plotly.graph_objs as go


trace0 = go.Scatter3d(x = cluster0[:,0], y = cluster0[:,1], z = cluster0[:,2], 
                      mode = 'markers', name='Cluster0', marker = dict(size = 4, color = 'black'))

trace1 = go.Scatter3d(x = cluster1[:,0], y = cluster1[:,1], z = cluster1[:,2], 
                      mode = 'markers', name='Cluster1', marker = dict(size = 4, color = 'green'))

trace2 = go.Scatter3d(x = cluster2[:,0], y = cluster2[:,1], z = cluster2[:,2], 
                      mode = 'markers', name='Cluster2', marker = dict(size = 4, color =  'chartreuse'))

trace3 = go.Scatter3d(x = cluster3[:,0], y = cluster3[:,1], z = cluster3[:,2], 
                      mode = 'markers', name='Cluster3', marker = dict(size = 4, color =  'maroon'))

trace4 = go.Scatter3d(x = cluster4[:,0], y = cluster4[:,1], z = cluster4[:,2], 
                      mode = 'markers', name='Cluster4', marker = dict(size = 4, color =  'hotpink'))

trace5 = go.Scatter3d(x = cluster5[:,0], y = cluster5[:,1], z = cluster5[:,2], 
                      mode = 'markers', name='Cluster5', marker = dict(size = 4, color =  'crimson'))

trace6 = go.Scatter3d(x = cluster6[:,0], y = cluster6[:,1], z = cluster6[:,2], 
                      mode = 'markers', name='Cluster6', marker = dict(size = 4, color =  'cyan'))


trace7 = go.Scatter3d(x = cluster7[:,0], y = cluster7[:,1], z = cluster7[:,2], 
                      mode = 'markers', name='Cluster7', marker = dict(size = 4, color =  'darkblue'))


trace8 = go.Scatter3d(x = cluster8[:,0], y = cluster8[:,1], z = cluster8[:,2], 
                      mode = 'markers', name='Cluster8', marker = dict(size = 4, color =  'chocolate'))

data = [trace0, trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8]


layout = dict(title = 'Mall Segmentation Clusters',
              scene = dict(xaxis= dict(title= 'Age',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'Annual Income',ticklen= 5,zeroline= False),
              zaxis= dict(title= 'Spending Score',ticklen= 5,zeroline= False))
             )

fig = go.Figure(dict(data = data, layout = layout))

fig.show()


Things to note from above plot:

* We can broadly see the clusters we identified in the model before training, eg the high spending young people with a low income. 
* The moderate spenders with around median are split into three age groups.
* There is a cluster of high income earners with a higher than median spending score.
* There are two clusters of high income earners with a lower than median spending score.
* There is a cluster of low income, young to middle age people with low spending scores.
* There is a cluster of older, low income earners with low spending scores.








In [None]:
#average silhouette score
print('The average silhouette score of this model with 9 clusters is {:0.2f}.'.format(AverageSilhouette(X, ind,  9)))

The average silhouette score shows that this model is reasonably well clustered. As this is an average score, it doesn't reflect the fact that some clusters have a better clustering quality that others. For example, the three clusters in the middle of the plot are 'tighter', with all their points relatively closer to each other, than the other clusters. 

We can use the descriptive statistics of each cluster, and procure additional information, to identify the demographics and spending patterns of each group. We can then target products and services accordingly.

[back to Contents](#back)