## Introduction

The purpose of this notebook is to identify characteristics of customer segments to develop an optimal marketing strategy.

K-means clustering is used to group datapoints together with certain similarities. We would then use these groupings as customer segments to understand its characteristics and how it will impact to the overall marketing strategy.

## Importing libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from itertools import chain
from collections import Counter
import warnings

warnings.filterwarnings("ignore")

## Data

The dataset contains the following information:
* Customer gender
* Age
* Annual income ($000)
* Spending score (1-100)

In [None]:
df = pd.read_csv("../input/mall-customers/Mall_Customers.csv")

Overview of dataframe

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
# Set index to Customer ID

df = df.set_index("CustomerID")

## Clustering 

In [None]:
# Create new dataframe for clustering
# Metrices that will be used for clustering will be Annual Income and Spending Score

df_precluster = df[["Annual Income (k$)", "Spending Score (1-100)"]]

In [None]:
# Visualise a scatterplot

sns.scatterplot(df['Annual Income (k$)'], df['Spending Score (1-100)'])
plt.title("Scatterplot of Annual Income and Spending Score")
plt.xlabel("Annual Income (k$)")
plt.ylabel("Spending Score (1-100)")

For choosing the optimum number of clusters, an elbow curve is used as below.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

ks = range(1, 10)
score = []
# Create scaler: scaler

scaler = StandardScaler()
for k in ks:
    # create KMeans instance
    kmeans = KMeans(n_clusters=k)
    pipeline = make_pipeline(scaler,kmeans)
    # Create a Kmeans instance with k clusters: model
    model = pipeline.fit(df_precluster)
    # Append the scores to the list of scores
    score.append(model.score(df_precluster))
    
# Plot the ks vs score
plt.plot(ks, score, '-o')
plt.xlabel("Number of Clusters, k")
plt.ylabel("Score")
plt.show()

Based of the Elbow Curve above, the optimum number of cluster is between 5 and 7. For this instance, let's choose 5 clusters to fit the model.

In [None]:
# Scale the data
scaler=StandardScaler().fit(df_precluster)
scaleddfpreC = scaler.transform(df_precluster)
# Define K-means model
kmeans = KMeans(n_clusters=5).fit(scaleddfpreC)
labels = kmeans.fit_predict(scaleddfpreC)
plt.scatter(df_precluster['Annual Income (k$)'], df_precluster['Spending Score (1-100)'], c=labels)

In [None]:
kmeans.labels_

In [None]:
# Create new column to identify each row to its cluster group

df['cluster_group'] = kmeans.labels_
df.head()

## Visualizations

In [None]:
# Bar Chart of Cluster Group counts

sns.countplot(y='cluster_group', hue='Genre', data=df)
plt.title("Cluster counts")

Key points:
* Cluster Group 2 has the most counts with 81 customers
* Cluster Group 1 and Cluster Group 4 has the smallest number of counts with 22 and 23 respectively
* Female customers dominates all Cluster Groups except for Cluster Group 0, where there is a higher male composition 

In [None]:
cts = Counter(chain.from_iterable(df.loc[df['cluster_group'] == 0].Genre.str.split('|').values))
_ = plt.pie(cts.values(), labels=cts.keys(), autopct='%1.0f%%')
_ = plt.ylabel('Gender breakdown for cluster 0')

In [None]:
ax = sns.boxplot(x='cluster_group', y="Spending Score (1-100)",
                 data=df)

In terms of spending score:
* Cluster Group 1 and Cluster Group 3 has the highest spending score
* Cluster Group 3 has a slightly higher mean of 82 compared to 79 for Cluster Group 1
* Cluster Group 0 and Cluster Group 4 shows the lowest spending score with mean of 17 and 21 respectively

In [None]:
ax = sns.boxplot(x='cluster_group', y="Age",
                 data=df)

In terms of age range:
* Cluster Group 3 has a narrow age range of between 27 to 40 with a mean of 33
* Cluster Group 1 is the youngest category with age between 18 to 35 and a mean of 25
* Cluster Group 2 has the widest age range of between 18 to 70 and with mean of 43

In [None]:
ax = sns.boxplot(x='cluster_group', y="Annual Income (k$)",
                 data=df)

In terms of annual income:
* Cluster Group 0 has the highest annual income mean 88k
* Cluster Group 1 has the lowest annual income with mean 26k

## Summary statistics for each cluster

### Cluster 0

In [None]:
df.loc[df['cluster_group'] == 0].describe()

### Cluster 1

In [None]:
df.loc[df['cluster_group'] == 1].describe()

### Cluster 2

In [None]:
df.loc[df['cluster_group'] == 2].describe()

### Cluster 3

In [None]:
df.loc[df['cluster_group'] == 3].describe()

### Cluster 4

In [None]:
df.loc[df['cluster_group'] == 4].describe()

## Conclusion

Using K-means clustering the mall customers in the dataset can be grouped into 5 different cluster groups based on their spending score and annual income.

The characteristics of each Cluster Group can be summarised as follows.

### Cluster Group 0
* Comprise of majority male customers (54%)
* Tend to spend less with lowest average spending score of 17 out of 100
* This is despite having the highest average income compared to the other Cluster Groups
* Age range for this Group varies but has a higher concentration in between mid 30s to mid 40s

### Cluster Group 1
* Smallest Cluster in terms of count with females being the majority
* Second highest average spending score with 79 out of 100
* This is despite having the lowest average annual income when compared to the other Cluster Groups
* Youngest Cluster Group with average age of 25

### Cluster Group 2
* Largest Cluster Group comprising 81 customers
* Moderate level of spending with average spending score of 50 out of 100
* Middle income category with average of 55k
* Covers all age range from 18 to 70

### Cluster Group 3
* Second largest Cluster Group comprising 39 customers with female majority
* Highest average spending score of 82 out of 100
* Second highest average annual income compared to the other Cluster Groups
* Age group concentrated in early 30s

### Cluster Group 4
* Second smallest Cluster Group comprising 23 customers
* Second lowest average spending score of 21 out of 100
* Low average income group
* Covers all age group from 19 to 67 years old


## Recommendations
* Develop marketing strategy to focus customers from Cluster Group 3 based on the characteristics above
* There is also a potential opportunity to capture customer segment from Cluster Group 0 with the highest level of income (but low spending score). Hence, marketing strategy could also focus on how to penetrate this segment.

### Further analysis
Other data that can be used for further analysis:
* Larger dataset to have a better representation of the population
* Further information such as number of household, occupation and purchase categories to have a better understanding of spending patterns