In [None]:
#importing data and checking columns

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv', usecols=['Gender', 'Age', 'Annual Income (k$)','Spending Score (1-100)'])
df.columns = ['Gender', 'Age', 'Income', 'Score']
df.head()

## Let's do some EDA

First i would like to have an overview of the features doing the simplest summay statistics

In [None]:
df.describe()

###### Now lets Try to relate the problem Statement to the summary statistics before going deeper in the EDA.

# Problem Statement

##### You own the mall and want to understand the customers like who can be easily converge Target Customers so that the sense can be given to marketing team and plan the strategy accordingly.

So we want to tell the marketing team which group of customers that can be easily converge so they can plan the strategy to the selected group. 

The easiest way would be saying: target the ones with the highest score. But who are the ones with the highest score? What if they are not sufficient to have a decent return over the investment?  What if their incomes are not very high?

I would argue that we should target a group with the most number of customers that still have a decent Score. Let's assume between 50 and 80 based on the mean score from the summary statistics.

For this task we will pre-cluster our customers by the age distribution, find the age group with the most number of customers then check if they have a decent score and their income are sufficient enough so they would keep their scores.

Now we can go ahead to the EDA with this in mind.

## Checking the correlation between the features

In [None]:
_= sns.heatmap(df.corr(), annot = True, cmap = 'coolwarm', fmt='.1g')

It does not look we have a strong correlation between any feature. 

Although income and score does have a positive correlation,
as it is not relevant we could guess that most of people can keep their score high 
independent of the amount of their income

They could rely on their credit card to keep spending. 

But it would not be a problem if their income are sufficient enough to keep the score relevant, 
we could say that if their income are above the mean it would be fine. That's one reason we should not target the one's with the highest score.

Let's keep going, now checking the distributions of the features with some CDFs.

In [None]:
fig, axes = plt.subplots(1,3, figsize=(15,5))

ax = sns.ecdfplot(data=df, x='Age', ax=axes[0])
ax = sns.ecdfplot(data=df, x='Income', ax=axes[1])
ax = sns.ecdfplot(data=df, x='Score', ax=axes[2])

plt.show()

## What the distribution and the correlation tell us:

As we are trying to cluster the age group with the most number of customers we can assume from the distribution that approximately 50% of customers are between 30 and 50 years old. 

The income ecdf tells us that 80% of the customers have an income of 80 

The Score ecdf shows that 80% of the customes have a score of 70.

In [None]:
_ = sns.pairplot(df)
#_.map_upper(sns.histplot)
#_.map_lower(sns.kdeplot, fill=True)
_.map_diag(sns.histplot, kde=True)

plt.show()

Here we can found that score and incomes have natural clusters: 

The cluster with most customers have average income and average score.
We also have customers with high income and low score that could be the target as they could be easy to converge. 
We also have a group with low income and low score, low income and high score and high income and high score, but for now i would say they are not the target for the marketing team.

Let's keep going trying to find a sweet spot between our age group and income and score now.

In [None]:
fig, axes = plt.subplots(1,3, figsize=(15,5))

_= sns.histplot(data=df, x='Age', y = 'Income', ax=axes[2])
_= sns.histplot(data=df, x='Age', y = 'Score',  ax=axes[1])
_= sns.histplot(data=df, x='Score', y = 'Income',  ax=axes[0])
plt.show()

The first plot shows more clearly that most customers have average score and average income. 

About our age group between 30 and 50 years old, they appear to have a score above the average, with most of them having score above 40, with a high number of customers with a score of 70.

When we check their income, they keep the income above the average, but customers with high income for this group usually are 30 years old.

### Conclusions of the EDA before clustering

We checked that we cannot rely only on the score feature to plan a marketing strategy, because it does not have a strong positive correlation to the income, so customers can keep their score high independent of their income, thus their score may not be sustainable and the conversion rate could not be satisfying for this group. 

Although income seems to be the easy feature that the marketing team can converge, the low proportion of customers with high income (above 80k) makes the return of the investment of a plan targeting people with high income not attractive. 

So, if the conversion rate of the marketing team is 50% for customers between 30 and 50 years old and also for customers with income above 80k, they would convert more customers if they target people between 30 and 50 years old rather than if they targeted customers with high income. 

As most of customers are between the age of 30 and 50 and they have an income that can keep their score sustainable  i would recommend, only seeing the EDA, to target this group.

## Now let's head to the clustering.

First, i scaled the data because the variance of the features are high and feature variance = feature influence. THe elbow method inertia tells us that 4 is a good number of clusters because above that number the inertia decreases in a slower rate. So it has a good tradeoff. After running KMeans i created the cluster column in the data as Classification, to check what we can find out based on our labels about the customers.



In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

samples = df[['Age', 'Income', 'Score']]

scaler = StandardScaler()
scaler.fit(samples)
samples_scaled = scaler.transform(samples)


inertia = []
for i in range(1, 10):
    km = KMeans(n_clusters=i).fit(samples_scaled)
    inertia.append(km.inertia_)

sns.lineplot(x=range(1,10), y=inertia)
plt.title('Elbow method inertia')
plt.xlabel('K')
plt.ylabel('Inertia')
plt.show()
    

In [None]:
Scaledmodel = KMeans(n_clusters=4)
Scaledmodel.fit(samples_scaled)
scaled_prediction = Scaledmodel.predict(samples_scaled)
df['Classification'] = scaled_prediction


## EDA with the classification labels

To check where are our clusters based on Income and Score related to the age, we have two scatter plots. The first one is Age versus Income, with score as size and the second one is Age versus Score with Income as the size. 

We can have some thoughts about where the clusters are with both scatterplots, we can see that one group has a good balance between Income and Score as another group have score below the average but the income is higher than 80K in its average.

In [None]:
fig, axes = plt.subplots(1,2, figsize=(15,5))
_ = sns.scatterplot(data = df, x='Age', y='Income', hue='Classification',palette='deep', size='Score', sizes=(10,500), alpha=0.5, ax=axes[0])
_ = sns.scatterplot(data = df, x='Age', y='Score', hue='Classification',palette='deep', size='Income', sizes=(10,500), alpha=0.5, ax=axes[1])

In [None]:
fig, axes = plt.subplots(1,3, figsize=(15,5))
_ = sns.boxplot(data = df, x='Classification', y='Income',palette='deep',ax=axes[0])
_ = sns.boxplot(data = df, x='Classification', y='Score',palette='deep',ax=axes[1])
_ = sns.boxplot(data = df, x='Classification', y='Age',palette='deep',ax=axes[2])

# Conclusions from clustering and advice for the marketing team.
In terms of target groups, i would recommend targeting the group with income above 80k, We have two groups in this situation. One of them has a high score as the other one has the lowest score, which can indicate some potential to conversion. The age of both groups are between 30 and 47 years old.

The other two groups have Incomes below the average income although the Score of one of them is the second highest. I would not recommend  them as a target because their score does not seem to be sustainable so a marketing strategy targeting groups with this characteristics would not have room to convert customers. The age of one group is below 30 years old and the age of the second group is above 50 years old.

If we desire, we could push it further and label each group: 2 and 3 as beeing High priority, group 1 as medium priority and group 0 as low prioriy We can customize advertisement for each group and predict the label for new customers so each group would receive customized advertises and new customers would get customized ads as their predicted label.

