# Know your customers

One of the most common applications of KMeans is to get to know your customers. Take a very simple dataset that is Mall Customers to try to discover customer segmentations.

0. Import usuals librairies

In [17]:
import pandas as pd

from sklearn.cluster import KMeans
from sklearn.compose import ColumnTransformer
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import  OneHotEncoder, StandardScaler

# Import plotly
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

1. Import the ```Mall_Customers.csv``` dataset

In [18]:
dataset = pd.read_csv("/Users/qxzjy/vscworkspace/dsfs-ft-34/06_UNSUPERVISED_MACHINE_LEARNING/01_KMEANS/02_EXERCICES/data/Mall_Customers.csv")
dataset

Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40
...,...,...,...,...,...
195,196,Female,35,120,79
196,197,Female,45,126,28
197,198,Male,32,126,74
198,199,Male,32,137,18


In [19]:
print("Number of rows : {}".format(dataset.shape[0]))
print("Number of columns : {}".format(dataset.shape[1]))
print()

print("Display of dataset: ")
display(dataset.head())
print()

print("Basics statistics: ")
data_desc = dataset.describe(include="all")
display(data_desc)
print()

print("Data types: ")
display(dataset.dtypes)

print("Percentage of missing values: ")
display(100 * dataset.isnull().sum() / dataset.shape[0])

Number of rows : 200
Number of columns : 5

Display of dataset: 


Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40



Basics statistics: 


Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200,200.0,200.0,200.0
unique,,2,,,
top,,Female,,,
freq,,112,,,
mean,100.5,,38.85,60.56,50.2
std,57.879185,,13.969007,26.264721,25.823522
min,1.0,,18.0,15.0,1.0
25%,50.75,,28.75,41.5,34.75
50%,100.5,,36.0,61.5,50.0
75%,150.25,,49.0,78.0,73.0



Data types: 


CustomerID                 int64
Genre                     object
Age                        int64
Annual Income (k$)         int64
Spending Score (1-100)     int64
dtype: object

Percentage of missing values: 


CustomerID                0.0
Genre                     0.0
Age                       0.0
Annual Income (k$)        0.0
Spending Score (1-100)    0.0
dtype: float64

2. Remove the "CustomerID" variable from your dataset. 

In [20]:
dataset.drop(labels="CustomerID", axis=1, inplace=True)

3. Make all the preprocessings

In [21]:
numerical_features = dataset.select_dtypes(exclude="object").columns
categorical_features = dataset.select_dtypes(include="object").columns

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_features),
        ("cat", OneHotEncoder(), categorical_features)
    ]
)

dataset_processed = preprocessor.fit_transform(dataset)

4. We are going to build our clusters, but to do so, we need to know the optimum number of clusters we need. First use the ```Elbow``` method to see if we can see how many we need to take as a value for ```k```.

In [22]:
wcss =  []
k = []
for i in range (1,11):
    kmeans = KMeans(n_clusters= i, random_state = 0, n_init = 'auto')
    kmeans.fit(dataset_processed)
    wcss.append(kmeans.inertia_)
    k.append(i)
    print("WCSS for K={} --> {}".format(i, wcss[-1]))

WCSS for K=1 --> 698.56
WCSS for K=2 --> 487.6586341571177
WCSS for K=3 --> 394.8558959531979
WCSS for K=4 --> 349.53644994533875
WCSS for K=5 --> 268.685437707417
WCSS for K=6 --> 235.33126125116934
WCSS for K=7 --> 215.85705119709655
WCSS for K=8 --> 193.20422724918876
WCSS for K=9 --> 167.69728553570377
WCSS for K=10 --> 151.3344802902286


In [23]:
wcss_frame = pd.DataFrame(wcss)
k_frame = pd.Series(k)

fig= px.line(
    wcss_frame,
    x=k_frame,
    y=wcss_frame.iloc[:,-1]
)

fig.update_layout(
    yaxis_title="Inertia",
    xaxis_title="# Clusters",
    title="Inertia per cluster"
)

fig.show()

5. Then use the _Silhouette_ method to see if we can refine our hypothesis for ```k```.

In [24]:
sil = []
k = []

## Careful, you need to start at i=2 as silhouette score cannot accept less than 2 labels
for i in range (2,11):
    kmeans = KMeans(n_clusters= i, random_state = 0, n_init = 'auto')
    kmeans.fit(dataset_processed)
    sil.append(silhouette_score(dataset_processed, kmeans.predict(dataset_processed)))
    k.append(i)
    print("Silhouette score for K={} is {}".format(i, sil[-1]))

Silhouette score for K=2 is 0.28206497092786603
Silhouette score for K=3 is 0.2879173501021572
Silhouette score for K=4 is 0.239741474822587
Silhouette score for K=5 is 0.29360072363002754
Silhouette score for K=6 is 0.29813958615424574
Silhouette score for K=7 is 0.3022120865019646
Silhouette score for K=8 is 0.3324323416761952
Silhouette score for K=9 is 0.3428636086983562
Silhouette score for K=10 is 0.3487088742578655


In [25]:
cluster_scores=pd.DataFrame(sil)
k_frame = pd.Series(k)

fig = px.bar(data_frame=cluster_scores,
             x=k,
             y=cluster_scores.iloc[:, -1]
            )

fig.update_layout(
    yaxis_title="Silhouette Score",
    xaxis_title="# Clusters",
    title="Silhouette Score per cluster"
)

fig.show()

6. Next, we will take $K=6$ clusters. Apply the KMeans to your dataset.

In [26]:
kmeans = KMeans(n_clusters=6, random_state=0, n_init="auto")

kmeans.fit(dataset_processed)

In [27]:
dataset["Cluster"] = kmeans.labels_
# dataset["Cluster"] = kmeans.predict(dataset_processed)

7. Let's create a graph that will allow us to visualize each of the clusters. We will first take the ```Spending Score``` as the ordinate and the ```Annual Income``` as the abscissa. 

In [28]:
px.scatter(dataset, x="Annual Income (k$)", y="Spending Score (1-100)", color="Cluster")

8. We have a nice visualization with a nice cluster separation. Look this time at the variable ```Age``` in relation to the ```Spending Score```. What do you notice?

In [29]:
px.scatter(dataset, x="Annual Income (k$)", y="Age", color="Cluster")

----> This time clusters are definitely less visible. 

9. Finally, make a 3d scatter plot of your clusters by using all the quantitative features

In [30]:
fig = px.scatter_3d(dataset, x="Annual Income (k$)", y="Spending Score (1-100)", z="Age", color = "Cluster")
fig.show()