# Consumer segmentation -- mixed datatypes

K-mean algorithm does not work well for segmenting consumers when there are both numerical and categorical variables. Instead, a modified method, **k-prototypes algorithm**, should be used to address the issue. Given the cluster number $k$, instead of minimizing SSE in k-means, k-prototypes minimizes the "clustering cost," which measures the clustering misfit for the mixed datatypes. You need to install the package "kmodes" from your anaconda prompt/cmd using `conda install -c conda-forge kmodes`.
<br/>
 



## Importing packages and dataset

In [1]:
import numpy as np
import pandas as pd
from  kmodes.kprototypes import KPrototypes   # We will use the k-prototypes algorithm

We will use "MallCustomersAllVariables.csv" for analysis

In [5]:
df = pd.read_csv("C:/Users/zoutianxin/Dropbox/Teach/Marketing analytics/2021/2021 Analytics/clustering/datasets/MallCustomersAllVariables.csv",index_col = 0) # use the first column (customer id) as index
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 1 to 200
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Gender                  200 non-null    object
 1   Age                     200 non-null    int64 
 2   Annual Income (k$)      200 non-null    int64 
 3   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 7.8+ KB


Rename the variables to follow the naming conventions.

In [6]:
df = df.rename(columns = {"Gender":"gender",
                          "Age":"age",
                          "Annual Income (k$)":"annual_income",
                          "Spending Score (1-100)":"spending_score"})
df.head()

Unnamed: 0_level_0,gender,age,annual_income,spending_score
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,M,21,15,81
2,M,19,15,39
3,F,23,16,77
4,F,20,16,6
5,F,22,17,76


Note that gender (F, M) is a categorical variable, so k-mean algorithm should not be applied here. We need to use k-prototypes algorithm to accommodate categorical variables.

## Segmenting consumers into three segments

### Normalize the variables to a 0-1 scale (only for **numerical** variables)

$$
X_{transform} = \frac {X_{original} - X_{min}} {X_{max} - X_{min}}
$$

Since it makes no sense to normalize categorical variables to a 0-1 scale, the normalization should be applied only to numerical variable, `age`, `annual_income`, and `spending_score`.

In [10]:
df_normalized = df.copy() # create a copy of the orignial dataset
df_normalized[['age','annual_income','spending_score']] = \
    (df[['age','annual_income','spending_score']]-df[['age','annual_income','spending_score']].min()) \
    /(df[['age','annual_income','spending_score']].max()-df[['age','annual_income','spending_score']].min())
df_normalized.head()


Unnamed: 0_level_0,gender,age,annual_income,spending_score
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,M,0.057692,0.0,0.816327
2,M,0.019231,0.0,0.387755
3,F,0.096154,0.008197,0.77551
4,F,0.038462,0.008197,0.05102
5,F,0.076923,0.016393,0.765306


### Applying k-prototypes algorithm to normalized data

In [None]:
# your code goes here
# creating 3 segments
# Set the k-mean model specs, specifying we need 3 clusters
# apply the model specs to the normalized dataset using 

In [15]:
kprotoSpec = KPrototypes(n_clusters = 3)  # setup the k-prototypes model specs
# apply the above method to normalized dataset
kproto_result3 = kprotoSpec.fit(df_normalized,categorical = [0])    # The categorical variable, gender, is in column 0


## Post-segmentation analysis

### Which segment does each consumer belong to?

Create a new column in the **original** dataframe for which segment a consumer belongs to.

In [7]:
# your code goes here
# the segmentation result can be accessed by "[your segmentation result]".labels_

In [16]:
df["segment"] = kproto_result3.labels_ 
df.head()

Unnamed: 0_level_0,gender,age,annual_income,spending_score,segment
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,M,21,15,81,0
2,M,19,15,39,0
3,F,23,16,77,0
4,F,20,16,6,2
5,F,22,17,76,0


### Summarizing segment characteristics

For each segment, summarize the mean/min/max's of all the **numerical** variable. (It makes no sense to calculate these stats for **categorical variables**.)
For each segment, summarize the percentage of male/female customers in each segment.
Also count the number of consumers in each segment.


In [34]:
# summarize numerical variables
summary_table_numerical = df.groupby("segment").aggregate({                           
    "age":["mean","min","max"],
    "annual_income": ["mean","min","max"],             # calculate the mean/min/max/std of annual income for each segment
    "spending_score": ["mean","min","max"],            # calculate the mean/min/max/std of spending score for each segment
    "segment": "count"                                       # count how many consumers are there in each segment
}
)
summary_table_numerical


Unnamed: 0_level_0,age,age,age,annual_income,annual_income,annual_income,spending_score,spending_score,spending_score,segment
Unnamed: 0_level_1,mean,min,max,mean,min,max,mean,min,max,count
segment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
0,28.260417,18,40,60.697917,15,137,69.479167,29,99,96
1,49.204082,19,70,62.244898,19,137,29.734694,1,60,49
2,48.109091,20,68,58.818182,16,126,34.781818,5,59,55


In [37]:
# summarize categorical variables (gender)

summary_table_categorical = (df.groupby("segment"))["gender"].value_counts(normalize = True)
summary_table_categorical

segment  gender
0        F         0.59375
         M         0.40625
1        M         1.00000
2        F         1.00000
Name: gender, dtype: float64

In [None]:
summary_table_numerical