### Segmenting Cutomers with K-means Clustering

#### Import Python modules which we will use in our clustering

In [1]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

#### Import data 

Read the HealthyFoodStore dataset to a pandas dataframe

In [3]:
filename = 'HealthyFoodStore2017.xlsx'
df_healthy_store = pd.read_excel(filename,'Data')   

#### Prepare the data 

Lets sum the amount sales for each customer per item

In [4]:
df_grouped = df_healthy_store.groupby(['Customer_ID','Item'],as_index = False).sum()

Lets look at the first five rows of the df_grouped dataframe. We see that Customer ID AA-1 has purchased several differant items (but mostly Aloe Vera)

In [7]:
df_grouped.head()

Unnamed: 0,Customer_ID,Item,Sales
0,AA-1,Aloe Vera,65
1,AA-1,Broccoli Powder,55
2,AA-1,Detox Green Tea,20
3,AA-1,Energy bar White Chocolate and Macadamia Nut,5
4,AA-1,Fusion Spice Red Tea,10


Prior to clustering we will transpose/pivot the dataset so that each row represents one unique customer. This gives us a sparse matrix and since our cluster model uses numerical values we will use the fillna method to fill NaN values with 0 if the customer has not purchased the item 

In [5]:
df_pivoted = df_grouped.pivot_table(index='Customer_ID', columns='Item', values='Sales').fillna(0)

Lets take a look at the sparse matrix/dataframe. We can see the customers spendings per item

In [152]:
df_pivoted.head()

Item,Aloe Vera,Broccoli Powder,Detox Green Tea,Energy bar White Chocolate and Macadamia Nut,Fusion Spice Red Tea,Ginger Lemon Tea,Grounded Garlic & Ginger,"HealthSmart Foods Chocolite Protein, French Vanilla",Muscle Combat crunch (Chocolate chip),"Oh Yeah!, Nutritional Shake, Chocolate Milkshake",Power bar - Banana Strawberry,Sprirulina,Wheat Grass
Customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
AA-1,65,55,20,5,10,15,35,5,5,0,0,35,35
AA-10,40,50,10,5,20,15,45,0,5,0,0,30,60
AA-11,55,50,15,5,25,15,50,5,0,5,0,45,55
AA-12,55,85,15,5,15,10,35,5,0,0,0,60,30
AA-13,10,10,35,5,45,65,20,15,10,0,10,15,15


#### Variable selection

Lets select all columns except the Customer ID as our input variables in the cluster model  

In [6]:
Xcols = [col for col in df_pivoted.columns if 'CUSTOMER_ID' not in col.upper()]

Here are our input variables

In [10]:
Xcols

[u'Aloe Vera',
 u'Broccoli Powder',
 u'Detox Green Tea',
 u'Energy bar White Chocolate and Macadamia Nut',
 u'Fusion Spice Red Tea',
 u'Ginger Lemon Tea',
 u'Grounded Garlic & Ginger',
 u'HealthSmart Foods Chocolite Protein, French Vanilla',
 u'Muscle Combat crunch (Chocolate chip)',
 u'Oh Yeah!, Nutritional Shake, Chocolate Milkshake',
 u'Power bar - Banana Strawberry',
 u'Sprirulina',
 u'Wheat Grass']

### Transforming variables

#### Dummy variables

If we had categorical variables in our dataset we would have to transform them into dummy/indicator variables. The pandas library has the get_dummies function which greatly simplifies this. Since all our variables are continuous variable there is no need to transform any features into dummy/indicator variables. 

#### Scaling 

Since we are using a distance measure to cluster our observations (the Euclidean distance) it is very important that variables are in the same scale. Otherwise the variables measured on a larger scale will be overly influenced compared to other variables in the model. 

#### Standardizing

We will transform all continuous variables by standardizing (Z-score normalize) them. This is done by is subtracting their mean and divide by their standard deviation  $z =\frac{x - \mu}{\sigma}$ 
<br>
The variables (features) will then be standardized with the mean of 0 and a standard deviation of 1. To perform the standardizing we will use zscore function from the scipy.stats module 

In [7]:
X = df_pivoted[Xcols]
from scipy.stats import zscore
for col in X:
    col_zscore = 'z_' + col          #prefix the column name wit a z_ for the standardized values 
    X[col_zscore] = zscore(X[col])
    del X[col]                       #Delete the column with original/non-standardized values

Lets take a look at the datframe with standardized features

In [12]:
X.head()

Item,z_Aloe Vera,z_Broccoli Powder,z_Detox Green Tea,z_Energy bar White Chocolate and Macadamia Nut,z_Fusion Spice Red Tea,z_Ginger Lemon Tea,z_Grounded Garlic & Ginger,"z_HealthSmart Foods Chocolite Protein, French Vanilla",z_Muscle Combat crunch (Chocolate chip),"z_Oh Yeah!, Nutritional Shake, Chocolate Milkshake",z_Power bar - Banana Strawberry,z_Sprirulina,z_Wheat Grass
Customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
AA-1,2.312569,1.681356,0.259865,-0.943962,-0.34367,-0.08766,0.965364,-0.840973,-0.966408,-1.081828,-1.018187,0.940438,0.943414
AA-10,1.093288,1.455164,-0.307113,-0.943962,0.213633,-0.08766,1.501677,-1.069292,-0.966408,-1.081828,-1.018187,0.673015,2.241689
AA-11,1.824857,1.455164,-0.023624,-0.943962,0.492284,-0.08766,1.769834,-0.840973,-1.188571,-0.871082,-1.018187,1.475284,1.982034
AA-12,1.824857,3.038504,-0.023624,-0.943962,-0.065019,-0.364481,0.965364,-0.840973,-1.188571,-1.081828,-1.018187,2.277554,0.683758
AA-13,-0.369848,-0.354366,1.110333,-0.943962,1.606888,2.680552,0.160894,-0.384336,-0.744245,-1.081828,-0.616271,-0.129255,-0.095207


#### Choosing the optimal number of clusters in the model

An inappropriate choice for the number of clusters (k) can result in poor model performance. We will use the Silhouette Coefficien to determine the optimal number of clusters. 

In our example we will test the value of k from 2 through 20 in a for loop. Finally we'll choose k in our cluster model by choosing the model with the highest silhouette_score.  The silhouette coefficient is a measure of the compactness and separation of the clusters and therefore is a good measure to selection of optimal number of clusters in the model.

In [8]:
from sklearn.metrics import silhouette_score as silhouette_score

# convert to matrix/numpy array to use in KMeans clustering class
data_for_clustering_matrix = X.as_matrix()
number_of_clusters = []
silhouette_coef = []

# Fit a CLuster model with 2 to 10 clusters (in a for-loop) and identify which has the highest silhouette score
k = range(2, 11)
for i in k:
    clustering_method = KMeans(n_clusters = i)
    clustering_method.fit(data_for_clustering_matrix)
    labels = clustering_method.predict(data_for_clustering_matrix)
    silhouette_average = silhouette_score(data_for_clustering_matrix, labels)
    silhouette_coef.append(silhouette_average)
    print "Number of clusters: %d has a silhouette coefficient of %.3f" % (i, silhouette_average)
    number_of_clusters.append(i)

max_silhouette_score = max(silhouette_coef)
#get the list index of the element with the highest value (highest element in the list
index_of_max_score = silhouette_coef.index(max_silhouette_score)

Number of clusters: 2 has a silhouette coefficient of 0.573
Number of clusters: 3 has a silhouette coefficient of 0.600
Number of clusters: 4 has a silhouette coefficient of 0.502
Number of clusters: 5 has a silhouette coefficient of 0.395
Number of clusters: 6 has a silhouette coefficient of 0.329
Number of clusters: 7 has a silhouette coefficient of 0.338
Number of clusters: 8 has a silhouette coefficient of 0.343
Number of clusters: 9 has a silhouette coefficient of 0.326
Number of clusters: 10 has a silhouette coefficient of 0.336


In [9]:
print "The number of clusters with the highest silhouette_coef is %d" % (number_of_clusters[index_of_max_score]) 

The number of clusters with the highest silhouette_coef is 3


So lets segment our customer with a cluster model with 3 clusters

In [12]:
# FIT KMEANS CLUSTER MODEL WITH NUMBER OF CLUSTERS WITH HIGHEST SILHOUETTE SCORE
cluster_model = KMeans(n_clusters = number_of_clusters[index_of_max_score])
cluster_model.fit(data_for_clustering_matrix)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

For model persistence we will use the cPickle module 
which allows us to serialize and de-serialize Python object
structures to compact byte code, so that we can save our cluster model and load it when we want to classify new samples without needing to learn the model from the training data all over again.


We also need to save the arrays containing  mean and standard deviations for all variables in the cluster model. This is to be abe to standardize the input data in real time 

In [10]:
#transform a pandas series to np array
mean_arr = np.array(df_pivoted.mean(), dtype=pd.Series)
std_arr = np.array(df_pivoted.std(), dtype=pd.Series)

#Save the pickled array to disc 
mean_arr.dump('savedmodels/mean_arr.pkl')
std_arr.dump('savedmodels/std_arr.pkl')

In [13]:
#Save the model to disc
import pickle
# save the classifier
with open('savedmodels/cluster_model.pkl', 'wb') as fid:
    pickle.dump(cluster_model, fid) 

#OR
#filename = 'savedmodels/cluster_model.sav'
#pickle.dump(cluster_model, open(filename, 'wb'))

Lets score the dataframe (cluster our customers) with our cluster model 

In [26]:
cluster_labels = cluster_model.predict(data_for_clustering_matrix)
df_pivoted['cluster'] = cluster_labels

Lets see the three clusters average spendings on each item in our store 

In [27]:
df_pivoted.groupby(['cluster']).mean() 

Item,Aloe Vera,Broccoli Powder,Detox Green Tea,Energy bar White Chocolate and Macadamia Nut,Fusion Spice Red Tea,Ginger Lemon Tea,Grounded Garlic & Ginger,"HealthSmart Foods Chocolite Protein, French Vanilla",Muscle Combat crunch (Chocolate chip),"Oh Yeah!, Nutritional Shake, Chocolate Milkshake",Power bar - Banana Strawberry,Sprirulina,Wheat Grass
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,2.333333,2.0,2.5,52.0,1.833333,2.666667,3.0,41.166667,47.0,46.0,47.166667,2.666667,2.333333
1,17.352941,17.352941,37.941176,5.882353,40.294118,41.176471,18.529412,7.941176,9.705882,7.647059,4.117647,20.294118,17.647059
2,53.076923,55.0,15.769231,5.0,17.692308,16.538462,47.307692,2.692308,2.307692,2.307692,2.692308,47.692308,49.230769


Customers in cluster 0 Enjoy Energy Bars and Protein Shakes. Lets all them Energizers (in need of breaks)   
Customers in cluster 1: They likes therir healthy herbal tea, lets call them Tea sippers <br>
Customers in cluster 2 Enjoys  Greens, Grass and Algaes  (although they don't seem to mind some healthy afternoon tea), lets call them our Healthy Herbs