# kmeans Clustering with Scikit-Learn Assignment

In this assignment we will be using scikit-learn to identify client segemnts in a database of customers from a wholesale retailer.

The original dataset (which can be found here) contains the follow features that we will use for clustering.

FRESH: annual spending (m.u.) on fresh products (Continuous);
MILK: annual spending (m.u.) on milk products (Continuous);
GROCERY: annual spending (m.u.)on grocery products (Continuous);
FROZEN: annual spending (m.u.)on frozen products (Continuous)
DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);

First, we need to load the appropriate libraries for cleaning the data and then running a kMeans clustering algorithm. Again, we also set a random seed for replication.

In [1]:
from sklearn import cluster
import pandas as pd
import numpy as np
np.random.seed(100)

Here we use the read_csv function from pandas to read in our dataset and take a subset of variables used for training.

In [2]:
df = pd.read_csv('Wholesale customers data.csv')
df = df[['Fresh','Milk','Grocery','Frozen','Detergents_Paper','Delicassen']]

Next, we use the KMeans function to group the data into three natural clusters. We can then print out the assigned labels using .labels_

In [3]:
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(df) 
k_means.labels_

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       2, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 1, 0, 2, 1, 1, 0, 0, 2, 0, 2,
       2, 2, 0, 2, 0, 0, 1, 0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 0, 1, 0, 0,
       2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0,
       0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 2, 1, 0, 0, 2, 0,
       0, 0, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
       1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 2, 2, 0, 2, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0,

Here we take a subset of the observations in our wholesale customer database that were assigned to cluster 0 and print out the average spent per category of purchases as well as the total. We can inspect the observations in cluster 0 by using: df.loc[i,]

In [4]:
i, = np.where(k_means.labels_==0)
print('Mean:')
print(df.loc[i,].mean())
print('\nTotal:')
print(df.loc[i,].mean().sum()/6)

df.loc[i,]

Mean:
Fresh               8253.469697
Milk                3824.603030
Grocery             5280.454545
Frozen              2572.660606
Detergents_Paper    1773.057576
Delicassen          1137.496970
dtype: float64

Total:
3806.95707071


Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,12669,9656,7561,214,2674,1338
1,7057,9810,9568,1762,3293,1776
2,6353,8808,7684,2405,3516,7844
3,13265,1196,4221,6404,507,1788
5,9413,8259,5126,666,1795,1451
6,12126,3199,6975,480,3140,545
7,7579,4956,9426,1669,3321,2566
8,5963,3648,6192,425,1716,750
10,3366,5403,12974,4400,5977,1744
11,13146,1124,4523,1420,549,497


In [5]:
i, = np.where(k_means.labels_==1)
print('Mean:')
print(df.loc[i,].mean())
print('\nTotal:')
print(df.loc[i,].mean().sum()/6)

df.loc[i,]

Mean:
Fresh               35941.400000
Milk                 6044.450000
Grocery              6288.616667
Frozen               6713.966667
Detergents_Paper     1039.666667
Delicassen           3049.466667
dtype: float64

Total:
9846.26111111


Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
4,22615,5410,7198,3915,1777,5185
12,31714,12319,11757,287,3881,2931
14,24653,9465,12091,294,5058,2168
22,31276,1917,4469,9408,2381,4334
24,22647,9776,13792,2915,4482,5778
29,43088,2100,2609,1200,1107,823
33,29729,4786,7326,6130,361,1083
36,29955,4362,5428,1729,862,4626
39,56159,555,902,10002,212,2916
40,24025,4332,4757,9510,1145,5864


In [6]:
i, = np.where(k_means.labels_==2)
print('Mean:')
print(df.loc[i,].mean())
print('\nTotal:')
print(df.loc[i,].mean().sum()/6)

df.loc[i,]

Mean:
Fresh                8000.04
Milk                18511.42
Grocery             27573.90
Frozen               1996.68
Detergents_Paper    12407.36
Delicassen           2252.02
dtype: float64

Total:
11790.2366667


Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
9,6006,11093,18881,1159,7425,2098
23,26373,36423,22019,5154,4337,16523
28,4113,20484,25957,1158,8604,5206
38,4591,15729,16709,33,6956,433
43,630,11095,23998,787,9529,72
45,5181,22044,21531,1740,7353,4985
46,3103,14069,21955,1668,6792,1452
47,44466,54259,55571,7782,24171,6465
49,4967,21412,28921,1798,13583,1163
56,4098,29892,26866,2616,17740,1340


# It looks like the clusters indicate that there are three customer segments: small (~3,800), medium ( ~9,845 ), and high ( ~11,790) spending groups.