![](images/person-holds-a-basket-full-of-groceries-in-a-supermarket.jpg)

In this project I used the K-means algorithm to cluster grocery items based on their transaction data.

**Items that are often purchases together can be placed in the same aisle or aisles closer to each other,
increasing sales!**

This used to be done by human experts, which would require **many years of experience** in the industry to
narrow things down. However, with the rise of big data and machine learning, why not
let AI do all the trick and hard work for you?

Ok let's get started. The first step is to gather the dataset. Here I downloaded
[the Instacart dataset](https://www.kaggle.com/c/instacart-market-basket-analysis/data)
from Kaggle, and **filtered out items that don't have a sufficient purchase history** yet in the dataset
(being purchased fewer than 100 times), because they may not contain enough information to be correctly
classified(i.e. they may end up forming weird 1-item categories).

In [42]:
import pandas as pd
from scipy import sparse
from sklearn import metrics
from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

In [43]:
# Load the Instacart data and cluster items that have been purchased at least 100 times
df = pd.read_csv('data/order_products__train.csv')
df = df.drop(['add_to_cart_order', 'reordered'], axis=1)
df = df.groupby('product_id').filter(lambda x:len(x) >= 100)

df_products = pd.read_csv('data/products.csv')
df_aisles = pd.read_csv('data/aisles.csv')

In [44]:
df.head()

Unnamed: 0,order_id,product_id
1,1,11109
2,1,10246
3,1,49683
5,1,13176
6,1,47209


In [49]:
from IPython.display import HTML
import random

def hide_toggle(for_next=False):
    this_cell = """$('div.cell.code_cell.rendered.selected')"""
    next_cell = this_cell + '.next()'

    toggle_text = 'Toggle show/hide'  # text shown on toggle link
    target_cell = this_cell  # target cell to control with toggle
    js_hide_current = ''  # bit of JS to permanently hide code in current cell (only when toggling next cell)

    if for_next:
        target_cell = next_cell
        toggle_text += ' next cell'
        js_hide_current = this_cell + '.find("div.input").hide();'

    js_f_name = 'code_toggle_{}'.format(str(random.randint(1,2**64)))

    html = """
        <script>
            function {f_name}() {{
                {cell_selector}.find('div.input').toggle();
            }}

            {js_hide_current}
        </script>

        <a href="javascript:{f_name}()">{toggle_text}</a>
    """.format(
        f_name=js_f_name,
        cell_selector=target_cell,
        js_hide_current=js_hide_current,
        toggle_text=toggle_text
    )

    return HTML(html)
hide_toggle()

In [5]:
# Transform order data into a matrix where
# each row is an order, each column is a product, and each value 1 indicates a purchase
df['filler'] = 1
df = df.pivot(index='order_id', columns='product_id', values='filler').fillna(0)
df.head()

product_id,10,34,45,79,95,116,117,130,141,160,...,49481,49517,49520,49533,49585,49605,49610,49621,49628,49683
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
36,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
print('Number of Transactions:', df.shape[0])
print('Number of Items:', df.shape[1])

Number of Transactions: 125956
Number of Items: 2457


In [9]:
# Save memories by converting it to a sparse matrix
data = df.to_numpy()
data_sparse = sparse.csr_matrix(data)
data_clustering = metrics.pairwise.cosine_similarity(data_sparse.T)
data_clustering = sparse.csr_matrix(data_clustering)

In [15]:
cluster_inertia = []
for i in range(20, 30):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(data_clustering)
    cluster_inertia.append(kmeans.inertia_)

plt.plot(range(20, 30), cluster_inertia)
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

In [25]:
kmeans = KMeans(n_clusters=15)

clusters = kmeans.fit_predict(data_clustering)
final_clusters = pd.DataFrame({'cluster':clusters,
                               'product_id':df.columns})
df_cluster = final_clusters.sort_values('cluster')
df_cluster = pd.merge(df_cluster, df_products, how='left', on='product_id')

In [27]:
df_cluster['cluster'].value_counts()


5     348
8     327
9     291
6     235
11    218
4     202
3     192
7     188
0     181
2     140
10     63
14     29
13     19
1      15
12      9
Name: cluster, dtype: int64

5     348
8     327
9     291
6     235
11    218
4     202
3     192
7     188
0     181
2     140
10     63
14     29
13     19
1      15
12      9
Name: cluster, dtype: int64

In [41]:
df_cluster.query('cluster==0')

Unnamed: 0,cluster,product_id,product_name,aisle_id,department_id
0,0,36189,Key Lime Yoghurt,120,16
1,0,20876,Crunchy Peanut Butter Energy Bar,3,19
2,0,20754,Mediterranean Mint Gelato,37,1
3,0,36076,Everything Deli Style Pretzel Crisps Crackers,107,19
4,0,20670,Organic Lentil Vegetable Soup,69,15
...,...,...,...,...,...
176,0,31883,Vegetable Lasagna,38,1
177,0,49273,Light and Lean Quinoa Black Beans with Buttern...,38,1
178,0,49247,Coconut Yoghurt,120,16
179,0,2962,"Milk, Reduced Fat, 2% Milkfat",84,16
