We are going to build a segmentation of InstaCart users to then use the segment as a factor to basic classifier model that predicts is a user will buy a product in their next order. Another option is identifying products that are likely to be purchased in next order that have not been purchased before to increase the sample for general model. Finally, we can build a separate model for each cluster.

I decided to do it to try a couple of algorithms in practice, specifically I am going to use TruncatedSVD for dimension reduction, IsolationForest for excluding outliers and KMeans for clusterization itself.

I found that it would be interesting to share my results with the community. I am very happy to get feedback on my approach, please comment if you have any questions or suggestions. Let's start!

First we load the data.

In [None]:
import pandas as pd
import numpy as np

input_folder = '../input/'

products = pd.read_csv(input_folder + 'products.csv', index_col='product_id')
orders = pd.read_csv(input_folder + 'orders.csv', usecols=['order_id','user_id','eval_set'], index_col='order_id')
item_prior = pd.read_csv(input_folder + 'order_products__prior.csv', usecols=['order_id','product_id'], index_col=['order_id','product_id'])

Now let's extract what we need - data which customers have bought which products.

In [None]:
# basic prior products table
user_product = orders.join(item_prior, how='inner').reset_index().groupby(['user_id','product_id']).count()
user_product = user_product.reset_index().rename(columns={'order_id':'prior_order_count'})

I am going to translate it into a sparse matrix to then apply a dimension reduction algorithm.

In [None]:
from scipy.sparse import csr_matrix
user_product_sparse = csr_matrix((user_product['prior_order_count'], (user_product['user_id'], user_product['product_id'])), shape=(user_product['user_id'].max()+1, user_product['product_id'].max()+1), dtype=np.uint16)

Now let's apply a singular value decomposition with 10 components. In short, this algorithm will reduce the dimension of our problem and instead of dealing with ~50k factors (products) it will determine a new feature space that will explain the most variance in our data.

In [None]:
from sklearn.decomposition import TruncatedSVD
decomp = TruncatedSVD(n_components=10, random_state=101)
user_reduced = decomp.fit_transform(user_product_sparse)

print(decomp.explained_variance_ratio_[:10], decomp.explained_variance_ratio_.sum())

As we can see in the output, our new 10 factors explain ~16% of total variance and the most important factor explains around 6% of variance. Not bad for a reduction from 50k to 10 variables.

The next step is to cluster the data, but before we need to do a couple of adjustments. First of all, let's scale it with StandardScaler to assure proper result of KMeans.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
user_reduced_scaled = scaler.fit_transform(user_reduced)

It's also a good idea to get rid of outliers before doing the clusterization, otherwise we have a good chances to get a separate class for each outlier and all the rest of users in single class, definitely something we don't really want. I am going to use IsolationForest algorithm with 5% set as a share of outliers I'd like to exclude at the step of KMeans model training.

In [None]:
from sklearn.ensemble import IsolationForest
clf = IsolationForest(contamination=0.05, random_state=101)
clf.fit(user_reduced_scaled)
outliers = clf.predict(user_reduced_scaled)

unique, counts = np.unique(outliers, return_counts=True)
dict(zip(unique, counts))

I have found this contamination parameter empirically, making sure we don't exclude too much data but the classes are more or less balanced. If someone has a good idea how to determine this paratemer using some algorithm - please let me know! Let's check how it looks like in 2D space of the first two factors (being most important ones).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# red is an outlier, green is a regular observation
color_map = np.vectorize({ -1: 'r', 1: 'g'}.get)
plt.scatter(user_reduced_scaled[:,0], user_reduced_scaled[:,1], c=color_map(outliers), alpha=0.1)

It might look like we have excluded too many observations, but in reality the density is very different across our space and most of points remained untouched.

Now let's train the KMeans algorithm and check if we can get some meaningful clusters out.

In [None]:
from sklearn.cluster import KMeans

clusters_count = 10

kmc = KMeans(n_clusters=clusters_count, init='random', n_init=10, random_state=101)
kmc.fit(user_reduced_scaled[outliers == 1,:])
clusters = kmc.predict(user_reduced_scaled)

unique, counts = np.unique(clusters, return_counts=True)
dict(zip(unique, counts))

Vast majority of users are in one single class, which might be not very good as we are not catching the actual differences between our users, but it might also be okay if in reality most of users are indeed super similar. There is really no way to know that for sure.

Let's check how the clusters look like on the same 2D plane.

In [None]:
plt.scatter(user_reduced_scaled[:,0], user_reduced_scaled[:,1], c=clusters / (clusters_count-1), cmap='tab10', alpha=0.1)

It looks like a mess, but don't forget we are dealing with 10 dimensions in our clasterization problem. Still there is some clear structure visible.

Now let's check what are the actual differences between clusters. We will check products that are the most popular in the clusters compared to all sample. It does not make a lot of sense to just look at the most popular products by cluster because they will likely have the same products across clusters, like bananas or avocadoes. We will compare product ranks (ranked by # of purchases) with ranks on the total population.

In [None]:
# dataframe with overall product ranks
top_products_overall = user_product[['product_id','prior_order_count']].groupby('product_id').sum().reset_index().sort_values('prior_order_count', ascending=False)
top_products_overall['rank_overall'] = top_products_overall['prior_order_count'].rank(ascending=False)

# packing clusters we found into dataframe
usersdf = pd.DataFrame(clusters[1:], columns=['cluster'], index=np.arange(1, user_product['user_id'].max()+1))

# dataframe with product ranks across clusters
top_products = user_product.merge(usersdf, left_on='user_id', right_index=True)[['product_id','cluster','prior_order_count']].groupby(['product_id','cluster']).sum().reset_index().sort_values(['cluster','prior_order_count'], ascending=False)
top_products['rank'] = top_products[['cluster','prior_order_count']].groupby('cluster').rank(ascending=False)

# merging with overall top products
top_products = top_products.merge(top_products_overall[['product_id','rank_overall']], left_on='product_id', right_on='product_id')
# calculating differences between ranks
top_products['rank_diff'] = top_products['rank'] - top_products['rank_overall']
# leaving top products in each cluster: 2 with largest and 2 with smallest difference in ranks
top_products_asc_diff = top_products.sort_values(['cluster','rank_diff'], ascending=False).groupby('cluster').head(2).reset_index(drop=True)
top_products_desc_diff = top_products.sort_values(['cluster','rank_diff'], ascending=True).groupby('cluster').head(2).reset_index(drop=True)
top_products_diff = pd.concat([top_products_asc_diff,top_products_desc_diff], axis=0)

# printing results
top_products_diff.merge(products[['product_name']], left_on='product_id', right_index=True)[['cluster','product_name','rank','rank_overall','rank_diff']].sort_values(['cluster','rank_diff'])

It's not always very easy to describe each of the clusters with words and get a feeling of them. We can see that users from cluster 5 tend to buy *California Ripe Pitted Extra Large Olives* and *Vanilla Spiru-Tein High Protein Energy Shake* significantly more frequent than average but they don't like *Extra Fancy Unsalted Mixed Nuts* and *Baby Cucumbers* which are one of the most popular products overall.

This absense of a clear interpretation of clusters does not mean the clusters are meaningless though. The only possible proof if we are talking about data science is the actual usage of these clusters in some way that I mentioned in the beginning of the kernel. I will soon add this variable to my basic XGBoost classifier and post my results.