# Analysis of Item Associations
We will cluster the item representations using hdbscan.
The following segment reads the item representations generated by Starspace. The first column is the item code, the other columns are the item representation.

In [2]:
import numpy as np
import time
from hdbscan import HDBSCAN
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
import os
import pandas as pd
from sklearn.decomposition import PCA
fpc = "/home/admin123/Starspace/"+ "data/pagespace.tsv"
df = pd.read_csv(fpc, sep = "\t", header = None)
item_codes = df.loc[:, 0]
df = df.loc[:, 1:]

In [3]:
# Compute DBSCAN
hdb_t1 = time.time()
hdb = HDBSCAN(min_cluster_size=3).fit(df)
hdb_labels = hdb.labels_
hdb_elapsed_time = time.time() - hdb_t1

In [4]:
# Number of clusters in labels, ignoring noise if present.
n_clusters_hdb_ = len(set(hdb_labels)) - (1 if -1 in hdb_labels else 0)
print('\n\n++ HDBSCAN Results')
print('Estimated number of clusters: %d' % n_clusters_hdb_)
print('Elapsed time to cluster: %.4f s' % hdb_elapsed_time)



++ HDBSCAN Results
Estimated number of clusters: 64
Elapsed time to cluster: 2.2547 s


## Create Item Code Description Lookup
This makes it easier to interpret the results of clustering. When we see which items cluster together, it is useful to have their descriptions.

In [5]:
fp = "/home/admin123/Starspace/"+ "data/Online_Retail.csv"
dfrd = pd.read_csv(fp)
dfrd = dfrd[-dfrd['InvoiceNo'].str.startswith("C")]
dfrd = dfrd[-dfrd['StockCode'].str.startswith("BANK")]
dfrd = dfrd.dropna(how ='any')
dfrd["StockCode"] = dfrd["StockCode"].astype(str)
dfrd["StockCode"] = "itemcode_" + dfrd["StockCode"]
req_cols = ["StockCode", "Description"]
dfrd = dfrd[req_cols]

## Implementation Note: 
Some items have multiple descriptions, for convinience, we will just one of them. These seem to be data coding errors.

In [6]:
len(dfrd["StockCode"].unique())

3664

In [7]:
len(dfrd["Description"].unique())

3876

In [8]:
desc_ct = dfrd.groupby("StockCode")["Description"].nunique()


In [9]:
desc_df = pd.DataFrame()
desc_df["itemcode"] = desc_ct.index
desc_df["num_description"] = desc_ct.tolist()

In [10]:
multiple_desc_df = desc_df.query("num_description > 1")

In [11]:
dfrd[dfrd["StockCode"] == multiple_desc_df.iloc[0,0]]["Description"].unique()

array(['WRAP, CAROUSEL', 'WRAP CAROUSEL'], dtype=object)

In [34]:
df_lookup = dfrd.groupby("StockCode").nth(0)

In [13]:
df_lookup.head()

Unnamed: 0_level_0,Description
StockCode,Unnamed: 1_level_1
itemcode_10002,INFLATABLE POLITICAL GLOBE
itemcode_10080,GROOVY CACTUS INFLATABLE
itemcode_10120,DOGGY RUBBER
itemcode_10123C,HEARTS WRAPPING TAPE
itemcode_10124A,SPOTS ON RED BOOKCOVER TAPE


In [24]:
df_lookup["item_code"] = df_lookup.index

In [37]:
lookup_desc = df_lookup.T.to_dict('series')

In [41]:
df_cluster = pd.DataFrame()
df_cluster["itemcode"] = item_codes
df_cluster["Description"] = df_cluster["itemcode"].apply(lambda item: lookup_desc[item])
df_cluster["Cluster"] = hdb_labels

## Analysis of Clustering 
A sample of cluster items for cluster 5 and cluster 1 are shown below. From an analysis of the tables shown below, it should be clear that the representation obtained from Starspace for the items is very useful. Similar to word2vec in NLP, item representations are generated using concurrently purchased items (similar to concurrently occuring words in a context window). Clearly, similar items are put in the same cluster. Cluster 1 contains water toys, Cluster 5 contains childrens play toys.

In [48]:
df_cluster.query("Cluster == 1")

Unnamed: 0,itemcode,Description,Cluster
496,itemcode_22431,WATERING CAN BLUE ELEPHANT,1
501,itemcode_22432,WATERING CAN PINK BUNNY,1
727,itemcode_22433,WATERING CAN GREEN DINOSAUR,1


In [49]:
df_cluster.query("Cluster == 5")

Unnamed: 0,itemcode,Description,Cluster
918,itemcode_22522,CHILDS GARDEN FORK BLUE,5
933,itemcode_22523,CHILDS GARDEN FORK PINK,5
1024,itemcode_22520,CHILDS GARDEN TROWEL BLUE,5
1129,itemcode_22521,CHILDS GARDEN TROWEL PINK,5
