## Instacart data frequent itemsets test

This is a test for frequent itemsets based on instacart data https://www.instacart.com/datasets/grocery-shopping-2017. 

In [12]:
import pandas as pd
import seaborn as sns
import numpy as np
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder
import warnings
warnings.simplefilter('ignore')

In [6]:
order_train = pd.read_csv("/Users/lingfeizeng/order_products__train.csv")
order_train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [7]:
order_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1384617 entries, 0 to 1384616
Data columns (total 4 columns):
order_id             1384617 non-null int64
product_id           1384617 non-null int64
add_to_cart_order    1384617 non-null int64
reordered            1384617 non-null int64
dtypes: int64(4)
memory usage: 42.3 MB


In [8]:
products = pd.read_csv("/Users/lingfeizeng/products.csv")
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [9]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 4 columns):
product_id       49688 non-null int64
product_name     49688 non-null object
aisle_id         49688 non-null int64
department_id    49688 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


In [5]:
order_train_products = pd.merge(order_train, products, on="product_id", how="inner")
order_train_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,1,49302,1,1,Bulgarian Yogurt,120,16
1,816049,49302,7,1,Bulgarian Yogurt,120,16
2,1242203,49302,1,1,Bulgarian Yogurt,120,16
3,1383349,49302,11,1,Bulgarian Yogurt,120,16
4,1787378,49302,8,0,Bulgarian Yogurt,120,16


Data is grouped by order_id to generate the "basket".

In [10]:
train = order_train_products[["order_id","product_id", "product_name" ]]
train.head()

Unnamed: 0,order_id,product_id,product_name
0,1,49302,Bulgarian Yogurt
1,816049,49302,Bulgarian Yogurt
2,1242203,49302,Bulgarian Yogurt
3,1383349,49302,Bulgarian Yogurt
4,1787378,49302,Bulgarian Yogurt


In [11]:
basket = train.groupby(["order_id"])["product_name"].apply(list) 
basket.head() 

order_id
1     [Bulgarian Yogurt, Organic 4% Milk Fat Whole M...
36    [Grated Pecorino Romano Cheese, Spring Water, ...
38    [Shelled Pistachios, Organic Biologique Limes,...
96    [Roasted Turkey, Organic Cucumber, Organic Gra...
98    [Bag of Organic Bananas, Organic Raspberries, ...
Name: product_name, dtype: object

In [15]:
te = TransactionEncoder()
te_ary = te.fit(basket).transform(basket)
df1 = pd.DataFrame(te_ary, columns=te.columns_)
itemsets = apriori(df1, min_support=0.01, use_colnames=True)

In [16]:
itemsets.iloc[104:] # print only the ietmsets that consist of more than one item.


Unnamed: 0,support,itemsets
104,0.017042,"(Organic Baby Spinach, Bag of Organic Bananas)"
105,0.018444,"(Organic Hass Avocado, Bag of Organic Bananas)"
106,0.013566,"(Bag of Organic Bananas, Organic Raspberries)"
107,0.023428,"(Organic Strawberries, Bag of Organic Bananas)"
108,0.016447,"(Large Lemon, Banana)"
109,0.010144,"(Banana, Limes)"
110,0.016889,"(Banana, Organic Avocado)"
111,0.015243,"(Organic Baby Spinach, Banana)"
112,0.016569,"(Organic Strawberries, Banana)"
113,0.014847,"(Strawberries, Banana)"


Surprinsingly, the frequently bought itemsets (doublets) are all fruits. Especially bananas are most frequently items that bought together with other fruits. Other fruits include avacado, strawberries, raspberries... Organic fruits tend to be bought together with organic ones.
