# SD201 Project 

## Dataset (from a Kaggle competition) : Instacart Market Basket Analysis

Link : https://www.kaggle.com/c/instacart-market-basket-analysis/data

Blog post about the competition : https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2

Key points from the dataset:

- 3M grocery store orders
- 200,000+ Instacart users
- 4 to 100 orders for each user, timestamped

“The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on 10/12/2021"

## Introduction

In this project, we seek to use association rule mining algorithm to make recommendations based on frequently bought-together items from the Instacart online store platform.

We will use and compare the following algorithms :

- Apriori algorithm
- Frequent Pattern Growth Algorithm (FP-Growth)
- ECLAT algorithm

In [9]:
'''Python librairies''' 

# Utility librairies
import pandas as pd
import scipy.stats as s
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

# Association rule preprocessing librairies
from mlxtend.preprocessing import TransactionEncoder

# Association rule mining librairies
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.frequent_patterns.fpgrowth import fpgrowth
from pyECLAT import ECLAT

# Metrics
from sklearn.metrics import f1_score

# Pretty charts
import seaborn as sns
sns.set_theme(style="ticks")

In [3]:
# Open the data
op_prior = pd.read_csv('./instacart/order_products__prior.csv')
orders = pd.read_csv('./instacart/orders.csv')
products = pd.read_csv('./instacart/products.csv')

### Data cleaning

For association rule mining we only need `order_id` and the ordered products for each order. Since there are not any null entries in the carts data, we do not have to deal with `nan` values here.

However there are some null entries in the `days_since_prior_order` that we have to deal with for multi-label classification (see `SD201-Instacart-MultiLabelClassification` notebook).

#### Formatting the data 

We cannot exploit our relational data directly: we need to perform merges using the keys in the data, and then perform an aggregation over the ordered products to get arrays of ordered products for each order.

Moreover, instead of keeping all the items (which poses memory problems when applying the mining algorithms), we can keep only the most frequent items according to what was done in EDA.

In [42]:
threshold = 10e-5
order_count = len(op_prior)

# Create the DataFrame of ordered products with their frequencies
item_freq = op_prior.product_id.value_counts()
item_freq = pd.DataFrame(item_freq.reset_index())
item_freq.rename(columns={'product_id':'n_occ', 'index':'product_id'}, inplace= True)
item_freq['frequency'] = item_freq['n_occ']/order_count
item_freq = item_freq.merge(products[['product_id', 'product_name']], on='product_id')

# Compare the number of products before and after the drop
bf_size = len(item_freq)
item_freq = item_freq[item_freq.frequency>threshold]
af_size = len(item_freq)
print('Number of products before :', bf_size, 'after:', af_size)

Number of products before : 49677 after: 1756


In [46]:
# Drop all rows with unfrequently bought products
op_prior = op_prior[op_prior.product_id.isin(item_freq.product_id)]

In [47]:
def arrange_data(op_data):
    '''
    Format the data so that to each order corresponds an array of products (the carts).
    op_data can be either op_train or op_prior.
    '''
    
    # Merge product information with order information
    data = op_data[['order_id', 'product_id']]
    data = data.merge(products[['product_id', 'product_name']], on='product_id')
    
    # Aggregate the carts into arrays
    groupby_cols = ['order_id']
    data = data.groupby(groupby_cols).aggregate(lambda x: list(x))
    
    # Rename the product_id column to 'cart'
    data.rename(columns = {'product_id':'cart'}, inplace = True)
    data.rename(columns = {'product_name':'cart_names'}, inplace = True)
    
    # Reset the index that was changed by the aggregation
    data = data.reset_index()
    
    return data

In [48]:
# Create the DataFrame with aggregated carts for each order
data = arrange_data(op_prior)
data

### Association rule mining 

#### Definitions and metrics 

#### Apriori algorithm

Because the Apriori algorithm is highly inefficient for large datasets (complexity in ...), we use the Apriori property to reduce the size of the set of products on which we apply the Apriori algorithm.

The Apriori property states that :


#### FP-Growth algorithm 

In [None]:
# Preprocess the transactions by one-hot-encoding them
te = TransactionEncoder()
te.fit(list(data.cart))
ohe_transactions = te.transform(list(data.cart))
ohe_transactions_df = pd.DataFrame(ohe_transactions, columns=te.columns_)
print(ohe_transactions_df)