# SD201 Project 

## Dataset (from a Kaggle competition) : Instacart Market Basket Analysis

Link : https://www.kaggle.com/c/instacart-market-basket-analysis/data

Blog post about the competition : https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2

Key points from the dataset:

- 3M grocery store orders
- 200,000+ Instacart users
- 4 to 100 orders for each user, timestamped

“The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on 10/12/2021"

## Introduction

In this notebook, we seek to optimise the k-NearestNeighbors classifier that we have chosen previously.

### Setup (run all)

In [1]:
# # Run cell if using Google Colab
# # Mount the private Google Drive folder to access the .csv files
# from google.colab import drive
# drive.mount('/gdrive')
# %cd /gdrive

In [2]:
'''Python librairies''' 

# Utility librairies
import pandas as pd
import scipy.stats as s
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

# Preprocessing and pipeline librairies
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

# Wrapper to convert regular classifiers to multi-label classifiers
from sklearn.multioutput import MultiOutputClassifier

# Classifiers that support multi-label output
from sklearn.neighbors import KNeighborsClassifier

# To optimize hyperparameters
from sklearn.model_selection import GridSearchCV

# Metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import jaccard_score

# Plotting (unused)
from mlxtend.plotting import plot_decision_regions

In [3]:
# Open the data
path_to_csv = './instacart/'

op_prior = pd.read_csv(path_to_csv + 'order_products__prior.csv')
op_train = pd.read_csv(path_to_csv + 'order_products__train.csv')
orders   = pd.read_csv(path_to_csv + 'orders.csv')
products = pd.read_csv(path_to_csv + 'products.csv')

## Data cleaning (run all)

We cannot exploit our relational data directly: we need to perform merges using the keys in the data, and then perform an aggregation over the ordered products to get arrays of ordered products for each order.

Moreover, instead of keeping all the items (which poses memory problems when applying the mining algorithms), we can keep only the most frequent items according to what was done in EDA.

As we will see, this poses some limitations but gives preliminary insights for the Instacart platform.

In [4]:
# Set product_id as index to avoid problems when using loc
products.set_index('product_id', inplace=True)

In [5]:
threshold = 5e-4
order_count = len(op_prior)

# Create the DataFrame of ordered products with their frequencies
item_freq = op_prior.product_id.value_counts(ascending=False)
item_freq = pd.DataFrame(item_freq.reset_index())
item_freq.rename(columns={'product_id':'n_occ', 'index':'product_id'}, inplace= True)
item_freq['frequency'] = item_freq['n_occ']/order_count

In [6]:
# Compare the number of products before and after the drop
bf_size = len(item_freq)
item_freq = item_freq[item_freq.frequency>threshold]
af_size = len(item_freq)
print('Number of products before :', bf_size, 'after:', af_size)

Number of products before : 49677 after: 250


In [7]:
# # Keep first item in cart
# first_item_prior = op_prior[op_prior.add_to_cart_order == 1]

# Drop all rows with unfrequently bought products
op_prior = op_prior[op_prior.product_id.isin(item_freq.product_id)]

In [8]:
def arrange_data(op_data):
    '''
    Format the data so that to each order corresponds an array of product_id (the cart),
    and an array indicating whether an item was reordered or not.
    op_data can be either op_train or op_prior.
    '''
    data = orders.merge(op_data[['order_id', 'product_id']], on='order_id')
    
    # Aggregate the carts into arrays
    groupby_cols = ['order_id',
                    'user_id',
                    'eval_set',
                    'order_number',
                    'order_dow',
                    'order_hour_of_day',
                    'days_since_prior_order']
    
    data = data.groupby(groupby_cols).aggregate(list)
    
    # Rename the product_id column to 'cart'
    data.rename(columns = {'product_id':'cart'}, inplace = True)
    
    # Reset the index that was changed by the aggregation
    data = data.reset_index()
    
    return data

In [9]:
# Create the DataFrame with aggregated carts for each order
train_data = arrange_data(op_prior)

In [10]:
# Free the RAM for Google Colab
op_prior = None

## Data mining 

### Defining the model (run all)



#### Models and pipelines definition

Same as before.

In [11]:
numerical_cols = ['order_dow', 'order_hour_of_day', 'days_since_prior_order']
# Impute the average over all orders
avg_imp = SimpleImputer(missing_values=np.nan, strategy='mean')

# Imputing the average for a given client in a pipeline necessitates writing a custom imputer.
# This is optional and will be done if there is enough time.

# Min-max normalization 
mm_scaler = MinMaxScaler()
std_scaler = StandardScaler()

# Define the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', avg_imp, numerical_cols),
        ('norm', std_scaler, numerical_cols)
    ])

In [12]:
# kNN Classifier
# Use GridSearchCV to tune k
kNN_model = KNeighborsClassifier()
multi_kNN_model = MultiOutputClassifier(kNN_model, n_jobs=-1)

In [13]:
# kNN Classifier
kNN_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('kNN model', multi_kNN_model)
                             ],
                        verbose=True)

We define functions for fitting, scoring, and comparing all models at once.

In [14]:
def fit_score(model, X_train, X_test, y_train, y_test):
    """Fits the given multi-label model and scores it
    using the four metrics below"""
    # Fit the model
    model.fit(X_train, y_train)

    # Predict
    y_pred = model.predict(X_test)

    # Score the model
    f1 = f1_score(y_test, y_pred, average='samples')
    accuracy = accuracy_score(y_test, y_pred)
    hamming = hamming_loss(y_test, y_pred)
    jaccard = jaccard_score(y_test, y_pred, average='samples')

    return model, y_pred, f1, hamming, accuracy, jaccard
  
def compare_models(pipelines):
    """Fit and compare models based on speed and four metrics"""
    models = []
    predictions = []
    for p in pipelines:
        model, y_pred, f1, hamming, accuracy, jaccard = fit_score(p,
                                                              X_train,
                                                              X_test,
                                                              y_train_bm,
                                                              y_test_bm)
        models.append(model)
        predictions.append(y_pred)
        print('f1-score:', f1)
        print('accuracy score:', accuracy)
        print('hamming loss:', hamming)
        print('jaccard score:', jaccard)
    return models, predictions

#### Utility functions for fitting and comparing models

We also define functions that make the predictions human readable.

In [15]:
def convert_to_carts(y_pred):
    """Convert back binary matrix prediction outputs to human readable carts"""
    arr = mlb.inverse_transform(y_pred)
    carts = [[] for i in range (len(arr))]
    for i in range(len(arr)):
        for id in arr[i]:
            carts[i].append(products.loc[id].product_name)
    return carts

def print_carts(y_pred):
    """Print carts and number of empty carts.
     y_ pred must be in binary matrix format."""
    carts = pd.Series(convert_to_carts(y_pred))
    empty = 0
    for cart in carts:
        if cart == []:
            empty+=1
        else:
            print(cart)
    print('Number of empty carts:', empty)

### Model optimisation

We can now optimize our selected model.

We start by redefining the dataset by taking more data for the training step, and using the sparse format for our binary matrixes since the kNN model supports it.



In [16]:
# Fraction of data to keep
frac = 1/30

# Train and validate the models

features = ['order_dow', 'order_hour_of_day', 'days_since_prior_order']
target = 'cart'

# Take more data for training
train_sample = train_data.sample(axis=0, frac=frac)

# Define target and features
X = train_sample[features]
y = train_sample[target]

# Fit the MultiLabelBinarizer with sparse output
mlb = MultiLabelBinarizer()
mlb.fit(y)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

# Convert the target data into binary matrix
y_train_bm = mlb.transform(y_train)
y_test_bm = mlb.transform(y_test)

In [17]:
# Do n fold cross-validation
n=3
scores = cross_val_score(kNN_pipeline, X_train, y_train_bm, cv=n,
                         scoring='jaccard_samples', error_score='raise')
scores

[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s
[Pipeline] ......... (step 2 of 2) Processing kNN model, total=  10.6s
[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s
[Pipeline] ......... (step 2 of 2) Processing kNN model, total=   8.9s
[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s
[Pipeline] ......... (step 2 of 2) Processing kNN model, total=   8.8s


array([0.0048834 , 0.00388358, 0.00418727])

The scores for all folds are very similar, so we do not need to use cross-validation for our model. There is enough data.

In [18]:
# Perform GridSearch to optimize for k with n folds cross-validation

n=3
param_grid = {
    'kNN model__estimator__n_neighbors':[3,5,10]
}
score = 'jaccard_samples'

clf = GridSearchCV(kNN_pipeline, param_grid=param_grid, scoring=score, cv=n,
                   verbose=1)
clf.fit(X_train, y_train_bm)

print (clf.best_params_, clf.best_score_)

Fitting 3 folds for each of 3 candidates, totalling 9 fits
[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[Pipeline] ......... (step 2 of 2) Processing kNN model, total=  13.9s
[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s
[Pipeline] ......... (step 2 of 2) Processing kNN model, total=  10.8s
[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s
[Pipeline] ......... (step 2 of 2) Processing kNN model, total=   9.4s
[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s
[Pipeline] ......... (step 2 of 2) Processing kNN model, total=   9.5s
[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s
[Pipeline] ......... (step 2 of 2) Processing kNN model, total=   9.6s
[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s
[Pipeline] ......... (step 2 of 2) Processing kNN model, total=   9.1s
[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s
[Pipeline] ......... (step 2 of 2) Processing kNN model, total=   9.8s
[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s
[Pipel

[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 12.6min finished


[Pipeline] ...... (step 1 of 2) Processing preprocessor, total=   0.0s
[Pipeline] ......... (step 2 of 2) Processing kNN model, total=  17.7s
{'kNN model__estimator__n_neighbors': 3} 0.009784911053101522


In [19]:
# To get keys use the commented code below
# kNN_pipeline.get_params().keys()

Our best parameter is k=3.

## Next 

See `SD201_Instacart_AssociationRuleMining.ipynb` for the next step.