<a href="https://colab.research.google.com/github/shubhamscifi/Instacart-Market-Basket-Analysis/blob/main/4.%20Baseline(Rule-Based)%20Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **[Instacart Market Basket Analysis](https://www.kaggle.com/c/instacart-market-basket-analysis/)**

---
- Given order_id predict all the products that the user will reorder.

---








## Importing libraries

In [None]:
import plotly
import plotly.express as px
import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import re
import gc   # garbage collector
import pickle
# https://pypi.org/project/tqdm/#:~:text=jupyter%20console.%20Use-,auto,-instead%20of%20autonotebook
from tqdm.auto import tqdm
import time
from joblib import Parallel, delayed
from sklearn.metrics import f1_score,confusion_matrix,\
                            precision_recall_fscore_support,classification_report,\
                            accuracy_score,log_loss
from sklearn.model_selection import train_test_split

bold = lambda string: '\033[1m'+string+'\033[0m'    # for bold text
printb = lambda string: print('\033[1m'+string+'\033[0m')
# https://stackoverflow.com/questions/8924173/how-do-i-print-bold-text-in-python/8930747

## Kaggle file uploader utility:
- To upload intermediate tables.

In [None]:
def kaggle_file_uploader(files,id='shubhamscifi/instacart',title='instacart',folder='data',msg='',first_time=False,del_after_upload=True):
    '''Uploads list of files to kaggle.
    Note: make sure to run after kaggle authentication.
    id : must be between 6-50 chars after "username/".
    title : must be between 6-50 chars.
    files : list of path of files that are to be uploaded.
    first_time: True if the data is being uploaded for the first time.
    del_after_upload: True if given folder needs to be deleted after file upload finishes.'''
    # https://github.com/Kaggle/kaggle-api

    # create data package json file
    !mkdir {folder}
    !kaggle datasets init -p {folder}

    # preparing metadata json file
    import json,os
    metadata = open(os.path.join(folder,'dataset-metadata.json'),'r+')
    meta = json.load(metadata)
    meta['id'] = id
    meta['title']= title
    metadata.seek(0)
    json.dump(meta,metadata)
    metadata.truncate()
    metadata.close()

    for file in set(files):
        !cp {file} {folder}

    # upload dataset to kaggle
    if (first_time):
        !kaggle datasets create -p {folder}
    else:
        # Create a New Dataset Version
        !kaggle datasets version -p {folder} -m '{msg}'

    if (del_after_upload):
        !rm -rf {folder}

## Loading Data

In [None]:
# Kaggle authentication
from getpass import getpass
import os

os.environ['KAGGLE_USERNAME'] = "shubhamscifi" #input('Enter kaggle username: ') # kaggle username
os.environ['KAGGLE_KEY'] = getpass('Enter Token: ') # kaggle api key

Enter Token: ··········


**Download intermediate prepared tables.**

In [None]:
!kaggle datasets download -d shubhamscifi/instacart --unzip

Downloading instacart.zip to /content
 99% 1.46G/1.47G [00:13<00:00, 189MB/s]
100% 1.47G/1.47G [00:13<00:00, 113MB/s]


In [None]:
%%time
dataset = pd.read_feather('dataset.feather')

CPU times: user 2.92 s, sys: 19.8 s, total: 22.7 s
Wall time: 1.35 s


In [None]:
# loading data into pandas dataframe
orders = pd.read_csv('/content/orders.csv',dtype={'order_id':np.uint32,
                                                  'user_id' :np.uint32,
                                                  'order_number':'uint8',
                                                  'order_hour_of_day':'uint8',
                                                  'order_dow':'uint8',
                                                  'days_since_prior_order':'float16'})
dep = pd.read_csv('/content/departments.csv', dtype={'department_id':'uint8',
                                                     'department': str})
aisles = pd.read_csv('/content/aisles.csv', dtype={'aisle_id':'uint8',
                                                     'aisle': str})
products = pd.read_csv('/content/products.csv', dtype={'aisle_id':'uint8',
                                                     'department_id':'uint8',
                                                     'product_name': str,
                                                     'product_id': np.uint16})
order_products_prior = pd.read_csv('/content/order_products__prior.csv',
                                   dtype={'add_to_cart_order':'uint8',
                                          'reordered':'uint8',
                                          'order_id':np.uint32,
                                          'product_id':np.uint16})
order_products_train = pd.read_csv('/content/order_products__train.csv',
                                   dtype={'add_to_cart_order':'uint8',
                                          'reordered':'uint8',
                                          'order_id':np.uint32,
                                          'product_id':np.uint16})

## Merging Tables.

In [None]:
# Merging relational tables
# joining orders and order_products_prior table to get whole prior data table.
prior_data = orders.merge(order_products_prior, how='inner', on='order_id')

# sorting prior_data to get a more structured data so that we can analyse well.
prior_data.sort_values(['user_id','order_number','add_to_cart_order'],inplace=True, axis='index',\
                 ignore_index=True)
prior_data

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered
0,2539329,1,prior,1,2,8,,196,1,0
1,2539329,1,prior,1,2,8,,14084,2,0
2,2539329,1,prior,1,2,8,,12427,3,0
3,2539329,1,prior,1,2,8,,26088,4,0
4,2539329,1,prior,1,2,8,,26405,5,0
...,...,...,...,...,...,...,...,...,...,...
32434484,2977660,206209,prior,13,1,12,7.0,14197,5,1
32434485,2977660,206209,prior,13,1,12,7.0,38730,6,0
32434486,2977660,206209,prior,13,1,12,7.0,31477,7,0
32434487,2977660,206209,prior,13,1,12,7.0,6567,8,0


In [None]:
prod_info = products.merge(dep,on='department_id').merge(aisles,on='aisle_id')
prod_info

Unnamed: 0,product_id,product_name,aisle_id,department_id,department,aisle
0,1,Chocolate Sandwich Cookies,61,19,snacks,cookies cakes
1,78,Nutter Butter Cookie Bites Go-Pak,61,19,snacks,cookies cakes
2,102,Danish Butter Cookies,61,19,snacks,cookies cakes
3,172,Gluten Free All Natural Chocolate Chip Cookies,61,19,snacks,cookies cakes
4,285,Mini Nilla Wafers Munch Pack,61,19,snacks,cookies cakes
...,...,...,...,...,...,...
49683,22827,Organic Black Mission Figs,18,10,bulk,bulk dried fruits vegetables
49684,28655,Crystallized Ginger Chunks,18,10,bulk,bulk dried fruits vegetables
49685,30365,Vegetable Chips,18,10,bulk,bulk dried fruits vegetables
49686,38007,Naturally Sweet Plantain Chips,18,10,bulk,bulk dried fruits vegetables




---



## Baseline(non-ML) Models:


### Utitlity Funcitons:

**Submission Function for test data:**

In [None]:
def get_reordered_prod_string(df,target='reordered'):
    '''Returns string: space delimited product_ids of reordered products.'''
    reordered_prod_list = ' '.join(list(map(str,df.loc[df[target]==1,'product_id'])))
    if len(reordered_prod_list)!=0:
        return reordered_prod_list
    else:   # if no reordered products
        return 'None'

def submission(prediction,target='reordered',msg='',sub_file_name='submission.csv'):
    '''Submits the prediction and prints the result.'''

    # creating submission dataframe as mentioned here
    # https://www.kaggle.com/c/instacart-market-basket-analysis/overview/evaluation
    sub = prediction[['order_id','product_id','reordered',target]].groupby(['order_id'])\
                    .apply(lambda df: get_reordered_prod_string(df,target))\
                    .reset_index()

    sub.columns = ['order_id','products']
    sub.to_csv(sub_file_name,index=False)

    # Submit a competition
    # https://github.com/Kaggle/kaggle-api#submit-to-a-competition
    !kaggle competitions submit instacart-market-basket-analysis -f {sub_file_name} -m '{msg}'

    time.sleep(2)
    
    # List my competition submissions
    # https://github.com/Kaggle/kaggle-api#list-competition-submissions
    result = !kaggle competitions submissions instacart-market-basket-analysis
    result[3] = "\033[1;31;47m"+result[3]+"\033[0m" # changing string color & background. https://ozzmaker.com/add-colour-to-text-in-python/
    print('\n'.join(result[:-37]))

**F1-score calculation function for train and CV datasets:**

In [None]:
def f_score_helper(df,target):
    '''
    Returns pd.Series of f1-score, precision and recall for given order_id.
    '''
    TP = df['reordered'] @ df[target] # true positive(numerator)
    den_pr = df[target].sum() # denominator for precision
    den_re = df['reordered'].sum()  # denominator for recall
    if (den_pr==0 and den_re==0):
        # if both the actual and prediction is None.
        TP+=1
    if den_pr==0:
        # if prediction is None.
        den_pr+=1
    if den_re==0:
        # if actual is None.
        den_re+=1
    
    return pd.Series({'f_score':2*TP/(den_re+den_pr),
                      'precision' :TP/(den_pr),
                      'recall': TP/(den_re)})

def f_score(dataset,target='reordered',pr_re=False):
    '''
    Returns Samples F1-score.

    pr_re : if True return (f-score, precision & recall) otherwise only f-score.
    '''
    # dataframe to contain contribution of each order_id to precision and recall.
    f_score,pr,re = dataset[['order_id','reordered',target]]\
                    .groupby('order_id')\
                    .apply(f_score_helper,target)\
                    .mean()
    
    if (pr_re==True):
        return f_score,pr,re
    return f_score

### 1.Dumb Model:
- This model predicts all prior purchased products of a user as reorder. This will have full recall but less precision.

In [None]:
dataset['all_one_pred'] = 1

In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='train'],'all_one_pred',True)
printb('\tTrain')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	Train[0m
------------------------------
Precision: 0.13203255435741582
Recall   : 0.934520828343096
F1-score : 0.21561404547484328


In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='cv'],'all_one_pred',True)
printb('\tCV')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	CV[0m
------------------------------
Precision: 0.13150145762540513
Recall   : 0.9342529786855677
F1-score : 0.21490249759132454


In [None]:
# for test datapoints prediction.
submission(dataset[dataset.eval_set=='test'],
           target='all_one_pred',
           msg='Dumb Baseline Model',
           sub_file_name='all_prior_prod.csv')

fileName                    date                 description                                     status    publicScore  privateScore  
--------------------------  -------------------  ----------------------------------------------  --------  -----------  ------------  
[1;31;47mall_prior_prod.csv          2021-07-01 15:37:03  Dumb Baseline Model                             complete  0.21648      0.21527       [0m


In [None]:
del dataset['all_one_pred']

Data | F1-score
--- | ---
Train | 0.2156
CV | 0.2149
Test | 0.2153

**Any sensible ML model must give test f1-score of more than 0.2153.**

### 2.Predict random k prior purchased products as reorder.
k = avg no. of reorders over all orders.

In [None]:
print('Avg no. of products purchased in an order:')
prior_data[prior_data.order_number!=1].groupby('order_id')['reordered'].sum().mean()

Avg no. of products purchased in an order:


6.357150430506554

In [None]:
print('Median no. of products purchased in an order:')
prior_data[prior_data.order_number!=1].groupby('order_id')['reordered'].sum().median()

Median no. of products purchased in an order:


5.0

An order has almost **6** reordered products on average.

In [None]:
avg = 6

In [None]:
def get_products_dataframe(prod,avg):
    '''Returns DataFrame: random avg no. of product_ids as prediction.'''
    prod = np.random.choice(prod,
                            size=min(len(prod),avg),
                            replace=False)
    return pd.DataFrame(data={'product_id':prod})

prediction = dataset.groupby('user_id')['product_id']\
                    .apply(lambda prod: get_products_dataframe(prod,avg))
prediction = prediction.droplevel(level=1)
prediction = prediction.reset_index()
prediction['prediction'] = 1
prediction.head(3)

Unnamed: 0,user_id,product_id,prediction
0,1,35951,1
1,1,26088,1
2,1,41787,1


In [None]:
dataset = dataset.merge(prediction,on=['user_id','product_id'],how='left')
dataset['prediction'] = dataset['prediction'].fillna(0)
dataset.head(3)

Unnamed: 0,user_id,product_id,order_id,eval_set,reordered,#reorders_u,#purchases_u,#first_purchases_u,p(reorder|user)_u,mean_#reorders_u,median_#reorders_u,min_#reorders_u,max_#reorders_u,mean_#purchases_u,median_#purchases_u,min_#purchases_u,max_#purchases_u,mean_#first_purchases_u,median_#first_purchases_u,min_#first_purchases_u,max_#first_purchases_u,"mean_p(reorder|user,order)_u","median_p(reorder|user,order)_u","min_p(reorder|user,order)_u","max_p(reorder|user,order)_u",dep_target_enc,aisle_target_enc,eatable,#avg_reorders_dep,p(reorder|dep_of_prod),#avg_reorders_aisle,p(reorder|aisle_of_prod),#reorders_p,#purchases_p,#first_purchases_p,p(reorder|product)_p,#reorders_up,"p(reorder|user,product)_up",reordered_in_last_order,reordered_in_2ndlast_order,reordered_in_3rdlast_order,prediction
0,2,13176,1492625,train,0.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.128464,0.169311,1,3658.37886,0.649913,6846.777487,0.718104,315913.0,379450.0,63537.0,0.832555,0,0.0,0,0,0,0.0
1,2,41787,1492625,train,1.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.128464,0.169311,1,3658.37886,0.649913,6846.777487,0.718104,23015.0,35413.0,12398.0,0.649903,1,0.076923,0,0,0,1.0
2,2,32792,1492625,train,1.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.088779,0.098166,1,264.682791,0.57418,306.341772,0.591986,791.0,1370.0,579.0,0.577372,8,0.615385,0,0,1,0.0


In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='train'],'prediction',True)
printb('\tTrain')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	Train[0m
------------------------------
Precision: 0.1315992712438728
Recall   : 0.17775653719766146
F1-score : 0.12954642884935152


In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='cv'],'prediction',True)
printb('\tCV')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	CV[0m
------------------------------
Precision: 0.13176629491316202
Recall   : 0.17693998410726286
F1-score : 0.12942676394294883


In [None]:
submission(dataset[dataset.eval_set=='test'],
           target='prediction',
           msg='Random 6 products per order',
           sub_file_name='random_6_prod.csv')

fileName                    date                 description                                     status    publicScore  privateScore  
--------------------------  -------------------  ----------------------------------------------  --------  -----------  ------------  
[1;31;47mrandom_6_prod.csv           2021-07-01 15:45:50  Random 6 products per order                     complete  0.13114      0.13044       [0m
all_prior_prod.csv          2021-07-01 15:37:03  Dumb Baseline Model                             complete  0.21648      0.21527       


In [None]:
del dataset['prediction']

Data | F1-score
--- | ---
Train | 0.1295
CV | 0.1294
Test | 0.1304

**Worse than the Dumb model.**

### 3.Predict random k prior purchased products as reorder by giving more weightage to product that has been reordered more by all users.
k = avg no. of reorders over all orders of a user.

In [None]:
def get_products_dataframe(prod):
    '''
    Returns DataFrame: random k no. of product_ids as prediction.
    k = avg no. of reorders over all orders of a user.
    '''
    user_mean_reorders = round(prod['mean_#reorders_u'].iloc[0])
    prob = np.asarray(prod['#reorders_p'])
    sum_= prob.sum()
    if(sum_==0):
        prod = []
    else:
        prob= prob/sum_
        prod = np.random.choice(prod['product_id'],
                                size=user_mean_reorders,
                                p=prob,
                                replace=False,)

    return pd.DataFrame(data={'product_id':prod})

prediction = dataset.groupby('user_id')\
                    .apply(lambda prod: get_products_dataframe(prod))
prediction = prediction.droplevel(level=1)
prediction = prediction.reset_index()
prediction['prediction'] = 1
prediction.head(3)

Unnamed: 0,user_id,product_id,prediction
0,1,13176,1
1,1,196,1
2,1,38928,1


In [None]:
dataset = dataset.merge(prediction,on=['user_id','product_id'],how='left')
dataset['prediction'] = dataset['prediction'].fillna(0)
dataset.head(3)

Unnamed: 0,user_id,product_id,order_id,eval_set,reordered,#reorders_u,#purchases_u,#first_purchases_u,p(reorder|user)_u,mean_#reorders_u,median_#reorders_u,min_#reorders_u,max_#reorders_u,mean_#purchases_u,median_#purchases_u,min_#purchases_u,max_#purchases_u,mean_#first_purchases_u,median_#first_purchases_u,min_#first_purchases_u,max_#first_purchases_u,"mean_p(reorder|user,order)_u","median_p(reorder|user,order)_u","min_p(reorder|user,order)_u","max_p(reorder|user,order)_u",dep_target_enc,aisle_target_enc,eatable,#avg_reorders_dep,p(reorder|dep_of_prod),#avg_reorders_aisle,p(reorder|aisle_of_prod),#reorders_p,#purchases_p,#first_purchases_p,p(reorder|product)_p,#reorders_up,"p(reorder|user,product)_up",reordered_in_last_order,reordered_in_2ndlast_order,reordered_in_3rdlast_order,prediction
0,2,13176,1492625,train,0.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.128464,0.169311,1,3658.37886,0.649913,6846.777487,0.718104,315913.0,379450.0,63537.0,0.832555,0,0.0,0,0,0,1.0
1,2,41787,1492625,train,1.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.128464,0.169311,1,3658.37886,0.649913,6846.777487,0.718104,23015.0,35413.0,12398.0,0.649903,1,0.076923,0,0,0,0.0
2,2,32792,1492625,train,1.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.088779,0.098166,1,264.682791,0.57418,306.341772,0.591986,791.0,1370.0,579.0,0.577372,8,0.615385,0,0,1,0.0


In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='train'],'prediction',True)
printb('\tTrain')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	Train[0m
------------------------------
Precision: 0.20281477926181793
Recall   : 0.16366686054301055
F1-score : 0.16830368494685785


In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='cv'],'prediction',True)
printb('\tCV')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	CV[0m
------------------------------
Precision: 0.2018252999490733
Recall   : 0.16265559340249897
F1-score : 0.1674412047461378


In [None]:
submission(dataset[dataset.eval_set=='test'],
           'prediction',
           msg='k most reordered prod by user',
           sub_file_name='random_k_prod.csv')

fileName                    date                 description                                     status    publicScore  privateScore  
--------------------------  -------------------  ----------------------------------------------  --------  -----------  ------------  
[1;31;47mrandom_k_prod.csv           2021-07-01 16:10:04  k most reordered prod by user                   complete  0.16905      0.16757       [0m
random_6_prod.csv           2021-07-01 15:45:50  Random 6 products per order                     complete  0.13114      0.13044       
all_prior_prod.csv          2021-07-01 15:37:03  Dumb Baseline Model                             complete  0.21648      0.21527       


In [None]:
del dataset['prediction']

Data | F1-score
--- | ---
Train | 0.1683
CV | 0.1674
Test | 0.1676

**Worse than the Dumb model.**

### 4.Predict random k prior purchased products as reorder by giving more weightage to product that has been reordered more by the user.
k = avg no. of reorders over all orders of a user.

In [None]:
def get_products_dataframe(prod):
    '''
    Returns DataFrame: random k no. of product_ids as prediction.
    k = avg no. of reorders over all orders of a user.
    '''
    user_mean_reorders = round(prod['mean_#reorders_u'].iloc[0])
    prob = np.asarray(prod['#reorders_up'])
    sum_= prob.sum()
    if(sum_==0):
        prod = []
    else:
        prob= prob/sum_
        prod = np.random.choice(prod['product_id'],
                                size=user_mean_reorders,
                                p=prob,
                                replace=False,)
    return pd.DataFrame(data={'product_id':prod})

prediction = dataset.groupby('user_id')\
                    .apply(lambda prod: get_products_dataframe(prod))
prediction = prediction.droplevel(level=1)
prediction = prediction.reset_index()
prediction['prediction'] = 1
prediction.head(3)

Unnamed: 0,user_id,product_id,prediction
0,1,13176.0,1
1,1,196.0,1
2,1,12427.0,1


In [None]:
dataset = dataset.merge(prediction,on=['user_id','product_id'],how='left')
dataset['prediction'] = dataset['prediction'].fillna(0)
dataset.head(3)

Unnamed: 0,user_id,product_id,order_id,eval_set,reordered,#reorders_u,#purchases_u,#first_purchases_u,p(reorder|user)_u,mean_#reorders_u,median_#reorders_u,min_#reorders_u,max_#reorders_u,mean_#purchases_u,median_#purchases_u,min_#purchases_u,max_#purchases_u,mean_#first_purchases_u,median_#first_purchases_u,min_#first_purchases_u,max_#first_purchases_u,"mean_p(reorder|user,order)_u","median_p(reorder|user,order)_u","min_p(reorder|user,order)_u","max_p(reorder|user,order)_u",dep_target_enc,aisle_target_enc,eatable,#avg_reorders_dep,p(reorder|dep_of_prod),#avg_reorders_aisle,p(reorder|aisle_of_prod),#reorders_p,#purchases_p,#first_purchases_p,p(reorder|product)_p,#reorders_up,"p(reorder|user,product)_up",reordered_in_last_order,reordered_in_2ndlast_order,reordered_in_3rdlast_order,prediction
0,2,13176,1492625,train,0.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.128464,0.169311,1,3658.37886,0.649913,6846.777487,0.718104,315913.0,379450.0,63537.0,0.832555,0,0.0,0,0,0,0.0
1,2,41787,1492625,train,1.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.128464,0.169311,1,3658.37886,0.649913,6846.777487,0.718104,23015.0,35413.0,12398.0,0.649903,1,0.076923,0,0,0,0.0
2,2,32792,1492625,train,1.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.088779,0.098166,1,264.682791,0.57418,306.341772,0.591986,791.0,1370.0,579.0,0.577372,8,0.615385,0,0,1,1.0


In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='train'],'prediction',True)
printb('\tTrain')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	Train[0m
------------------------------
Precision: 0.29788501750751745
Recall   : 0.24004032968391323
F1-score : 0.24712396386151955


In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='cv'],'prediction',True)
printb('\tCV')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	CV[0m
------------------------------
Precision: 0.29705093555004275
Recall   : 0.23930457982169814
F1-score : 0.24627466478839993


In [None]:
submission(dataset[dataset.eval_set=='test'],
           'prediction',
           msg='k most reordered prod(weighted) by user',
           sub_file_name='random_k_prod.csv')

fileName                    date                 description                                     status    publicScore  privateScore  
--------------------------  -------------------  ----------------------------------------------  --------  -----------  ------------  
[1;31;47mrandom_k_prod.csv           2021-07-01 16:18:53  k most reordered prod(weighted) by user         complete  0.24865      0.24653       [0m
random_k_prod.csv           2021-07-01 16:10:04  k most reordered prod by user                   complete  0.16905      0.16757       
random_6_prod.csv           2021-07-01 15:45:50  Random 6 products per order                     complete  0.13114      0.13044       
all_prior_prod.csv          2021-07-01 15:37:03  Dumb Baseline Model                             complete  0.21648      0.21527       


In [None]:
del dataset['prediction']

Data | F1-score
--- | ---
Train | 0.2471
CV | 0.2463
Test | 0.2465

**Any sensible ML model must give test f1-score of more than 0.2465.**

### 5.Predict purchases of last order as reorder:

In [None]:
def get_products_dataframe(df):
    """Returns string: space delimited product_ids of user's last order's purchased products."""
    last_order_number = df.order_number.max()
    reordered_prod = df.loc[df.order_number==last_order_number,['product_id']]
    return reordered_prod

In [None]:
prediction = prior_data.groupby('user_id')\
                        .apply(get_products_dataframe)
prediction = prediction.droplevel(level=1)
prediction = prediction.reset_index()
prediction['prediction'] = 1
prediction.head(3)

Unnamed: 0,user_id,product_id,prediction
0,1,196,1
1,1,46149,1
2,1,39657,1


In [None]:
dataset = dataset.merge(prediction,on=['user_id','product_id'],how='left')
dataset['prediction'] = dataset['prediction'].fillna(0)
dataset.head(3)

Unnamed: 0,user_id,product_id,order_id,eval_set,reordered,#reorders_u,#purchases_u,#first_purchases_u,p(reorder|user)_u,mean_#reorders_u,median_#reorders_u,min_#reorders_u,max_#reorders_u,mean_#purchases_u,median_#purchases_u,min_#purchases_u,max_#purchases_u,mean_#first_purchases_u,median_#first_purchases_u,min_#first_purchases_u,max_#first_purchases_u,"mean_p(reorder|user,order)_u","median_p(reorder|user,order)_u","min_p(reorder|user,order)_u","max_p(reorder|user,order)_u",dep_target_enc,aisle_target_enc,eatable,#avg_reorders_dep,p(reorder|dep_of_prod),#avg_reorders_aisle,p(reorder|aisle_of_prod),#reorders_p,#purchases_p,#first_purchases_p,p(reorder|product)_p,#reorders_up,"p(reorder|user,product)_up",reordered_in_last_order,reordered_in_2ndlast_order,reordered_in_3rdlast_order,prediction
0,2,13176,1492625,train,0.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.128464,0.169311,1,3658.37886,0.649913,6846.777487,0.718104,315913.0,379450.0,63537.0,0.832555,0,0.0,0,0,0,0.0
1,2,41787,1492625,train,1.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.128464,0.169311,1,3658.37886,0.649913,6846.777487,0.718104,23015.0,35413.0,12398.0,0.649903,1,0.076923,0,0,0,0.0
2,2,32792,1492625,train,1.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.088779,0.098166,1,264.682791,0.57418,306.341772,0.591986,791.0,1370.0,579.0,0.577372,8,0.615385,0,0,1,0.0


In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='train'],'prediction',True)
printb('\tTrain')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	Train[0m
------------------------------
Precision: 0.28617361269170094
Recall   : 0.42907766220462307
F1-score : 0.31153070081836953


In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='cv'],'prediction',True)
printb('\tCV')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	CV[0m
------------------------------
Precision: 0.28414623956748736
Recall   : 0.4276941325218089
F1-score : 0.3095193525012776


In [None]:
submission(dataset[dataset.eval_set=='test'],
           'prediction',
           msg='purchases of last order as reorder',
           sub_file_name='last_order.csv')

fileName                    date                 description                                     status    publicScore  privateScore  
--------------------------  -------------------  ----------------------------------------------  --------  -----------  ------------  
[1;31;47mlast_order.csv              2021-07-01 16:34:08  purchases of last order as reorder              complete  0.31180      0.31202       [0m
random_k_prod.csv           2021-07-01 16:18:53  k most reordered prod(weighted) by user         complete  0.24865      0.24653       
random_k_prod.csv           2021-07-01 16:10:04  k most reordered prod by user                   complete  0.16905      0.16757       
random_6_prod.csv           2021-07-01 15:45:50  Random 6 products per order                     complete  0.13114      0.13044       
all_prior_prod.csv          2021-07-01 15:37:03  Dumb Baseline Model                             complete  0.21648      0.21527       


In [None]:
del dataset['prediction']

Data | F1-score
--- | ---
Train | 0.3115
CV | 0.3095
Test | 0.3120

**Any sensible ML model must give test f1-score of more than 0.3120.**

### 6.Predict reorders of last order as a reorder:

In [None]:
def get_products_dataframe(df):
    """Returns string: space delimited product_ids of user's last order's reordered products."""
    last_order_number = df.order_number.max()
    reordered_prod = df.loc[(df.order_number==last_order_number) & (df.reordered==1),['product_id']]
    return reordered_prod

In [None]:
prediction = prior_data.groupby('user_id')\
                        .apply(get_products_dataframe)
prediction = prediction.droplevel(level=1)
prediction = prediction.reset_index()
prediction['prediction'] = 1
prediction.head(3)

Unnamed: 0,user_id,product_id,prediction
0,1,196,1
1,1,46149,1
2,1,25133,1


In [None]:
dataset = dataset.merge(prediction,on=['user_id','product_id'],how='left')
dataset['prediction'] = dataset['prediction'].fillna(0)
dataset.head(2)

Unnamed: 0,user_id,product_id,order_id,eval_set,reordered,#reorders_u,#purchases_u,#first_purchases_u,p(reorder|user)_u,mean_#reorders_u,median_#reorders_u,min_#reorders_u,max_#reorders_u,mean_#purchases_u,median_#purchases_u,min_#purchases_u,max_#purchases_u,mean_#first_purchases_u,median_#first_purchases_u,min_#first_purchases_u,max_#first_purchases_u,"mean_p(reorder|user,order)_u","median_p(reorder|user,order)_u","min_p(reorder|user,order)_u","max_p(reorder|user,order)_u",dep_target_enc,aisle_target_enc,eatable,#avg_reorders_dep,p(reorder|dep_of_prod),#avg_reorders_aisle,p(reorder|aisle_of_prod),#reorders_p,#purchases_p,#first_purchases_p,p(reorder|product)_p,#reorders_up,"p(reorder|user,product)_up",reordered_in_last_order,reordered_in_2ndlast_order,reordered_in_3rdlast_order,prediction
0,2,13176,1492625,train,0.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.128464,0.169311,1,3658.37886,0.649913,6846.777487,0.718104,315913.0,379450.0,63537.0,0.832555,0,0.0,0,0,0,0.0
1,2,41787,1492625,train,1.0,93.0,195,102.0,0.476923,7.153846,9.0,0,14,14.0,14.0,5,26,6.846154,7.0,1,12,0.482419,0.571429,0.0,0.888889,0.128464,0.169311,1,3658.37886,0.649913,6846.777487,0.718104,23015.0,35413.0,12398.0,0.649903,1,0.076923,0,0,0,0.0


In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='train'],'prediction',True)
printb('\tTrain')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	Train[0m
------------------------------
Precision: 0.36659992740179564
Recall   : 0.3478140704088926
F1-score : 0.32665589705750153


In [None]:
f,pr,re = f_score(dataset[dataset.eval_set=='cv'],'prediction',True)
printb('\tCV')
print('------------------------------')
print('Precision:',pr)
print('Recall   :',re)
print('F1-score :',f)

[1m	CV[0m
------------------------------
Precision: 0.36370208855045355
Recall   : 0.3445139459523401
F1-score : 0.3230913244011057


In [None]:
submission(dataset[dataset.eval_set=='test'],
           'prediction',
           msg='reorders of last order as prediction',
           sub_file_name='last_order.csv')

fileName                    date                 description                                     status    publicScore  privateScore  
--------------------------  -------------------  ----------------------------------------------  --------  -----------  ------------  
[1;31;47mlast_order.csv              2021-07-01 16:47:38  reorders of last order as prediction            complete  0.32768      0.32763       [0m
last_order.csv              2021-07-01 16:34:08  purchases of last order as reorder              complete  0.31180      0.31202       
random_k_prod.csv           2021-07-01 16:18:53  k most reordered prod(weighted) by user         complete  0.24865      0.24653       
random_k_prod.csv           2021-07-01 16:10:04  k most reordered prod by user                   complete  0.16905      0.16757       
random_6_prod.csv           2021-07-01 15:45:50  Random 6 products per order                     complete  0.13114      0.13044       
all_prior_prod.csv          2021-07-01 15

In [None]:
del prediction

In [None]:
del dataset['prediction']

Data | F1-score
--- | ---
Train | 0.3267
CV | 0.3231
Test | 0.3276

**Any sensible ML model must give test f1-score of more than 0.3276.**

### Summary of Baseline models:
Baseline-Model | Test F1-score | Rank
--- | --- | ---
1 | 0.2153 | 4th
2 | 0.1304 | 6th
3 | 0.1676 | 5th
4 | 0.2465 | 3rd
5 | 0.3120 | 2nd
6 | 0.3276 | 1st

**The best baseline model we got is when we predict reorders of last order of customer as reorder.
Any sensible ML model must perform better than this.**

---

