Instacart Kaggle Competition 

This notebook was developed for a Kaggle competition on predicting repeat product purchases on Instacart, an online grocery ordering site.

Here is a link to details on the Instacart competition, including core datasets:
https://www.kaggle.com/c/instacart-market-basket-analysis

Data: The datasets included information on 3.4M orders and 49.7K unique products.  The orders were further broken down into “prior”, “train” and “test” sets. The prior orders covered the five (or more, as available) most recent orders leading up to the final order in the dataset.  The train set covered the final order.  The test set included user IDs and a few temporal attributes for the final order (but not the ordered products, which is what we were predicting). 

Approach:  I first conducted exploratory data analysis(EDA), using code posted by other Kaggle competitors as well as my own code. I engineered additional features based on products (e.g., how often the item was reordered), users (e.g., how many items  he/she typically ordered) and product/user combination characteristics (e.g., how often a user bought a specific product). I then added those features to the train dataset, and in turn used the trained classifier on the test dataset.  

I tested a variety of classifiers and regression models, ultimately using a Random Forest classifier for my final submission.  

In [1]:
import numpy as np
import pandas as pd
import gc

In [None]:
from sklearn import preprocessing
from sklearn.model_selection import cross_val_predict, cross_val_score, GridSearchCV

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

In [36]:
# Read the input -- these were the core files supplied by Instacart 

# orders with products identified - train dataset; 1.4M records
orders_train = pd.read_csv('Instacart_order_products__train.csv')

# orders with products identified - prior dataset; 32.4M records  
orders_prior = pd.read_csv('Instacart_order_products__prior.csv')

# all orders (w/o) products; prior,train and test; test incl. NaNs; 3.4M records
orders = pd.read_csv('Instacart_orders.csv')

# product info; 49.7K records
products = pd.read_csv('Instacart_products.csv')

# aisle info 
aisles = pd.read_csv('Instacart_aisles.csv')

# dept info 
departments = pd.read_csv('Instacart_departments.csv')

This notebook picks up extensive data preparation and feature processing work conducted in earlier notebooks.  It focuses on some final features related to whether a product was in a user's last three orders (rather than all orders).  

In [3]:
# Read in previously processes feature data 
dfA=pd.read_csv('master_all_products_ordered.csv')

In [4]:
dfA.shape

(33894106, 11)

In [None]:
dfA.head(2)

Unnamed: 0.1,Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,0,2.0,33120.0,1.0,1.0,202279,prior,3,5,9,8.0
1,1,2.0,28985.0,2.0,1.0,202279,prior,3,5,9,8.0


In [None]:
dfB=dfA.loc[dfA['eval_set']=='prior'].reset_index()

In [7]:
dfB.shape

(32434489, 12)

In [8]:
dfB.columns.values

array(['index', 'Unnamed: 0', 'order_id', 'product_id',
       'add_to_cart_order', 'reordered', 'user_id', 'eval_set',
       'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order'], dtype=object)

In [9]:
dfB= dfB.drop(['index', 'Unnamed: 0', 'add_to_cart_order', 'reordered', 'eval_set', 'order_dow', 'order_hour_of_day'],axis=1)

In [10]:
dfB.head()

Unnamed: 0,order_id,product_id,user_id,order_number,days_since_prior_order
0,2.0,33120.0,202279,3,8.0
1,2.0,28985.0,202279,3,8.0
2,2.0,9327.0,202279,3,8.0
3,2.0,45918.0,202279,3,8.0
4,2.0,30035.0,202279,3,8.0


In [11]:
dfB['order_number_max']=dfB.groupby(['user_id'])['order_number'].transform('max')

In [12]:
dfB.head()

Unnamed: 0,order_id,product_id,user_id,order_number,days_since_prior_order,order_number_max
0,2.0,33120.0,202279,3,8.0,8
1,2.0,28985.0,202279,3,8.0,8
2,2.0,9327.0,202279,3,8.0,8
3,2.0,45918.0,202279,3,8.0,8
4,2.0,30035.0,202279,3,8.0,8


In [13]:
dfB.shape

(32434489, 6)

In [14]:
dfB['order_number_max_minus_1']=dfB['order_number_max']- 1
dfB['order_number_max_minus_2']=dfB['order_number_max']- 2

In [15]:
dfB.head(2)

Unnamed: 0,order_id,product_id,user_id,order_number,days_since_prior_order,order_number_max,order_number_max_minus_1,order_number_max_minus_2
0,2.0,33120.0,202279,3,8.0,8,7,6
1,2.0,28985.0,202279,3,8.0,8,7,6


In [16]:
dfB['in_last_order']=np.where( dfB['order_number']==dfB['order_number_max'], 1, 0 )

In [17]:
dfB.head(2)

Unnamed: 0,order_id,product_id,user_id,order_number,days_since_prior_order,order_number_max,order_number_max_minus_1,order_number_max_minus_2,in_last_order
0,2.0,33120.0,202279,3,8.0,8,7,6,0
1,2.0,28985.0,202279,3,8.0,8,7,6,0


In [18]:
dfB.loc[dfB['in_last_order']==1][:5]

Unnamed: 0,order_id,product_id,user_id,order_number,days_since_prior_order,order_number_max,order_number_max_minus_1,order_number_max_minus_2,in_last_order
219,25.0,9755.0,59897,19,25.0,19,18,17,1
220,25.0,31487.0,59897,19,25.0,19,18,17,1
221,25.0,37510.0,59897,19,25.0,19,18,17,1
222,25.0,14576.0,59897,19,25.0,19,18,17,1
223,25.0,22105.0,59897,19,25.0,19,18,17,1


In [19]:
dfB['in_2nd_last_order']=np.where( dfB['order_number']==dfB['order_number_max_minus_1'], 1, 0 )
dfB['in_3rd_last_order']=np.where( dfB['order_number']==dfB['order_number_max_minus_2'], 1, 0 )

In [20]:
dfB.loc[dfB['in_2nd_last_order']==1][:5]

Unnamed: 0,order_id,product_id,user_id,order_number,days_since_prior_order,order_number_max,order_number_max_minus_1,order_number_max_minus_2,in_last_order,in_2nd_last_order,in_3rd_last_order
59,7.0,34050.0,142903,11,30.0,12,11,10,0,1,0
60,7.0,46802.0,142903,11,30.0,12,11,10,0,1,0
141,16.0,9755.0,174840,18,13.0,19,18,17,0,1,0
142,16.0,25466.0,174840,18,13.0,19,18,17,0,1,0
143,16.0,45437.0,174840,18,13.0,19,18,17,0,1,0


In [21]:
dfB.loc[dfB['in_3rd_last_order']==1][:5]

Unnamed: 0,order_id,product_id,user_id,order_number,days_since_prior_order,order_number_max,order_number_max_minus_1,order_number_max_minus_2,in_last_order,in_2nd_last_order,in_3rd_last_order
125,14.0,20392.0,18194,49,3.0,51,50,49,0,0,1
126,14.0,27845.0,18194,49,3.0,51,50,49,0,0,1
127,14.0,162.0,18194,49,3.0,51,50,49,0,0,1
128,14.0,2452.0,18194,49,3.0,51,50,49,0,0,1
129,14.0,8575.0,18194,49,3.0,51,50,49,0,0,1


In [None]:
# # Save/read in as needed  
# dfB.to_csv('master_last_3_orders', index=False)

In [22]:
dfC=dfB.groupby(['user_id', 'product_id'])['in_last_order'].transform('count')

In [23]:
dfB['last_order']=dfB.groupby(['user_id', 'product_id'])['in_last_order'].transform('sum')

In [24]:
dfB.head()

Unnamed: 0,order_id,product_id,user_id,order_number,days_since_prior_order,order_number_max,order_number_max_minus_1,order_number_max_minus_2,in_last_order,in_2nd_last_order,in_3rd_last_order,last_order
0,2.0,33120.0,202279,3,8.0,8,7,6,0,0,0,1
1,2.0,28985.0,202279,3,8.0,8,7,6,0,0,0,0
2,2.0,9327.0,202279,3,8.0,8,7,6,0,0,0,0
3,2.0,45918.0,202279,3,8.0,8,7,6,0,0,0,0
4,2.0,30035.0,202279,3,8.0,8,7,6,0,0,0,0


In [25]:
dfB['2nd_last_order']=dfB.groupby(['user_id', 'product_id'])['in_2nd_last_order'].transform('sum')
dfB['3rd_last_order']=dfB.groupby(['user_id', 'product_id'])['in_3rd_last_order'].transform('sum')

In [26]:
dfB.head(25)

Unnamed: 0,order_id,product_id,user_id,order_number,days_since_prior_order,order_number_max,order_number_max_minus_1,order_number_max_minus_2,in_last_order,in_2nd_last_order,in_3rd_last_order,last_order,2nd_last_order,3rd_last_order
0,2.0,33120.0,202279,3,8.0,8,7,6,0,0,0,1,0,1
1,2.0,28985.0,202279,3,8.0,8,7,6,0,0,0,0,0,1
2,2.0,9327.0,202279,3,8.0,8,7,6,0,0,0,0,0,0
3,2.0,45918.0,202279,3,8.0,8,7,6,0,0,0,0,1,1
4,2.0,30035.0,202279,3,8.0,8,7,6,0,0,0,0,1,0
5,2.0,17794.0,202279,3,8.0,8,7,6,0,0,0,0,1,1
6,2.0,40141.0,202279,3,8.0,8,7,6,0,0,0,0,0,1
7,2.0,1819.0,202279,3,8.0,8,7,6,0,0,0,0,0,0
8,2.0,43668.0,202279,3,8.0,8,7,6,0,0,0,0,0,1
9,3.0,33754.0,205970,16,12.0,25,24,23,0,0,0,1,1,1


In [27]:
dfB.shape

(32434489, 14)

In [28]:
dfB.columns.values

array(['order_id', 'product_id', 'user_id', 'order_number',
       'days_since_prior_order', 'order_number_max',
       'order_number_max_minus_1', 'order_number_max_minus_2',
       'in_last_order', 'in_2nd_last_order', 'in_3rd_last_order',
       'last_order', '2nd_last_order', '3rd_last_order'], dtype=object)

In [None]:
# Build dataset of last2 and last3 ordered(we already have all ordered and last ordered); then we need a features set
# of last1, last 2 and last3 ordered by each product/user pair 

In [29]:
dfLast2=dfB[(dfB['last_order']==1) | (dfB['2nd_last_order']==1)]

In [30]:
dfLast2.shape

(12840758, 14)

In [31]:
dfLast3=dfB[(dfB['last_order']==1) | (dfB['2nd_last_order']==1)  | (dfB['3rd_last_order']==1) ]

In [32]:
dfLast3.shape

(15869574, 14)

In [33]:
dfLast2rev=dfLast2[['product_id', 'user_id', 'last_order', '2nd_last_order', '3rd_last_order']]

In [34]:
dfLast2rev.shape

(12840758, 5)

In [None]:
dfLast2rev.reset_index()[:5]

In [36]:
dfLast2rev.columns.values

array(['product_id', 'user_id', 'last_order', '2nd_last_order',
       '3rd_last_order'], dtype=object)

In [None]:
dfLast2rev.drop_duplicates(inplace=True)

In [38]:
dfLast2rev.shape

(3661801, 5)

In [39]:
dfLast2rev.head()

Unnamed: 0,product_id,user_id,last_order,2nd_last_order,3rd_last_order
0,33120.0,202279,1,0,1
3,45918.0,202279,0,1,1
4,30035.0,202279,0,1,0
5,17794.0,202279,0,1,1
9,33754.0,205970,1,1,1


In [None]:
# # Save/read in as needed
# dfLast2rev.to_csv('master_products_by_user_last2_orders', index=False)

In [40]:
dfLast3rev=dfLast3[['product_id', 'user_id', 'last_order', '2nd_last_order', '3rd_last_order']].reset_index()

In [41]:
dfLast3rev.shape

(15869574, 6)

In [42]:
dfLast3rev.columns.values

array(['index', 'product_id', 'user_id', 'last_order', '2nd_last_order',
       '3rd_last_order'], dtype=object)

In [43]:
dfLast3rev=dfLast3rev.drop('index',1)

In [44]:
dfLast3rev.head()

Unnamed: 0,product_id,user_id,last_order,2nd_last_order,3rd_last_order
0,33120.0,202279,1,0,1
1,28985.0,202279,0,0,1
2,45918.0,202279,0,1,1
3,30035.0,202279,0,1,0
4,17794.0,202279,0,1,1


In [45]:
dfLast3rev.drop_duplicates(inplace=True)

In [46]:
dfLast3rev.shape

(4925215, 5)

In [47]:
dfLast3rev.head()

Unnamed: 0,product_id,user_id,last_order,2nd_last_order,3rd_last_order
0,33120.0,202279,1,0,1
1,28985.0,202279,0,0,1
2,45918.0,202279,0,1,1
3,30035.0,202279,0,1,0
4,17794.0,202279,0,1,1


In [None]:
# Save/read in as needed 
# dfLast3rev.to_csv('master_products_by_user_last3_orders', index=False)

In [48]:
dfB.shape

(32434489, 14)

In [49]:
dfB=dfB[['product_id', 'user_id', 'last_order', '2nd_last_order', '3rd_last_order']].reset_index()

In [50]:
dfB.columns.values

array(['index', 'product_id', 'user_id', 'last_order', '2nd_last_order',
       '3rd_last_order'], dtype=object)

In [51]:
dfB=dfB.drop('index',1)

In [52]:
dfB.drop_duplicates(inplace=True)

In [53]:
dfB.shape

(13307953, 5)

In [None]:
# Save/read in as needed 
# dfB.to_csv('master_products_by_user_all_orders_with_last3_orders_flagged', index=False)

In [None]:
# Adding additional features -- "add to cart" order and avg order size. 

In [54]:
gc.collect()

135

In [None]:
# # Read in as needed 
# dfB=pd.read_csv("master_all_products_ordered.csv")

In [75]:
dfB.shape

(33894106, 5)

In [76]:
dfB.columns.values

array(['order_id', 'product_id', 'user_id', 'eval_set', 'add_to_cart_order'], dtype=object)

In [77]:
dfB=dfB[['order_id', 'product_id', 'user_id','eval_set', 'add_to_cart_order']]      

In [78]:
dfB.shape

(33894106, 5)

In [79]:
dfB.head()

Unnamed: 0,order_id,product_id,user_id,eval_set,add_to_cart_order
0,2.0,33120.0,202279,prior,1.0
1,2.0,28985.0,202279,prior,2.0
2,2.0,9327.0,202279,prior,3.0
3,2.0,45918.0,202279,prior,4.0
4,2.0,30035.0,202279,prior,5.0


In [80]:
dfC=dfB.loc[dfB['eval_set']!='test'].reset_index()

In [81]:
dfC.shape

(33819106, 6)

In [82]:
dfC['avg_add_to_cart_order']=dfC.groupby(['product_id', 'user_id'])['add_to_cart_order'].transform(np.mean)

In [84]:
del dfA, dfB; gc.collect()

61

In [85]:
dfC.head()

Unnamed: 0,index,order_id,product_id,user_id,eval_set,add_to_cart_order,avg_add_to_cart_order
0,0,2.0,33120.0,202279,prior,1.0,1.833333
1,1,2.0,28985.0,202279,prior,2.0,3.2
2,2,2.0,9327.0,202279,prior,3.0,3.0
3,3,2.0,45918.0,202279,prior,4.0,4.8
4,4,2.0,30035.0,202279,prior,5.0,4.666667


In [86]:
dfC.loc[(dfC['user_id']==202279) & (dfC['product_id']==28985)]

Unnamed: 0,index,order_id,product_id,user_id,eval_set,add_to_cart_order,avg_add_to_cart_order
1,1,2.0,28985.0,202279,prior,2.0,3.2
1254955,1254955,132412.0,28985.0,202279,prior,7.0,3.2
14214768,14214768,1500071.0,28985.0,202279,prior,1.0,3.2
26635933,26635933,2808715.0,28985.0,202279,prior,2.0,3.2
27453972,27453972,2894949.0,28985.0,202279,prior,4.0,3.2


In [87]:
dfC.shape

(33819106, 7)

In [88]:
dfC['order_count']=dfC.groupby('order_id')['product_id'].transform('count')

In [89]:
dfC.head()

Unnamed: 0,index,order_id,product_id,user_id,eval_set,add_to_cart_order,avg_add_to_cart_order,order_count
0,0,2.0,33120.0,202279,prior,1.0,1.833333,9.0
1,1,2.0,28985.0,202279,prior,2.0,3.2,9.0
2,2,2.0,9327.0,202279,prior,3.0,3.0,9.0
3,3,2.0,45918.0,202279,prior,4.0,4.8,9.0
4,4,2.0,30035.0,202279,prior,5.0,4.666667,9.0


In [90]:
dfC.drop('index', 1, inplace=True)

In [None]:
grouped=dfC.groupby('user_id')['order_id'].aggregate({'num_orders': lambda x: x.nunique(), 'num_products': 'count'}).reset_index()

In [92]:
grouped.head()

Unnamed: 0,user_id,num_orders,num_products
0,1,11.0,70
1,2,15.0,226
2,3,12.0,88
3,4,5.0,18
4,5,5.0,46


In [93]:
grouped['avg_order_size']=grouped['num_products']/grouped['num_orders']
grouped.head()

Unnamed: 0,user_id,num_orders,num_products,avg_order_size
0,1,11.0,70,6.363636
1,2,15.0,226,15.066667
2,3,12.0,88,7.333333
3,4,5.0,18,3.6
4,5,5.0,46,9.2


In [94]:
dfD=pd.merge(dfC, grouped, how='left', on=['user_id'])

In [95]:
dfD.head()

Unnamed: 0,order_id,product_id,user_id,eval_set,add_to_cart_order,avg_add_to_cart_order,order_count,num_orders,num_products,avg_order_size
0,2.0,33120.0,202279,prior,1.0,1.833333,9.0,9.0,100,11.111111
1,2.0,28985.0,202279,prior,2.0,3.2,9.0,9.0,100,11.111111
2,2.0,9327.0,202279,prior,3.0,3.0,9.0,9.0,100,11.111111
3,2.0,45918.0,202279,prior,4.0,4.8,9.0,9.0,100,11.111111
4,2.0,30035.0,202279,prior,5.0,4.666667,9.0,9.0,100,11.111111


In [None]:
# Save
# dfD.to_csv('master_addl_features_addtocart_ordersize', index=False)

In [99]:
del grouped; gc.collect()

294

In [106]:
del dfLast2, dfLast2rev, dfLast3, dfLast3rev; gc.collect()

366

In [None]:
# Read in 
df8=pd.read_csv('master_features_prior_and_train_incl_boughtornot_v1.csv',index_col=0)

In [4]:
df8.shape

(13863746, 44)

In [5]:
# Read in
df9=pd.read_csv('master_products_by_user_all_orders_with_last3_orders_flagged')

In [6]:
df9.shape

(13307953, 5)

In [7]:
df10=pd.merge(df8, df9, how='left', on=['product_id', 'user_id'])

In [8]:
del df8, df9; gc.collect()

21

In [9]:
df10.shape

(13863746, 47)

In [10]:
df10.columns

Index(['order_id', 'product_id', 'product_name', 'user_id', 'eval_set',
       'reordered', 'reorder_probability', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order',
       'cust_reorder_propensity', 'bought', 'department_id_1',
       'department_id_2', 'department_id_3', 'department_id_4',
       'department_id_5', 'department_id_6', 'department_id_7',
       'department_id_8', 'department_id_9', 'department_id_10',
       'department_id_11', 'department_id_12', 'department_id_13',
       'department_id_14', 'department_id_15', 'department_id_16',
       'department_id_17', 'department_id_18', 'department_id_19',
       'department_id_20', 'department_id_21', 'aisle_id', 'order_dow_0',
       'order_dow_1', 'order_dow_2', 'order_dow_3', 'order_dow_4',
       'order_dow_5', 'order_dow_6', 'product_reorder_propensity_by_user',
       'in_last_order', 'last_order', '2nd_last_order', '3rd_last_order'],
      dtype='object')

In [11]:
# Reducing dataset to those orders/products in last 3 orders 
df10=df10.loc[(df10['last_order']==1) | (df10['2nd_last_order']==1) | (df10['3rd_last_order']==1)]

In [12]:
df10.shape

(4925215, 47)

In [None]:
# adding the new features in 

In [13]:
dfD=pd.read_csv('master_addl_features_addtocart_ordersize')

In [14]:
dfD.columns

Index(['order_id', 'product_id', 'user_id', 'eval_set', 'add_to_cart_order',
       'avg_add_to_cart_order', 'order_count', 'num_products', 'num_orders',
       'avg_order_size'],
      dtype='object')

In [15]:
dfD.head(1)

Unnamed: 0,order_id,product_id,user_id,eval_set,add_to_cart_order,avg_add_to_cart_order,order_count,num_products,num_orders,avg_order_size
0,2.0,33120.0,202279,prior,1.0,1.833333,9,100,9.0,11.111111


In [16]:
dfD=dfD[['order_id', 'product_id', 'add_to_cart_order', 'avg_add_to_cart_order', 'order_count',
       'num_products', 'num_orders', 'avg_order_size']]

In [17]:
df11=pd.merge(df10, dfD, how='left', on=['order_id','product_id'])

In [18]:
df11.shape

(4925215, 53)

In [None]:
# Save
# df11.to_csv('master_features_prior_and_train_incl_boughtornot_v2.csv', index=False)

In [None]:
# adding in avg elapsed time b/w product purchases feature 

In [19]:
df12=df11

In [20]:
del df11; gc.collect()

12

In [21]:
df12.head()

Unnamed: 0,order_id,product_id,product_name,user_id,eval_set,reordered,reorder_probability,order_number,order_dow,order_hour_of_day,...,in_last_order,last_order,2nd_last_order,3rd_last_order,add_to_cart_order,avg_add_to_cart_order,order_count,num_products,num_orders,avg_order_size
0,1187899.0,196.0,Soda,1,train,1.0,0.777843,11,4,8,...,1.0,1.0,1.0,1.0,1.0,1.363636,11,70,11.0,6.363636
1,487368.0,196.0,Soda,15,prior,1.0,0.777843,22,1,10,...,1.0,1.0,1.0,0.0,1.0,2.2,2,72,22.0,3.272727
2,532817.0,196.0,Soda,19,prior,1.0,0.777843,7,4,17,...,0.0,0.0,0.0,1.0,1.0,6.333333,42,204,9.0,22.666667
3,2187180.0,196.0,Soda,43,prior,1.0,0.777843,9,4,12,...,0.0,0.0,0.0,1.0,1.0,5.0,19,180,12.0,15.0
4,2757217.0,196.0,Soda,67,train,1.0,0.777843,25,0,11,...,0.0,0.0,1.0,1.0,1.0,1.4,3,84,25.0,3.36


In [22]:
# Read in 
df13=pd.read_csv('master_elapsedavgordertime_by_product_id_ONLY')

In [23]:
df14=pd.merge(df12, df13, how='left', on=['product_id'])

In [24]:
df14.shape

(4925215, 54)

In [25]:
del df12, df13

In [None]:
# Save/read in as needed 
# df14.to_csv('master_features_prior_and_train_incl_boughtornot_v3.csv', index=False)

In [None]:
# Handling NaNs

In [26]:
df14.isnull().sum()

order_id                                 0
product_id                               0
product_name                             0
user_id                                  0
eval_set                                 0
reordered                                0
reorder_probability                      0
order_number                             0
order_dow                                0
order_hour_of_day                        0
days_since_prior_order                   0
cust_reorder_propensity                  0
bought                                   0
department_id_1                          0
department_id_2                          0
department_id_3                          0
department_id_4                          0
department_id_5                          0
department_id_6                          0
department_id_7                          0
department_id_8                          0
department_id_9                          0
department_id_10                         0
department_

In [27]:
df14.elapsed_avg.max()

356.0

In [28]:
df14.elapsed_avg.min()

0.0

In [29]:
df14.elapsed_avg.fillna(value=365,inplace=True)

In [30]:
df14.isnull().sum()

order_id                              0
product_id                            0
product_name                          0
user_id                               0
eval_set                              0
reordered                             0
reorder_probability                   0
order_number                          0
order_dow                             0
order_hour_of_day                     0
days_since_prior_order                0
cust_reorder_propensity               0
bought                                0
department_id_1                       0
department_id_2                       0
department_id_3                       0
department_id_4                       0
department_id_5                       0
department_id_6                       0
department_id_7                       0
department_id_8                       0
department_id_9                       0
department_id_10                      0
department_id_11                      0
department_id_12                      0


In [None]:
# Save/read in as needed 
# df14.to_csv('master_features_prior_and_train_incl_boughtornot_v3.csv')
# df8=pd.read_csv('master_features_prior_and_train_incl_boughtornot_v3.csv')

In [33]:
# df14 is the final core dataset; now we move on to preparing the training and test sets 

In [None]:
df8=df14
del  df14
gc.collect()

In [35]:
df8.shape

(4925215, 54)

In [37]:
featureset3=['reordered', 'reorder_probability', 'order_number',
       'order_hour_of_day', 'days_since_prior_order',
       'cust_reorder_propensity', 'department_id_1',
       'department_id_2', 'department_id_3', 'department_id_4',
       'department_id_5', 'department_id_6', 'department_id_7',
       'department_id_8', 'department_id_9', 'department_id_10',
       'department_id_11', 'department_id_12', 'department_id_13',
       'department_id_14', 'department_id_15', 'department_id_16',
       'department_id_17', 'department_id_18', 'department_id_19',
       'department_id_20', 'department_id_21','product_reorder_propensity_by_user',
       'last_order', '2nd_last_order', '3rd_last_order', 'avg_add_to_cart_order', 'avg_order_size', 'elapsed_avg']

In [38]:
# Note: dropped aisle_id from feature set (among others)
X=df8[featureset3].values
#X = df8.drop(drop_cols, axis=1).values
y = df8['bought'].values

In [40]:
#Splitting training and testing data... 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20)

In [41]:
len(X_test)

985043

Modeling....
Experimented with a variety of models-- Decision Tree, Naive Bayes, etc.  Ultimately, used Random Forest classifier for final submission. The classifier predicted which products from a user's last three orders would make it into his/her final basket. F1 score was used in evaluating models  -- but internal f1 scores here did not always align precisely with the competition's scoring.  Thus, I relied more on the competition's scoring mechanism for parameter tuning.    

In [46]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

infoGain_clf = DecisionTreeClassifier(criterion='entropy', class_weight='balanced')
infoGain_clf.fit (X_train, y_train) 
predicted = infoGain_clf.predict(X_test)
print(classification_report(y_test, predicted))

             precision    recall  f1-score   support

          0       0.99      0.99      0.99    860177
          1       0.91      0.91      0.91    124866

avg / total       0.98      0.98      0.98    985043



In [47]:
# Cross validate 
from sklearn.model_selection import cross_val_score
scores = cross_val_score(infoGain_clf, X, y, cv=5)
print(scores)
print(scores.mean())

[ 0.94134069  0.97277682  0.97943643  0.97978565  0.98387378]
0.971442676621


In [48]:
# Experiment with lower classification thresholds to boost recall   
y_test_probabilities= infoGain_clf.predict_proba(X_test)
y_predicted_high_recall = y_test_probabilities[:,1]>0.40
print (classification_report(y_test, y_predicted_high_recall))

             precision    recall  f1-score   support

          0       0.99      0.99      0.99    860177
          1       0.91      0.91      0.91    124866

avg / total       0.98      0.98      0.98    985043



In [49]:
importances = infoGain_clf.feature_importances_
list(zip(featureset3,importances))

[('reordered', 0.33842022374037106),
 ('reorder_probability', 0.007926402154797705),
 ('order_number', 0.3990679469166401),
 ('order_hour_of_day', 0.0059569200353608819),
 ('days_since_prior_order', 0.0056338563932641535),
 ('cust_reorder_propensity', 0.0099138879741721506),
 ('department_id_1', 0.00043001476787072468),
 ('department_id_2', 2.7224831635541749e-06),
 ('department_id_3', 0.0004282100423332214),
 ('department_id_4', 0.00070947952878459863),
 ('department_id_5', 6.4529449249652837e-05),
 ('department_id_6', 5.3405092511311479e-05),
 ('department_id_7', 0.0005963637239594714),
 ('department_id_8', 3.8443279043575462e-05),
 ('department_id_9', 0.00015565598134049423),
 ('department_id_10', 9.4506811589607493e-06),
 ('department_id_11', 5.5675904821559777e-05),
 ('department_id_12', 0.00021635132560019743),
 ('department_id_13', 0.00014097113763138504),
 ('department_id_14', 0.00018754613610071316),
 ('department_id_15', 0.00017369081284980729),
 ('department_id_16', 0.000695

In [55]:
# Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier
# defaults = 100 trees, max depth 3, learning rate 0.01
GradientBoosting_clf=GradientBoostingClassifier()
GradientBoosting_clf.fit (X_train, y_train) 
predicted = GradientBoosting_clf.predict(X_test)
print (classification_report(y_test, predicted))

             precision    recall  f1-score   support

          0       0.90      0.99      0.94    860177
          1       0.74      0.25      0.38    124866

avg / total       0.88      0.89      0.87    985043



In [51]:
print ("Accuracy on training set: {:.3f}".format(GradientBoosting_clf.score(X_train, y_train)))
print ("Accuracy on test set: {:.3f}".format(GradientBoosting_clf.score(X_test, y_test)))

Accuracy on training set: 0.894
Accuracy on test set: 0.894


In [52]:
importances = GradientBoosting_clf.feature_importances_
list(zip(featureset3,importances))

[('reordered', 0.083815040222933956),
 ('reorder_probability', 0.0),
 ('order_number', 0.29945442044030485),
 ('order_hour_of_day', 0.0),
 ('days_since_prior_order', 0.049672948208404996),
 ('cust_reorder_propensity', 0.0045496107257348766),
 ('department_id_1', 0.0),
 ('department_id_2', 0.0),
 ('department_id_3', 0.0),
 ('department_id_4', 0.0017062644663406483),
 ('department_id_5', 0.0),
 ('department_id_6', 0.0),
 ('department_id_7', 0.0),
 ('department_id_8', 0.0),
 ('department_id_9', 0.0),
 ('department_id_10', 0.0),
 ('department_id_11', 0.0),
 ('department_id_12', 0.0),
 ('department_id_13', 0.0),
 ('department_id_14', 0.0),
 ('department_id_15', 0.0),
 ('department_id_16', 0.0),
 ('department_id_17', 0.0),
 ('department_id_18', 0.0),
 ('department_id_19', 0.0),
 ('department_id_20', 0.0),
 ('department_id_21', 0.0015974366535915488),
 ('product_reorder_propensity_by_user', 0.35289471286434554),
 ('last_order', 0.055689386716361193),
 ('2nd_last_order', 0.061866991754224616),

In [53]:
# Some tuning on Gradient Boosting model 
GradientBoosting_clf=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=2, max_features='sqrt', max_leaf_nodes=None,
              min_samples_leaf=100, min_samples_split=35000,
              min_weight_fraction_leaf=0.0, n_estimators=60,
              presort='auto', random_state=10, subsample=0.8, verbose=0,
              warm_start=False)
GradientBoosting_clf.fit (X_train, y_train) 
predicted = GradientBoosting_clf.predict(X_test)
print(classification_report(y_test, predicted))

             precision    recall  f1-score   support

          0       0.89      0.99      0.94    860177
          1       0.69      0.12      0.20    124866

avg / total       0.86      0.88      0.84    985043



In [56]:
# Light GBM Model 

import lightgbm as lgb

params = {'boosting_type': 'gbdt',
          'max_depth' : 10,
          'objective': 'binary', 
          'nthread': 5, 
          'silent': True,
          'num_leaves': 96, 
          'learning_rate': 0.1, 
          'max_bin': 250, 
          'subsample_for_bin': 200,
          'subsample': 1, 
          'subsample_freq': 1, 
          'colsample_bytree': 0.8, 
          'reg_alpha': 5, 
          'reg_lambda': 10,
          'min_split_gain': 0.5, 
          'min_child_weight': 1, 
          'min_child_samples': 5, 
          'scale_pos_weight': 1,
          'num_class' : 1,
          'metric' : 'auc', 
          'task': 'train',
          'feature_fraction': 0.9,
          'bagging_fraction': 0.95,
          'bagging_freq': 5
          }

lgb_clf = lgb.LGBMClassifier(boosting_type= 'gbdt', 
          objective = 'binary', 
          nthread = 5, 
          silent = True,
          min_data_in_leaf= 200,                   
          max_depth = params['max_depth'],
          max_bin = params['max_bin'], 
          subsample_for_bin = params['subsample_for_bin'],
          subsample = params['subsample'], 
          subsample_freq = params['subsample_freq'], 
          min_split_gain = params['min_split_gain'], 
          min_child_weight = params['min_child_weight'], 
          min_child_samples = params['min_child_samples'], 
          scale_pos_weight = params['scale_pos_weight'])
lgb_clf.fit (X_train, y_train) 
predicted = lgb_clf.predict(X_test)
print (classification_report(y_test, predicted))

             precision    recall  f1-score   support

          0       0.91      0.99      0.95    860177
          1       0.82      0.31      0.45    124866

avg / total       0.90      0.90      0.89    985043



In [58]:
importances = lgb_clf.feature_importances_
from operator import itemgetter
sorted(list(zip(featureset3,importances)),key=itemgetter(1), reverse=True )

[('order_number', 145),
 ('product_reorder_propensity_by_user', 105),
 ('3rd_last_order', 17),
 ('2nd_last_order', 11),
 ('reordered', 10),
 ('last_order', 9),
 ('elapsed_avg', 3),
 ('reorder_probability', 0),
 ('order_hour_of_day', 0),
 ('days_since_prior_order', 0),
 ('cust_reorder_propensity', 0),
 ('department_id_1', 0),
 ('department_id_2', 0),
 ('department_id_3', 0),
 ('department_id_4', 0),
 ('department_id_5', 0),
 ('department_id_6', 0),
 ('department_id_7', 0),
 ('department_id_8', 0),
 ('department_id_9', 0),
 ('department_id_10', 0),
 ('department_id_11', 0),
 ('department_id_12', 0),
 ('department_id_13', 0),
 ('department_id_14', 0),
 ('department_id_15', 0),
 ('department_id_16', 0),
 ('department_id_17', 0),
 ('department_id_18', 0),
 ('department_id_19', 0),
 ('department_id_20', 0),
 ('department_id_21', 0),
 ('avg_add_to_cart_order', 0),
 ('avg_order_size', 0)]

In [59]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
bayes_clf = GaussianNB()
bayes_clf.fit (X_train, y_train)
predicted = bayes_clf.predict(X_test)
print (classification_report(y_test, predicted))

             precision    recall  f1-score   support

          0       0.98      0.60      0.75    860177
          1       0.25      0.91      0.39    124866

avg / total       0.89      0.64      0.70    985043



In [None]:
#Random Forest Classifier Model 
from sklearn.ensemble import RandomForestClassifier
RandomForest_clf = RandomForestClassifier( max_depth=9, min_samples_leaf=100, max_features='sqrt',
                                          class_weight='balanced_subsample')
RandomForest_clf.fit(X_train, y_train)

In [71]:
#The mean square error
print ("Residual sum of squares: %.2f" % np.mean((RandomForest_clf.predict(X_test) - y_test) ** 2))

#Explained variance score: 1 is perfect prediction
#Returns the mean accuracy on the given test data and labels.
print ('Variance score: %.2f' % RandomForest_clf.score(X_test, y_test))
print (classification_report(y_test,RandomForest_clf.predict(X_test)))

# Added
print ("predicted:", RandomForest_clf.predict(X_test[-10:,:]))
print ("truth:", y_test[-10:])

Residual sum of squares: 0.30
Variance score: 0.70
             precision    recall  f1-score   support

          0       0.99      0.66      0.79    860177
          1       0.29      0.95      0.44    124866

avg / total       0.90      0.70      0.75    985043

predicted: [1 1 1 1 1 0 1 0 1 1]
truth: [0 0 0 0 0 0 1 0 1 0]


In [77]:
y_test_probabilities= RandomForest_clf.predict_proba(X_test)
y_predicted_high_recall = y_test_probabilities[:,1]>0.35
print (classification_report(y_test, y_predicted_high_recall))

             precision    recall  f1-score   support

          0       1.00      0.57      0.73    860177
          1       0.25      1.00      0.40    124866

avg / total       0.91      0.62      0.69    985043



Final Submission...
- Take same core basket as before 
- Merge it with test orders -- apply new episodic data from test orders 
- Run predictions using the classifier fitted on prior/train data 
- Append predictions and submit 

In [79]:
# Reread 
dfZ=pd.read_csv('master_final_test_baskets_v4')

In [80]:
dfZ.shape

(4833292, 40)

In [81]:
dfZ.columns

Index(['order_id_x', 'user_id', 'product_id', 'reordered',
       'reorder_probability', 'order_number', 'order_dow_0', 'order_dow_1',
       'order_dow_2', 'order_dow_3', 'order_dow_4', 'order_dow_5',
       'order_dow_6', 'order_hour_of_day', 'days_since_prior_order',
       'department_id_1', 'department_id_2', 'department_id_3',
       'department_id_4', 'department_id_5', 'department_id_6',
       'department_id_7', 'department_id_8', 'department_id_9',
       'department_id_10', 'department_id_11', 'department_id_12',
       'department_id_13', 'department_id_14', 'department_id_15',
       'department_id_16', 'department_id_17', 'department_id_18',
       'department_id_19', 'department_id_20', 'department_id_21',
       'cust_reorder_propensity', 'product_reorder_propensity_by_user',
       'order_id_y', 'in_last_order'],
      dtype='object')

In [82]:
# Read in 
df11=pd.read_csv('master_features_prior_and_train_incl_boughtornot_v2.csv')

In [83]:
df11.shape

(4925215, 53)

In [84]:
df11= df11[['order_id', 'product_id', 'user_id', 'last_order', '2nd_last_order', '3rd_last_order','avg_add_to_cart_order', 'num_products', 'num_orders', 'avg_order_size']]

In [85]:
dfZ=pd.merge(dfZ, df11, how='left', on=['product_id', 'user_id'])

In [86]:
del df11; gc.collect()

13629

In [87]:
# Remove products not ordered in user's last 3 orders 
dfZ=dfZ.loc[(dfZ['last_order']==1) | (dfZ['2nd_last_order']==1) | (dfZ['3rd_last_order']==1)]

In [88]:
dfZ.shape 

(1793764, 48)

In [89]:
# Save as needed 
# dfZ.to_csv('master_final_test_baskets_v5_addl_features_appended', index=False)

In [None]:
# appends elapsed_avg to test set 

In [90]:
# Read in 
dfA=pd.read_csv('master_elapsedavgordertime_by_product_id_ONLY')

In [91]:
# Read in as needed 
# dfX=pd.read_csv('master_final_test_baskets_v5_addl_features_appended')
dfX=dfZ

In [92]:
dfY=pd.merge(dfX, dfA, on='product_id', how='left')

In [93]:
dfY.shape

(1793764, 49)

In [94]:
dfY.elapsed_avg.isnull().sum()

2900

In [95]:
dfY.elapsed_avg.fillna(value=365, inplace=True)

In [96]:
dfY.elapsed_avg.isnull().sum()

0

In [97]:
# Save
# dfY.to_csv('master_final_test_baskets_v6_addl_features_appended', index=False)

In [98]:
# This is the final dataset for use in the submission; aligns with the training dataset 
dfZ=dfY
del dfA, dfY

In [None]:
# Read in, as needed
# dfZ=pd.read_csv('master_final_test_baskets_v6_addl_features_appended')

In [None]:
Apply Model to Test (Submission) Dataset...

In [99]:
X_final=dfZ[featureset3].values

In [101]:
# Apply RF Model 
predicted = RandomForest_clf.predict(X_final) 

In [102]:
# Append predicted column
dfZ['predicted']= predicted

In [103]:
# Selects only the predicted as "bought" items
dfZ=dfZ.loc[dfZ['predicted']==1]

In [104]:
def products_concat(vet):
    
    """ Turns a column of predicted purchased products by user ID into a space delimited list -- 
    for submission for grading """
    
    out = ''
    
    #vet is a pd.Series
    for prod in vet:
        if prod > 0:
            out += str(int(prod)) + ' '
    
    if out != '':
        return out.rstrip()
    else:
        return 'None'

In [105]:
# Concat all product by user in space delimited column 
user_products = dfZ.groupby('user_id').product_id.apply(products_concat)
user_products.head()

user_id
3     47766 17668 18599 21903 23650 24810 32402 3919...
4                                                 35469
6                                           21903 38293
11    8309 14947 20383 27959 28465 33572 34658 35948...
12    13176 14992 21616 5746 7076 8239 10863 20350 2...
Name: product_id, dtype: object

In [None]:
# Retrieves the test set 
all_products_ordered_deduped_by_userID = pd.read_csv('master_all_products_ordered_deduped_by_usedID.csv', index_col=0) 

In [107]:
# Generating test set, which contains the user_id and every product he/she bought.
test_set = all_products_ordered_deduped_by_userID.loc[all_products_ordered_deduped_by_userID.eval_set == 'test'][['order_id', 'user_id']]
test_set = test_set.join(user_products, on='user_id')
test_set.head()

Unnamed: 0,order_id,user_id,product_id
33819106,2774568.0,3,47766 17668 18599 21903 23650 24810 32402 3919...
33819107,329954.0,4,35469
33819108,1528013.0,6,21903 38293
33819109,1376945.0,11,8309 14947 20383 27959 28465 33572 34658 35948...
33819110,1356845.0,12,13176 14992 21616 5746 7076 8239 10863 20350 2...


In [108]:
test_set = test_set.replace(np.nan, 'None', regex=True)
test_set.head(50)

Unnamed: 0,order_id,user_id,product_id
33819106,2774568.0,3,47766 17668 18599 21903 23650 24810 32402 3919...
33819107,329954.0,4,35469
33819108,1528013.0,6,21903 38293
33819109,1376945.0,11,8309 14947 20383 27959 28465 33572 34658 35948...
33819110,1356845.0,12,13176 14992 21616 5746 7076 8239 10863 20350 2...
33819111,2161313.0,15,196 12427 10441 14715 27839 48142
33819112,1416320.0,16,24852 21137 21903 651 43014 4086 5134 17948 41...
33819113,1735923.0,19,196 2192 12108 15131 15599 17008 31487 33122 3...
33819114,1980631.0,20,9387 6184 13575 13914 22362 41400 46061
33819115,139655.0,22,13176 27845 22963 21903 17794 22935 35221 2496...


In [109]:
test_set.shape

(75000, 3)

In [110]:
#Only need these columns for the submission.
submission = pd.DataFrame({'order_id': test_set.order_id, 'products': test_set.product_id})
submission.tail()

Unnamed: 0,order_id,products
33894101,2728930.0,16797 21709 24852 49683 18926 17038 432 41177 ...
33894102,350108.0,21137 43961 42828 5646 49075 48287 30561 15649...
33894103,1043943.0,11520 23029 42623
33894104,2821651.0,13176 25133 27845 27344 27966 33754 47209 2203...
33894105,803273.0,13176 27845 24838 21137 43961 31717 45007 2099...


In [111]:
# Need to convert order id to int
submission.order_id = pd.to_numeric(submission.order_id, errors='coerce')
submission.order_id = submission.order_id.astype(int)
submission.head()

Unnamed: 0,order_id,products
33819106,2774568,47766 17668 18599 21903 23650 24810 32402 3919...
33819107,329954,35469
33819108,1528013,21903 38293
33819109,1376945,8309 14947 20383 27959 28465 33572 34658 35948...
33819110,1356845,13176 14992 21616 5746 7076 8239 10863 20350 2...


In [112]:
#Saving to file.
submission.to_csv('master_submission82.csv', index=False)
submitted_data = pd.read_csv('master_submission82.csv') 
submitted_data.head()

Unnamed: 0,order_id,products
0,2774568,47766 17668 18599 21903 23650 24810 32402 3919...
1,329954,35469
2,1528013,21903 38293
3,1376945,8309 14947 20383 27959 28465 33572 34658 35948...
4,1356845,13176 14992 21616 5746 7076 8239 10863 20350 2...


In [None]:
Notes and Kaggle scores from selected predictions 

GB - defaults  
Your submission scored 0.1478085, which is not an improvement of your best score.

RF  min_samples_leaf=100, n_estimators=15, max_features='sqrt')
Your submission scored 0.3539836, which is not an improvement of your best score. Keep trying!

RF
Your submission scored 0.3509929, which is not an improvement of your best score. 

RF elapsed_avg added; no max depth
Your submission scored 0.3577902, which is not an improvement of your best score. 


RF, max d 15
Your submission scored 0.3624220, which is an improvement of your previous score of 0.3592294.

RF, max d 14
Your submission scored 0.3653651, which is an improvement of your previous score of 0.3624220.

RF, max d 11
Your submission scored 0.3666596, which is an improvement of your previous score of 0.3653651. 

RF, max d 9
Your submission scored 0.3686281, which is an improvement of your previous score of 0.3666596.
