# Machine Learning - Feature Creations

We will use all the prior order information to generate features, and use the train data to create the target variables.

From exploratory analysis, we learned that purchase pattern at the department level is pretty consistence, and the reorder pattern are similar to orders created by users before. Also, reorder on average fall under 15 days for new clients, and gradually decrease over time to 2-4 days. Hence, we will test the following features
- product purchase frequency
- product purchase recency

### Import modules and data

In [1]:
import pandas as pd
import numpy as np
import gc
from pycaret.classification import *

In [2]:
# aisles = pd.read_csv('./data/aisles.csv')
# dept = pd.read_csv('./data/departments.csv')
orders = pd.read_csv('./data/orders.csv')
products = pd.read_csv('./data/products.csv')
orders_p = pd.read_csv('./data/order_products__prior.csv')
orders_tr = pd.read_csv('./data/order_products__train.csv')

In [3]:
prior_order = orders.query('eval_set == "prior"')
train_order = orders.query('eval_set == "train"')
test_order = orders.query('eval_set == "test"')

In [4]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [5]:
print(orders.shape)
display(orders.head())
print(orders_p.shape)
print(orders_tr.shape)
display(orders_p.head(), orders_tr.head())

(3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


(32434489, 4)
(1384617, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


### Feature Creations

- last purchase: whether a particular product was purchased in the latest order by user
- last 3 purchase: number of times a product appear in the last 3 purchases
- the average duration (in terms of days) that item under a certain department was purchased
- how many times an item has been purchased by the user
- how often that item appear in user purchase history, in form of percentage

##### Obtain a list of last purchase order by user

In [6]:
last_purchase = prior_order[prior_order['order_number'] == prior_order.groupby(
    ['user_id'])['order_number'].transform('max')]
last_purchase_list = last_purchase['order_id'].tolist()

##### Generating last purchase feature

In [7]:
order_detail = orders_p.merge(prior_order[['order_id', 'user_id',
                            'order_number', 'days_since_prior_order']],
                              on=['order_id'], how='left')
order_detail.loc[order_detail['order_id'].isin(last_purchase_list),
                 'last_purchase'] = 1
order_detail.loc[~order_detail['order_id'].isin(last_purchase_list),
                 'last_purchase'] = 0

##### Generating average days of department/item reordering

The feature is to cover purchase pattern of a user at the department level. A user may switch between different brands of product (e.g. snacks) among his/her purchase history.

Steps on getting the average days of product reordering

- Get a cumulative sum of days of each order, create column (cumsum)
- apply the cumsum at the order level to products within each order
- get department id from the product id, and map it back to cumsum of each order
- use groupby then .diff() at the user/department level to get the absolute date lag between each order containing items within certain departments

In [8]:
ord_duration = orders[['user_id', 'order_number', 'days_since_prior_order']]
ord_duration['cumsum'] = ord_duration.groupby(['user_id'])['days_since_prior_order'].cumsum()
ord_duration = ord_duration.drop('days_since_prior_order', axis=1)
ord_duration = ord_duration.dropna()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [10]:
gc.collect()
get_dept = order_detail[['product_id', 'user_id', 'order_number']]
get_dept['dummy'] = 1# dummy column to allow groupby in next step
get_dept = pd.merge(get_dept, products[['product_id', 'department_id']],
                     on='product_id', how='left')
get_dept = get_dept.groupby(['user_id', 'department_id',
                               'order_number']).agg({'dummy': 'count'})
get_dept = get_dept.reset_index()
get_dept = pd.merge(get_dept, ord_duration,
                    on=['user_id', 'order_number'], how='left')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [11]:
get_dept.head()

Unnamed: 0,user_id,department_id,order_number,dummy,cumsum
0,1,4,2,1,15.0
1,1,4,5,4,93.0
2,1,7,1,1,
3,1,7,2,1,15.0
4,1,7,3,1,36.0


In [16]:
dept_lag = get_dept.copy()
dept_lag['lag'] = dept_lag.groupby(['user_id',
                                      'department_id'])['cumsum'].diff()

In [17]:
final_lag = dept_lag[['user_id', 'department_id',
                       'order_number', 'lag']]

Add lag features back to the order_detail dataframe.

In [18]:
order_detail = pd.merge(order_detail,
                        products[['product_id', 'department_id']],
                        on='product_id', how='left')
order_detail = pd.merge(order_detail, final_lag,
                        on=['user_id', 'department_id', 'order_number'], how='left')

In [19]:
order_detail.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,days_since_prior_order,last_purchase,department_id,lag
0,2,33120,1,1,202279,3,8.0,0.0,16,8.0
1,2,28985,2,1,202279,3,8.0,0.0,4,8.0
2,2,9327,3,0,202279,3,8.0,0.0,13,8.0
3,2,45918,4,1,202279,3,8.0,0.0,13,8.0
4,2,30035,5,0,202279,3,8.0,0.0,13,8.0


##### Generate item purchased count and percentage of item appeared in user purchase history

In [20]:
user_df = order_detail.groupby(['user_id', 'product_id']).agg({
    'order_number': 'max', 'last_purchase': 'max', 'reordered': 'sum', 'lag': 'mean'}).fillna(0)

In [21]:
user_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,order_number,last_purchase,reordered,lag
user_id,product_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,196,10,1.0,9,20.125
1,10258,10,1.0,8,20.125
1,10326,5,0.0,0,78.0
1,12427,10,1.0,9,20.125
1,13032,10,1.0,2,80.5


In [22]:
user_df['product_appear'] = (user_df['reordered']+1)/user_df['order_number']

In [23]:
user_order_cnt = orders.groupby(['user_id', 'order_number']).agg({
    'days_since_prior_order': 'sum'})

In [24]:
user_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,order_number,last_purchase,reordered,lag,product_appear
user_id,product_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,196,10,1.0,9,20.125,1.0
1,10258,10,1.0,8,20.125,0.9
1,10326,5,0.0,0,78.0,0.2
1,12427,10,1.0,9,20.125,1.0
1,13032,10,1.0,2,80.5,0.3


##### Generate consecutive purchase features for the last 3 orders, by user and by item

In [25]:
# Get the last 3 purchases
user_order_cnt_1 = user_order_cnt.groupby(level=0).apply(lambda df: df[-4:])
user_order_cnt_1.index = user_order_cnt_1.index.droplevel(0)

In [26]:
last_three = user_order_cnt_1.reset_index().drop(
    'days_since_prior_order', axis=1)

In [27]:
product_last_three = pd.merge(last_three, order_detail,
                              on=['user_id', 'order_number'], how='inner')
product_cnt = product_last_three.groupby(['user_id', 'product_id']).agg(
    {'product_id': 'count'})

##### Forming final training dataframe

In [28]:
temp_df = pd.merge(user_df, product_cnt, left_index=True,
                   right_index=True, how='left')

In [29]:
temp_df = temp_df.rename(columns={'product_id': 'buy_cnt'})

In [30]:
temp_df.reset_index(inplace=True)

In [34]:
# save copies for ease of access on other model training scearios
temp_df.to_csv('./data/df.csv', index=True)

In [32]:
# getting user id within the train dataset section
train_id = train_order.user_id.unique().tolist()

In [33]:
# filter out the training set user id from the temp dataframe
temp_train_df = temp_df.loc[temp_df['user_id'].isin(train_id)]

Generating target features from the train data

In [35]:
train_target = pd.merge(train_order[['order_id', 'user_id']], orders_tr,
                        on=['order_id'], how='left')
train_train = train_target.drop(['order_id', 'add_to_cart_order'], axis=1)
train_target = train_target.rename(columns={'reordered': 'target'})

# set to 1 as these are the products brought in the next purchase
train_target['target'] = 1

Add target feature to the training dataframe, and drop order_number column

In [36]:
train_df = pd.merge(temp_train_df, train_target, on=['user_id', 'product_id'],
                    how='outer').fillna(0)
train_df = train_df.set_index(['user_id', 'product_id'])

# columns that won't be needed for final features
train_df.drop(['order_number', 'order_id', 'add_to_cart_order'],
              axis=1, inplace=True)

Set target to integer to ensure model recognize that as binary variable

In [37]:
train_df = train_df.astype({'target': 'int'})

In [38]:
train_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,last_purchase,reordered,lag,product_appear,buy_cnt,target
user_id,product_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,196,1.0,9.0,20.125,1.0,3.0,1
1,10258,1.0,8.0,20.125,0.9,3.0,1
1,10326,0.0,0.0,78.0,0.2,0.0,0
1,12427,1.0,9.0,20.125,1.0,3.0,0
1,13032,1.0,2.0,80.5,0.3,1.0,1


In [39]:
train_df.to_csv('./data/df.csv', index=True)

##### Create separate formatted test dataset for ease of access later.

In [40]:
test_id = test_order.user_id.unique().tolist()
df_for_submit = temp_df.loc[temp_df['user_id'].isin(test_id)].set_index(
    ['user_id', 'product_id']).fillna(0)

In [41]:
df_for_submit.to_csv('./data/df_submit.csv')

[Modeling Part 1](https://github.com/sittingman/instacart_product_repurchase/blob/master/4_ML_model_p1.ipynb)
<br>
<br>
[Modeling Part 2](https://github.com/sittingman/instacart_product_repurchase/blob/master/4_ML_model_p2.ipynb)
- repeat modeling part 1 but dropping 'lag' feature and re-access best model fitting