# Machine Learning

We will use all the prior order information to generate features, and use the train data to create the target variables.

From exploratory analysis, we learned that purchase pattern at the department level is pretty consistence, and the reorder pattern are similar to orders created by users before. Also, reorder on average fall under 15 days for new clients, and gradually decrease over time to 2-4 days. Hence, we will test the following features
- product purchase frequency
- duration from last purchase

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pycaret.classification import *
sns.set_style('ticks')
%matplotlib inline 

  import pandas.util.testing as tm


In [2]:
# aisles = pd.read_csv('./data/aisles.csv')
# dept = pd.read_csv('./data/departments.csv')
orders = pd.read_csv('./data/orders.csv')
# products = pd.read_csv('./data/products.csv')
orders_p = pd.read_csv('./data/order_products__prior.csv')
orders_tr = pd.read_csv('./data/order_products__train.csv')

In [3]:
prior_order = orders.query('eval_set == "prior"')
train_order = orders.query('eval_set == "train"')
test_order = orders.query('eval_set == "test"')

In [4]:
print(orders.shape)
display(orders.head())
print(orders_p.shape)
print(orders_tr.shape)

(3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


(32434489, 4)
(1384617, 4)


#### Getting product details

In [6]:
# product_full_detail = pd.merge(products, dept, on='department_id').merge(aisles, on='aisle_id')

### Feature Creations

- last purchase: whether a particular product was purchased in the latest order by user
- last 3 purchase: a product that appear consecutively in the last 3 purchase
- the duration (in terms of days) that an item was purchased since last purchase (will be zero if the product didn't exist in latest orders)
- how many times an item has been purchased by the user
- how often that item appear in user purchase history, in form of percentage

##### Obtain a list of last purchase order by user

In [7]:
last_purchase = prior_order[prior_order['order_number'] == prior_order.groupby(['user_id'])['order_number'].transform('max')]
last_purchase_list = last_purchase['order_id'].tolist()

##### Generating last purchase feature

In [8]:
order_p_detail = orders_p.merge(prior_order[['order_id','user_id','order_number']], on=['order_id'], how='left')
order_p_detail.loc[order_p_detail['order_id'].isin(last_purchase_list), 'last_purchase'] = 1
order_p_detail.loc[~order_p_detail['order_id'].isin(last_purchase_list), 'last_purchase'] = 0

##### Generating times of purchases and product appear frequency

In [10]:
user_df = order_p_detail.groupby(['user_id', 'product_id']).agg({'order_number':'max', 'last_purchase':'max', 'reordered':'sum'}) #'days_since_prior_order': 'max', 'order_dow':'max'

In [12]:
user_df['product_appear'] = (user_df['reordered']+1)/user_df['order_number']

In [13]:
user_order_cnt = orders.groupby(['user_id', 'order_number']).agg({'days_since_prior_order':'sum'})

##### Generate 3 consecutive purchase features for the last 3 orders, by user and by item

In [14]:
user_order_cnt_1 = user_order_cnt.groupby(level=0).apply(lambda df: df[-4:]) #get the last 3 purchases
user_order_cnt_1.index = user_order_cnt_1.index.droplevel(0)

In [15]:
last_three = user_order_cnt_1.reset_index().drop('days_since_prior_order', axis=1)

In [16]:
product_last_three = pd.merge(last_three, order_p_detail, on=['user_id', 'order_number'], how='inner')
product_last_three = product_last_three.groupby(['user_id', 'product_id']).agg({'product_id': 'count'})

In [17]:
buy_3_time = product_last_three.query('product_id ==3').rename(columns={'product_id': '3x'})

##### Forming final training dataframe

In [18]:
temp_df = pd.merge(user_df, buy_3_time, left_index=True, right_index=True, how='left')
temp_df['3x'] = temp_df['3x'].apply(lambda x: 1 if x ==3 else 0) # setting 1 if item was brought 3 time consecutively in the last 3 orders

In [19]:
temp_df.reset_index(inplace=True)

In [20]:
# getting user id within the train dataset section
train_id = train_order.user_id.unique().tolist() 

In [21]:
# filter out the training set user id from the temp dataframe
temp_train_df = temp_df.loc[temp_df['user_id'].isin(train_id)]

Generating target features from the train data

In [22]:
train_target = pd.merge(train_order[['order_id', 'user_id']], orders_tr, on=['order_id'], how='left').drop(['order_id','add_to_cart_order'], axis=1)
train_target = train_target.rename(columns={'reordered': 'target'})

Add target feature to the training dataframe, and drop order_number column

In [23]:
train_df = pd.merge(temp_train_df, train_target, on=['user_id', 'product_id'], how='outer').fillna(0).set_index(['user_id', 'product_id'])

In [24]:
train_df.drop(['order_number'], axis=1, inplace=True)

Set target to integer to ensure model recognize that as binary variable

In [25]:
train_df = train_df.astype({'target': 'int'})

##### Create separate formatted test dataset for ease of access later.

In [27]:
test_id = test_order.user_id.unique().tolist()
df_for_submit = temp_df.loc[temp_df['user_id'].isin(test_id)].set_index(['user_id','product_id'])

In [28]:
df_for_submit.to_csv('./data/df_submit.csv')

### Modeling

##### Setup

In [29]:
exp = setup(train_df, target='target', categorical_features=['last_purchase', '3x'], train_size=.8)

 
Setup Succesfully Completed!


Unnamed: 0,Description,Value
0,session_id,8656
1,Target Type,Binary
2,Label Encoded,
3,Original Data,"(9030454, 5)"
4,Missing Values,False
5,Numeric Features,2
6,Categorical Features,2
7,Ordinal Features,False
8,High Cardinality Features,False
9,High Cardinality Method,


##### Comparing model to see if any models standout from the F1 score perspective

In [30]:
compare_models(fold=2,blacklist=['knn', 'ridge', 'svm', 'lda', 'nb', 'qda', 'et'], round=2, sort='F1')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Ada Boost Classifier,0.91,0.79,0.16,0.6,0.26,0.23
1,Gradient Boosting Classifier,0.91,0.79,0.15,0.62,0.25,0.22
2,Light Gradient Boosting Machine,0.91,0.79,0.15,0.62,0.25,0.22
3,CatBoost Classifier,0.91,0.79,0.15,0.62,0.25,0.22
4,Decision Tree Classifier,0.91,0.79,0.15,0.62,0.24,0.21
5,Random Forest Classifier,0.91,0.79,0.15,0.61,0.24,0.21
6,Extreme Gradient Boosting,0.91,0.79,0.15,0.62,0.24,0.22
7,Logistic Regression,0.91,0.78,0.14,0.61,0.23,0.2


**Observation**: It turns major classification models have similar F1 score performance. Note that the base models have low "Recall" across the broad, which mean out of all the true qualified reordered items, only a small portion is being identified. This could be the result that the default 0.5 binary cut off rate is too steep that eliminate a lot of qualified records.
<br>
To increase the recall rate, we will lower the binary cut off rate to 0.3, knowing that will impact precision, but if that improve overall F1 scor, then it meets the problem's objective.

In [1]:
def model_f1(model, prob):
    """ Automate the step on generating models and prediction
    
    Args:
        model: the abbreviated string for estimators
        prob: probabiliy therhold to determin whether an output is 0 or 1
        
    Returns:
        name: created model
        var: prediction generated from the model
    """
    name = create_model(model, fold=2)
    var = predict_model(name, probability_threshold=prob)
    return name, var

In [55]:
%%time
lr, lr_pred = model_f1('lr', .3)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Logistic Regression,0.9101,0.7766,0.236,0.5221,0.3251,0.2841


Wall time: 48 s


In [56]:
gbc, gbc_pred = model_f1('gbc', .3)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Gradient Boosting Classifier,0.9061,0.7925,0.2889,0.4808,0.3609,0.3137


In [57]:
%%time
lightgbm, lg_pred = model_f1('lightgbm', .3)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Light Gradient Boosting Machine,0.9058,0.7926,0.2919,0.4783,0.3625,0.3149


Wall time: 2min 1s


In [58]:
%%time
xgb, xgb_pred = model_f1('xgboost', .3)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Extreme Gradient Boosting,0.9062,0.7925,0.287,0.4819,0.3598,0.3127


Wall time: 4min 17s


In [59]:
%%time
rfc, rfc_pred = model_f1('rf', .3)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Random Forest Classifier,0.9057,0.7912,0.2874,0.4772,0.3588,0.3112


Wall time: 3min 17s


##### Save all the models for final test data submission

In [60]:
path = './data/save_model/'
save_model(lr, model_name=path+'lr')
save_model(gbc, model_name=path+'gbc')
save_model(lightgbm, model_name=path+'lgbm')
save_model(rfc, model_name=path+'rfc')
save_model(xgb, model_name=path+'xgb')

Transformation Pipeline and Model Succesfully Saved
Transformation Pipeline and Model Succesfully Saved
Transformation Pipeline and Model Succesfully Saved
Transformation Pipeline and Model Succesfully Saved
Transformation Pipeline and Model Succesfully Saved


Test out the the output by blending models, performance is inferior than individual model, hence won't proceed with that strategy

In [63]:
%%time
# blend = blend_models(estimator_list=[rfc, xgb, lightgbm, gbc], fold=2)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,0.9138,0.0,0.1476,0.6301,0.2392,0.2118
1,0.9138,0.0,0.1492,0.6288,0.2412,0.2135
Mean,0.9138,0.0,0.1484,0.6294,0.2402,0.2126
SD,0.0,0.0,0.0008,0.0006,0.001,0.0009


Wall time: 22min 27s


In [67]:
# pred_blend = predict_model(blend, probability_threshold=.2)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Voting Classifier,0.914,0,0.1483,0.6337,0.2403,0.2129


In [66]:
# save_model(blend, model_name=path+'blend')

Transformation Pipeline and Model Succesfully Saved
