# Machine Learning

We will use all the prior order information to generate features, and use the train data to create the target variables.

From exploratory analysis, we learned that purchase pattern at the department level is pretty consistence, and the reorder pattern are similar to orders created by users before. Also, reorder on average fall under 15 days for new clients, and gradually decrease over time to 2-4 days. Hence, we will test the following features
- product purchase frequency
- duration from last purchase

### Import modules and data

In [1]:
import pandas as pd
import gc
from pycaret.classification import * 

In [2]:
# aisles = pd.read_csv('./data/aisles.csv')
# dept = pd.read_csv('./data/departments.csv')
orders = pd.read_csv('./data/orders.csv')
# products = pd.read_csv('./data/products.csv')
orders_p = pd.read_csv('./data/order_products__prior.csv')
orders_tr = pd.read_csv('./data/order_products__train.csv')

In [3]:
prior_order = orders.query('eval_set == "prior"')
train_order = orders.query('eval_set == "train"')
test_order = orders.query('eval_set == "test"')

In [4]:
print(orders.shape)
display(orders.head())
print(orders_p.shape)
print(orders_tr.shape)

(3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


(32434489, 4)
(1384617, 4)


### Feature Creations

- last purchase: whether a particular product was purchased in the latest order by user
- last 3 purchase: product appear in the last 3 purchases
- the duration (in terms of days) that an item was purchased since last purchase (will be zero if the product didn't exist in latest orders)
- how many times an item has been purchased by the user
- how often that item appear in user purchase history, in form of percentage

##### Obtain a list of last purchase order by user

In [4]:
last_purchase = prior_order[prior_order['order_number'] == prior_order.groupby(
    ['user_id'])['order_number'].transform('max')]
last_purchase_list = last_purchase['order_id'].tolist()

##### Generating last purchase feature

In [22]:
order_detail = orders_p.merge(prior_order[['order_id', 'user_id', 'order_number']],
                              on=['order_id'], how='left')
order_detail.loc[order_detail['order_id'].isin(last_purchase_list),
                 'last_purchase'] = 1
order_detail.loc[~order_detail['order_id'].isin(last_purchase_list),
                 'last_purchase'] = 0

##### Generating times of purchases and product appear frequency

In [14]:
user_df = order_detail.groupby(['user_id', 'product_id']).agg({
    'order_number': 'max', 'last_purchase': 'max', 'reordered': 'sum'})

In [15]:
user_df['product_appear'] = (user_df['reordered']+1)/user_df['order_number']

In [16]:
user_order_cnt = orders.groupby(['user_id', 'order_number']).agg({
    'days_since_prior_order': 'sum'})

##### Generate 3 consecutive purchase features for the last 3 orders, by user and by item

In [17]:
# Get the last 3 purchases
user_order_cnt_1 = user_order_cnt.groupby(level=0).apply(lambda df: df[-4:])
user_order_cnt_1.index = user_order_cnt_1.index.droplevel(0)

In [19]:
last_three = user_order_cnt_1.reset_index().drop(
    'days_since_prior_order', axis=1)

In [24]:
product_last_three = pd.merge(last_three, order_detail,
                              on=['user_id', 'order_number'], how='inner')
product_cnt = product_last_three.groupby(['user_id', 'product_id']).agg(
    {'product_id': 'count'})

In [25]:
buy_3_time = product_last_three.query('product_id ==3').rename(
    columns={'product_id': '3x'})

##### Forming final training dataframe

In [30]:
temp_df = pd.merge(user_df, product_cnt, left_index=True,
                   right_index=True, how='left')
# temp_df['3x'] = temp_df['3x'].apply(lambda x: 1 if x ==3 else 0)
# setting 1 if item was brought 3 time consecutively in the last 3 orders

In [34]:
temp_df = temp_df.rename(columns={'product_id': 'buy_cnt'})

In [35]:
temp_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,order_number,last_purchase,reordered,product_appear,buy_cnt
user_id,product_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,196,10,1.0,9,1.0,3.0
1,10258,10,1.0,8,0.9,3.0
1,10326,5,0.0,0,0.2,
1,12427,10,1.0,9,1.0,3.0
1,13032,10,1.0,2,0.3,1.0


In [36]:
temp_df.reset_index(inplace=True)

In [37]:
# getting user id within the train dataset section
train_id = train_order.user_id.unique().tolist()

In [42]:
# filter out the training set user id from the temp dataframe
temp_train_df = temp_df.loc[temp_df['user_id'].isin(train_id)]

In [43]:
temp_train_df.head()

Unnamed: 0,user_id,product_id,order_number,last_purchase,reordered,product_appear,buy_cnt
0,1,196,10,1.0,9,1.0,3.0
1,1,10258,10,1.0,8,0.9,3.0
2,1,10326,5,0.0,0,0.2,
3,1,12427,10,1.0,9,1.0,3.0
4,1,13032,10,1.0,2,0.3,1.0


Generating target features from the train data

In [40]:
train_target = pd.merge(train_order[['order_id', 'user_id']], orders_tr,
                        on=['order_id'], how='left')
train_train = train_target.drop(['order_id', 'add_to_cart_order'], axis=1)
train_target = train_target.rename(columns={'reordered': 'target'})
# set to 1 as these are the products brought in the next purchase
train_target['target'] = 1

Add target feature to the training dataframe, and drop order_number column

In [55]:
train_df = pd.merge(temp_train_df, train_target, on=['user_id', 'product_id'],
                    how='outer').fillna(0)
train_df = train_df.set_index(['user_id', 'product_id'])
# order columts that won't be needed for final features
train_df.drop(['order_number', 'order_id', 'add_to_cart_order'],
              axis=1, inplace=True)

Set target to integer to ensure model recognize that as binary variable

In [56]:
train_df = train_df.astype({'target': 'int'})

In [57]:
train_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,last_purchase,reordered,product_appear,buy_cnt,target
user_id,product_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,196,1.0,9.0,1.0,3.0,1
1,10258,1.0,8.0,0.9,3.0,1
1,10326,0.0,0.0,0.2,0.0,0
1,12427,1.0,9.0,1.0,3.0,0
1,13032,1.0,2.0,0.3,1.0,1


##### Create separate formatted test dataset for ease of access later.

In [58]:
test_id = test_order.user_id.unique().tolist()
df_for_submit = temp_df.loc[temp_df['user_id'].isin(test_id)].set_index(
    ['user_id', 'product_id'])

In [59]:
df_for_submit.to_csv('./data/df_submit.csv')

### Modeling

##### Setup

In [60]:
gc.collect()
exp = setup(train_df, target='target', categorical_features=['last_purchase'],
            train_size=.8)

 
Setup Succesfully Completed!


Unnamed: 0,Description,Value
0,session_id,2408
1,Target Type,Binary
2,Label Encoded,
3,Original Data,"(9030454, 5)"
4,Missing Values,False
5,Numeric Features,3
6,Categorical Features,1
7,Ordinal Features,False
8,High Cardinality Features,False
9,High Cardinality Method,


##### Comparing model to see if any models standout from the F1 score perspective

In [61]:
compare_models(blacklist=['knn', 'ridge', 'svm', 'lda', 'nb', 'qda', 'et'],
               fold=2, round=3, sort='F1')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Extreme Gradient Boosting,0.914,0.882,0.5,0.891,0.641,0.596
1,Light Gradient Boosting Machine,0.914,0.882,0.5,0.893,0.641,0.596
2,CatBoost Classifier,0.914,0.882,0.5,0.893,0.641,0.596
3,Ada Boost Classifier,0.914,0.881,0.501,0.887,0.64,0.595
4,Gradient Boosting Classifier,0.914,0.882,0.499,0.893,0.64,0.596
5,Random Forest Classifier,0.913,0.879,0.499,0.885,0.638,0.593
6,Decision Tree Classifier,0.913,0.877,0.497,0.889,0.637,0.592
7,Logistic Regression,0.848,0.705,0.035,0.586,0.066,0.05


**Observation**: It turns major classification models have similar F1 score performance except Logistic Regression. Most models are able to have 0.5 "Recall", which mean out of all the true qualified reordered items, half of them are identified. Percision is at 0.89, which is pretty decent given no tuning has been made yet.
<br>
There could be room to increase the recall rate in order to maximize the F1 score (objective of the problem), assuming the trade off from precision will worth it. We will test will lower the binary cut off rate to 0.4 so see if improvement exist.

Base on the result above, we will test Extreme Gradient Boosting, Light Gradient Boosting Machine and Random Forest to get range of performance when apply different probability thresholds.
<br>

To save time, we will create the function below to automate the steps.

In [62]:

def model_f1(model, prob):
    """ Automate the step on generating models and prediction

    Args:
        model: the abbreviated string for estimators
        prob: probabiliy therhold to determin whether an output is 0 or 1

    Returns:
        name: created model
        var: prediction generated from the model
    """
    name = create_model(model, fold=2)
    var = predict_model(name, probability_threshold=prob)
    return name, var

In [63]:
%%time
lightgbm, lg_pred = model_f1('lightgbm', .4)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Light Gradient Boosting Machine,0.9125,0.8822,0.5433,0.8269,0.6557,0.6081


Wall time: 1min 56s


In [64]:
%%time
xgb, xgb_pred = model_f1('xgboost', .4)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Extreme Gradient Boosting,0.9124,0.8814,0.5439,0.8253,0.6556,0.6079


Wall time: 3min 55s


In [65]:
%%time
rfc, rfc_pred = model_f1('rf', .4)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Random Forest Classifier,0.912,0.8805,0.5402,0.8257,0.6531,0.6052


Wall time: 3min 58s


**Obseravtion**: The trade off appear to be equal among the increase in Recall but decrease in Precision. The net gain in F1 score is 0.01, not significant enough to justify the adjustment, at least from the training set data. However, the test set data are much smaller and that caes it may worth lower the probability in order minimize penalty from low recall.

#### Save all the models for final test data submission

In [66]:
path = './data/save_model/'

save_model(lightgbm, model_name=path+'lgbm')
save_model(rfc, model_name=path+'rfc')
save_model(xgb, model_name=path+'xgb')

Transformation Pipeline and Model Succesfully Saved
Transformation Pipeline and Model Succesfully Saved
Transformation Pipeline and Model Succesfully Saved


#### Blend Models

Test out the the output by blending models, performance is inferior than individual model, hence won't proceed with that strategy

In [67]:
%%time
blend = blend_models(estimator_list=[rfc, xgb, lightgbm], fold=2)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,0.9143,0.0,0.5003,0.894,0.6416,0.5972
1,0.9139,0.0,0.4981,0.8935,0.6396,0.5952
Mean,0.9141,0.0,0.4992,0.8938,0.6406,0.5962
SD,0.0002,0.0,0.0011,0.0003,0.001,0.001


Wall time: 8min 59s


**Observation**: No major improvement on F1 score by blending models, won't pursue this strategy.