# Machine Learning - Modeling (Part 2)

This section is to repeat the machine learning process by excluding "lags" features.

In [1]:
import pandas as pd
from pycaret.classification import *

In [2]:
orders = pd.read_csv('./data/orders.csv')
prior_order = orders.query('eval_set == "prior"')
train_order = orders.query('eval_set == "train"')
test_order = orders.query('eval_set == "test"')

In [3]:
df = pd.read_csv('./data/df.csv')

In [4]:
df.head()

Unnamed: 0,user_id,product_id,last_purchase,reordered,lag,product_appear,buy_cnt,target
0,1,196,1.0,9.0,20.125,1.0,3.0,1
1,1,10258,1.0,8.0,20.125,0.9,3.0,1
2,1,10326,0.0,0.0,78.0,0.2,0.0,0
3,1,12427,1.0,9.0,20.125,1.0,3.0,0
4,1,13032,1.0,2.0,80.5,0.3,1.0,1


In [5]:
df = df.set_index(['user_id', 'product_id'])

#### Setup

In [6]:
exp = setup(df, target='target', categorical_features=['last_purchase'],
            ignore_features=['lag'], train_size=.8)

 
Setup Succesfully Completed!


Unnamed: 0,Description,Value
0,session_id,251
1,Target Type,Binary
2,Label Encoded,
3,Original Data,"(9030454, 6)"
4,Missing Values,False
5,Numeric Features,4
6,Categorical Features,1
7,Ordinal Features,False
8,High Cardinality Features,False
9,High Cardinality Method,


#### Compare models

See if any models standout from the F1 score perspective

In [7]:
compare_models(blacklist=['knn', 'ridge', 'svm', 'lda', 'nb', 'qda', 'et', 'catboost'],
               fold=2, round=3, sort='F1')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Extreme Gradient Boosting,0.914,0.882,0.501,0.89,0.641,0.597
1,Light Gradient Boosting Machine,0.914,0.882,0.5,0.893,0.641,0.596
2,Ada Boost Classifier,0.914,0.881,0.501,0.886,0.64,0.595
3,Gradient Boosting Classifier,0.914,0.882,0.499,0.893,0.64,0.596
4,Random Forest Classifier,0.913,0.878,0.499,0.883,0.638,0.592
5,Decision Tree Classifier,0.913,0.876,0.495,0.889,0.636,0.591
6,Logistic Regression,0.848,0.705,0.035,0.588,0.066,0.05


**Observation**: Taking out the feature 'lag' overall F1 score stay relatively the same. Extreme Gradient Boosting (Xgboost) become the best performance mode. Trees-based models such as Random Forecast see higher performance boost. We will test out Xgboost and Random Forecast to see if we can achieve the same level of without the 'lag" feature.

### Save all the models for final test data submission

In [8]:
path = './data/save_model/'

def final_save(model, name):
    """ create the final model and save that as pkl file

    Args:
    model: model created with the sample data

    Returns:
    finalize model using all the data
    """

    final = finalize_model(model)
    save_model(final, model_name=path+name)

### Tune Model

We will go straight to the tune model part to optimize for best F1 score.

In [11]:
rfc_tune = tune_model('rf', fold=2, optimize='F1')
final_save(rfc_tune, 'rfc_tune')

Transformation Pipeline and Model Succesfully Saved


In [12]:
xbc_tune = tune_model('xgboost', fold=2, n_iter=5, optimize='F1')

Transformation Pipeline and Model Succesfully Saved


In [13]:
final_save(xbc_tune, 'xbc_tune')

Transformation Pipeline and Model Succesfully Saved


#### Kaggle Score (tuned model)

| Model | F1 |
|------| ----|
| Xgboost | 0.36049 |
| Random Forecast| 0.36025|

Both models achieved higher F1 score when 'lag' feature is dropped.

### Data Submission

Go to [data_sumbit](https://github.com/sittingman/instacart_product_repurchase/blob/master/5_data_submit.ipynb) workbook.