# Model Type Selection with PyCaret

This notebook explains how to use `PyCaret` to build multiple models and compare them to pick the best model for your data.

This notebook will build and evaluate a model to predict arrival delay for flights in and out of NYC in 2013.  

### Packages

This tutorial uses:
* [pandas](https://pandas.pydata.org/docs/)
* [statsmodels](https://www.statsmodels.org/stable/index.html)
    * [statsmodels.api](https://www.statsmodels.org/stable/api.html#statsmodels-api)
* [pycaret](https://pycaret.readthedocs.io/en/latest/index.html)
    * [pycaret.regression](https://pycaret.readthedocs.io/en/latest/api/regression.html)
    * [pycaret.utils](https://pycaret.readthedocs.io/en/latest/index.html)

In [1]:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import pycaret.regression as pycr
import pycaret.utils as pycu

## Reading the data

The data is from `rdatasets` imported using the Python package `statsmodels`.

In [2]:
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute    

## Feature Engineering

### Handle null values

In [3]:
df.isnull().sum()

year                 0
month                0
day                  0
dep_time          8255
sched_dep_time       0
dep_delay         8255
arr_time          8713
sched_arr_time       0
arr_delay         9430
carrier              0
flight               0
tailnum           2512
origin               0
dest                 0
air_time          9430
distance             0
hour                 0
minute               0
time_hour            0
dtype: int64

As this model will predict arrival delay, the `Null` values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.

In [4]:
df.dropna(inplace=True)

### Convert the times from floats or ints to hour and minutes

In [5]:
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)

As `PyCaret` can use large amounts of memory, we will randomly select **100,000** rows for this comparison, reserving the remaining rows as a test set.

In [6]:
dftrain = df.sample(n=100000, random_state=1066)
dftest = df.drop(dftrain.index)

## Fit the models

Setup the `PyCaret` environment.  **session_id** is equivalent to **random_state** in `scikit-learn` and allows the experiment to be repeated.

In [7]:
pycaret_experiment = pycr.setup(data=dftrain, target="arr_delay", session_id=1066,
                                ignore_features=['flight', 'tailnum', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])

Unnamed: 0,Description,Value
0,session_id,1066
1,Target,arr_delay
2,Original Data,"(100000, 25)"
3,Missing Values,False
4,Numeric Features,9
5,Categorical Features,6
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(69999, 162)"


Calling `compare_models` will train about 20 models and show their **MAE**, **MSE**, **RMSE**, **R^2**, **RMSLE**, and **MAPE**.  It will also highlight the best peforming model on each of those metrics.

In [8]:
best = pycr.compare_models(sort='RMSE')

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
catboost,CatBoost Regressor,1.9788,36.7907,5.5109,0.9818,0.2145,0.1533,4.07
rf,Random Forest Regressor,1.0965,38.4911,5.629,0.9809,0.1149,0.0655,25.03
et,Extra Trees Regressor,1.3357,42.8113,6.1431,0.9788,0.151,0.0817,35.599
xgboost,Extreme Gradient Boosting,3.2131,40.0363,6.1862,0.98,0.3423,0.2805,17.131
dt,Decision Tree Regressor,1.7987,54.8896,6.8635,0.9728,0.2031,0.1283,0.377
lightgbm,Light Gradient Boosting Machine,6.272,165.4157,12.5736,0.9176,0.5248,0.4355,1.281
gbr,Gradient Boosting Regressor,14.3364,561.705,23.6655,0.7183,1.1623,0.8105,5.796
br,Bayesian Ridge,25.1405,1707.8873,41.3148,0.1422,1.3191,1.8122,0.935
ridge,Ridge Regression,25.1445,1708.1058,41.3174,0.1421,1.3159,1.8226,0.065
lr,Linear Regression,25.1495,1708.5206,41.3225,0.1418,1.3159,1.8245,0.437


Calling `create_model` with one of the types above, will create the model that can then be used like any other model.

In [9]:
catboost = pycr.create_model('catboost')

Unnamed: 0,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,2.0394,95.0439,9.749,0.9536,0.2097,0.1461
1,1.9653,13.1927,3.6322,0.9936,0.2171,0.1557
2,2.0037,12.6137,3.5516,0.9933,0.2239,0.1662
3,2.0038,17.1477,4.141,0.9914,0.2165,0.1585
4,1.8106,9.8252,3.1345,0.9948,0.209,0.1422
5,1.9573,37.8786,6.1546,0.9814,0.2032,0.153
6,2.0721,93.5713,9.6732,0.9533,0.208,0.1536
7,1.9391,15.2617,3.9066,0.9926,0.2142,0.1474
8,2.0233,62.9052,7.9313,0.9699,0.2108,0.1473
9,1.9735,10.4671,3.2353,0.9942,0.2328,0.1628


Evaluate this model on the hold_out sample

In [10]:
predict_result = pycr.predict_model(catboost)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,CatBoost Regressor,1.9158,30.7283,5.5433,0.9849,0.2069,0.1498


This model is built using the default hyperparameters.  The model with tuned hyperparameters can be found using `tune_model`.

In [11]:
catboost = pycr.tune_model(catboost)

Unnamed: 0,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,2.3507,116.4995,10.7935,0.9431,0.2424,0.1784
1,2.5108,22.7701,4.7718,0.989,0.2792,0.2172
2,2.2563,26.8256,5.1793,0.9858,0.2707,0.1974
3,2.4275,47.8128,6.9147,0.9761,0.2687,0.2114
4,2.2783,13.9257,3.7317,0.9927,0.2747,0.2036
5,2.3571,50.3552,7.0961,0.9752,0.2491,0.1948
6,2.5718,173.7631,13.1819,0.9133,0.2706,0.2126
7,2.4011,53.7416,7.3309,0.974,0.2814,0.2061
8,2.6607,53.3262,7.3025,0.9744,0.278,0.2101
9,2.2036,14.4401,3.8,0.992,0.2585,0.1939


Evaluate this model on the hold_out sample

In [12]:
tuned_result = pycr.predict_model(catboost)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,CatBoost Regressor,2.3154,48.661,6.9757,0.9761,0.2608,0.1978


Finalize the model for deployment by fitting the model onto all of the data including the hold-out.

In [13]:
final_catboost = pycr.finalize_model(catboost)

Use this final model to predict on the observations not sampled above

In [14]:
predictions = pycr.predict_model(final_catboost, data=dftest)

Check the **R^2** for these predictions

In [15]:
pycu.check_metric(predictions.arr_delay, predictions.Label, 'R2')

0.9822