# 2023-08-25: AutoML Regression Experiments

## Authors

* Kevin Chu (kevin@velexi.com)

## Overview

This Jupyter notebook demonstrates the use of AutoML to quickly assess multiple common ML regression models for a toy problem. It is based on the tutorial provided by PyCaret at https://www.pycaret.org/tutorials/html/REG101.html.

## History

### 2023-08-25

- Initial version of notebook.

## Experimentation & Development

### Imports

In [1]:
# --- Imports

# External packages
from pycaret import regression
from pycaret.datasets import get_data

### Parameters

In [2]:
# Dataset
dataset_name = "diamond"

# AutoML
experiment_name = "automl-regression-test"
num_best_models = 7
random_seed = 123  # seed used for random number generators to ensure reproducibility of results in this notebook

### Prepare Data

In [3]:
# --- Load dataset

data_df = get_data('diamond')

# --- Check DataFrame

print(f"Number of records: {len(data_df.index)}")
print(f"Columns: {list(data_df.columns)}")

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.1,Ideal,H,SI1,VG,EX,GIA,5169
1,0.83,Ideal,H,VS1,ID,ID,AGSL,3470
2,0.85,Ideal,H,SI1,EX,EX,GIA,3183
3,0.91,Ideal,E,SI1,VG,VG,GIA,4370
4,0.83,Ideal,G,SI1,EX,EX,GIA,3171


Number of records: 6000
Columns: ['Carat Weight', 'Cut', 'Color', 'Clarity', 'Polish', 'Symmetry', 'Report', 'Price']


In [4]:
# --- Construct hold-out dataset

data_df = data_df.sample(frac=0.9, random_state=random_seed).reset_index(drop=True)
data_unseen_df = data_df.drop(data_df.index).reset_index(drop=True)

print('Data for Modeling: ' + str(data_df.shape))
print('Unseen Data For Evaluation of Tuned Model: ' + str(data_unseen_df.shape))

Data for Modeling: (5400, 8)
Unseen Data For Evaluation of Tuned Model: (0, 8)


### Perform AutoML Evaluation

In [5]:
# --- Perform AutoML Evaluation

# Set up the dataset for AutoML regression
regression.setup(data=data_df,
                 target="Price",
                 log_experiment=True,
                 experiment_name=experiment_name,
                 session_id=random_seed,
                ) 

# Automatically train, test, and evaluate models
best_models = regression.compare_models(n_select=num_best_models, verbose=False)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Price
2,Target type,Regression
3,Original data shape,"(5400, 8)"
4,Transformed data shape,"(5400, 29)"
5,Transformed train set shape,"(3779, 29)"
6,Transformed test set shape,"(1621, 29)"
7,Numeric features,1
8,Categorical features,6
9,Preprocess,True


### Analyze Results

In [6]:
# Best models
for model in best_models:
    print(model)
    print()

ExtraTreesRegressor(n_jobs=-1, random_state=123)

RandomForestRegressor(n_jobs=-1, random_state=123)

GradientBoostingRegressor(random_state=123)

LGBMRegressor(n_jobs=-1, random_state=123)

DecisionTreeRegressor(random_state=123)

Ridge(random_state=123)

Lasso(random_state=123)



In [7]:
# Display score table
regression.pull()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,757.9822,2272753.0,1476.2494,0.9783,0.0798,0.06,0.085
rf,Random Forest Regressor,763.173,2656370.0,1582.1759,0.9747,0.0793,0.0592,0.072
gbr,Gradient Boosting Regressor,941.4675,3516412.0,1844.9039,0.9664,0.1026,0.0783,0.027
lightgbm,Light Gradient Boosting Machine,780.8216,3561541.0,1834.3033,0.966,0.0788,0.0569,0.415
dt,Decision Tree Regressor,980.8632,4869726.0,2151.924,0.953,0.1046,0.0761,0.009
ridge,Ridge Regression,2416.2896,14233900.0,3757.7832,0.8622,0.611,0.2797,0.011
lasso,Lasso Regression,2411.9718,14248230.0,3757.7641,0.862,0.6103,0.2788,0.101
llar,Lasso Least Angle Regression,2411.9918,14248210.0,3757.7623,0.862,0.6103,0.2788,0.009
br,Bayesian Ridge,2415.4019,14263390.0,3760.1952,0.8618,0.6133,0.2795,0.011
lr,Linear Regression,2532.9422,15515990.0,3908.8662,0.8494,0.6744,0.294,0.148


### Tune Promising Models

In [8]:
# --- Extra Trees Regressor

et_model = regression.create_model('et')
et_model_tuned = regression.tune_model(et_model)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,656.8496,1385351.4689,1177.0095,0.9843,0.0735,0.0568
1,695.6708,1482894.3498,1217.7415,0.984,0.0824,0.0599
2,756.1282,1648361.4215,1283.8853,0.986,0.0763,0.059
3,746.1383,1798134.3896,1340.9453,0.982,0.083,0.061
4,832.3255,3552837.2282,1884.8971,0.966,0.0815,0.0614
5,872.6804,4307289.3929,2075.401,0.9645,0.0825,0.0616
6,797.776,2998419.4551,1731.5945,0.9709,0.0846,0.063
7,636.9389,1252903.4129,1119.3317,0.9865,0.0752,0.0574
8,792.4113,2020576.8931,1421.47,0.9805,0.0785,0.0594
9,792.9027,2280759.4279,1510.2183,0.9785,0.0801,0.0603


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,811.572,2562415.8994,1600.7548,0.9709,0.0897,0.0704
1,810.8486,1610032.2983,1268.8705,0.9826,0.095,0.0748
2,901.0224,2448794.0849,1564.8623,0.9792,0.0907,0.0712
3,848.4694,2419126.7919,1555.3542,0.9758,0.0934,0.0695
4,1050.1278,7002519.0515,2646.2273,0.933,0.1037,0.0789
5,994.0563,6696958.3855,2587.8482,0.9448,0.099,0.0723
6,938.738,4533073.7783,2129.1016,0.956,0.1008,0.0766
7,881.9795,5697680.962,2386.9816,0.9386,0.0992,0.0745
8,928.9357,2535956.9188,1592.4688,0.9755,0.1003,0.0774
9,1010.7798,3883245.394,1970.5952,0.9634,0.1028,0.0773


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [9]:
# --- Random Forest Regressor

rf_model = regression.create_model('rf')
rf_model_tuned = regression.tune_model(rf_model)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,656.5914,1314274.657,1146.4182,0.9851,0.0717,0.0554
1,608.9327,1119582.7694,1058.1034,0.9879,0.0741,0.0548
2,726.1218,1624885.2422,1274.7099,0.9862,0.0747,0.0573
3,773.2848,1922794.5468,1386.6487,0.9808,0.0822,0.061
4,846.1857,3810255.1626,1951.9875,0.9635,0.0808,0.0617
5,895.751,5616621.1352,2369.9412,0.9537,0.0874,0.0615
6,801.7685,3874840.9475,1968.4616,0.9624,0.0833,0.0615
7,702.3567,2556332.883,1598.8536,0.9724,0.0772,0.0588
8,803.4151,2040523.4085,1428.4689,0.9803,0.0808,0.0604
9,817.3222,2683589.078,1638.1664,0.9747,0.0811,0.0599


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,803.0107,2236428.7769,1495.4694,0.9746,0.09,0.0687
1,749.1237,1297440.7356,1139.0526,0.986,0.0916,0.0711
2,901.1614,2521887.2106,1588.0451,0.9786,0.0941,0.0722
3,897.0724,2688493.1703,1639.6625,0.9731,0.0985,0.0718
4,1054.8961,6336389.1464,2517.2185,0.9394,0.1037,0.08
5,1028.1158,7024226.8276,2650.3258,0.9421,0.1031,0.073
6,975.1932,4218399.031,2053.8742,0.9591,0.1037,0.0775
7,932.5143,5456801.841,2335.9798,0.9412,0.1059,0.0791
8,949.6886,3002411.2887,1732.7467,0.971,0.106,0.0771
9,989.3482,3720420.2118,1928.8391,0.9649,0.1031,0.076


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [10]:
# --- Gradient Boosting Regressor

gbr_model = regression.create_model('gbr')
gbr_model_tuned = regression.tune_model(gbr_model)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,860.5037,2117421.9211,1455.1364,0.976,0.103,0.0819
1,806.7037,2565487.5621,1601.7139,0.9723,0.0969,0.0715
2,1049.0769,3303350.5342,1817.5122,0.972,0.1034,0.0813
3,885.7624,2184437.4004,1477.9842,0.9782,0.105,0.0792
4,985.2669,5901222.9487,2429.2433,0.9435,0.1064,0.0792
5,1075.709,6140920.631,2478.0881,0.9494,0.102,0.076
6,976.6967,3523560.4822,1877.1149,0.9658,0.1109,0.0875
7,888.8486,2732049.6972,1652.8913,0.9705,0.0998,0.077
8,950.4537,3387457.8659,1840.5048,0.9673,0.1004,0.0745
9,935.6538,3308215.7371,1818.8501,0.9688,0.0982,0.0744


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,681.506,1866890.9087,1366.3422,0.9788,0.0748,0.0565
1,757.7444,2141974.4265,1463.5486,0.9768,0.0843,0.0614
2,841.8786,2530079.8457,1590.6225,0.9785,0.0809,0.0597
3,779.0007,2013067.4534,1418.8261,0.9799,0.0916,0.0644
4,813.1483,2278217.2158,1509.3764,0.9782,0.0833,0.063
5,846.4676,4110585.6319,2027.4579,0.9661,0.0836,0.0598
6,898.9549,3612130.3711,1900.5605,0.965,0.0915,0.0711
7,682.5619,1717493.6279,1310.5318,0.9815,0.0714,0.0545
8,882.923,2731389.7272,1652.6917,0.9736,0.083,0.0632
9,784.4956,2852546.8459,1688.9484,0.9731,0.0777,0.0567


Fitting 10 folds for each of 10 candidates, totalling 100 fits
