# PyCaret2 AutoML Regression Example

- https://pycaret.org/

> PyCaret is an open-source, low-code machine learning library in Python that aims to reduce the cycle time from hypothesis to insights. It is well suited for seasoned data scientists who want to increase the productivity of their ML experiments by using PyCaret in their workflows or for citizen data scientists and those new to data science with little or no background in coding. PyCaret allows you to go from preparing your data to deploying your model within seconds using your choice of notebook environment.

This notebook is a modified version of the PyCaret 2.0 example notebook
- https://github.com/pycaret/pycaret/blob/master/examples/PyCaret%202%20Regression.ipynb


PyCaret v2.2+ supports GPU training
- https://towardsdatascience.com/pycaret-2-2-is-here-whats-new-ad7612ca63b

In [None]:
%%html
<style> 
table { float:left; margin-bottom: 1em; } 
table + * { content: ""; clear: both; display: table; }
</style>

In [None]:
%%time
# BUGFIX: Cannot uninstall 'llvmlite' == --ignore-installed
# BUGFIX: AttributeError: module 'PIL.Image' has no attribute 'Resampling' == Pillow==9.0.0
!pip3 install pycaret[full] Pillow==9.0.0 --quiet --ignore-installed 2> /dev/null

In [None]:
# check version
from pycaret.utils import version
version()

# Auto EDA

The use of `pandas_profiling.ProfileReport` was inspired by:
- https://www.kaggle.com/rushikeshlavate/perform-eda-automatically

In [None]:
import pandas as pd

data = pd.read_csv('../input/tabular-playground-series-feb-2021/train.csv', index_col='id')
data

In [None]:
%%time
from pandas_profiling import ProfileReport

auto_eda = ProfileReport(data, title="Tabular Data Series - Feb 2021", explorative=True, minimal=False, progress_bar=False)
auto_eda.to_notebook_iframe()  # BUGFIX: https://github.com/pandas-profiling/pandas-profiling/issues/493

# 1. Loading Dataset

- https://pycaret.org/get-data/

> All modules in PyCaret can work directly with pandas Dataframe. It can consume the dataframe, Irrespective of how it is loaded in the environment. See the below example of loading a csv file into the notebook using pandas native functionality.
>
> PyCaret also hosts the repository of open source datasets that were used throughout the documentation for demonstration purposes. These are hosted on PyCaret’s github and can also be directly loaded using `pycaret.datasets module`


In [None]:
# from pycaret.datasets import get_data
# data = get_data('insurance')

# 2. Initialize Setup

- https://pycaret.org/setup/

> Depending on the type of experiment you want to perform, one of the six available modules currently supported must be imported in your python environment. Importing a module prepares an environment for specific task. For example, if you have imported the Classification module, the environment will be setup accordingly to perform classification tasks only. 

| S.No	| Module	| How to Import |
|-------|:----------|:--------------|
| 1		| Classification				| from pycaret.classification import * |
| 2		| Regression					| from pycaret.regression import * |
| 3		| Clustering					| from pycaret.clustering import * |
| 4		| Anomaly Detection				| from pycaret.anomaly import * |
| 5		| Natural Language Processing	| from pycaret.nlp import * |
| 6		| Association Rule Mining		| from pycaret.arules import * |



> Note: If you don’t want PyCaret to display the dialogue for confirmation of data types you may pass silent as True within setup to perform a unattended run of experiment. We don’t recommend that unless you are absolutely sure the inference is correct or you have performed the experiment before or you are overwriting data types using numeric_feature and categorical_feature parameter.

In [None]:
%%time
from pycaret.regression import *
reg1 = setup(data, target='target', session_id=42, log_experiment=False, experiment_name='tabular-playground-feb-2021', silent=True, use_gpu=True)

# 3. Compare Baseline

- https://pycaret.org/compare-models/

> This is the first step we recommend in the workflow of any supervised experiment. This function trains all the models in the model library using default hyperparameters and evaluates performance metrics using cross-validation. It returns the trained model object. 


CatBoost, LGBM and XGBoost are the top three contenders. 
- CatBoost has slightly better results, however LGBM is an order of magnitude faster on CPU

In [None]:
%%time
# Slowest CPU Models
# 139s | RMSE = 0.8593 | rf = Random Forest Regressor | 14s on GPU
# 67s  | RMSE = 0.8673 | et = Extra Trees Regressor   | 158s on GPU
# 234s | RMSE = 0.9447 | knn = K Neighbors Regressor  | 0.5s on GPU

# compared_models = compare_models(fold=5, n_select=3, exclude=['rf', 'et', 'knn'])            # CPU
# compared_models = compare_models(fold=5, n_select=5, exclude=['et', 'huber', 'gbr', 'ada'])  # GPU
compared_models = compare_models(fold=5, n_select=10) 

# 4. Create Model

- https://pycaret.org/create-model/


> Creating a model in any module is as simple as writing create_model. It takes only one parameter i.e. the Model ID as a string. For supervised modules (classification and regression) this function returns a table with k-fold cross validated performance metrics along with the trained model object. For unsupervised module For unsupervised module clustering, it returns performance metrics along with trained model object and for remaining unsupervised modules anomaly detection, natural language processing and association rule mining, it only returns trained model object. The evaluation metrics used are:
>
> - Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC
> - Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE
>
> The number of folds can be defined using fold parameter within create_model function. By default, the fold is set to 10. All the metrics are rounded to 4 decimals by default by can be changed using round parameter within create_model. Although there is a separate function to ensemble the trained model, however there is a quick way available to ensemble the model while creating by using ensemble parameter along with method parameter within create_model function.

In [None]:
%%time
# Settings from previus tune_model()
lightgbm = create_model('lightgbm', 
    bagging_fraction=0.7, bagging_freq=7, boosting_type='gbdt',
    class_weight=None, colsample_bytree=1.0, device='gpu',
    feature_fraction=0.5, importance_type='split',
    learning_rate=0.071, max_depth=-1, min_child_samples=70,
    min_child_weight=0.001, min_split_gain=0.8, n_estimators=150,
    n_jobs=-1, num_leaves=100, objective=None, random_state=42,
    reg_alpha=5, reg_lambda=0.15, silent=True, subsample=1.0,
    subsample_for_bin=200000, subsample_freq=0
)

In [None]:
%%time
catboost = create_model('catboost', 
 depth = 8,
 l2_leaf_reg = 4,
 loss_function = 'RMSE',
 border_count = 32,
 random_strength = 0.7,
 task_type = 'GPU',
 n_estimators = 290,
)

In [None]:
%%time
xgboost = create_model('xgboost',
    base_score=0.5, booster='gbtree', colsample_bylevel=1,
    colsample_bynode=1, colsample_bytree=0.5, gamma=0, gpu_id=0,
    importance_type='gain', interaction_constraints='',
    learning_rate=0.274, max_delta_step=0, max_depth=3,
    min_child_weight=3, monotone_constraints='()',
    n_estimators=200, n_jobs=-1, num_parallel_tree=1,
    objective='reg:squarederror', reg_alpha=0.7,
    reg_lambda=0.15, scale_pos_weight=48.30000000000001, subsample=0.7,
    tree_method='gpu_hist', validate_parameters=1, verbosity=0                      
)

In [None]:
%%time
bayesian_ridge = create_model('br',
    alpha_1=0.05, alpha_2=0.0005, alpha_init=None,
    compute_score=False, copy_X=True, fit_intercept=True,
    lambda_1=0.2, lambda_2=0.0005, lambda_init=None, n_iter=300,
    normalize=False, tol=0.001, verbose=False
)

In [None]:
# %%time
# import numpy as np
# lgbms = [create_model('lightgbm', learning_rate=i) for i in np.arange(0.1,1,0.2)]
# # print('len(lgbms)', len(lgbms))

In [None]:
# %%time
# catboosts = [create_model('catboost', learning_rate=i) for i in np.arange(0.1,1,0.2)]
# # print('len(catboosts)', len(catboosts))

# 5. Tune Hyperparameters

- https://pycaret.org/tune-model/


> Tuning hyperparameters of a machine learning model in any module is as simple as writing tune_model. It tunes the hyperparameter of the model passed as an estimator using Random grid search with pre-defined grids that are fully customizable. Optimizing the hyperparameters of a model requires an objective function which is linked to target variable automatically in supervised experiments such as Classification or Regression. However for unsupervised experiments such as Clustering, Anomaly Detection and Natural Language Processing PyCaret allows you to define custom objective function by specifying supervised target variable using supervised_target parameter within tune_model (see examples below). For supervised learning, this function returns a table with k-fold cross validated scores of common evaluation metrics along with trained model object. For unsupervised learning, this function only returns trained model object. The evaluation metrics used for supervised learning are:
>
> - Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC
> - Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE
>
> The number of folds can be defined using fold parameter within tune_model function. By default, the fold is set to 10. All the metrics are rounded to 4 decimals by default by can be changed using round parameter. Tune model function in PyCaret is a randomized grid search of a pre-defined search space hence it relies on number of iterations of search space. By default, this function performs 10 random iteration over search space which can be changed using n_iter parameter within tune_model. Increasing the n_iter parameter may increase the training time but often results in a highly optimized model. Metric to be optimized can be defined using optimize parameter. By default, Regression tasks will optimize R2 and Classification tasks will optimize Accuracy. 


In [None]:
%%time
lightgbm = tune_model(lightgbm, n_iter=50, optimize='RMSE')
lightgbm

In [None]:
# %%time
# catboost = tune_model(catboost, n_iter=50, optimize='RMSE')
# catboost.get_params()

In [None]:
# %%time
# xgboost = tune_model(xgboost, n_iter=50, optimize='RMSE')
# xgboost

In [None]:
%%time
bayesian_ridge = tune_model(bayesian_ridge, n_iter=50, optimize='RMSE')
bayesian_ridge

# 6. Ensemble Model

- https://pycaret.org/ensemble-model/

> Ensembling a trained model is as simple as writing ensemble_model. It takes only one mandatory parameter i.e. the trained model object. This functions returns a table with k-fold cross validated scores of common evaluation metrics along with trained model object. The evaluation metrics used are:

> - Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC
> - Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE

> The number of folds can be defined using fold parameter within ensemble_model function. By default, the fold is set to 10. All the metrics are rounded to 4 decimals by default by can be changed using round parameter. There are two methods available for ensembling that can be set using method parameter within ensemble_model function. Both the methods require re-sampling of the data and fitting multiple estimators, hence the number of estimators can be controlled using n_estimators parameter. By default, n_estimators is set to 10.

> This function is only available in pycaret.classification and pycaret.regression modules.


> **Bagging:**
> Bagging, also known as Bootstrap aggregating, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

 
> **Boosting:**
> Boosting is an ensemble meta-algorithm for primarily reducing bias and variance in supervised learning. Boosting is in the family of machine learning algorithms that convert weak learners to strong ones. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

In [None]:
%%time
dt        = create_model('dt')
bagged_dt = ensemble_model(dt, n_estimators=10, optimize='RMSE')

In [None]:
%%time
bagged_lightgbm = ensemble_model(lightgbm, method='Bagging', n_estimators=10, optimize='RMSE')

In [None]:
%%time
bagged_catboost = ensemble_model(catboost, method='Bagging', n_estimators=10, optimize='RMSE')

In [None]:
%%time
bagged_xgboost = ensemble_model(xgboost, method='Bagging', n_estimators=10, optimize='RMSE')

In [None]:
%%time
bagged_br = ensemble_model(bayesian_ridge, method='Bagging', n_estimators=10, optimize='RMSE')

# 7. Blend Models

- https://pycaret.org/blend-models/


> Blending models is a method of ensembling which uses consensus among estimators to generate final predictions. The idea behind blending is to combine different machine learning algorithms and use a majority vote or the average predicted probabilities in case of classification to predict the final outcome. Blending models in PyCaret is as simple as writing blend_models. This function can be used to blend specific trained models that can be passed using estimator_list parameter within blend_models or if no list is passed, it will use all the models in model library. In case of Classification, method parameter can be used to define ‘soft‘ or ‘hard‘ where soft uses predicted probabilities for voting and hard uses predicted labels. This functions returns a table with k-fold cross validated scores of common evaluation metrics along with trained model object. The evaluation metrics used are:

> - Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC
> - Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE

> The number of folds can be defined using fold parameter within blend_models function. By default, the fold is set to 10. All the metrics are rounded to 4 decimals by default by can be changed using round parameter within blend_models.

> This function is only available in pycaret.classification and pycaret.regression modules.

In [None]:
# %%time
# blender = blend_models(optimize='RMSE', estimator_list=[
#     bayesian_ridge,
#     lightgbm,
#     catboost,
#     xgboost,
#     bagged_lightgbm,
#     bagged_catboost,
#     bagged_xgboost,
#     bagged_dt,
# ])

# 8. Stack Models

- https://pycaret.org/stack-models/

> Stacking models is method of ensembling that uses meta learning. The idea behind stacking is to build a meta model that generates the final prediction using the prediction of multiple base estimators. Stacking models in PyCaret is as simple as writing stack_models. This function takes a list of trained models using estimator_list parameter. All these models form the base layer of stacking and their predictions are used as an input for a meta model that can be passed using meta_model parameter. If no meta model is passed, a linear model is used by default. In case of Classification, method parameter can be used to define ‘soft‘ or ‘hard‘ where soft uses predicted probabilities for voting and hard uses predicted labels. This function returns a table with k-fold cross validated scores of common evaluation metrics along with trained model object. The evaluation metrics used are:

> - Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC
> - Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE

> The number of folds can be defined using fold parameter within stack_models function. By default, the fold is set to 10. All the metrics are rounded to 4 decimals by default by can be changed using round parameter within stack_models. restack parameter controls the ability to expose the raw data to meta model. By default, it is set to True. When changed to False, meta-model will only use predictions of base models to generate final predictions.

 
> **Multiple Layer Stacking**
> Base models can be in a single layer or in multiple layers in which case the predictions from each preceding layer is passed to the next layer as an input until it reaches meta-model where predictions from all the layers including base layer is used as an input to generate final prediction. To stack models in multiple layers, create_stacknet function accepts estimator_list parameters as a list within list. All other parameters are the same. See the below regression example use of the create_stacknet function.

> This function is only available in pycaret.classification and pycaret.regression modules.

> WARNING : This function will be deprecated in future release of PyCaret 2.x.

In [None]:
%%time
# stacker = stack_models(estimator_list = compare_models(n_select=5, fold=5, whitelist=models(type='ensemble').index.tolist()))
stacker = stack_models(estimator_list=[
    compared_models,
    lightgbm,
    catboost,
    xgboost,
    bayesian_ridge,
    bagged_lightgbm,
    bagged_catboost,
    bagged_xgboost,
    bagged_dt,
    bagged_br,
])

# 11. AutoML()

- https://pycaret.org/automl/

> This function returns the best model out of all models created in the current active environment based on metric defined in optimize parameter. Run this code at the end of  your script.




In [None]:
%%time
best = automl(optimize='RMSE')
best

# 9. Analyze Model

- https://pycaret.org/plot-model/

> Analyzing performance of trained machine learning model is an integral step in any machine learning workflow. Analyzing model performance in PyCaret is as simple as writing plot_model. The function takes trained model object and type of plot as string within plot_model function.

| Name	| Plot |
|:-------|------:|
| Area Under the Curve	| ‘auc’ | 
| Discrimination Threshold	| ‘threshold’ |
| Precision Recall Curve	| ‘pr’ |
| Confusion Matrix	| ‘confusion_matrix’ |
| Class Prediction Error	| ‘error’ |
| Classification Report	| ‘class_report’ |
| Decision Boundary	| ‘boundary’ |
| Recursive Feature Selection	| ‘rfe’ |
| Learning Curve	| ‘learning’ |
| Manifold Learning	| ‘manifold’ |
| Calibration Curve	| ‘calibration’ |
| Validation Curve	| ‘vc’ |
| Dimension Learning	| ‘dimension’ |
| Feature Importance	| ‘feature’ |
| Model Hyperparameter	| ‘parameter’ |

In [None]:
%%time
plot_model(best)
plot_model(lightgbm)

In [None]:
%%time
plot_model(best, plot='error')
plot_model(lightgbm, plot='error')

In [None]:
%%time
plot_model(lightgbm, plot='feature')

In [None]:
%%time
evaluate_model(best)

# 10. Interpret Model

- https://pycaret.org/interpret-model/

> Interpreting complex models are of fundamental importance in machine learning. Model Interpretability helps debug the model by analyzing what the model really thinks is important. Interpreting models in PyCaret is as simple as writing interpret_model. The function takes trained model object and type of plot as string. Interpretations are implemented based on the SHAP (SHapley Additive exPlanations) and is only available for tree-based models.

In [None]:
%%time
interpret_model(lightgbm)

In [None]:
%%time
interpret_model(lightgbm, plot='correlation')

In [None]:
%%time
interpret_model(lightgbm, plot='reason', observation=12)

# 12. Predict Model

- https://pycaret.org/predict-model/

> Once a model is successfully deployed either on cloud using deploy_model or locally using save_model, it can be used to predict on unseen data using predict_model function. This functions takes a trained model object and the dataset to predict. It will automatically apply the entire transformation pipeline created during the experiment. For classification, predicted labels are created based on 50% probability, but if you choose to use a different threshold that you may have obtained using optimize_threshold, you can pass the probability_threshold parameter within predict_model. This function can also be used to generate predictions on hold-out / test set.

In [None]:
%%time
pred_holdouts = predict_model(best)
pred_holdouts.head()

In [None]:
%%time
# new_data = data.copy()
# new_data.drop(['charges'], axis=1, inplace=True)

new_data = pd.read_csv('../input/tabular-playground-series-feb-2021/test.csv', index_col='id')
predict_new = predict_model(best, data=new_data)
predict_new.head()

In [None]:
submission_df = pd.read_csv('../input/tabular-playground-series-feb-2021/sample_submission.csv', index_col='id')
submission_df['target'] = predict_new['Label']
submission_df.to_csv('submission.csv')
!head submission.csv

# 13. Save / Load Model

- https://pycaret.org/save-model/

> Saving a trained model in PyCaret is as simple as writing save_model. The function takes a trained model object and saves the entire transformation pipeline and trained model object as a transferable binary pickle file for later use.

In [None]:
save_model(best, model_name='best-model')

In [None]:
loaded_bestmodel = load_model('best-model')
print(loaded_bestmodel)

In [None]:
from sklearn import set_config
set_config(display='diagram')
loaded_bestmodel[0]

In [None]:
from sklearn import set_config
set_config(display='text')

# 14. Deploy Model

- https://pycaret.org/deploy-model/

> Once a model is finalized using finalize_model, it’s ready for deployment. A trained model can be consumed locally using save_model functionality which save the transformation pipeline and trained model which can be consumed by end user applications as a binary pickle file. Alternatively, models can be deployed on cloud using PyCaret. Deploying a model on cloud is as simple as writing deploy_model.

In [None]:
# deploy_model(best, model_name = 'best-aws', authentication = {'bucket' : 'pycaret-test'})

# 15. Get Config / Set Config

- https://pycaret.org/get-config/
- https://pycaret.org/set-config/

> These functiona are used to access global environment variables.

In [None]:
# X_train = get_config('X_train')
# X_train.head()

In [None]:
# get_config('seed')

In [None]:
# from pycaret.regression import set_config
# set_config('seed', 999)

In [None]:
# get_config('seed')

# 16. MLFlow UI

- https://pycaret.org/mlflow/

> PyCaret 2.0 embeds MLflow Tracking component as a backend API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results. To start logging your experiments set log_experiment parameter within setup to True and defines experiment name using experiment_name parameter.

In [None]:
# !mlflow ui

# End
> Thank you. For more information / tutorials on PyCaret, please visit https://www.pycaret.org

# Further Reading

This notebook is part of a series exploring:

[Tabular Playground - Jan 2021](https://www.kaggle.com/c/tabular-playground-series-jan-2021)
- 0.72746 / 0.72935 - [scikit-learn Ensemble](https://www.kaggle.com/jamesmcguigan/tabular-playground-scikit-learn-ensemble)
- 0.71552 / 0.71659 - [Fast.ai Tabular Solver](https://www.kaggle.com/jamesmcguigan/fast-ai-tabular-solver)
- 0.70317 / 0.70426 - [XGBoost](https://www.kaggle.com/jamesmcguigan/tabular-playground-xgboost)
- 0.70011 / 0.70181 - [LightGBM](https://www.kaggle.com/jamesmcguigan/tabular-playground-lightgbm)

[Tabular Playground - Feb 2021](https://www.kaggle.com/c/tabular-playground-series-feb-2021)
- 0.84452 - [PyCaret2 AutoML Regression](https://www.kaggle.com/jamesmcguigan/tps-pycaret2-automl-regression)

If you enjoyed this notebook, please upvote!