# PyCaret — the library for low-code ML

Train, visualize, evaluate, interpret, and deploy models with minimal code

When we approach supervised machine learning problems, it can be tempting to just see how a random forest or gradient boosting model performs and stop experimenting if we are satisfied with the results. What if you could compare many different models with just one line of code? What if you could reduce each step of the data science process from feature engineering to model deployment to just a few lines of code?

This is exactly where PyCaret comes into play. PyCaret is a high-level, low-code Python library that makes it easy to compare, train, evaluate, tune, and deploy machine learning models with only a few lines of code. At its core, PyCaret is basically just a large wrapper over many data science libraries such as Scikit-learn, Yellowbrick, SHAP, Optuna, and Spacy. Yes, you could use these libraries for the same tasks, but if you don’t want to write a lot of code, PyCaret could save you a lot of time.

# Installing PyCaret

To install the default, smaller version of PyCaret with only the required dependencies, you can run the following command.

In [None]:
!pip install pycaret

# Import Libraries

In the code below, We simply imported Numpy and Pandas for handling the data for this demonstration.

In [None]:
import numpy as np
import pandas as pd

# Read the Data

For this example, We used the California Housing Prices Dataset available on Kaggle. In the code below, I read this dataset into a dataframe and displayed the first five rows of the dataframe.

In [None]:
housing_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
housing_data.head()

In [None]:
categorical = []
for i in housing_data.columns:
    if (housing_data[i].dtype=='object'):
        categorical.append(i)
print("Categorical Attribute : {}\n ".format(len(categorical)))
categorical.append('MSSubClass')
for x in range(len(categorical)): 
    print(categorical[x])


In [None]:
(housing_data[categorical].nunique()).sort_values(ascending=False)

In [None]:
for i in categorical:
    print(i)
    print(housing_data[i].value_counts())
    print()


In [None]:
housing_data.shape

The output above gives us an idea of what the data looks like. The data contains mostly numerical features with multiple categorical features. The target column that we are trying to predict is the SalePrice column. The entire dataset contains a total of 1460 observations.

# Initialize Experiment

Now that we have the data, we can initialize a PyCaret experiment, which will preprocess the data and enable logging for all of the models that we will train on this dataset.

In [None]:
from pycaret.regression import *
reg_experiment = setup(housing_data, 
                       target = 'SalePrice', 
                       session_id=42, 
                       experiment_name='me_housing',
                       ignore_features=['Id'],
                       normalize = True, 
                  transformation = True, 
                  remove_multicollinearity = True, #rop one of the two features that are highly correlated with each other
                  ignore_low_variance = True,#all categorical features with statistically insignificant variances are removed from the dataset.
                  combine_rare_levels = True,# all levels in categorical features below the threshold defined in rare_level_threshold param are combined together as a single level
                    transform_target = True,
                       categorical_features=categorical,ordinal_features = {
                         'Utilities' : ['AllPub', 'NoSeWa'],
                           'LandSlope':['Gtl', 'Mod', 'Sev'],
                           'OverallQual':['1','2','3','4','5','6','7','8','9','10'],
                           'MoSold':['1','2','3','4','5','6','7','8','9','10','11','12'],
                       },
                      high_cardinality_features =['Neighborhood','Exterior2nd','MSSubClass','Exterior1st']
                           )

# Compare Baseline Models

We can compare different baseline models at once to find the model that achieves the best K-fold cross-validation performance with the compare_models function as shown in the code below. 

In [None]:
best_model = compare_models()

The function produces a data frame with the performance statistics for each model and highlights the metrics for the best performing model, which in this case was the CatBoost regressor.

# Creating a Model

We can also train a model in just a single line of code with PyCaret. The create_model function simply requires a string corresponding to the type of model that you want to train. 

In [None]:
catboost = create_model('catboost')

The create_model function produces the dataframe above with cross-validation metrics for the trained CatBoost model.

# Hyperparameter Tuning

Now that we have a trained model, we can optimize it even further with hyperparameter tuning. With just one line of code, we can tune the hyperparameters of this model.

In [None]:
tuned_catboost = tune_model(catboost, optimize = 'MSE')

The most important results, in this case, the average metrics, are highlighted in yellow.

# Visualizing the Model’s Performance

There are many plots that we can create with PyCaret to visualize a model’s performance. PyCaret uses another high-level library called Yellowbrick for building these visualizations.

## Residual Plot

The plot_model function will produce a residual plot by default for a regression model as demonstrated below.

In [None]:
plot_model(tuned_catboost)

# Prediction Error

We can also visualize the predicted values against the actual target values by creating a prediction error plot.

In [None]:
plot_model(tuned_catboost, plot = 'error')

The plot above is particularly useful because it gives us a visual representation of the R² coefficient for the CatBoost model. In a perfect scenario (R² = 1), where the predicted values exactly matched the actual target values, this plot would simply contain points along the dashed identity line.

# Feature Importances

We can also visualize the feature importances for a model as shown below.

In [None]:
plot_model(tuned_catboost, plot = 'feature')

Based on the plot above, we can see that the median_income feature is the most important feature when predicting the price of a house. Since this feature corresponds to the median income in the area in which a house was built, this evaluation makes perfect sense. Houses built in higher-income areas are likely more expensive than those in lower-income areas.

# Evaluating the Model Using All Plots

We can also create multiple plots for evaluating a model with the evaluate_model function.

In [None]:
print(evaluate_model(tuned_catboost))

# Interpreting the Model

In [None]:
interpret_model(tuned_catboost)

The interpret_model function is a useful tool for explaining the predictions of a model. This function uses a library for explainable machine learning called SHAP 

With just one line of code, we can create a SHAP beeswarm plot for the model.

Based on the plot above, we can see that the GrLivArea field has the greatest impact on the predicted house value.

# AutoML

PyCaret also has a function for running automated machine learning (AutoML). We can specify the loss function or metric that we want to optimize and then just let the library take over as demonstrated below.

In [None]:
automl_model = automl(optimize = 'MSE')

AutoML model also happens to be a CatBoost regressor, which we can confirm by printing out the model.

In [None]:
automl_model

# Generating Predictions

The predict_model function allows us to generate predictions by either using data from the experiment or new unseen data.

In [None]:
pred_holdouts = predict_model(automl_model)
pred_holdouts.head()

The predict_model function above produces predictions for the holdout datasets used for validating the model during cross-validation. The code also gives us a dataframe with performance statistics for the predictions generated by the AutoML model.

# Saving the Model

PyCaret also allows us to save trained models with the save_model function. This function saves the transformation pipeline for the model to a pickle file

In [None]:
save_model(automl_model, model_name='./automl-model')

We can also load the saved AutoML model with the load_model function.

In [None]:
loaded_model = load_model('./automl-model')
print(loaded_model)

Printing out the loaded model produces the output

# Pros and Cons of Using PyCaret

While PyCaret is a great tool, it comes with its own pros and cons that you should be aware of if you plan to use it for your data science projects.

**Pros**
- Low-code library.
- Great for simple, standard tasks and general-purpose machine learning.
- Provides support for regression, classification, natural language processing, clustering, anomaly detection, and association rule mining.
- Makes it easy to create and save complex transformation pipelines for models.
- Makes it easy to visualize the performance of your model.

**Cons**
- As of now, PyCaret is not ideal for text classification because the NLP utilities are limited to topic modeling algorithms.
- PyCaret is not ideal for deep learning and doesn’t use Keras or PyTorch models.
- We can’t perform more complex machine learning tasks such as image classification and text generation with PyCaret.
- By using PyCaret, we are sacrificing a certain degree of control for simple and high-level code.