# Low Code ML Workshop with PyCaret

### Vanderlei Munhoz - IBM Hybrid Cloud Build Team

<hr>

# About PyCaret

PyCaret is an open source, low-code machine learning library in Python that aims to **reduce cycle time from hypothesis to insights**.

Reference: https://pycaret.readthedocs.io/en/latest/index.html

<hr>

# Installing Python packages

In [None]:
!pip install pycaret shap

<hr>

# Loading Datasets

1. PyCaret has sample datasets that can be loaded with the **pycaret.datasets.get_data** method.
2. All modules in PyCaret can work directly with pandas Dataframe. It can consume the dataframe, Irrespective of how it is loaded in the environment.

In [None]:
# Using the PyCaret sample data repository
from pycaret.datasets import get_data

dataset = get_data('credit')

In [None]:
# Importing data using Pandas:
import pandas as pd

!wget https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/credit.csv
dataset = pd.read_csv('./credit.csv')
dataset.head()

<hr>

# Analyzing Datasets

In [None]:
# Using standard Pandas methods on the dataframes:
dataset.info()

In [None]:
# Using standard Pandas methods on the dataframes:
dataset.describe()

<hr>

# Plotting Variables

In [None]:
# Using SNS for plotting:
import seaborn as sns

sns.pairplot(dataset[['SEX', 'AGE', 'EDUCATION', 'default']])

<hr>

# Visualizing Correlation

In [None]:
# Using matplotlib and sns for plotting correlation between variables:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

corr = dataset.corr()

sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.figure(figsize=(20,10))

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask, 1)] = True

a = sns.heatmap(corr, mask=mask, annot=True, fmt='.2f')

<hr>

# Dataset Splitting for Training & Testing

In [None]:
# Split into training & testing according to the **frac** value. Use **random_state** for reproducibility
train_data = dataset.sample(frac=0.8, random_state=2130912)
test_data = dataset.drop(train_data.index)

In [None]:
# Reset DataFrame index values
train_data.reset_index(inplace=True, drop=True)
test_data.reset_index(inplace=True, drop=True)

In [None]:
train_data

In [None]:
test_data

<hr>

# PyCaret Setup

Before calling any PyCaret function, first we need to run the **setup** method.

This function initializes the training environment and creates the transformation pipeline. 

It takes two mandatory parameters: **data** and **target**. All the other parameters are optional.

Reference: https://pycaret.readthedocs.io/en/latest/api/classification.html

In [None]:
!pip install numpy

In [None]:
from pycaret.classification import *

clf = setup(
    data=train_data, 
    target='default', 
    session_id=123,
    normalize=True,
    normalize_method='robust',
    pca=False,  # If set to True, dimensionality reduction is performed
    pca_method='linear',  # 'linear' performs the Single Value Decomposition
    pca_components=0.8,  # Retain 80% of the original features
    remove_multicollinearity=True,  # remove features with inter-correlations higher than the defined threshold below
    multicollinearity_threshold=0.92,
    remove_outliers=False,
    outliers_threshold=0.05,
    fix_imbalance=True,  # When set to True, SMOTE (Synthetic Minority Over-sampling Technique) is applied by default to create synthetic datapoints for minority class
    fold_strategy='stratifiedkfold',
    fold_shuffle=True,
    fold=5,  # Number of folds for cross validation
    use_gpu=True,
)

<hr>

# Train and Compare classifiers automatically with PyCaret

**Accuracy**: Fraction of correct predictions (accuracy alone doesn't tell the full story when you're working with a class-imbalanced data set)

**AUC**: [0, 1.0] - Bigger is better (derived from ROC). Reference: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

**Recall**: Answers the question: "What proportion of actual positives was identified correctly?"

**Precision**: Answers the question: "What proportion of positive identifications was actually correct?"

**F1**: F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

**Kappa**: Better for multi-class and imbalanced class problems.

**MCC**: Matthew's Correlation Coefficient (measures the quality of binary classificators). The coefficient takes into account true and false positives and negatives. Can also be used for imbalanced problems.

In [None]:
top5 = compare_models(n_select=5, exclude=['xgboost', 'lightgbm', 'gbc'])

In [None]:
# The top5 variable is a list with the best 5 models regarding the chosen metric
print(top5[0])

In [None]:
plot_model(top5[0])

In [None]:
plot_model(top5[0], 'class_report')

In [None]:
plot_model(top5[0], 'feature')

Reference for plotting models: https://pycaret.org/plot-model/

<hr>

# Tuning the best model parameters automatically with PyCaret

In [None]:
# If we want to tune the model for a different metric (ex: F1 score):
top_model_tuned_for_AUC = tune_model(top5[0], optimize='AUC')

Below we can check the different generated hyperparameters for each model:

In [None]:
plot_model(top_model_tuned_for_AUC)

In [None]:
plot_model(top_model_tuned_for_AUC, 'class_report')

In [None]:
plot_model(top_model_tuned_for_AUC, 'parameter')

<hr>

# Blending Models with PyCaret

Blending models is a method of ensembling which uses consensus among estimators to generate final predictions. The idea behind blending is to combine different machine learning algorithms and use a majority vote or the average predicted probabilities in case of classification to predict the final outcome. Blending models in PyCaret is as simple as writing blend_models. This function can be used to blend specific trained models that can be passed using estimator_list parameter within blend_models or if no list is passed, it will use all the models in model library. In case of Classification, method parameter can be used to define ‘soft‘ or ‘hard‘ where soft uses predicted probabilities for voting and hard uses predicted labels. This functions returns a table with k-fold cross validated scores of common evaluation metrics along with trained model object. The evaluation metrics used are:

    Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC
    Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE

Reference: https://pycaret.org/blend-models/

In [None]:
# Here we'll train a `blended` model based on the top5 models trained before
top5_models_blended = blend_models(estimator_list=top5, method='hard')

In [None]:
plot_model(top5_models_blended, 'class_report')

<hr>

# Ensembling models with PyCaret

Ensembling a trained model is as simple as writing ensemble_model. It takes only one mandatory parameter i.e. the trained model object. This functions returns a table with k-fold cross validated scores of common evaluation metrics along with trained model object. The evaluation metrics used are:

    Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC
    Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE

Reference: https://pycaret.org/ensemble-model/

### Bagging:

Bagging, also known as Bootstrap aggregating, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. 

It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. 

Bagging is a special case of the model averaging approach.

In [None]:
# Train a bagging classifier on the top model trained before
bagged_top0 = ensemble_model(top5[0], method='Bagging')

In [None]:
plot_model(bagged_top0, 'class_report')

### Boosting:

Boosting is an ensemble meta-algorithm for primarily reducing bias and variance in supervised learning. 

Boosting is in the family of machine learning algorithms that convert weak learners to strong ones. 

A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). 

In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.