# AutoML tools: PyCaret

In this notebook, we will explore a powerful AutoML library:
[**PyCaret**](https://pycaret.gitbook.io/docs).
[**PyCaret**](https://pycaret.gitbook.io/docs) provides a user-friendly interface for automating various steps in the machine learning workflow, making it easier for both beginners and experienced data scientists to build and evaluate machine learning models. 

We will be using this tool for regression (Boston dataset) and classification (Titanic dataset) problems.  
First, we install the library.

In [0]:
pip install -q pycaret

In [0]:
# You only need to run this cell after installing the optuna package on Databricks
dbutils.library.restartPython()

Then we load the Boston dataset using Pandas.

In [0]:
import pandas as pd

boston_df = pd.read_csv('../../../../Data/Boston.csv')

Before using AutoML tools, let's take a quick look at our dataset and its structure:

In [0]:
boston_df.head()

In [0]:
boston_df.describe()

In [0]:
from sklearn.model_selection import train_test_split

X = boston_df.iloc[:, 1:14]
y = boston_df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Regression with PyCaret

[PyCaret](https://pycaret.gitbook.io/docs)
is an open-source, low-code machine learning Python library, Python wrapper around machine learning libraries and frameworks, such as scikit-learn, XGBoost, LightGBM, CatBoost, and a few more.
It was inspired by the emerging role of citizen data scientists, individuals who are not necessarily trained in data science or analytics but have the skills and tools to work with data and extract insights.

[PyCaret](https://pycaret.gitbook.io/docs) supports regression, classification and clustering problems, speeds up experiments and is integrated with BI.

In this part of the notebook we will explore some of the key features of PyCaret.

Let's import regression module and [`setup()`](https://pycaret.gitbook.io/docs/get-started/functions/initialize#setting-up-environment) an experiment. 

Note: PyCaret can automatically handle common preprocessing tasks, such as handling missing values, feature scaling, and categorical encoding, so we don't need to worry about it.

In [0]:
from pycaret.regression import *
 
s = setup(boston_df, target = 'target')

Now that the data is preprocessed, we can use
[`compare_models()`](https://pycaret.gitbook.io/docs/get-started/functions/train#compare_models)
function, which trains and evaluates the performance of all the estimators.

In [0]:
best = compare_models()

With PyCaret we got very similar list of best regressors.

####Optimization

PyCaret makes it easy to tune hyperparameters of the selected model using the [`tune_model()`](https://pycaret.gitbook.io/docs/get-started/functions/optimize#tune_model) function. 

You can increase the number of iterations (n_iter parameter) depending on how much time and resouces you have. By default, it is set to 10.

You can also choose which metric to optimize for (optimize parameter). By default, it is set to R2 for regression problem.

In [0]:
tuned_model = tune_model(best, n_iter = 10, optimize='MAE')

More advanced features: 
- you can customize the search space (define the search space and pass it to `custom_grid` parameter)
- you can change the search algorithm. By default, RandomGridSearch is used, but you can change it by setting `search_library` and `search_algorithm` parameters
- you can get access to the tuner object. Normally, [`tune_model()`](https://pycaret.gitbook.io/docs/get-started/functions/optimize#tune_model) only returns the best model. The sample code below shows how it can be done:

In [0]:
#tuned_model, tuner = tune_model(dt, return_tuner=True)
#print(tuner)

We can look how hyperparameters have changed:

In [0]:
# default model
print(best)

# tuned model
print(tuned_model)

Sometimes [`tune_model()`](https://pycaret.gitbook.io/docs/get-started/functions/optimize#tune_model) doesn't improve the default model or even gives worse result. If we play around in the notebook where we can choose the best option manually, it's fine. But if we run a python script where we first create models and then tune them, and use the tuned model after, it can be a problem. 

To solve this, we can set **choose_better** parameter to True, so the best model (default or tuned) will be chosen automatically:

In [0]:
#tuned_model = tune_model(best, n_iter = 10, optimize='MAE', choose_better=True)

####Analysis
Note that we can easily see the hyperparameters of the model and the whole pipeline, in contrast to LazyPredict library.
We also have many other various visualizations provided by the [`evaluate_model()`](https://pycaret.gitbook.io/docs/get-started/functions/analyze#evaluate_model) function.

In [0]:
evaluate_model(best)

In [0]:
interpret_model(best)

*There are many other analyzing tools implemented in PyCaret such as morris sensitivity analysis, reason plot, dashboard etc. You can read more here: https://pycaret.gitbook.io/docs/get-started/functions/analyze.*

####Deployment
Let us demonstrate some useful functions:

- [`predict_model()`](https://pycaret.gitbook.io/docs/get-started/functions/deploy#predict_model)

You can pass to the parameter **data** some new, unseen dataset. In the example below we didn't specify this parameter, so the predictions are made for the holdout set:

In [0]:
predict_model(tuned_model)

- [`finalize_model()`](https://pycaret.gitbook.io/docs/get-started/functions/deploy#finalize_model)

Refits on the entire dataset including the hold-out set.

In [0]:
finalize_model(tuned_model)

- [`save_model()`](https://pycaret.gitbook.io/docs/get-started/functions/deploy#save_model)

Saves the model as a file in the working directory

In [0]:
save_model(tuned_model, 'my_best_model')

- [`load_model()`](https://pycaret.gitbook.io/docs/get-started/functions/deploy#load_model)

Loads a previosly saved model

In [0]:
load_model('my_best_model')

##Your turn!

Now, it's time to take your newly acquired knowledge and skills to the next level by trying this powerful AutoML libraries for a classification problem.

In [0]:
# Task: Import titanic.csv dataset

titanic_df = ...

In [0]:
X = titanic_df[['Sex', 'Embarked', 'Pclass', 'Age', 'Survived']]
y = titanic_df[['Survived']]

In [0]:
# Task: split the dataset into train and test sets

...

## Classification with PyCaret

*For this new challenge, we encourage you to consult the PyCaret library's documentation to effectively handle the following task: https://pycaret.gitbook.io/docs/get-started/quickstart#classification.*

In [0]:
# Task: Initialize the environment

...

In [0]:
# Task: Compare models

...

In [0]:
# Task: Optimize the best default model. Set parameters in such a way that the function will return the most efficient model among the default and tuned models.

...

In [0]:
# Task: plot confusion matrix

...

# What does the confusion matrix tell us? 

In [0]:
# Task: get visualization of the pipeline. Hint: use evaluate_model()

...

# What is the most important feature? 
# Task: Let's take a look at survival rate by sex. Hint: use seaborn barplot() function. Don't forget to import seaborn!

...

# What conclusion can we make?

In [0]:
# Task: save the model as 'my_best_classifier'

...

Congratulations! You've completed the study notebook on automating machine learning workflows with PyCaret.
By automating repetitive tasks, these libraries enable us to iterate faster, experiment with various algorithms, and gain valuable insights from our data more efficiently.

While we explored a wide range of capabilities offered by these libraries, it's essential to note that we haven't covered every single function and feature they provide.
As you continue your journey in machine learning, we encourage you to dive deeper into the documentation to discover their full range of capabilities.

**Documentation:**
- PyCaret: https://pycaret.gitbook.io/docs