# AutoML tools: PyCaret

In this notebook, we will explore a powerful AutoML library:
[**PyCaret**](https://pycaret.gitbook.io/docs).
[**PyCaret**](https://pycaret.gitbook.io/docs) provides a user-friendly interface for automating various steps in the machine learning workflow, making it easier for both beginners and experienced data scientists to build and evaluate machine learning models. 

We will be using this tool for regression (Boston dataset) and classification (Titanic dataset) problems.  
First, we install the library.

In [1]:
!pip install pycaret




[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
!pip install shap




[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip





In [3]:
pip install pycaret[analysis]

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import pandas as pd

df = pd.read_csv(r"C:\Users\mmartinovic\06-DataScienceMasterclass\bank-additional-full\bank-additional-full.csv", sep=";")

Before using AutoML tools, let's take a quick look at our dataset and its structure:

In [5]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [6]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [7]:
df.drop(["default", "duration"], axis=1, inplace=True)

In [8]:
from sklearn.model_selection import train_test_split

X = df.drop('y', axis=1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Regression with PyCaret

In [9]:
from pycaret.regression import *
 
s = setup(df, target = 'y')

Unnamed: 0,Description,Value
0,Session id,2899
1,Target,y
2,Target type,Regression
3,Original data shape,"(41188, 19)"
4,Transformed data shape,"(41188, 59)"
5,Transformed train set shape,"(28831, 59)"
6,Transformed test set shape,"(12357, 59)"
7,Numeric features,9
8,Categorical features,9
9,Preprocess,True


Now that the data is preprocessed, we can use
[`compare_models()`](https://pycaret.gitbook.io/docs/get-started/functions/train#compare_models)
function, which trains and evaluates the performance of all the estimators.

In [10]:
best = compare_models()

With PyCaret we got very similar list of best regressors.

#### Optimization

PyCaret makes it easy to tune hyperparameters of the selected model using the [`tune_model()`](https://pycaret.gitbook.io/docs/get-started/functions/optimize#tune_model) function. 

You can increase the number of iterations (n_iter parameter) depending on how much time and resouces you have. By default, it is set to 10.

You can also choose which metric to optimize for (optimize parameter). By default, it is set to R2 for regression problem.

In [11]:
tuned_model = tune_model(best, n_iter = 10, optimize='MAE')

ValueError: Estimator [] does not have the required fit() method.

In [None]:
evaluate_model(best)

In [None]:
interpret_model(best)