<a href="https://colab.research.google.com/github/ribeirod/publico/blob/main/14_AutomacaoMLcomPyCaret.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Automação de um projeto de ML com PyCaret**

Neste projeto, iremos usar o PyCaret para classificar/reconhecer texto, mais precisamente, dígitos escaneados dos envelopes pelo Serviço Postal dos EUA. O objetivo é mostrar as principais etapas do processo usando o PyCaret. Mas, em projetos da vida real, as etapas de coleta e pré-processamento dos dados são as mais importantes e às que requerem a maior parte do tempo do projeto, geralmente.

Documentação do PyCaret: https://pycaret.gitbook.io/docs/

Conjunto de dados (não precisa fazer download): https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#usps



In [None]:
# instala biblioteca
!pip install pycaret

In [2]:
# importar bibliotecas
import pycaret.classification as pyclf
from torchvision import datasets, transforms
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import torch
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### **Carrega o conjunto de dados USPS**



In [3]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
usps_train = datasets.USPS(root='./data', train=True, download=True, transform=transform)
usps_test = datasets.USPS(root='./data', train=False, download=True, transform=transform)

Downloading https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/usps.bz2 to ./data/usps.bz2


100%|██████████| 6579383/6579383 [00:01<00:00, 5351853.30it/s]


Downloading https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/usps.t.bz2 to ./data/usps.t.bz2


100%|██████████| 1831726/1831726 [00:01<00:00, 1825019.95it/s]


#### **Converte o conjunto de dados em dataframe**

In [4]:
# Converter o conjunto de dados em dataframe
X_train = pd.DataFrame(usps_train.data.reshape((len(usps_train), -1)))
y_train = pd.DataFrame(usps_train.targets, columns=['targets'])
X_test = pd.DataFrame(usps_test.data.reshape((len(usps_test), -1)))
y_test = pd.DataFrame(usps_test.targets, columns=['targets'])
# shape dos dados
print("Shape dos dados de treinamento:", X_train.shape)
print("Shape dos rótulos de treinamento:", y_train.shape)
print("Shape dos dados de teste:", X_test.shape)
print("Shape dos rótulos de teste:", y_test.shape)
# concatenando os dados de treino e teste
train_set = pd.concat([X_train, y_train], axis=1)
test_set = pd.concat([X_test, y_test], axis=1)

Shape dos dados de treinamento: (7291, 256)
Shape dos rótulos de treinamento: (7291, 1)
Shape dos dados de teste: (2007, 256)
Shape dos rótulos de teste: (2007, 1)


#### **Inicializa o ambiente de classificação do PyCaret**

Nesta etapa, configuramos alguns dos inúmeros parâmetros do PyCaret, em outras palavras, "informamos ao PyCaret o que queremos que ele faça". Para mim, é a etapa mais importante no projeto de automação. Como ciência de dados envolve muita experimentação, aqui você pode testar várias combinações de parâmetros até encontrar aquela que você considera mais adequada, ou que melhor funciona no seu caso.

In [5]:
clf_setup = pyclf.setup(
    data = train_set, # conjunto de dados a ser utilizado para treinamento
    target = 'targets', # coluna que contem os rotulos (target) que deseja prever
    normalize = True, # se deve ou nao normalizar os dados. Normalizar os dados eh util quando as caracteristicas tem escalas diferentes
    normalize_method = 'minmax', # metodo de normalizacao a ser utilizado, como 'zscore' para escore z ou 'minmax' para escala min-max
    remove_multicollinearity = True, # se deve ou nao remover caracteristicas altamente correlacionadas
    multicollinearity_threshold = 0.95, # limite de correlacao para remover caracteristicas altamente correlacionadas
    remove_outliers = True, # se deve ou nao remover outliers nos dados
    outliers_threshold = 0.05, # porcentagem de outliers a serem removidos no conjunto de dados
    feature_selection = True, # se deve ou nao realizar selecao de caracteristicas
    feature_selection_method = 'classic', # algoritmo para selecao das caracteristicas
    fix_imbalance = False, # se deve ou nao lidar com desbalanceamento de classe
    data_split_shuffle = True, # se deve ou nao embaralhar os dados ao dividir em conjuntos de treinamento e teste
    session_id = 42 # ID da sessao para garantir reprodutibilidade
)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.037991 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 52556
[LightGBM] [Info] Number of data points in the train set: 4847, number of used features: 253
[LightGBM] [Info] Start training from score -1.883527
[LightGBM] [Info] Start training from score -1.930758
[LightGBM] [Info] Start training from score -2.337647
[LightGBM] [Info] Start training from score -2.401616
[LightGBM] [Info] Start training from score -2.406182
[LightGBM] [Info] Start training from score -2.600011
[LightGBM] [Info] Start training from score -2.370223
[LightGBM] [Info] Start training from score -2.399341
[LightGBM] [Info] Start training from score -2.583482
[LightGBM] [Info] Start training from score -2.390291


Unnamed: 0,Description,Value
0,Session id,42
1,Target,targets
2,Target type,Multiclass
3,Original data shape,"(7291, 257)"
4,Transformed data shape,"(7035, 52)"
5,Transformed train set shape,"(4847, 52)"
6,Transformed test set shape,"(2188, 52)"
7,Numeric features,256
8,Preprocess,True
9,Imputation type,simple


#### **Compara diferentes modelos**

In [6]:
pyclf.compare_models(sort = 'Accuracy')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.9534,0.0,0.9534,0.9545,0.9533,0.9477,0.9479,35.978
knn,K Neighbors Classifier,0.9508,0.0,0.9508,0.9515,0.9506,0.9449,0.945,19.335
et,Extra Trees Classifier,0.9496,0.0,0.9496,0.9506,0.9496,0.9436,0.9437,20.573
xgboost,Extreme Gradient Boosting,0.9461,0.0,0.9461,0.9473,0.946,0.9396,0.9398,22.93
qda,Quadratic Discriminant Analysis,0.9443,0.0,0.9443,0.9488,0.945,0.9377,0.938,19.828
rf,Random Forest Classifier,0.9434,0.0,0.9434,0.9447,0.9432,0.9365,0.9367,21.557
gbc,Gradient Boosting Classifier,0.9347,0.0,0.9347,0.9362,0.9347,0.9269,0.9271,55.026
lr,Logistic Regression,0.9228,0.0,0.9228,0.9237,0.9225,0.9135,0.9137,20.429
svm,SVM - Linear Kernel,0.9085,0.0,0.9085,0.9115,0.9078,0.8975,0.8979,20.167
lda,Linear Discriminant Analysis,0.8961,0.0,0.8961,0.898,0.896,0.8837,0.8839,19.602


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

#### **Seleciona os melhores modelo com base na métrica Acurácia**

In [7]:
models = pyclf.pull() # exporta para DataFrame a tabela acima
listaModels = list(models.index)

model1 = pyclf.create_model(listaModels[0])
model2 = pyclf.create_model(listaModels[1])
model3 = pyclf.create_model(listaModels[2])

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9648,0.0,0.9648,0.9652,0.9648,0.9605,0.9606
1,0.953,0.0,0.953,0.9548,0.9528,0.9474,0.9476
2,0.9335,0.0,0.9335,0.9343,0.9333,0.9255,0.9256
3,0.9588,0.0,0.9588,0.9594,0.959,0.9539,0.9539
4,0.949,0.0,0.949,0.9497,0.9487,0.9428,0.943
5,0.9529,0.0,0.9529,0.9535,0.9529,0.9473,0.9473
6,0.9471,0.0,0.9471,0.9497,0.9473,0.9407,0.941
7,0.9569,0.0,0.9569,0.9583,0.9563,0.9517,0.9519
8,0.9725,0.0,0.9725,0.9728,0.9725,0.9692,0.9693
9,0.9451,0.0,0.9451,0.947,0.9452,0.9385,0.9387


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.953,0.0,0.953,0.9542,0.9531,0.9474,0.9475
1,0.953,0.0,0.953,0.9533,0.9526,0.9474,0.9475
2,0.9472,0.0,0.9472,0.9474,0.9469,0.9408,0.9409
3,0.9647,0.0,0.9647,0.9651,0.9645,0.9604,0.9605
4,0.9471,0.0,0.9471,0.9482,0.9464,0.9406,0.9408
5,0.9412,0.0,0.9412,0.942,0.941,0.9341,0.9342
6,0.9569,0.0,0.9569,0.958,0.9566,0.9516,0.9518
7,0.951,0.0,0.951,0.9514,0.9509,0.9451,0.9452
8,0.949,0.0,0.949,0.9496,0.9487,0.9429,0.943
9,0.9451,0.0,0.9451,0.9459,0.9452,0.9385,0.9386


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.955,0.0,0.955,0.9557,0.955,0.9496,0.9496
1,0.9354,0.0,0.9354,0.939,0.9356,0.9276,0.9279
2,0.9374,0.0,0.9374,0.9381,0.9373,0.9298,0.9299
3,0.9627,0.0,0.9627,0.9642,0.9629,0.9583,0.9584
4,0.9373,0.0,0.9373,0.9379,0.9366,0.9296,0.9298
5,0.949,0.0,0.949,0.9493,0.9489,0.9429,0.9429
6,0.9569,0.0,0.9569,0.9572,0.9566,0.9517,0.9518
7,0.9549,0.0,0.9549,0.9555,0.9548,0.9495,0.9496
8,0.9569,0.0,0.9569,0.9574,0.9566,0.9517,0.9518
9,0.951,0.0,0.951,0.9518,0.9511,0.9451,0.9452


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

#### **Otimiza o modelo**

Por meio da função `tune_model` os hipermarâmetros do modelo são ajustados, visando melhorar o resultado da métrica (por padrão, usa-se acurácia na classificação, mas você pode alterar). O código seria: `tune_model(best_model)`

Mas aqui, vamos combinar modelos (ensemble). Um modelo de ensemble combina as previsões de vários modelos individuais para produzir uma previsão mais robusta e geralmente mais precisa.

Especificamente, Blanding (ou ensamble stacking) é uma técnica que envolve criar um modelo de nível superior a partir da combinação das previsões de modelos de nível inferior. Em outras palavras, as previsões dos modelos de nível inferior são usadas como entradas do modelo de nível superior que aprende a ponderação adequada de cada modelo para fazer a previsão final. A ponderação final pode ser uma votação ou média das previsões dos modelos inferiores.

In [8]:
# cria modelo de ensemble
blend_model = pyclf.blend_models(estimator_list=[model1, model2, model3], method='hard')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9589,0.0,0.9589,0.9598,0.9589,0.9539,0.954
1,0.9452,0.0,0.9452,0.9473,0.9451,0.9386,0.9388
2,0.9413,0.0,0.9413,0.9423,0.9413,0.9342,0.9343
3,0.9725,0.0,0.9725,0.9735,0.9726,0.9692,0.9693
4,0.949,0.0,0.949,0.9494,0.9485,0.9428,0.943
5,0.9569,0.0,0.9569,0.9572,0.9568,0.9517,0.9517
6,0.9608,0.0,0.9608,0.9613,0.9606,0.9561,0.9562
7,0.9569,0.0,0.9569,0.9571,0.9566,0.9517,0.9518
8,0.9608,0.0,0.9608,0.9613,0.9606,0.956,0.9561
9,0.9549,0.0,0.9549,0.9557,0.955,0.9495,0.9496


Processing:   0%|          | 0/6 [00:00<?, ?it/s]

#### **Prevê com o modelo treinado**

In [9]:
predictions = pyclf.predict_model(blend_model, data=test_set)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Voting Classifier,0.9193,0,0.9193,0.9204,0.9192,0.9093,0.9095


#### **Salvando um modelo treinado**

In [10]:
pyclf.save_model(blend_model,'Blend Model USPS')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['0', '1', '2', '3', '4', '5', '6',
                                              '7', '8', '9', '10', '11', '12',
                                              '13', '14', '15', '16', '17', '18',
                                              '19', '20', '21', '22', '23', '24',
                                              '25', '26', '27', '28', '29', ...],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,...
                                                                     max_features='sqrt',
                                           

#### **Carregando um modelo previamente treinado**

In [11]:
saved_model = pyclf.load_model('/content/Blend Model USPS')

Transformation Pipeline and Model Successfully Loaded


In [12]:
predictions_saved_model = pyclf.predict_model(saved_model, data=test_set)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Voting Classifier,0.9193,0,0.9193,0.9204,0.9192,0.9093,0.9095
