#Introdução a Inteligência Artificial com PyCaret
##Cientista de Dados: Sam Faraday
###Agosto de 2020
###Faculdade EASE

###Instalando o Pycaret

In [None]:
pip install pycaret

###Importando Pandas para Manipulação dos Dados

In [None]:
import pandas as pd

###Carregando os dados de Diabetes já disponíveis no Pycaret

In [18]:
# Importing dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


###Importando o Módulo de Classificação do Pycaret

In [None]:
from pycaret.classification import *

###Inicializando o Modelo com setup
Parâmentros: 
  0. clf1 = nome de uma variável que receperará o processamento. Aqui a chamamos de clf1, mas poderia ser classficacao1 ou outro nome que vc desejasse

  1. data = Informar o Dataset que deseja processsar
  
  2. target =  O nome do coluna que queremos prever


In [None]:
clf1 = setup(data = diabetes, target = 'Class variable')

###Agora vamos Comparar Alguns Modelos

In [11]:
# Pegar os melhores (best) modelos
# best é uma variável, poderia ser chamada de melhores , tipo:
# melhores = compare_models()

best = compare_models()

# vamos pegar apenas s 3 melhores baseado em Acurácia, que é o padrão
top3 = compare_models(n_select = 3)

# Agora vamos pegar os melhores baseado em AUC
best = compare_models(sort = 'AUC') 

# Poderíams Comparar algumos modelos específicos com o comando abaixo
#best_specific = compare_models(whitelist = ['dt','rf','xgboost'])

# Podem excluir alguns modelos, se desejarmos, usando uma blacklist
#best_specific = compare_models(blacklist = ['catboost', 'svm'])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7728,0.8494,0.5944,0.7311,0.6358,0.4769,0.4963,0.261
rf,Random Forest Classifier,0.7637,0.8452,0.5787,0.705,0.6126,0.4513,0.4693,0.518
lda,Linear Discriminant Analysis,0.7654,0.8439,0.5614,0.7247,0.6131,0.4524,0.4731,0.019
gbc,Gradient Boosting Classifier,0.7506,0.838,0.6044,0.6641,0.6211,0.4383,0.4473,0.132
et,Extra Trees Classifier,0.7468,0.8207,0.5342,0.6972,0.5801,0.4079,0.4326,0.471
lightgbm,Light Gradient Boosting Machine,0.7264,0.8177,0.5567,0.6321,0.5719,0.3779,0.3925,0.053
ada,Ada Boost Classifier,0.7283,0.7973,0.588,0.6191,0.5952,0.3927,0.3988,0.113
knn,K Neighbors Classifier,0.7226,0.7771,0.545,0.627,0.5674,0.3679,0.3813,0.119
nb,Naive Bayes,0.6647,0.748,0.1629,0.4587,0.237,0.1037,0.1171,0.016
dt,Decision Tree Classifier,0.7636,0.7403,0.6646,0.6741,0.66,0.4807,0.4889,0.018


### Se precisar entender alguma palavra-chave, usa  help(palavra-chave)
Você vai precisar disso para saber os nomes abreviados dos modelos, dentre outras coisas

In [12]:
help(create_model)

Help on function create_model in module pycaret.classification:

create_model(estimator: Union[str, Any], fold: Union[int, Any, NoneType] = None, round: int = 4, cross_validation: bool = True, fit_kwargs: Union[dict, NoneType] = None, groups: Union[str, Any, NoneType] = None, verbose: bool = True, **kwargs) -> Any
    This function trains and evaluates the performance of a given estimator 
    using cross validation. The output of this function is a score grid with 
    CV scores by fold. Metrics evaluated during CV can be accessed using the 
    ``get_metrics`` function. Custom metrics can be added or removed using 
    ``add_metric`` and ``remove_metric`` function. All the available models
    can be accessed using the ``models`` function.
    
    Example
    -------
    >>> from pycaret.datasets import get_data
    >>> juice = get_data('juice')
    >>> from pycaret.classification import *
    >>> exp_name = setup(data = juice,  target = 'Purchase')
    >>> lr = create_model('lr')
 

###Já que a 'lr' - Logistic Regression    foi o melhor modelo, vamos criá-lo para ser usado em nossas análises

Veja que agora ele traz vários indicadores baseados nesse modelo

In [14]:
lr = create_model('lr')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.7963,0.8812,0.5,0.8182,0.6207,0.4923,0.5202
1,0.8333,0.8611,0.6667,0.8,0.7273,0.6087,0.6139
2,0.7407,0.8009,0.7778,0.5833,0.6667,0.4615,0.4743
3,0.7778,0.8045,0.4737,0.8182,0.6,0.4609,0.4939
4,0.7593,0.8932,0.3684,0.875,0.5185,0.3917,0.4569
5,0.6852,0.818,0.4737,0.5625,0.5143,0.2839,0.2862
6,0.8333,0.8857,0.6842,0.8125,0.7429,0.6209,0.6259
7,0.7925,0.8683,0.8333,0.6522,0.7317,0.5665,0.5779
8,0.7925,0.8603,0.5,0.8182,0.6207,0.489,0.5171
9,0.717,0.8206,0.6667,0.5714,0.6154,0.3936,0.3965


###Vamos criar um subconjunto de dados para usarmos como teste. Daremos o nome de diabetes2 e pegaremos apenas 5 linhas, da 200 a 205

In [16]:
diabetes2 = diabetes.iloc[200:205]

In [19]:
diabetes2

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
200,0,113,80,16,0,31.0,0.874,21,0
201,1,138,82,0,0,40.1,0.236,28,0
202,0,108,68,20,0,27.3,0.787,32,0
203,2,99,70,16,44,20.4,0.235,27,0
204,6,103,72,32,190,37.7,0.324,55,0


###Agora vamos fazer um previsão nesse subconjunto de testes
Para iss chamamos a função predict_model, passamos o modelo lr que criamos acima e o subconjunto diabestes2

In [20]:
predict_model(lr, data=diabetes2)

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,Label,Score
200,0,113,80,16,0,31.0,0.874,21,0,0,0.8287
201,1,138,82,0,0,40.1,0.236,28,0,1,0.5138
202,0,108,68,20,0,27.3,0.787,32,0,0,0.8619
203,2,99,70,16,44,20.4,0.235,27,0,0,0.9845
204,6,103,72,32,190,37.7,0.324,55,0,0,0.6011


### Na dúvida, chama o help

In [21]:
help(predict_model)

Help on function predict_model in module pycaret.classification:

predict_model(estimator, data: Union[pandas.core.frame.DataFrame, NoneType] = None, probability_threshold: Union[float, NoneType] = None, encoded_labels: bool = False, raw_score: bool = False, round: int = 4, verbose: bool = True) -> pandas.core.frame.DataFrame
    This function predicts ``Label`` and ``Score`` (probability of predicted 
    class) using a trained model. When ``data`` is None, it predicts label and 
    score on the holdout set.
    
    
    Example
    -------
    >>> from pycaret.datasets import get_data
    >>> juice = get_data('juice')
    >>> from pycaret.classification import *
    >>> exp_name = setup(data = juice,  target = 'Purchase')
    >>> lr = create_model('lr')
    >>> pred_holdout = predict_model(lr)
    >>> pred_unseen = predict_model(lr, data = unseen_dataframe)
        
    
    estimator: scikit-learn compatible object
        Trained model object
    
    
    data: pandas.DataFrame


###Agora precisamos finalizar o modelo para que seja fisicamente salvo e utilizado. 

Note que, depois de salvo, você não mais vai precisar dos passos anteriores quando quiser fazer uma classificação
Palavras chaves: finalize_model e save_model

In [25]:
final_logistic_regression = finalize_model(lr)

In [26]:
save_model(final_logistic_regression,'Final Logistic Regression 09 Agosto 2021')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[],
                                       target='Class variable',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeri...
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs', 'passthrough'), ('pca', 'passthrough'),
                 ['trained_model',
                  LogisticRegression(C=1.0, class_we

### Agora que temos um modelo salvo, podemos carregá-lo

Na prática, isso seria feito em outro notebook / arquivo Python. Por fim didático, estamos usando um só :)

In [34]:
meu_modelo_salvo = load_model('Final Logistic Regression 09 Agosto 2021')

Transformation Pipeline and Model Successfully Loaded


### Tcharammmm... agora é só fazer as previsões /  classificações
Basta chamar a função **predict_model**, passar o modelo salvo e o subconjunto de dados, **data**,  que gostaria de usar

In [29]:
predict_model(meu_modelo_salvo, data=diabetes2)

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,Label,Score
200,0,113,80,16,0,31.0,0.874,21,0,0,0.7967
201,1,138,82,0,0,40.1,0.236,28,0,0,0.6073
202,0,108,68,20,0,27.3,0.787,32,0,0,0.8282
203,2,99,70,16,44,20.4,0.235,27,0,0,0.9672
204,6,103,72,32,190,37.7,0.324,55,0,0,0.7091
