# Pycaret 

<img src="https://pycaret.org/wp-content/uploads/2020/07/1.png" width="60%">

# Index

## 0. Introduction & Installation of Pycaret

### 0-0. What is Pycaret?

<img src="https://miro.medium.com/max/875/1*wT0m1kx8WjY_P7hrM6KDbA.png" width="50%">

##### concepts
- An open source, low-code machine learning library in Python
- It automates machine learning workflow
- An end-to-end machine learning and model managemnet tool that speeds up machine learning experiment cycle
- It aims to reduce the cycle time from hypothesis to insights
- It makes us more productive

- Low-code
    - 기존의 수동 코딩 컴퓨터 프로그래밍 대신 그래픽 사용자 인터페이스 및 구성을 통해 응용 프로그램 소프트웨어를 만드는데 사용되는 개발 환경을 제공하는 소프트웨어

### 0-1. Installing Pycaret

In [None]:
!pip install pycaret==2.0

## 1. Initialize

### 1-1. Packages

In [1]:
## Basic module
import pandas as pd
import numpy as np

## Pycaret classification module
from pycaret.classification import *

### 1-2. Load Data

#### Data 및 target 소개
    - Data : Safe Driver Prediction
    - Shape : (30,240, 17)
    - Target : 회사가 자동차 보험료를 지불하는지 여부(0 / 1)
    - 해당 데이터에서는 target이 0과 1로 인코딩 되어있음
    - 0 : 회사는 자동차 보험료를 지불함 / 1 : 회사는 자동차 보험료를 지불하지 않음
    - 비용 차원에서 생각하면 회사는 자동차 보험료를 지불하지 않는 것이 이득
    - Binary Classification data

#### 1-2-1. What is Binary Classification?
    - a supervised machine learning technique where the goal is to predict categorical class labels 
    - which are discrete and unoredered such as Pass/Fail, Positive/Negative



In [3]:
import os
os.chdir("/home/advice/HYJ/")

In [4]:
driver = pd.read_csv("./data/drvier.csv")
driver.head()

Unnamed: 0,ID,target,Gender,EngineHP,credit_history,Years_Experience,annual_claims,Marital_Status,Vehical_type,Miles_driven_annually,size_of_family,Age_bucket,EngineHP_bucket,Years_Experience_bucket,Miles_driven_annually_bucket,credit_history_bucket,State
0,1,1,F,522,656,1,0,Married,Car,14749.0,5,<18,>350,<3,<15k,Fair,IL
1,2,1,F,691,704,16,0,Married,Car,15389.0,6,28-34,>350,15-30,15k-25k,Good,NJ
2,3,1,M,133,691,15,0,Married,Van,9956.0,3,>40,90-160,15-30,<15k,Good,CT
3,4,1,M,146,720,9,0,Married,Van,77323.0,3,18-27,90-160,9-14',>25k,Good,CT
4,5,1,M,128,771,33,1,Married,Van,14183.0,4,>40,90-160,>30,<15k,Very Good,WY


### 1-3. Processing Data
    - 데이터 내 특수 기호 처리를 위한 전처리 과정
    - 모델링 시 문제가 되는 특수 기호를 다른 기호로 처리

In [5]:
col1 = driver['Age_bucket']
col2 = driver['EngineHP_bucket']
col3 = driver['Years_Experience_bucket']
col4 = driver['Miles_driven_annually_bucket']

In [6]:
col1_new = col1.replace('>40', '40~')
col1_new = col1_new.replace('<18', '~18')

col2_new = col2.replace('>350', '350~')
col2_new = col2_new.replace('<90', '~90')

col3_new = col3.replace('<3', '3~')
col3_new = col3_new.replace('>30', '30~')
col3_new = col3_new.replace("3-8'", '3-8')
col3_new = col3_new.replace("9-14'", '9-14')

col4_new = col4.replace('<15k', '~15k')
col4_new = col4_new.replace('>25k', '25k~')

In [7]:
driver['Age_bucket'] = col1_new
driver['EngineHP_bucket'] = col2_new
driver['Years_Experience_bucket'] = col3_new
driver['Miles_driven_annually_bucket'] = col4_new

driver.head()

Unnamed: 0,ID,target,Gender,EngineHP,credit_history,Years_Experience,annual_claims,Marital_Status,Vehical_type,Miles_driven_annually,size_of_family,Age_bucket,EngineHP_bucket,Years_Experience_bucket,Miles_driven_annually_bucket,credit_history_bucket,State
0,1,1,F,522,656,1,0,Married,Car,14749.0,5,~18,350~,3~,~15k,Fair,IL
1,2,1,F,691,704,16,0,Married,Car,15389.0,6,28-34,350~,15-30,15k-25k,Good,NJ
2,3,1,M,133,691,15,0,Married,Van,9956.0,3,40~,90-160,15-30,~15k,Good,CT
3,4,1,M,146,720,9,0,Married,Van,77323.0,3,18-27,90-160,9-14,25k~,Good,CT
4,5,1,M,128,771,33,1,Married,Van,14183.0,4,40~,90-160,30~,~15k,Very Good,WY


### 1-4. Setting up

#### - 모델링을 위한 데이터를 준비하기 위해 transforamtion pipeline을 생성하는 과정
#### - Mandatory Parameter
    - Pandas Dataframe
    - Name of the "target" colunm
#### - setup(data, target = "target_name")
#### - Initialize the setup in other environment
    - (In Non-Notebook env) : setup(data, target, html = False)
    - (Remote runs like Kaggle of Github actions) : setup(data, target, silent = True)
#### - setup() 함수 내에서 사용할 수 있는 다양한 옵션 및 파라미터에 대한 설명
    - pycaret.org 참고

In [25]:
#intialize the setup (in Notebook env)
exp_clf = setup(driver, target = 'target', ignore_features=["ID"], session_id=1004)

Setup Succesfully Completed!


Unnamed: 0,Description,Value
0,session_id,1004
1,Target Type,Binary
2,Label Encoded,
3,Original Data,"(30240, 17)"
4,Missing Values,True
5,Numeric Features,5
6,Categorical Features,11
7,Ordinal Features,False
8,High Cardinality Features,False
9,High Cardinality Method,


## 2. Model Training

### 2-1. Compare models

#### - 특정 모델을 생성하기 이전에 model library 안에 있는 모든 모델을 비교해보는 과정

#### - Model Library안에 있는 모든 모델들을 학습시킨 후 CV(Cross Validation) 값으로 스코어를 냄

#### - Output은 Accuracy, AUC, Recall, Precision, F1, Kappa 등을 score grid 형태로 보여줌

#### - compare_models()

#### - Other options
    - Return best model based on specific score(ex. Recall)
        - compare_models(sort = 'Recall') ## default is 'Accuracy'
    - Compare all models on 5 fold cross validation
        - compare_models(fold = 5)

In [11]:
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
0,Logistic Regression,0.7075,0.4976,1.0,0.7075,0.8287,0.0,0.0,0.0963
1,Ridge Classifier,0.7075,0.0,1.0,0.7075,0.8287,0.0,0.0,0.0805
2,Linear Discriminant Analysis,0.7075,0.5046,1.0,0.7075,0.8287,0.0,0.0,0.28
3,Ada Boost Classifier,0.7072,0.5045,0.9989,0.7076,0.8284,0.0008,0.0072,1.168
4,Naive Bayes,0.7069,0.5082,0.9981,0.7076,0.8281,0.0005,0.0025,0.0225
5,Gradient Boosting Classifier,0.7069,0.5062,0.9971,0.708,0.828,0.003,0.0167,4.259
6,CatBoost Classifier,0.7055,0.5067,0.9929,0.7082,0.8267,0.0045,0.0166,14.73
7,Light Gradient Boosting Machine,0.7048,0.5039,0.9933,0.7076,0.8264,0.0005,0.0007,0.2966
8,Extreme Gradient Boosting,0.6898,0.5063,0.9556,0.708,0.8134,0.003,0.0052,5.901
9,Extra Trees Classifier,0.6606,0.4992,0.8913,0.7061,0.7879,-0.0075,-0.0088,0.5056


#### 2-1-1. Compare specific models

##### * Blacklist certarin models
    - compare_models(blacklist = ['model1', 'model2'])

In [None]:
black_list = compare_models(blacklist = ['dt', 'svm'])

##### * Compare specific models
    - compare_models(whitelist = ['model1', 'model2'])

In [None]:
best_specific = compare_models(whitelist = ['lr', 'lda', 'ridge'])

##### * Return top 3 models based on Accuracy
    - compare_models(n_select = 3)

In [None]:
top3 = compare_models(sort = 'AUC', n_select = 3) # default is 'Accuracy'

### 2.2 Create model

#### - 원하는 특정 모델 생성

#### - create_model('model_name')

#### - Other options
    - train model using 5 fold CV
        - creat_model('model', fold = 5)
    - train model without CV
        - create_model('model', cross_validation = False)
    - train xgboost model with max depth = 10
        - create_model('xgboost', max_depth = 10)
    - train multiple lightgbm models with n learning_rate
        - lgbms = [create_model('lightgbm', learning_rate = i) for i in np.arange(0.1,1,0.1)]

In [12]:
# check the model library to see all models
models()

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Logistic Regression,sklearn.linear_model.LogisticRegression,True
knn,K Neighbors Classifier,sklearn.neighbors.KNeighborsClassifier,True
nb,Naive Bayes,sklearn.naive_bayes.GaussianNB,True
dt,Decision Tree Classifier,sklearn.tree.DecisionTreeClassifier,True
svm,SVM - Linear Kernel,sklearn.linear_model.SGDClassifier,True
rbfsvm,SVM - Radial Kernel,sklearn.svm.SVC,False
gpc,Gaussian Process Classifier,sklearn.gaussian_process.GPC,False
mlp,MLP Classifier,sklearn.neural_network.MLPClassifier,False
ridge,Ridge Classifier,sklearn.linear_model.RidgeClassifier,True
rf,Random Forest Classifier,sklearn.ensemble.RandomForestClassifier,True


In [26]:
lightgbm = create_model('lightgbm')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.7062,0.5001,0.996,0.7078,0.8275,0.0012,0.0059
1,0.7048,0.4961,0.9947,0.7072,0.8266,-0.003,-0.014
2,0.7071,0.5126,0.9967,0.7083,0.8281,0.0044,0.0218
3,0.7057,0.5146,0.992,0.7086,0.8267,0.0069,0.0231
4,0.7067,0.5048,0.996,0.7081,0.8277,0.0035,0.0163
5,0.7052,0.5111,0.9953,0.7073,0.827,-0.002,-0.0101
6,0.7029,0.5063,0.992,0.7065,0.8252,-0.0044,-0.0172
7,0.7018,0.5047,0.99,0.7064,0.8245,-0.0072,-0.0256
8,0.7089,0.5262,0.996,0.7097,0.8288,0.0148,0.0571
9,0.7027,0.5001,0.9927,0.7063,0.8253,-0.008,-0.0347


### 2-3. Tune model

#### - 생성한 모델에 대한 hyperparameter 튜닝

#### - tune_model('model_name')

#### - Ohter options
    - tune hyperparameters with increased n_iter
        - tune_model(model, n_iter = 50)
    - tune hyperparameters to optimize AUC
        - tune_model(rf, optimize = 'AUC') ## default is 'Accuracy'
    - tune hyperparameters with custom_grid
        - params = {"max_depth": np.random.randint(1, (len(data.columns)*.85),20),
                    "max_features": np.random.randint(1, len(data.columns),20),
                    "min_samples_leaf": [2,3,4,5,6],
                    criterion": ["gini", "entropy"]
                    }
        - tuned_dt_custom = tune_model(model, custom_grid = params)
    - tune multiple models dynamically
        - top3 = compare_models(n_select = 3)
        - tuned_top3 = [tune_model(i) for i in top3]

In [14]:
tuned_model = tune_model(lightgbm)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.7057,0.5032,0.9933,0.7081,0.8268,0.0064,0.0235
1,0.7062,0.5064,0.996,0.7078,0.8275,0.0012,0.0059
2,0.7085,0.5195,0.9967,0.7093,0.8288,0.0112,0.0483
3,0.7052,0.5005,0.9933,0.7079,0.8267,0.002,0.0076
4,0.7081,0.5045,0.9947,0.7095,0.8282,0.0129,0.0469
5,0.7071,0.4911,0.9947,0.7088,0.8278,0.0084,0.0324
6,0.7071,0.5118,0.9947,0.7088,0.8278,0.0084,0.0324
7,0.7027,0.4827,0.9913,0.7067,0.8251,-0.0054,-0.0202
8,0.7051,0.5038,0.994,0.7076,0.8267,0.0006,0.0026
9,0.7046,0.4925,0.9927,0.7076,0.8262,0.001,0.0038


## 3. Model Analysis

### 3-1. Plot model

#### - AUC, confusion_matrix, decision boundary 등과 같은 다양한 관점에서 plot을 통해 모델의 performance를 분석

#### - 학습된 모델 object를 취하여 test / hold-out set에 기초한 plot을 return

#### - plot_model(model, plot = 'parameter')

In [None]:
# AUC plot
plot_model(tuned_model, plot = 'auc')

In [None]:
# Confusion matrix plot
plot_model(tuned_model, plot = 'confusion_matrix')

In [None]:
# Decision Boundary plot
plot_model(tuned_model, plot = 'boundary')

In [None]:
plot_model(tuned_model, plot='feature')

### 3-2. Evaluate model

#### - plot_model()을 기반으로 주어진 모델들에 대해 그려질 수 있는 모든 plot들을 한 눈에 알아볼 수 있도록 보여줌

#### - evaluate_model(model)

In [None]:
evaluate_model(lightgbm)

### 3-3. Optimize Threshold

#### - threshold 값을 최적화하는 과정

#### - optimize_threshold(model, true_positive = , true_negative = , false_positive = , false_negative = )

#### - confusion matrix 요소들의 weight 값 입력함

#### - weight 에 따라 어느 요소를 중요하게 볼지 선택 가능

#### - 예시) 현재 코드에서는 TN과 FN에 weight를 줌
    >> FN에 -100의 가중치를 준 이유는 회사의 입장에 돈을 지불하지 않아도 되는데 돈을 지불한 것으로 회사의 입장에서 손해이기 때문 음수의 가중치를 줌
    >> TN에 10의 가중치를 준 이유는 돈을 지불해야하는데 돈을 지불한 것은 옳은 일이기 때문에 양수의 가중치를 줌
    
#### - 회사 입장에서 손해를 더 줄여야하기에 FN에 더 큰 값을 음수의 가중치로 줌

In [56]:
optimize_threshold(lightgbm, true_negative = 5, false_negative = -9)

Optimized Probability Threshold: 0.44 | Optimized Cost Function: 3


## 4. Model Deployment

### 4-1. Predict model

#### - predict_model(model)

#### (위 그림 참고)Score 값은 threshold(0.36)을 기준으로 왼쪽이 0(돈을 지불해야함), 오른쪽이 1(돈을 지불하지 않아도 됨)

#### score 값에 따라 Label은 threshold 값을 기준으로 왼쪽이면 0, 오른쪽이면 1로 나타남

#### - target과 label의 값을 기준으로 accuracy 등 스코어 값을 산출

#### -  돈을 안줘도 되는 줘야하는 상황이 손해이기에 threshold 값을 더 작게 산정하는 것이 회사 입장에서는 이득 

#### - predict_model()자체에서 지정되는 threshold의 default 값은 0.5인데 위에 optimize한 0.36을 사용
    - 회사 입장에 손해를 볼 확률을 줄일 수 있음

In [57]:
predict_model(lightgbm, probability_threshold=0.44)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Light Gradient Boosting Machine,0.7079,0.5006,0.9995,0.708,0.8289,0.0025,0.0259


Unnamed: 0,EngineHP,credit_history,Years_Experience,Miles_driven_annually,Gender_F,Gender_M,annual_claims_0,annual_claims_1,annual_claims_2,annual_claims_3,...,State_UT,State_VA,State_VT,State_WA,State_WI,State_WV,State_WY,target,Label,Score
0,122.0,703.0,1.0,16670.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.7076
1,133.0,797.0,10.0,11605.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1,1,0.7301
2,126.0,769.0,16.0,6440.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0.7777
3,82.0,735.0,20.0,6418.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0.7983
4,160.0,756.0,16.0,9970.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0.7426
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9068,147.0,683.0,22.0,8798.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0.6273
9069,147.0,681.0,15.0,6937.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0.6961
9070,125.0,710.0,25.0,6493.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.7803
9071,116.0,709.0,21.0,7129.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1,1,0.7050


In [58]:
predict_model(lightgbm, probability_threshold=0.5)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Light Gradient Boosting Machine,0.7076,0.5006,0.9977,0.7083,0.8284,0.0047,0.0263


Unnamed: 0,EngineHP,credit_history,Years_Experience,Miles_driven_annually,Gender_F,Gender_M,annual_claims_0,annual_claims_1,annual_claims_2,annual_claims_3,...,State_UT,State_VA,State_VT,State_WA,State_WI,State_WV,State_WY,target,Label,Score
0,122.0,703.0,1.0,16670.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.7076
1,133.0,797.0,10.0,11605.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1,1,0.7301
2,126.0,769.0,16.0,6440.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0.7777
3,82.0,735.0,20.0,6418.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0.7983
4,160.0,756.0,16.0,9970.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0.7426
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9068,147.0,683.0,22.0,8798.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0.6273
9069,147.0,681.0,15.0,6937.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0.6961
9070,125.0,710.0,25.0,6493.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.7803
9071,116.0,709.0,21.0,7129.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1,1,0.7050


### 4-2. Finalize model

#### - 실험의 마지막 단계로 지정한 최종 모델에 대한 summary를 보여줌

#### - 모델을 최종적으로 내보내기 앞서 이전 과정들(setup ~ tune)을 거친 완벽한 dataset으로 학습하기 위함
    
#### - finalize_model(model)

In [None]:
finalize_model(lightgbm)

### 4-3. Save & Load model

#### - 최종적으로 만들어진 모델을 저장하고 이후에 사용할 때 불러오는 과정

#### - save_model(model, 'file_name')
    - 최종적으로 만들어진 모델을 현재 지정된 directory에 pickle 형태로 저장
    
#### - load_model('file_name')
    - 이전에 저장한 모델을 불러오는 것

In [None]:
save_model(lightgbm, 'lightgbm_model_0824')

In [None]:
load_model('lightgbm_model_0824')

## 5. Pros & Cons

<img src="https://miro.medium.com/max/4180/1*d7n1wOYUE3e-fE6NVqTryg.png" width="50%">

### 5-1. Pros

#### - 위의 그림과 같은 다양한 AutoML 라이브러리들과 비교해보았을 때

#### - 속도가 빠르며 데이터 처리부터 모델 학습이 진행되는 과정 및 시간을 직관적으로 확인할 수 있음

#### - Missing Value(결측치) 처리에 능함

### 5-2. Cons

#### - 특수 문자가 포함된 data가 있을 경우 특정 모델에서 작동되지 않아 미리 처리해줘야 됨

#### - 빅데이터를 정교한 모델로 다루는 데 한계가 있다(AutoML 전반적인 한계)