<a href="https://colab.research.google.com/github/ymuto0302/ML/blob/main/MachineLearning_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyCaret を用いて，分類問題を楽にする

データセットからランダム，かつ独立に学習データ，テストデータを切り出した結果，試行毎に
- 学習データ，テストデータの中身が異なる
- その結果，正解率が異なる。

この問題を解消する手法が cross validation (交差検証) である。

<strike>
目的：
- モデル定義を行えば，その後の学習・予測は同じ形式で実行できることを知る。
- 分類が困難なデータの場合，モデルによって分類性能に差が出ることを知る。
</strike>

sklearn における cross validation の説明：  
https://scikit-learn.org/stable/modules/cross_validation.html

## PyCaret を利用するための準備
PyCaret は多数のライブラリに依存するため，そのインストールに相応の時間を要する。
また，Google Colab 環境では Jinja2 のバージョンが整合しないため，最新版の Jinja2 もインストールする必要がある。

**(注意) PyCaret および Jinja2 をインストールした後，カーネル（Google Colab の場合，ランタイム）を再起動しなければならない。**

In [None]:
# PyCaret のインストール (Install PyCaret)
# (注意) 依存するライブラリが多いため，それなりに時間がかかる
!pip install pycaret

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pycaret
  Downloading pycaret-2.3.10-py3-none-any.whl (320 kB)
[K     |████████████████████████████████| 320 kB 4.5 MB/s 
[?25hCollecting scikit-plot
  Downloading scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Collecting umap-learn
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[K     |████████████████████████████████| 88 kB 6.5 MB/s 
[?25hCollecting kmodes>=0.10.1
  Downloading kmodes-0.12.1-py2.py3-none-any.whl (20 kB)
Collecting scikit-learn==0.23.2
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 18.8 MB/s 
[?25hCollecting mlflow
  Downloading mlflow-1.26.0-py3-none-any.whl (17.8 MB)
[K     |████████████████████████████████| 17.8 MB 448 kB/s 
[?25hCollecting Boruta
  Downloading Boruta-0.3-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 3.2 MB/s 
[?25hCollecting lightg

In [None]:
# Google Colab における実行時の "jinja2 のインポートエラー" を避けるため，
# Jinja2 の最新版をインストール (Install the latest Jinja2 library)
!pip install -U Jinja2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Jinja2
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 5.0 MB/s 
Installing collected packages: Jinja2
  Attempting uninstall: Jinja2
    Found existing installation: Jinja2 2.11.3
    Uninstalling Jinja2-2.11.3:
      Successfully uninstalled Jinja2-2.11.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.27.1 which is incompatible.
flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.1.2 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
Successfully installed Jinja2-3.1.2


**(重要) このタイミングでカーネル (ランタイム) を再起動する！！**

**(IMPORTANT) Restart the kernel (runtime) at this time!!**

---
## データ読み込み
有名な titanic dataset (タイタニック生存予想問題) を用いる。

In [1]:
# taitanic dataset を読み込む
from pycaret.datasets import get_data
titanic = get_data('titanic') # データ読み込み

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
# 明らかに分類に影響しない "PassengerId", "Name", "Ticket" を削除する
titanic = titanic.drop(columns=['PassengerId', 'Name', 'Ticket'])
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


In [3]:
# PyCaret から，利用するクラスを読み込む
from pycaret.classification import *  # 全てのクラスを読み込む

  defaults = yaml.load(f)


---
## PyCaret の全ての関数を利用できるよう，インポート
このノートブックでは分類分問題を扱うため，`pycaret.classification` から関数群を読み込む

---
## (参考) Pandasの profile_report()を用いたデータの観察

In [4]:
# (参考) Pandasのprofile_report()を使い、データの中身を観察する
import pandas_profiling
titanic.profile_report()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



---
## 前処理 (Preprocessing)
前処理(setup)では，データ型の推論，欠損値処理，データ分割などを行ってくれる。  
`setup()` の引数は，データセットと目的変数名である。

**(注意) `setup()` の実行後，白枠が出てくるから Enter キーを押下する。**


In [5]:
# 前処理 (preprocessing)
exp = setup(titanic, target='Survived') # 枠が出てくるから，enter を入力すること

Unnamed: 0,Description,Value
0,session_id,2617
1,Target,Survived
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(891, 9)"
5,Missing Values,True
6,Numeric Features,2
7,Categorical Features,6
8,Ordinal Features,False
9,High Cardinality Features,False


---
## 複数のモデルの比較 (Comparison of Several Models)
PyCaret にて利用可能な分類器を全て当てはめ，その性能を評価する。その結果，性能がよい順にモデルが並ぶ。

ただし，個々のモデルのハイパーパラメータは最適化されていない。


In [6]:
# 複数のモデルの比較 (comparison of several models)
compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.8235,0.86,0.7361,0.7887,0.7592,0.6207,0.6233,0.334
gbc,Gradient Boosting Classifier,0.8235,0.8681,0.694,0.8186,0.7481,0.6144,0.6217,0.166
lightgbm,Light Gradient Boosting Machine,0.8155,0.8715,0.7317,0.778,0.7505,0.6048,0.6089,0.089
ridge,Ridge Classifier,0.809,0.0,0.7067,0.7708,0.7355,0.5871,0.5897,0.018
ada,Ada Boost Classifier,0.8059,0.8458,0.7487,0.7508,0.7472,0.5902,0.5928,0.125
lda,Linear Discriminant Analysis,0.8041,0.8503,0.6859,0.7739,0.7252,0.5746,0.5786,0.034
rf,Random Forest Classifier,0.7865,0.8576,0.7107,0.7366,0.7183,0.5474,0.5523,0.512
et,Extra Trees Classifier,0.7784,0.8413,0.69,0.7184,0.7,0.5254,0.529,0.485
dt,Decision Tree Classifier,0.7753,0.7601,0.7024,0.714,0.7045,0.5239,0.5277,0.02
knn,K Neighbors Classifier,0.6998,0.7222,0.531,0.638,0.5756,0.3467,0.3529,0.123


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=2617, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### モデルの構築 (Building the Model)
compare_models() の結果をふまえ，Random Fores Classifier (rf)" を用いてモデルを構築する。

In [7]:
# Random Forest を分類モデルとして用いる
model = create_model('rf')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7619,0.8579,0.7917,0.6552,0.717,0.5146,0.5215
1,0.8889,0.9599,0.8333,0.8696,0.8511,0.7625,0.763
2,0.6984,0.8109,0.75,0.5806,0.6545,0.3945,0.4047
3,0.7419,0.8166,0.5652,0.6842,0.619,0.4266,0.431
4,0.8065,0.8728,0.6667,0.8,0.7273,0.5792,0.585
5,0.8387,0.8871,0.6667,0.8889,0.7619,0.6437,0.6589
6,0.7581,0.8388,0.6667,0.6957,0.6809,0.4862,0.4865
7,0.7097,0.7538,0.5417,0.65,0.5909,0.3688,0.3725
8,0.9032,0.9578,0.875,0.875,0.875,0.7961,0.7961
9,0.7581,0.8207,0.75,0.6667,0.7059,0.5016,0.5041


### ハイパーパラメータの最適化 (Hyperparameter Optimization)
モデルには（設計者が指定すべき）ハイパーパラメータ (hyper-parameter)が含まれる。

PyCaret における「ハイパーパラメータの最適化」は random grid search により行われる。ここで候補点数(`n_iter`)と最適化規準(`optimize`)を指定する。

In [8]:
# ハイパーパラメータの最適化 (hyperprameter optimization)
# ここでは random grid search の回数を 300回，Accuracy を基準とした最適化を試みる
tuned_model = tune_model(model, n_iter=300, optimize='Accuracy')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7778,0.8024,0.625,0.75,0.6818,0.5132,0.5183
1,0.873,0.8638,0.7917,0.8636,0.8261,0.7264,0.7281
2,0.8571,0.8665,0.7917,0.8261,0.8085,0.6947,0.6951
3,0.7419,0.8094,0.5652,0.6842,0.619,0.4266,0.431
4,0.8548,0.9002,0.625,1.0,0.7692,0.6714,0.7109
5,0.8548,0.9008,0.8333,0.8,0.8163,0.6964,0.6968
6,0.7742,0.8388,0.5833,0.7778,0.6667,0.5011,0.513
7,0.7742,0.7555,0.625,0.75,0.6818,0.509,0.5141
8,0.9355,0.9594,0.9167,0.9167,0.9167,0.864,0.864
9,0.8387,0.8503,0.7917,0.7917,0.7917,0.6601,0.6601


### モデルの評価 (Model Evaluation)
ハイパーパラメータを最適化したモデル `tuned_model' について，評価を行う。
様々な観点から評価を行えるが，以下では分かりやすい項目のみを列挙する。

- AUC : ROC 曲線
- Confusion Matrix : コンフュージョン・マトリクス
- Prediction Error : 予測誤差

In [9]:
# モデルの評価
evaluate_model(tuned_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

## モデルの確定および予測 (Model Determination and Prediction)
重要なのは「モデルによる予測」である。 予め(hold-out 法により)確保していたテストデータを用いて予測し，真値との比較を行う。

In [10]:
# モデルの確定
final_model = finalize_model(tuned_model)

# 予測
predict_model(final_model)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.806,0.8449,0.7087,0.7684,0.7374,0.5839,0.5851


Unnamed: 0,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,SibSp_0,SibSp_1,SibSp_2,SibSp_3,...,Cabin_G6,Cabin_T,Cabin_not_available,Embarked_C,Embarked_Q,Embarked_S,Embarked_not_available,Survived,Label,Score
0,44.00000,27.720800,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,1,0.9471
1,29.49818,7.750000,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0,0,0.8851
2,44.00000,8.050000,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0,0,0.8851
3,24.00000,15.850000,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1,1,0.5897
4,4.00000,13.416700,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1,1,0.5897
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,32.00000,56.495800,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1,0,0.8851
264,63.00000,77.958298,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1,1,0.9471
265,41.00000,39.687500,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0,0,0.8889
266,24.00000,26.000000,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1,1,0.9471
