# 最適なアルゴリズムやパラメータを見つける
アヤメの分類プログラムを例に考える.  

## iris.ipynbを業務に使う場合の懸念点
1. アルゴリズムの選定   
他にもっと高い正解率を出すアルゴリズムがあるのではないか  
→ 各アルゴリズムの正解率を比較

2. アルゴリズムの評価  
データに対する結果に対し, ロバスト性があるのか  
→ クロスバリデーション

### クロスバリデーション(交差検証)
複数のデータパターンで評価し, ロバスト性があるものを選択する. 

In [8]:
# アルゴリズムの正解率を比較する

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
from sklearn.utils import all_estimators

# アヤメデータの読み込み
iris_data = pd.read_csv("csv/iris.csv", encoding="utf-8")

# アヤメデータをラベルと入力に分離
y = iris_data.loc[:, "Name"]
x = iris_data.loc[:, ["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"]]

# 学習用とテスト用に分離
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, train_size=0.8, shuffle=True)

#classifierのアルゴリズムを全て取得
all_algorithms = all_estimators(type_filter="classifier")

for (name, algorithm) in all_algorithms:
    clf = algorithm()

    clf.fit(x_train, y_train)
    y_predict = clf.predict(x_test)
    print("{0}の正解率 = {1}".format(name, accuracy_score(y_test, y_predict)))


AdaBoostClassifierの正解率 = 0.9333333333333333
BaggingClassifierの正解率 = 0.9333333333333333
BernoulliNBの正解率 = 0.26666666666666666
CalibratedClassifierCVの正解率 = 0.7333333333333333
CategoricalNBの正解率 = 0.9666666666666667


TypeError: _BaseChain.__init__() missing 1 required positional argument: 'base_estimator'

ClassifierChainではbase_estimatorの引数が必要
→ 今回は省く

In [12]:
# アルゴリズムの正解率を比較する

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
from sklearn.utils import all_estimators

# アヤメデータの読み込み
iris_data = pd.read_csv("csv/iris.csv", encoding="utf-8")

# アヤメデータをラベルと入力に分離
y = iris_data.loc[:, "Name"]
x = iris_data.loc[:, ["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"]]

# 学習用とテスト用に分離
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, train_size=0.8, shuffle=True)

#classifierのアルゴリズムを全て取得
all_algorithms = all_estimators(type_filter="classifier")

for (name, algorithm) in all_algorithms:
    if name == 'ClassifierChain':
        continue

    clf = algorithm()
    clf.fit(x_train, y_train)
    y_predict = clf.predict(x_test)
    print("{0}の正解率 = {1}".format(name, accuracy_score(y_test, y_predict)))


AdaBoostClassifierの正解率 = 0.9333333333333333
BaggingClassifierの正解率 = 0.9333333333333333
BernoulliNBの正解率 = 0.26666666666666666
CalibratedClassifierCVの正解率 = 0.8666666666666667
CategoricalNBの正解率 = 0.9333333333333333
ComplementNBの正解率 = 0.7
DecisionTreeClassifierの正解率 = 0.9333333333333333
DummyClassifierの正解率 = 0.26666666666666666
ExtraTreeClassifierの正解率 = 0.9333333333333333
ExtraTreesClassifierの正解率 = 0.9333333333333333
GaussianNBの正解率 = 0.9333333333333333
GaussianProcessClassifierの正解率 = 0.9666666666666667
GradientBoostingClassifierの正解率 = 0.9333333333333333
HistGradientBoostingClassifierの正解率 = 0.9333333333333333
KNeighborsClassifierの正解率 = 0.9666666666666667
LabelPropagationの正解率 = 0.9333333333333333
LabelSpreadingの正解率 = 0.9333333333333333
LinearDiscriminantAnalysisの正解率 = 0.9666666666666667
LinearSVCの正解率 = 0.9
LogisticRegressionの正解率 = 0.9333333333333333
LogisticRegressionCVの正解率 = 0.9666666666666667
MLPClassifierの正解率 = 0.9333333333333333


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

TypeError: MultiOutputClassifier.__init__() missing 1 required positional argument: 'estimator'

他にも変数が必要なものがある
→ try文でエラーになる場合を無視

In [18]:
# アルゴリズムの正解率を比較する

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
from sklearn.multioutput import MultiOutputClassifier
from sklearn.utils import all_estimators

# アヤメデータの読み込み
iris_data = pd.read_csv("csv/iris.csv", encoding="utf-8")

# アヤメデータをラベルと入力に分離
y = iris_data.loc[:, "Name"]
x = iris_data.loc[:, ["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"]]

# 学習用とテスト用に分離
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, train_size=0.8, shuffle=True)

#classifierのアルゴリズムを全て取得
all_algorithms = all_estimators(type_filter="classifier")

for (name, algorithm) in all_algorithms:
    try:
        clf = algorithm()
        clf.fit(x_train, y_train)
        y_predict = clf.predict(x_test)
        
        print("{0}の正解率 = {1}".format(name, accuracy_score(y_test, y_predict)))
    except:
        pass


AdaBoostClassifierの正解率 = 1.0
BaggingClassifierの正解率 = 1.0
BernoulliNBの正解率 = 0.3
CalibratedClassifierCVの正解率 = 0.9333333333333333
CategoricalNBの正解率 = 0.9
ComplementNBの正解率 = 0.6
DecisionTreeClassifierの正解率 = 0.9
DummyClassifierの正解率 = 0.3
ExtraTreeClassifierの正解率 = 0.9666666666666667
ExtraTreesClassifierの正解率 = 1.0
GaussianNBの正解率 = 1.0
GaussianProcessClassifierの正解率 = 1.0
GradientBoostingClassifierの正解率 = 1.0
HistGradientBoostingClassifierの正解率 = 0.9
KNeighborsClassifierの正解率 = 1.0
LabelPropagationの正解率 = 1.0
LabelSpreadingの正解率 = 1.0
LinearDiscriminantAnalysisの正解率 = 1.0
LinearSVCの正解率 = 0.9666666666666667
LogisticRegressionの正解率 = 1.0
LogisticRegressionCVの正解率 = 1.0
MLPClassifierの正解率 = 0.9666666666666667
MultinomialNBの正解率 = 0.8666666666666667
NearestCentroidの正解率 = 0.9
NuSVCの正解率 = 1.0
PassiveAggressiveClassifierの正解率 = 0.9333333333333333
Perceptronの正解率 = 0.9666666666666667
QuadraticDiscriminantAnalysisの正解率 = 1.0
RadiusNeighborsClassifierの正解率 = 1.0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

RandomForestClassifierの正解率 = 1.0
RidgeClassifierの正解率 = 0.8666666666666667
RidgeClassifierCVの正解率 = 0.8666666666666667
SGDClassifierの正解率 = 0.6
SVCの正解率 = 1.0


### k分割クロスバリデーションを行う

データをk個のグループに分割し, k-1個のデータを学習, 残り1つのデータで評価をk回繰り返す

In [23]:
# 5分割クロスバリデーション
import pandas as pd 
import numpy as np 
from sklearn.utils import all_estimators
from sklearn.model_selection import KFold
import warnings
from sklearn.model_selection import cross_val_score

# アヤメデータの読み込み
iris_data = pd.read_csv("csv/iris.csv", encoding="utf-8")

# データの分離
y = iris_data.loc[:, "Name"]
x = iris_data.loc[:, ["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"]]

# classifierのアルゴリズムを全て取得
warnings.filterwarnings('ignore')
all_algorithms = all_estimators(type_filter="classifier")

# K分割クロスバリデーション用オブジェクト
kfold_cv = KFold(n_splits=5, shuffle=True)

for (name, algorithm) in all_algorithms:
    try:
        clf = algorithm()

        # score属性を持つアルゴリズムのみを対象
        if hasattr(clf, "score"):

            scores = cross_val_score(clf, x, y, cv=kfold_cv)
            print('{}の正解率='.format(name))
            print(scores)
    except:
        pass

AdaBoostClassifierの正解率=
[0.93333333 0.93333333 0.93333333 0.93333333 1.        ]
BaggingClassifierの正解率=
[0.96666667 0.9        0.9        0.96666667 1.        ]
BernoulliNBの正解率=
[0.3        0.33333333 0.26666667 0.3        0.23333333]
CalibratedClassifierCVの正解率=
[1.         0.86666667 0.96666667 0.76666667 0.86666667]
CategoricalNBの正解率=
[0.96666667 0.9        0.93333333 0.9        0.9       ]
ComplementNBの正解率=
[0.6        0.73333333 0.66666667 0.73333333 0.6       ]
DecisionTreeClassifierの正解率=
[0.96666667 0.93333333 0.9        0.96666667 0.93333333]
DummyClassifierの正解率=
[0.3        0.26666667 0.2        0.26666667 0.23333333]
ExtraTreeClassifierの正解率=
[0.93333333 0.9        1.         0.96666667 0.9       ]
ExtraTreesClassifierの正解率=
[0.96666667 0.93333333 0.9        1.         0.93333333]
GaussianNBの正解率=
[0.96666667 1.         0.9        0.96666667 0.93333333]
GaussianProcessClassifierの正解率=
[0.93333333 0.93333333 0.9        1.         0.93333333]
GradientBoostingClassifierの正解率=
[0.96666

これはパラメータがデフォルト
→ 最適なパラメータも見つけたい

### グリッドサーチにより最適なハイパラメータを見つける

#### グリッドサーチ
ハイパラメータのチューニング手法の1つ.  
指定したパラメータの全パターンについて, 正解率を比較し最も正解率の高いパラメータの組み合わせを選択する.  

In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# アヤメデータの読み込み
iris_data = pd.read_csv("csv/iris.csv", encoding="utf-8")

# データの分離
y = iris_data.loc[:, "Name"]
x = iris_data.loc[:, ["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"]]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, train_size=0.8, shuffle=True)

# グリッドサーチに使用するパラメータ
parameters = [
    {"C": [1, 10, 100, 1000], "kernel": ["linear"]},
    {"C": [1, 10, 100, 1000], "kernel": ["rbf"], "gamma": [0.001, 0.0001]},
    {"C": [1, 10, 100, 1000], "kernel": ["sigmoid"], "gamma": [0.001, 0.0001]}
]

# グリッドサーチ
kfold_cv = KFold(n_splits=5, shuffle=True)
clf = GridSearchCV(SVC(), parameters, cv=kfold_cv)
clf.fit(x_train, y_train)
print("最適なパラメータ = ", clf.best_estimator_)

# 最適なパラメータの評価
y_predict = clf.predict(x_test)
print("評価時の正解率 = ", accuracy_score(y_test, y_predict))

最適なパラメータ =  SVC(C=1000, gamma=0.001)
評価時の正解率 =  0.9666666666666667
