<font size=8>The right way of using Smot with Cross-validation</font>

关键词：**Smote**, **Imbalanced Sample**, **Cross-validation**, **stratified_kfold**, **Pipeline**

参考文献：

1.https://towardsdatascience.com/the-right-way-of-using-smote-with-cross-validation-92a8d09d00c7

导入包：

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.preprocessing import MinMaxScaler

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline

from sklearn.datasets import load_breast_cancer


定义全局变量与常量：

In [18]:
random_state=42

cv_skf = StratifiedKFold(n_splits=5,
                         shuffle=True,
                         random_state=random_state)

scoring_aucc = {'AUC': 'roc_auc', 'Accuracy':make_scorer(accuracy_score)}


数据集准备：

In [19]:
df = load_breast_cancer()

In [20]:
x = df['data'].copy()
y = df['target'].copy()
label_decoder = {
    0: 'Sky',
    1: 'Grass',
}

x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=random_state
                                                    )

观察样本是否均衡：

In [21]:
# target_counts = y_train.value_counts()
unique, counts = np.unique(y_train, return_counts=True)
target_counts = dict(zip(unique, counts))

In [23]:
for key, val in target_counts.items():
    print(f"Class : {label_decoder[key]}, Count : {val}")

Class : Sky, Count : 170
Class : Grass, Count : 285


模型训练：

In [None]:
classifier = LogisticRegression(random_state=random_state, max_iter=1000)
pipeline = imbpipeline(steps=[['smote', SMOTE(random_state=random_state)],
                              ['scalar', MinMaxScaler()],
                              ['classifier', classifier]])


param_grid = {
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
}

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           cv=cv_skf,
                           scoring=scoring_aucc,
                           refit='AUC',
                           n_jobs=-1)

grid_search.fit(x_train, y_train)
cv_score = grid_search.best_score_

test_score = grid_search.score(x_test, y_test)


print(f'Cross-validation score: {cv_score}\nTest score: {test_score}')
