这段代码是一个机器学习流程，用于训练和评估决策树分类器的性能。以下是每个部分的解释：

1. 导入必要的库：pandas、numpy、scikit-learn的DecisionTreeClassifier、GridSearchCV、train_test_split、Pipeline和评价指标（accuracy_score，confusion_matrix，roc_auc_score和f1_score）。

2. 定义函数read_and_split_data(file_path, test_size=0.2, random_state=40)，用于读取数据文件，将数据集分为训练集和测试集，并对目标变量进行预处理（将原来的1和3替换为2，2替换为1）。

3. 定义函数evaluate_model_performance(model, X_test, y_test)，用于评估模型在测试集上的性能，包括准确率（accuracy）、混淆矩阵（confusion matrix）、AUC和F1值。

4. 定义Pipeline对象pipeline，其中包含了决策树分类器（DecisionTreeClassifier）。

5. 定义超参数空间param_grid，包含了分类器的最大深度、最小样本分割和最小样本叶子等参数。

6. 使用GridSearchCV对象grid_search来寻找最佳超参数组合，通过5折交叉验证和网格搜索的方式来搜索最佳超参数，评价指标是准确率（accuracy）。

7. 调用read_and_split_data函数读取和分割数据集，将训练集和测试集存储到X_train、X_test、y_train、y_test变量中。

8. 在训练集上拟合GridSearchCV对象grid_search，寻找最佳超参数组合。

9. 打印最佳超参数组合。

10. 使用最佳超参数组合训练DecisionTreeClassifier对象best_model。

11. 在测试集上评估最佳模型best_model的性能。

这段代码的目的是通过优化决策树分类器的超参数来提高其在测试集上的性能。GridSearchCV用于在超参数空间中进行网格搜索，寻找最佳超参数组合。Pipeline用于将数据预处理和模型训练和评估链接在一起。最终，我们可以使用最佳模型来预测新数据的目标变量。

## Pipeline

在机器学习的实践中，数据预处理和模型训练是一个必要的过程，而管道（Pipeline）是一个将数据预处理和模型训练结合起来的工具。管道可以将多个预处理步骤和模型训练步骤组合成一个单一的对象，使得整个流程更加规范化、简洁和易于管理。以下是一些建立管道的好处：

提高效率：使用管道可以减少重复的代码编写，从而提高编写代码和实验的效率。

简化代码：将多个步骤组合成管道后，可以将大量的代码组织在一起，从而简化代码。

避免数据泄露：当进行交叉验证时，需要对每个折叠的数据进行预处理。如果每次预处理都是手动完成的，那么就可能会发生数据泄露的情况。使用管道可以保证在交叉验证过程中，每个折叠都使用独立的预处理过程。

可重复性：使用管道可以轻松地重复实验，并且可以使用相同的代码和管道对新的数据进行预测。

易于管理：将多个步骤组合成管道后，可以轻松地管理整个过程，从而更容易地进行实验和进行模型的优化。

综上所述，建立管道可以提高效率、简化代码、避免数据泄露、保证可重复性和易于管理。在机器学习实践中，建立管道是一个非常实用的工具。

# Logistic Regression

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, f1_score

# Define a function to read and split the dataset
def read_and_split_data(file_path, test_size=0.2, random_state=40):
    data = pd.read_csv(file_path)
    # Copy target variable
    y = data['session'].copy()
    # Replace 1 and 3 with 2, and 2 with 1 in y
    y.replace({1: 2, 2: 1, 3: 2}, inplace=True)
    # Split the dataset into a training set and a test set
    X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], y, test_size=test_size, 
                                                        stratify=y, random_state=random_state)
    return X_train, X_test, y_train, y_test

# Define a function to evaluate model performance
def evaluate_model_performance(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))
    print('Confusion Matrix: \n', confusion_matrix(y_test, y_pred, normalize='true'))
    print('AUC: ', roc_auc_score(y_test, y_pred))
    print('F1-score: ', f1_score(y_test, y_pred))

# Define a pipeline to link data preprocessing and model training and evaluation
pipeline = Pipeline([
    ('classifier', DecisionTreeClassifier(random_state=40))
])

# Define hyperparameter space
param_grid = {
    'classifier__max_depth': [2, 5, 10, 15],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(
    pipeline,
    param_grid=param_grid,
    cv=5, # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,
)

# Read and split the dataset
X_train, X_test, y_train, y_test = read_and_split_data('HRV_ECG_step60.csv')

# Fit GridSearchCV object on the training dataset
grid_search.fit(X_train, y_train)

# Print best hyperparameters
print('Best parameters:', grid_search.best_params_)

# Fit DecisionTreeClassifier object with best hyperparameters on the training dataset
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# Evaluate model performance
evaluate_model_performance(best_model, X_test, y_test)

Best parameters: {'classifier__max_depth': 15, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2}
Accuracy:  0.7953964194373402
Confusion Matrix: 
 [[0.8128655  0.1871345 ]
 [0.21818182 0.78181818]]
AUC:  0.7973418394471026
F1-score:  0.7765363128491621


# Decision Tree

In [2]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, f1_score

# Define a function to read and split the dataset
def read_and_split_data(file_path, test_size=0.2, random_state=40):
    data = pd.read_csv(file_path)
    # Copy target variable
    y = data['session'].copy()
    # Replace 1 and 3 with 2, and 2 with 1 in y
    y.replace({1: 2, 2: 1, 3: 2}, inplace=True)
    # Split the dataset into a training set and a test set
    X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], y, test_size=test_size, 
                                                        stratify=y, random_state=random_state)
    return X_train, X_test, y_train, y_test

# Define a function to evaluate model performance
def evaluate_model_performance(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))
    print('Confusion Matrix: \n', confusion_matrix(y_test, y_pred, normalize='true'))
    print('AUC: ', roc_auc_score(y_test, y_pred))
    print('F1-score: ', f1_score(y_test, y_pred))

# Define a pipeline to link data preprocessing and model training and evaluation
pipeline = Pipeline([
    ('classifier', DecisionTreeClassifier(random_state=40))
])

# Define hyperparameter space
param_grid = {
    'classifier__max_depth': [2, 5, 10, 15],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(
    pipeline,
    param_grid=param_grid,
    cv=5, # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,
)

# Read and split the dataset
X_train, X_test, y_train, y_test = read_and_split_data('HRV_ECG_step60.csv')

# Fit GridSearchCV object on the training dataset
grid_search.fit(X_train, y_train)

# Print best hyperparameters
print('Best parameters:', grid_search.best_params_)

# Fit DecisionTreeClassifier object with best hyperparameters on the training dataset
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# Evaluate model performance
evaluate_model_performance(best_model, X_test, y_test)

Best parameters: {'classifier__max_depth': 15, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2}
Accuracy:  0.7953964194373402
Confusion Matrix: 
 [[0.8128655  0.1871345 ]
 [0.21818182 0.78181818]]
AUC:  0.7973418394471026
F1-score:  0.7765363128491621


# SVC

In [3]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, f1_score
from sklearn.preprocessing import StandardScaler


# 采用了随机搜索替代网格搜索，并且使用了核函数和特征缩放进行处理，以提高模型效率和性能

# Define a function to read and split the dataset
def read_and_split_data(file_path, test_size=0.2, random_state=40):
    data = pd.read_csv(file_path)
    # Copy target variable
    y = data['session'].copy()
    # Replace 1 and 3 with 2, and 2 with 1 in y
    y.replace({1: 2, 2: 1, 3: 2}, inplace=True)
    # Split the dataset into a training set and a test set
    X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], y, test_size=test_size, 
                                                        stratify=y, random_state=random_state)
    return X_train, X_test, y_train, y_test

# Define a function to evaluate model performance
def evaluate_model_performance(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))
    print('Confusion Matrix: \n', confusion_matrix(y_test, y_pred))
    print('AUC: ', roc_auc_score(y_test, y_pred))
    print('F1-score: ', f1_score(y_test, y_pred))

# Define a pipeline to link data preprocessing and model training and evaluation
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC(random_state=40))
])

# Define hyperparameter distributions for random search
param_distributions = {
    'classifier__C': np.logspace(-3, 3, 7),
    'classifier__kernel': ['linear', 'rbf'],
    'classifier__gamma': ['scale', 'auto'] + list(np.logspace(-3, 3, 7)),
}

# Use RandomizedSearchCV to find the best hyperparameters
random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_distributions,
    cv=5, # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,
    n_iter=50, # number of random search iterations
    random_state=40,
)

# Read and split the dataset
X_train, X_test, y_train, y_test = read_and_split_data('HRV_ECG_step60.csv')

# Fit RandomizedSearchCV object on the training dataset
random_search.fit(X_train, y_train)

# Print best hyperparameters
print('Best parameters:', random_search.best_params_)

# Fit SVM object with best hyperparameters on the training dataset
best_model = random_search.best_estimator_
best_model.fit(X_train, y_train)

# Evaluate model performance
evaluate_model_performance(best_model, X_test, y_test)

Best parameters: {'classifier__kernel': 'rbf', 'classifier__gamma': 0.1, 'classifier__C': 1000.0}
Accuracy:  0.8567774936061381
Confusion Matrix: 
 [[139  32]
 [ 24 196]]
AUC:  0.8518872939925571
F1-score:  0.8323353293413173
