## 任务3：构建高糖预测分类模型

经过前两天的学习与实验，已经完成了数据预处理与分析部分。数据准备完毕后就要完成模型训练工作。具体任务分为以下三点:

### 任务点3.1 训练集与测试集划分
请完成训练集与测试集划分任务，具体划分要求为同一人的数据不能同时作为训练集和测试集。一共16位用户的数据可以参考4:1或3:2的方式进行划分。
考虑到发生高糖的次数非常少，会造成高糖标记样本远少于低糖标记样本的类别不平衡问题，可通过降采样、过采样SMOTE等方法进行处理。

In [1]:
!pip install imbalanced-learn

Looking in indexes: https://mirror.baidu.com/pypi/simple/, https://mirrors.aliyun.com/pypi/simple/


In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# 生成文件编号
file_nums = [f'{i:03d}' for i in range(2, 17)]

# 初始化一个空的DataFrame用于存放训练集和测试集数据
train_data = pd.DataFrame()
test_data = pd.DataFrame()

# 对每个文件编号进行处理
for file_num in file_nums:
    # 读取文件
    data = pd.read_csv(f'final_data/{file_num}.csv')
    
    # 将数据追加到相应的DataFrame
    if file_num in ['009', '011']:
        test_data = pd.concat([test_data, data], ignore_index=True)
    else:
        train_data = pd.concat([train_data, data], ignore_index=True)

# 获取第一列的名字
first_column_name = train_data.columns[0]

# 将第一列转换为目标变量二分类，如果>0则标记为1，否则为0
train_data[first_column_name] = (train_data[first_column_name] > 0).astype(int)
test_data[first_column_name] = (test_data[first_column_name] > 0).astype(int)

# 分离特征和目标变量
X_train = train_data.drop(first_column_name, axis=1)
y_train = train_data[first_column_name]

X_test = test_data.drop(first_column_name, axis=1)
y_test = test_data[first_column_name]

# 创建SMOTE实例
sm = SMOTE(sampling_strategy={1: len(y_train[y_train == 0])}, random_state=42)
# 对训练集进行过采样
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# 合并过采样后的特征和目标变量
train_resampled = pd.concat([X_train_res, y_train_res], axis=1)

# 查看过采样后的前几行数据
print(train_resampled.head())

# 保存处理后的训练集和测试集
train_resampled.to_csv('train_test_data_binary/train_resampled.csv', index=False)
X_test.to_csv('train_test_data_binary/X_test.csv', index=False)
y_test.to_csv('train_test_data_binary/y_test.csv', index=False)

print("Train and test sets have been processed and saved.")

    eda_max   eda_std  eda_mean  sugar_sum  acc_x_min  acc_z_mean  \
0 -0.544867 -0.485389 -0.514385  -0.450338   0.488968    1.091837   
1 -0.553896 -0.483379 -0.519125   0.900017  -0.218853    0.812283   
2 -0.554717 -0.482878 -0.521699   0.900017  -0.218853    0.547707   
3 -0.554717 -0.480062 -0.524683   0.900017  -0.218853    0.258676   
4 -0.554717 -0.472443 -0.529481   0.900017  -0.218853    0.116224   

   glucose_freq  
0             1  
1             1  
2             1  
3             1  
4             1  
Train and test sets have been processed and saved.


### 任务点3.2 模型构建与参数调优

运用两种以上分类算法，构建高糖预测分类模型，并使用正则化、交叉验证和网格搜索等方法进行调参。

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
    make_scorer
)
from tqdm import tqdm

# 读取训练集和测试集
X_train = pd.read_csv('train_test_data_binary/train_resampled.csv').drop('glucose_freq', axis=1)
y_train = pd.read_csv('train_test_data_binary/train_resampled.csv')['glucose_freq']
X_test = pd.read_csv('train_test_data_binary/X_test.csv')
y_test = pd.read_csv('train_test_data_binary/y_test.csv')['glucose_freq']

# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 定义分类器和参数网格
classifiers = {
    'Logistic Regression': {
        'model': LogisticRegression(),
        'params': {'C': [0.001, 0.01], 'penalty': ['l1', 'l2'], 'solver': ['liblinear']}
    },
    'Support Vector Machine': {
        'model': SVC(),
        'params': {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
    },
    'Random Forest': {
        'model': RandomForestClassifier(),
        'params': {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5]}
    }
}

# 定义评估指标
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'recall': make_scorer(recall_score),
    'precision': make_scorer(precision_score),
    'f1': make_scorer(f1_score)
}

# 交叉验证和网格搜索
for name, classifier_info in tqdm(classifiers.items(), desc="Classifiers"):
    model = classifier_info['model']
    params = classifier_info['params']
    
    # 创建GridSearchCV实例
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=params,
        cv=5,  # 5折交叉验证
        scoring=scoring,  # 使用多个评估指标
        refit='recall'  # 选择最佳模型时使用召回率
    )
    
    # 拟合模型
    grid_search.fit(X_train_scaled, y_train)
    
    # 获取最佳模型和参数
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    
    # 评估最佳模型
    y_pred = best_model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    # 输出结果
    print(f"Classifier: {name}")
    print(f"Best Parameters: {best_params}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Classification Report:\n{report}\n")

Classifiers:  33%|███▎      | 1/3 [00:00<00:01,  1.63it/s]

Classifier: Logistic Regression
Best Parameters: {'C': 0.001, 'penalty': 'l1', 'solver': 'liblinear'}
Accuracy: 0.4989
Recall: 0.6324
Precision: 0.3580
F1 Score: 0.4572
Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.43      0.53      2814
           1       0.36      0.63      0.46      1409

    accuracy                           0.50      4223
   macro avg       0.53      0.53      0.50      4223
weighted avg       0.59      0.50      0.51      4223




Classifiers:  67%|██████▋   | 2/3 [14:49<08:43, 523.40s/it]

Classifier: Support Vector Machine
Best Parameters: {'C': 0.1, 'kernel': 'linear'}
Accuracy: 0.4625
Recall: 0.7282
Precision: 0.3522
F1 Score: 0.4748
Classification Report:
              precision    recall  f1-score   support

           0       0.71      0.33      0.45      2814
           1       0.35      0.73      0.47      1409

    accuracy                           0.46      4223
   macro avg       0.53      0.53      0.46      4223
weighted avg       0.59      0.46      0.46      4223




Classifiers: 100%|██████████| 3/3 [18:42<00:00, 374.06s/it]

Classifier: Random Forest
Best Parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}
Accuracy: 0.5326
Recall: 0.5415
Precision: 0.3649
F1 Score: 0.4360
Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.53      0.60      2814
           1       0.36      0.54      0.44      1409

    accuracy                           0.53      4223
   macro avg       0.53      0.53      0.52      4223
weighted avg       0.59      0.53      0.55      4223







In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
    make_scorer
)
from tqdm import tqdm

# 读取训练集和测试集
X_train = pd.read_csv('train_test_data_binary/train_resampled.csv').drop('glucose_freq', axis=1)
y_train = pd.read_csv('train_test_data_binary/train_resampled.csv')['glucose_freq']
X_test = pd.read_csv('train_test_data_binary/X_test.csv')
y_test = pd.read_csv('train_test_data_binary/y_test.csv')['glucose_freq']

# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 定义分类器和参数网格
classifiers = {
    'Gradient Boosting': {
        'model': GradientBoostingClassifier(),
        'params': {'n_estimators': [50, 100], 'learning_rate': [0.01, 0.1], 'max_depth': [3, 5]}
    },
    'XGBoost': {
        'model': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
        'params': {'n_estimators': [50, 100], 'learning_rate': [0.01, 0.1], 'max_depth': [3, 5]}
    }
}

# 定义评估指标
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'recall': make_scorer(recall_score),
    'precision': make_scorer(precision_score),
    'f1': make_scorer(f1_score)
}

# 交叉验证和网格搜索
for name, classifier_info in tqdm(classifiers.items(), desc="Classifiers"):
    model = classifier_info['model']
    params = classifier_info['params']
    
    # 创建GridSearchCV实例
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=params,
        cv=5,  # 5折交叉验证
        scoring=scoring,  # 使用多个评估指标
        refit='recall'  # 选择最佳模型时使用召回率
    )
    
    # 拟合模型
    grid_search.fit(X_train_scaled, y_train)
    
    # 获取最佳模型和参数
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    
    # 评估最佳模型
    y_pred = best_model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    # 输出结果
    print(f"Classifier: {name}")
    print(f"Best Parameters: {best_params}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Classification Report:\n{report}\n")

Classifiers:  50%|█████     | 1/2 [03:47<03:47, 227.02s/it]

Classifier: Gradient Boosting
Best Parameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
Accuracy: 0.4430
Recall: 0.6714
Precision: 0.3337
F1 Score: 0.4458
Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.33      0.44      2814
           1       0.33      0.67      0.45      1409

    accuracy                           0.44      4223
   macro avg       0.50      0.50      0.44      4223
weighted avg       0.56      0.44      0.44      4223




Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode

Classifier: XGBoost
Best Parameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
Accuracy: 0.4449
Recall: 0.7048
Precision: 0.3400
F1 Score: 0.4587
Classification Report:
              precision    recall  f1-score   support

           0       0.68      0.31      0.43      2814
           1       0.34      0.70      0.46      1409

    accuracy                           0.44      4223
   macro avg       0.51      0.51      0.44      4223
weighted avg       0.57      0.44      0.44      4223





