`XGBClassifier` 是 XGBoost 库中的一个强大的分类器，用于梯度增强树模型。

1. **`random_state`**:
   - 描述: 控制随机数生成器的种子。它确保了模型结果的可重复性。在不同运行中使用相同的 `random_state` 值将产生相同的结果。

2. **`n_estimators`**:
   - 默认值: 100
   - 描述: 树的个数。增加树的数量可以提高模型的性能，但也可能导致过拟合。

3. **`learning_rate`**:
   - 默认值: 0.3
   - 描述: 学习率用于控制每棵树的贡献。较小的值意味着需要更多的树来建立模型，但可以提高模型的泛化能力。

4. **`max_depth`**:
   - 默认值: 6
   - 描述: 树的最大深度。增加深度可以使模型更复杂，但也可能导致过拟合。

5. **`min_child_weight`**:
   - 默认值: 1
   - 描述: 决定最小叶子节点样本权重和。较大的值可以防止模型学习过于具体的样本，有助于防止过拟合。

6. **`subsample`**:
   - 默认值: 1
   - 描述: 用于训练每棵树的样本占总样本的比例。较低的值可以防止过拟合，但也可能导致欠拟合。

7. **`colsample_bytree`**:
   - 默认值: 1
   - 描述: 在建立每棵树时用于训练的特征的比例。类似于随机森林中的 `max_features`。


In [1]:
import pandas as pd
from sklearn.utils import resample
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
import numpy as np

In [2]:
# 加载数据
file_path = 'E:\竞赛\服创赛-A08\data\欺诈手段特征集.xlsx'
data = pd.read_excel(file_path)

In [3]:
# Function to perform undersampling
def perform_undersampling(data, target_column, sampling_strategy=1.0):
    majority_class = data[data[target_column] == 0]
    minority_class = data[data[target_column] == 1]
    majority_class_undersampled = resample(majority_class, 
                                           replace=False, 
                                           n_samples=int(len(minority_class) / sampling_strategy), 
                                           random_state=42)
    undersampled_data = pd.concat([majority_class_undersampled, minority_class])
    return undersampled_data

In [4]:
# Function to evaluate model with undersampling and report
def evaluate_model_with_undersampling_and_report(data, target_column, sampling_strategies, model, test_size=0.3, n_splits=5):
    reports = {}
    for strategy in sampling_strategies:
        balanced_data = perform_undersampling(data, target_column, sampling_strategy=strategy)
        X = balanced_data.drop(target_column, axis=1)
        y = balanced_data[target_column]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        report = classification_report(y_test, y_pred)
        reports[strategy] = report
    return reports

In [5]:
# Define the model - XGBoost Classifier
xgb_model = XGBClassifier(random_state=42)

# Define sampling strategies to evaluate
sampling_strategies = [0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.75]  # Different ratios for undersampling

# Evaluate the model with different undersampling strategies and get classification reports
classification_reports = evaluate_model_with_undersampling_and_report(data, 'RES', sampling_strategies, xgb_model)

# Print classification reports for each undersampling strategy
for strategy, report in classification_reports.items():
    print(f"采样比例{strategy}:\n{report}\n")

采样比例0.4:
              precision    recall  f1-score   support

           0       0.83      0.92      0.87       596
           1       0.72      0.53      0.61       237

    accuracy                           0.81       833
   macro avg       0.78      0.72      0.74       833
weighted avg       0.80      0.81      0.80       833


采样比例0.45:
              precision    recall  f1-score   support

           0       0.83      0.88      0.86       538
           1       0.68      0.57      0.62       229

    accuracy                           0.79       767
   macro avg       0.75      0.73      0.74       767
weighted avg       0.78      0.79      0.79       767


采样比例0.5:
              precision    recall  f1-score   support

           0       0.82      0.88      0.85       489
           1       0.69      0.57      0.62       225

    accuracy                           0.78       714
   macro avg       0.75      0.73      0.74       714
weighted avg       0.78      0.78      0.78 