## GBDT模型

In [1]:
import pandas as pd
from sklearn.utils import resample
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
import numpy as np

In [2]:
# 加载数据
file_path = 'E:\竞赛\服创赛-A08\data\欺诈手段特征集.xlsx'
data = pd.read_excel(file_path)

### 参数:

- data: 这是包含特征和目标列的 pandas DataFrame。
- target_column: 这是目标变量的列名，在这个函数中用于区分多数类和少数类。
- sampling_strategy: 这是一个浮点数，指定欠采样后少数类和多数类样本数量的比例。例如，如果设置为 1.0，表示在欠采样后，多数类和少数类的样本数量将是相同的。

In [10]:
def perform_undersampling(data, target_column, sampling_strategy=0.4):  #sampling_strategy=0.4是默认采样比例
    """
    Perform undersampling on the dataset to handle class imbalance.

    Parameters:
    - data: pandas DataFrame, the dataset.
    - target_column: str, the name of the target column.
    - sampling_strategy: float, the desired ratio of the number of samples in the minority class
      over the number of samples in the majority class after resampling. Default is 1.0 (equal balance).

    Returns:
    - Balanced DataFrame after undersampling.
    """
    majority_class = data[data[target_column] == 0]
    minority_class = data[data[target_column] == 1]

    majority_class_undersampled = resample(majority_class, 
                                           replace=False, 
                                           n_samples=int(len(minority_class) / sampling_strategy), 
                                           random_state=42)
    undersampled_data = pd.concat([majority_class_undersampled, minority_class])
    return undersampled_data

该 `evaluate_model_with_undersampling` 函数是为了评估在不同欠采样策略下机器学习模型的表现。它主要通过交叉验证来测试模型在处理类别不平衡数据时的效果。函数的工作流程如下：

**参数**:
   - `data`: DataFrame，包含特征和目标列的数据集。
   - `target_column`: 字符串，指定目标列的名称。
   - `sampling_strategies`: 浮点数列表，代表要评估的不同欠采样策略。
   - `model`: 用于评估的机器学习模型。
   - `n_splits`: 整数，指定交叉验证的折数。

1. **遍历不同的欠采样策略**:
   - 函数接收一系列欠采样策略（例如 0.5, 0.75, 1.0），这些策略指定了在欠采样后少数类和多数类的样本比例。

2. **对每个策略执行欠采样**:
   - 使用 `perform_undersampling` 函数按照当前的策略对数据集进行欠采样，以达到更平衡的类别分布。

3. **数据分割**:
   - 对欠采样后的数据集进行分割，生成训练集和测试集。

4. **模型训练与预测**:
   - 在训练集上训练指定的模型。
   - 在测试集上进行预测。

5. **生成分类报告**:
   - 使用 `classification_report` 函数生成每个欠采样策略下的模型性能报告。

6. **存储并返回结果**:
   - 将每种策略下的分类报告存储在一个字典中，并返回这个字典。

In [11]:
def evaluate_model_with_undersampling_and_report(data, target_column, sampling_strategies, model, test_size=0.3, n_splits=5):
    """
    Evaluate the model using cross-validation with different undersampling strategies and provide classification reports.

    Parameters:
    - data: pandas DataFrame, the dataset.
    - target_column: str, the name of the target column.
    - sampling_strategies: list of floats, different sampling strategies to be evaluated.
    - model: the machine learning model to be evaluated.
    - test_size: float, the proportion of the dataset to include in the test split.
    - n_splits: int, number of folds for cross-validation.

    Returns:
    - Dictionary containing sampling strategies and their corresponding evaluation scores and reports.
    """
    reports = {}
    for strategy in sampling_strategies:
        # Undersampling the dataset
        balanced_data = perform_undersampling(data, target_column, sampling_strategy=strategy)
        X = balanced_data.drop(target_column, axis=1)
        y = balanced_data[target_column]

        # Splitting the dataset into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)

        # Train the model
        model.fit(X_train, y_train)

        # Predictions on the test set
        y_pred = model.predict(X_test)

        # Classification report
        report = classification_report(y_test, y_pred)

        # Storing the results
        reports[strategy] = report

    return reports

###  参数

1. **`loss`**:
   - 默认值: 'deviance'
   - 描述: 损失函数类型。'deviance' 适用于分类，它对应于逻辑回归。

2. **`learning_rate`**:
   - 默认值: 0.1
   - 描述: 学习率缩小每棵树的贡献。较小的值通常需要更多的树来构建模型，但可以提高模型的泛化能力。

3. **`n_estimators`**:
   - 默认值: 100
   - 描述: 要执行的提升阶段（树）的数量。更多的树可以增加模型的复杂性，但也可能导致过拟合。

4. **`subsample`**:
   - 默认值: 1.0
   - 描述: 用于拟合各个基础学习器的样本比例。如果小于 1.0，则将随机抽取部分样本进行训练，这有助于减少过拟合。

5. **`criterion`**:
   - 默认值: 'friedman_mse'
   - 描述: 用于测量分割质量的函数。'friedman_mse' 是一种改进的均方误差。

6. **`min_samples_split`**:
   - 默认值: 2
   - 描述: 分割内部节点所需的最少样本数。

7. **`min_samples_leaf`**:
   - 默认值: 1
   - 描述: 在叶节点处需要的最小样本数。这有助于平滑模型，特别是对于回归。

8. **`max_depth`**:
   - 默认值: 3
   - 描述: 各个回归估计量的最大深度。深度越大，模型越复杂。

9. **`max_features`**:
   - 默认值: None
   - 描述: 寻找最佳分割时要考虑的特征数量。'None' 意味着使用所有特征。

In [14]:
# Define the model - Gradient Boosting Classifier
gbdt_model = GradientBoostingClassifier(
    max_depth = 5,
    random_state = 42
)

# Define sampling strategies to evaluate
sampling_strategies = [0.4,0.45,0.5,0.55, 0.6,0.65,0.75]  # Different ratios for undersampling

# Evaluate the model with different undersampling strategies and get classification reports
classification_reports = evaluate_model_with_undersampling_and_report(data, 'RES', sampling_strategies, gbdt_model)

# Print classification reports for each undersampling strategy
for strategy, report in classification_reports.items():
    print(f"采样比例{strategy}:\n{report}\n")

采样比例0.4:
              precision    recall  f1-score   support

           0       0.83      0.94      0.88       596
           1       0.76      0.52      0.62       237

    accuracy                           0.82       833
   macro avg       0.80      0.73      0.75       833
weighted avg       0.81      0.82      0.81       833


采样比例0.45:
              precision    recall  f1-score   support

           0       0.83      0.90      0.86       538
           1       0.70      0.57      0.63       229

    accuracy                           0.80       767
   macro avg       0.76      0.73      0.74       767
weighted avg       0.79      0.80      0.79       767


采样比例0.5:
              precision    recall  f1-score   support

           0       0.82      0.89      0.85       489
           1       0.71      0.56      0.63       225

    accuracy                           0.79       714
   macro avg       0.76      0.73      0.74       714
weighted avg       0.78      0.79      0.78 

## 阈值移动

In [15]:
def evaluate_performance_at_different_thresholds(model, X_test, y_test, thresholds):
    """
    Evaluate the performance of a model at different thresholds and provide classification reports.

    Parameters:
    - model: trained model
    - X_test: Test features
    - y_test: True values for test set
    - thresholds: list of thresholds to evaluate

    Returns:
    - Dictionary of classification reports for each threshold
    """
    y_scores = model.predict_proba(X_test)[:, 1]  # Get probabilities for the positive class
    reports = {}

    for threshold in thresholds:
        # Apply threshold
        y_pred_threshold = (y_scores >= threshold).astype(int)

        # Generate classification report
        report = classification_report(y_test, y_pred_threshold)

        reports[threshold] = report

    return reports


In [17]:
# 选择一个采样策略，例如 0.5
balanced_data = perform_undersampling(data, 'RES', sampling_strategy=0.4)
X_balanced = balanced_data.drop('RES', axis=1)
y_balanced = balanced_data['RES']

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.3, random_state=42)

# 训练模型
GBDT_1 = gbdt_model
GBDT_1.fit(X_train, y_train)

In [19]:
# Assuming you have already trained your GBDT model and have X_test and y_test
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]  # Define a range of thresholds to evaluate
classification_reports = evaluate_performance_at_different_thresholds(GBDT_1, X_test, y_test, thresholds)

# Print classification reports for each threshold
for threshold, report in classification_reports.items():
    print(f"Threshold: {threshold}\n{report}\n")

Threshold: 0.3
              precision    recall  f1-score   support

           0       0.86      0.84      0.85       596
           1       0.63      0.66      0.64       237

    accuracy                           0.79       833
   macro avg       0.74      0.75      0.75       833
weighted avg       0.80      0.79      0.79       833


Threshold: 0.4
              precision    recall  f1-score   support

           0       0.84      0.90      0.87       596
           1       0.69      0.57      0.63       237

    accuracy                           0.81       833
   macro avg       0.77      0.74      0.75       833
weighted avg       0.80      0.81      0.80       833


Threshold: 0.5
              precision    recall  f1-score   support

           0       0.83      0.94      0.88       596
           1       0.76      0.52      0.62       237

    accuracy                           0.82       833
   macro avg       0.80      0.73      0.75       833
weighted avg       0.81    

## 代价敏感


In [20]:
# 定义样本权重计算函数
def calculate_sample_weights(y):
    class_weights = len(y) / (2 * np.bincount(y))
    sample_weights = y.map(lambda x: class_weights[x])
    return sample_weights

In [25]:
# 分割数据集
# X = data.drop('RES', axis=1)
# y = data['RES']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 计算训练集的样本权重
train_weights = calculate_sample_weights(y_train)

# 创建并训练 GBDT 模型
gbdt_model = GradientBoostingClassifier(max_depth=5, random_state=42)
gbdt_model.fit(X_train, y_train, sample_weight=train_weights)

# 预测和评估
y_pred = gbdt_model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.90      0.94      4563
           1       0.23      0.56      0.33       237

    accuracy                           0.89      4800
   macro avg       0.60      0.73      0.63      4800
weighted avg       0.94      0.89      0.91      4800



## 三种方法结合

In [22]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

In [23]:
# 分割数据集
X = data.drop('RES', axis=1)
y = data['RES']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 创建过采样和欠采样的组合
over = SMOTE(sampling_strategy=0.1, random_state=42)
under = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
steps = [('over', over), ('under', under)]
pipeline = Pipeline(steps=steps)

# 对训练数据应用过采样和欠采样
X_train_resampled, y_train_resampled = pipeline.fit_resample(X_train, y_train)

# 计算代价敏感的样本权重（可以根据需要调整计算方式）
weights = {0: 1, 1: 10}

# 创建并训练代价敏感的 GBDT 模型
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train_resampled, y_train_resampled, sample_weight=y_train_resampled.map(weights))


In [24]:
# 预测和评估
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.49      0.65      4563
           1       0.08      0.89      0.15       237

    accuracy                           0.51      4800
   macro avg       0.54      0.69      0.40      4800
weighted avg       0.94      0.51      0.63      4800

