# Bagging算法
#### Bagging 的核心思想是通过自助采样（有放回采样）从原始数据集中生成多个子集，然后在每个子集上独立训练一个基学习器，最后将这些基学习器的预测结果进行综合（分类任务通常采用投票法，回归任务通常采用平均法）。通过并行训练多个基学习器并聚合结果，减少模型过拟合风险

## 1 bagging用于分类

In [5]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 加载数据
wine = load_wine()

X, y = wine.data, wine.target
# 打印数据信息
print(X.shape, y.shape)
# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 创建Bagging分类器（基模型为决策树）
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_depth=3),    
    n_estimators=100,
    max_samples=0.8,
    bootstrap=True,
    random_state=42
)

# 训练与评估
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
print(f"准确率: {accuracy_score(y_test, y_pred):.4f}")

(178, 13) (178,)
准确率: 0.9630


## 2 bagging用于回归

In [6]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error

# 加载数据
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 创建Bagging回归器
bag_reg = BaggingRegressor(
    DecisionTreeRegressor(),
    n_estimators=50,
    max_samples=0.7,
    bootstrap_features=True,
    random_state=42
)

# 训练与评估
bag_reg.fit(X_train, y_train)
y_pred = bag_reg.predict(X_test)
print(f"均方误差: {mean_squared_error(y_test, y_pred):.4f}")

均方误差: 2761.5917


## 3 查看单个决策树

In [8]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error

# 加载数据
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

decision_tree = DecisionTreeRegressor()
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)
print(f"均方误差: {mean_squared_error(y_test, y_pred):.4f}")

均方误差: 5718.3759
