# 决策树回归 (Decision Tree Regression)

本 notebook 系统性地介绍决策树在回归任务中的应用，涵盖以下核心内容：

1. **理论基础**：决策树回归的工作原理与 MSE 分裂准则
2. **一维回归**：可视化阶梯状预测特性
3. **多维回归**：在真实数据集上的应用
4. **正则化**：通过超参数控制模型复杂度
5. **模型评估**：交叉验证与学习曲线分析

---

## 核心知识点

- 决策树回归在每个叶子节点输出**常数值**（节点内样本目标值的均值）
- 分裂准则基于**均方误差 (MSE)** 最小化
- 预测函数呈**阶梯状**，无法像线性模型那样外推
- 需要通过正则化（如 `max_depth`）防止过拟合

## 1. 环境配置与数据准备

In [None]:
# 标准库
import warnings
warnings.filterwarnings('ignore')

# 数值计算
import numpy as np

# 可视化
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
plt.style.use('seaborn-v0_8-whitegrid')

# scikit-learn 模块
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import fetch_california_housing

# 设置随机种子保证可复现性
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("环境配置完成")

## 2. 一维回归：理解阶梯状预测

决策树回归的核心特性是**阶梯状预测**：
- 每个叶子节点输出一个常数值
- 整体预测函数由多个水平线段组成
- `max_depth` 越大，阶梯越细密，拟合越精细

下面用正弦函数生成数据，直观展示这一特性：

In [None]:
# 生成带噪声的正弦数据
n_samples = 200
X_1d = np.sort(5 * np.random.rand(n_samples, 1), axis=0)
y_1d = np.sin(X_1d).ravel() + np.random.normal(0, 0.1, n_samples)

# 用于绘制预测曲线的连续点
X_plot = np.linspace(0, 5, 500).reshape(-1, 1)

print(f"数据形状: X={X_1d.shape}, y={y_1d.shape}")
print(f"目标值范围: [{y_1d.min():.3f}, {y_1d.max():.3f}]")

In [None]:
# 对比不同深度的决策树回归效果
depths = [2, 4, 6, 10]
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for ax, depth in zip(axes.ravel(), depths):
    # 训练模型
    reg = DecisionTreeRegressor(max_depth=depth, random_state=RANDOM_STATE)
    reg.fit(X_1d, y_1d)
    y_pred = reg.predict(X_plot)
    
    # 计算训练集 MSE
    train_mse = mean_squared_error(y_1d, reg.predict(X_1d))
    
    # 绘图
    ax.scatter(X_1d, y_1d, c='steelblue', alpha=0.5, s=20, label='Training data')
    ax.plot(X_plot, np.sin(X_plot), 'g--', linewidth=2, label='True function')
    ax.plot(X_plot, y_pred, 'r-', linewidth=2, label='Prediction')
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    ax.set_title(f'max_depth={depth}, MSE={train_mse:.4f}, leaves={reg.get_n_leaves()}')
    ax.legend(loc='upper right', fontsize=8)

plt.suptitle('Decision Tree Regression: Effect of max_depth on Prediction Shape', fontsize=14)
plt.tight_layout()
plt.show()

## 3. 多维回归：加州房价数据集

使用加州房价数据集 (California Housing) 演示决策树回归在实际问题中的应用。

**数据集特征：**
- 8 个特征：房龄、房间数、人口密度、地理位置等
- 目标变量：房价中位数（以 10 万美元为单位）
- 样本数：20,640

In [None]:
# 加载数据集
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names

print(f"数据集形状: X={X.shape}, y={y.shape}")
print(f"\n特征列表:")
for i, name in enumerate(feature_names):
    print(f"  {i+1}. {name}: mean={X[:, i].mean():.2f}, std={X[:, i].std():.2f}")

print(f"\n目标变量统计:")
print(f"  mean={y.mean():.3f}, std={y.std():.3f}")
print(f"  range=[{y.min():.3f}, {y.max():.3f}]")

In [None]:
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

print(f"训练集: {X_train.shape[0]} 样本")
print(f"测试集: {X_test.shape[0]} 样本")

## 4. 模型训练与评估

In [None]:
def evaluate_model(model, X_train, y_train, X_test, y_test):
    """
    评估回归模型性能
    
    Parameters
    ----------
    model : 已训练的回归模型
    X_train, y_train : 训练数据
    X_test, y_test : 测试数据
    
    Returns
    -------
    dict : 包含各项评估指标的字典
    """
    # 预测
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    metrics = {
        'train_mse': mean_squared_error(y_train, y_train_pred),
        'test_mse': mean_squared_error(y_test, y_test_pred),
        'train_rmse': np.sqrt(mean_squared_error(y_train, y_train_pred)),
        'test_rmse': np.sqrt(mean_squared_error(y_test, y_test_pred)),
        'train_mae': mean_absolute_error(y_train, y_train_pred),
        'test_mae': mean_absolute_error(y_test, y_test_pred),
        'train_r2': r2_score(y_train, y_train_pred),
        'test_r2': r2_score(y_test, y_test_pred)
    }
    
    return metrics


# 训练基准模型（无正则化）
tree_full = DecisionTreeRegressor(random_state=RANDOM_STATE)
tree_full.fit(X_train, y_train)

metrics_full = evaluate_model(tree_full, X_train, y_train, X_test, y_test)

print("无正则化决策树 (Full Tree):")
print(f"  树深度: {tree_full.get_depth()}")
print(f"  叶子数: {tree_full.get_n_leaves()}")
print(f"  Train RMSE: {metrics_full['train_rmse']:.4f}")
print(f"  Test RMSE:  {metrics_full['test_rmse']:.4f}")
print(f"  Train R²:   {metrics_full['train_r2']:.4f}")
print(f"  Test R²:    {metrics_full['test_r2']:.4f}")
print(f"\n过拟合程度: Train/Test RMSE ratio = {metrics_full['train_rmse']/metrics_full['test_rmse']:.2f}")

## 5. 正则化：超参数调优

### 5.1 max_depth 的影响

通过限制树的最大深度来控制模型复杂度，是最常用的正则化方法。

In [None]:
# 分析不同 max_depth 对性能的影响
depths = range(1, 25)
train_scores, test_scores = [], []

for depth in depths:
    tree = DecisionTreeRegressor(max_depth=depth, random_state=RANDOM_STATE)
    tree.fit(X_train, y_train)
    train_scores.append(-mean_squared_error(y_train, tree.predict(X_train)))
    test_scores.append(-mean_squared_error(y_test, tree.predict(X_test)))

# 绘制验证曲线
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(depths, np.sqrt(-np.array(train_scores)), 'b-o', label='Training RMSE', markersize=4)
ax.plot(depths, np.sqrt(-np.array(test_scores)), 'r-o', label='Test RMSE', markersize=4)

# 标记最佳深度
best_depth = depths[np.argmin(-np.array(test_scores))]
ax.axvline(x=best_depth, color='green', linestyle='--', alpha=0.7, label=f'Best depth={best_depth}')

ax.set_xlabel('max_depth')
ax.set_ylabel('RMSE')
ax.set_title('Validation Curve: max_depth vs RMSE')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"最佳深度: {best_depth}")
print(f"对应 Test RMSE: {np.sqrt(-test_scores[best_depth-1]):.4f}")

### 5.2 多参数交叉验证

In [None]:
# 对最佳深度模型进行交叉验证
tree_optimized = DecisionTreeRegressor(
    max_depth=best_depth,
    min_samples_leaf=5,
    random_state=RANDOM_STATE
)

# 5 折交叉验证
cv_scores = cross_val_score(
    tree_optimized, X_train, y_train,
    cv=5, scoring='neg_mean_squared_error'
)

cv_rmse = np.sqrt(-cv_scores)

print("5 折交叉验证结果:")
print(f"  RMSE (各折): {cv_rmse}")
print(f"  Mean RMSE:   {cv_rmse.mean():.4f}")
print(f"  Std RMSE:    {cv_rmse.std():.4f}")

In [None]:
# 训练最终模型并在测试集上评估
tree_optimized.fit(X_train, y_train)
metrics_opt = evaluate_model(tree_optimized, X_train, y_train, X_test, y_test)

print("优化后的决策树:")
print(f"  树深度: {tree_optimized.get_depth()}")
print(f"  叶子数: {tree_optimized.get_n_leaves()}")
print(f"  Train RMSE: {metrics_opt['train_rmse']:.4f}")
print(f"  Test RMSE:  {metrics_opt['test_rmse']:.4f}")
print(f"  Train R²:   {metrics_opt['train_r2']:.4f}")
print(f"  Test R²:    {metrics_opt['test_r2']:.4f}")

print("\n性能对比:")
print(f"  Test RMSE 提升: {(metrics_full['test_rmse'] - metrics_opt['test_rmse']) / metrics_full['test_rmse'] * 100:.1f}%")
print(f"  模型复杂度降低: {tree_full.get_n_leaves()} -> {tree_optimized.get_n_leaves()} 叶子")

## 6. 学习曲线分析

In [None]:
# 绘制学习曲线
train_sizes, train_scores_lc, test_scores_lc = learning_curve(
    DecisionTreeRegressor(max_depth=best_depth, min_samples_leaf=5, random_state=RANDOM_STATE),
    X_train, y_train,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

train_rmse_mean = np.sqrt(-train_scores_lc.mean(axis=1))
train_rmse_std = np.sqrt(-train_scores_lc).std(axis=1)
test_rmse_mean = np.sqrt(-test_scores_lc.mean(axis=1))
test_rmse_std = np.sqrt(-test_scores_lc).std(axis=1)

fig, ax = plt.subplots(figsize=(10, 6))
ax.fill_between(train_sizes, train_rmse_mean - train_rmse_std, train_rmse_mean + train_rmse_std, alpha=0.1, color='blue')
ax.fill_between(train_sizes, test_rmse_mean - test_rmse_std, test_rmse_mean + test_rmse_std, alpha=0.1, color='red')
ax.plot(train_sizes, train_rmse_mean, 'b-o', label='Training RMSE')
ax.plot(train_sizes, test_rmse_mean, 'r-o', label='Cross-validation RMSE')

ax.set_xlabel('Training Set Size')
ax.set_ylabel('RMSE')
ax.set_title('Learning Curve: Decision Tree Regressor')
ax.legend(loc='best')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 7. 特征重要性分析

In [None]:
# 特征重要性可视化
importances = tree_optimized.feature_importances_
indices = np.argsort(importances)[::-1]

fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(range(len(importances)), importances[indices], color='steelblue')
ax.set_yticks(range(len(importances)))
ax.set_yticklabels([feature_names[i] for i in indices])
ax.set_xlabel('Feature Importance')
ax.set_title('Decision Tree Regressor: Feature Importance')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

print("特征重要性排名:")
for i, idx in enumerate(indices):
    print(f"  {i+1}. {feature_names[idx]}: {importances[idx]:.4f}")

## 8. 预测结果可视化

In [None]:
# 真实值 vs 预测值散点图
y_pred = tree_optimized.predict(X_test)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 散点图
ax1 = axes[0]
ax1.scatter(y_test, y_pred, alpha=0.3, s=10)
ax1.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2, label='Perfect prediction')
ax1.set_xlabel('Actual Values')
ax1.set_ylabel('Predicted Values')
ax1.set_title('Actual vs Predicted')
ax1.legend()

# 残差分布
ax2 = axes[1]
residuals = y_test - y_pred
ax2.hist(residuals, bins=50, edgecolor='black', alpha=0.7)
ax2.axvline(x=0, color='red', linestyle='--', linewidth=2)
ax2.set_xlabel('Residual (Actual - Predicted)')
ax2.set_ylabel('Frequency')
ax2.set_title(f'Residual Distribution (mean={residuals.mean():.3f}, std={residuals.std():.3f})')

plt.tight_layout()
plt.show()

## 9. 决策树结构可视化（简化版）

In [None]:
# 训练一个简化版树用于可视化
tree_simple = DecisionTreeRegressor(max_depth=3, random_state=RANDOM_STATE)
tree_simple.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(
    tree_simple,
    feature_names=feature_names,
    filled=True,
    rounded=True,
    fontsize=10,
    ax=ax
)
ax.set_title('Decision Tree Structure (max_depth=3)', fontsize=14)
plt.tight_layout()
plt.show()

print("\n树结构说明:")
print("  - 每个节点显示: 分裂条件、MSE、样本数、预测值")
print("  - 颜色深浅表示预测值高低")
print("  - 左分支表示条件为真，右分支表示条件为假")

## 10. 总结

### 关键发现

1. **阶梯状预测**：决策树回归的预测值是离散的阶梯形状，由叶子节点数量决定精细程度

2. **过拟合风险**：无正则化的决策树容易过拟合，训练误差接近 0 但泛化能力差

3. **正则化效果**：通过 `max_depth` 和 `min_samples_leaf` 可以有效控制过拟合

4. **特征重要性**：决策树天然提供特征重要性度量，有助于理解数据

### 实践建议

- 优先调节 `max_depth`（通常 5-15 之间）
- 设置 `min_samples_leaf` 避免叶子节点过小
- 使用交叉验证选择最佳超参数
- 考虑使用集成方法（随机森林、梯度提升）获得更好性能

In [None]:
# ============================================================
# 单元测试：验证代码正确性
# ============================================================

def run_tests():
    """运行基础功能测试"""
    print("="*50)
    print("运行单元测试...")
    print("="*50)
    
    # 测试 1: 模型可以正常训练
    try:
        model = DecisionTreeRegressor(max_depth=5, random_state=42)
        model.fit(X_train[:100], y_train[:100])  # 使用小数据集加速
        assert model.get_depth() <= 5
        print("[PASS] 测试 1: 模型训练成功")
    except Exception as e:
        print(f"[FAIL] 测试 1: {e}")
    
    # 测试 2: 预测输出形状正确
    try:
        pred = model.predict(X_test[:10])
        assert pred.shape == (10,)
        print("[PASS] 测试 2: 预测输出形状正确")
    except Exception as e:
        print(f"[FAIL] 测试 2: {e}")
    
    # 测试 3: 特征重要性维度正确
    try:
        imp = model.feature_importances_
        assert len(imp) == X_train.shape[1]
        assert abs(sum(imp) - 1.0) < 1e-6
        print("[PASS] 测试 3: 特征重要性维度正确")
    except Exception as e:
        print(f"[FAIL] 测试 3: {e}")
    
    # 测试 4: 交叉验证可以运行
    try:
        scores = cross_val_score(model, X_train[:200], y_train[:200], cv=3)
        assert len(scores) == 3
        print("[PASS] 测试 4: 交叉验证运行成功")
    except Exception as e:
        print(f"[FAIL] 测试 4: {e}")
    
    # 测试 5: 评估函数正常工作
    try:
        metrics = evaluate_model(model, X_train[:100], y_train[:100], X_test[:50], y_test[:50])
        assert all(key in metrics for key in ['train_mse', 'test_mse', 'train_r2', 'test_r2'])
        print("[PASS] 测试 5: 评估函数正常工作")
    except Exception as e:
        print(f"[FAIL] 测试 5: {e}")
    
    print("="*50)
    print("测试完成!")
    print("="*50)

run_tests()