# 逆强化学习 (Inverse Reinforcement Learning) 深度教程

## 目录
1. [问题定义与动机](#1-问题定义与动机)
2. [理论基础](#2-理论基础)
3. [最大边际IRL](#3-最大边际irl)
4. [最大熵IRL](#4-最大熵irl)
5. [深度IRL与GAIL](#5-深度irl与gail)
6. [实验与对比](#6-实验与对比)

## 1. 问题定义与动机

### 1.1 从正向RL到逆向RL

**正向强化学习 (Forward RL)**：
- 输入：MDP = (S, A, P, R, γ)
- 输出：最优策略 π*
- 问题：给定奖励，找最优行为

**逆强化学习 (Inverse RL)**：
- 输入：MDP\R = (S, A, P, ?, γ) + 专家演示 D = {τ₁, τ₂, ...}
- 输出：奖励函数 R 使得专家行为是最优的
- 问题：给定行为，推断奖励/意图

### 1.2 为什么需要IRL？

1. **奖励函数难以手工设计**
   - 复杂任务的奖励函数可能有数百个组件
   - 奖励工程容易导致"奖励黑客"行为

2. **从演示中学习**
   - 人类专家可以演示正确行为，但难以精确描述奖励
   - "我知道好的驾驶是什么样，但无法写出奖励公式"

3. **理解智能体意图**
   - 通过观察行为推断目标
   - 应用：人机交互、安全AI、行为预测

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple, Optional, Callable
from dataclasses import dataclass
import sys
sys.path.append('..')

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

## 2. 理论基础

### 2.1 线性奖励函数假设

IRL通常假设奖励函数是特征的线性组合：

$$R(s) = \theta^T \phi(s) = \sum_{i=1}^{d} \theta_i \phi_i(s)$$

其中：
- $\phi: S \rightarrow \mathbb{R}^d$ 是特征提取函数
- $\theta \in \mathbb{R}^d$ 是待学习的奖励权重

### 2.2 特征期望 (Feature Expectations)

对于策略 $\pi$，特征期望定义为：

$$\mu_\pi = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t \phi(s_t)\right]$$

这是策略的"签名"——描述了该策略会访问哪些类型的状态。

### 2.3 IRL的模糊性问题

**关键洞察**：存在无穷多个奖励函数与同一策略一致！

- $R(s) = 0$ 对所有s：任何策略都是最优的
- $R(s) = c$（常数）：同上
- 任何使专家策略最优的R

需要额外约束来选择"正确的"奖励函数。

In [None]:
from inverse_rl import (
    IRLConfig,
    Demonstration,
    LinearFeatureExtractor,
    MaxMarginIRL,
    MaxEntropyIRL,
    DeepIRL,
    GAILDiscriminator,
    compute_feature_matching_loss,
)

def visualize_irl_ambiguity():
    """可视化IRL的模糊性问题。"""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # 简单的2D状态空间
    x = np.linspace(-2, 2, 50)
    y = np.linspace(-2, 2, 50)
    X, Y = np.meshgrid(x, y)
    
    # 三个不同的奖励函数，都使得向(1,1)移动是最优的
    goal = np.array([1.0, 1.0])
    
    # R1: 负距离
    R1 = -np.sqrt((X - goal[0])**2 + (Y - goal[1])**2)
    
    # R2: 负距离的平方
    R2 = -((X - goal[0])**2 + (Y - goal[1])**2)
    
    # R3: 指数衰减
    R3 = np.exp(-np.sqrt((X - goal[0])**2 + (Y - goal[1])**2))
    
    rewards = [R1, R2, R3]
    titles = ['$R_1 = -||s - g||$', '$R_2 = -||s - g||^2$', '$R_3 = e^{-||s - g||}$']
    
    for ax, R, title in zip(axes, rewards, titles):
        contour = ax.contourf(X, Y, R, levels=20, cmap='viridis')
        ax.plot(goal[0], goal[1], 'r*', markersize=15, label='Goal')
        ax.set_xlabel('$s_1$')
        ax.set_ylabel('$s_2$')
        ax.set_title(title)
        plt.colorbar(contour, ax=ax)
    
    plt.suptitle('IRL Ambiguity: Different Rewards, Same Optimal Policy', y=1.02)
    plt.tight_layout()
    plt.show()
    
    print("关键洞察：这三个奖励函数都使得向(1,1)移动是最优的！")
    print("IRL需要额外的约束（如最大边际、最大熵）来选择唯一解。")

visualize_irl_ambiguity()

## 3. 最大边际IRL (Max-Margin IRL)

### 3.1 核心思想

找到奖励权重 $\theta$，使得专家策略的期望回报最大化地超过其他所有策略：

$$\max_\theta \min_\pi \left[ \theta^T (\mu_E - \mu_\pi) \right]$$

约束 $||\theta||_2 \leq 1$

### 3.2 几何解释

在特征期望空间中，寻找最优分离超平面，将专家行为与其他行为分开。

In [None]:
def demonstrate_max_margin_irl():
    """演示最大边际IRL的几何直觉。"""
    
    # 模拟特征期望（2D用于可视化）
    np.random.seed(42)
    
    # 专家的特征期望
    expert_mu = np.array([3.0, 4.0])
    
    # 其他策略的特征期望
    other_policies_mu = np.array([
        [1.0, 1.5],
        [2.0, 1.0],
        [1.5, 2.5],
        [0.5, 3.0],
        [2.5, 2.0],
        [1.0, 0.5],
    ])
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # 左图：特征期望空间
    ax = axes[0]
    ax.scatter(other_policies_mu[:, 0], other_policies_mu[:, 1], 
               c='blue', s=100, label='Other Policies $\mu_\pi$', alpha=0.6)
    ax.scatter(expert_mu[0], expert_mu[1], c='red', s=200, 
               marker='*', label='Expert $\mu_E$', zorder=5)
    
    # 找到凸包中最近点
    closest_point = np.mean(other_policies_mu, axis=0)  # 简化
    
    # 分离超平面方向
    theta = expert_mu - closest_point
    theta = theta / np.linalg.norm(theta)
    
    # 绘制分离线
    midpoint = (expert_mu + closest_point) / 2
    perp = np.array([-theta[1], theta[0]])
    line_points = np.array([midpoint - 3*perp, midpoint + 3*perp])
    ax.plot(line_points[:, 0], line_points[:, 1], 'g--', linewidth=2, 
            label='Separating Hyperplane')
    
    # 绘制θ方向
    ax.arrow(midpoint[0], midpoint[1], theta[0], theta[1],
             head_width=0.15, head_length=0.1, fc='green', ec='green')
    ax.annotate('$\\theta$', midpoint + theta + 0.2, fontsize=14)
    
    ax.set_xlabel('Feature Expectation $\mu_1$')
    ax.set_ylabel('Feature Expectation $\mu_2$')
    ax.set_title('Max-Margin IRL: Finding Separating Hyperplane')
    ax.legend()
    ax.set_xlim(-0.5, 4.5)
    ax.set_ylim(-0.5, 5)
    ax.grid(True, alpha=0.3)
    
    # 右图：算法迭代过程
    ax = axes[1]
    
    # 模拟迭代
    margins = [0.5, 1.2, 1.8, 2.1, 2.3, 2.35, 2.38]
    ax.plot(margins, 'b-o', linewidth=2, markersize=8)
    ax.axhline(y=2.4, color='r', linestyle='--', label='Optimal Margin')
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Margin $||\mu_E - \mu_\pi||$')
    ax.set_title('Max-Margin IRL Convergence')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"\n学习到的奖励权重 θ: [{theta[0]:.3f}, {theta[1]:.3f}]")
    print(f"归一化后: ||θ|| = {np.linalg.norm(theta):.3f}")

demonstrate_max_margin_irl()

### 3.3 实现与测试

In [None]:
# 创建模拟演示数据
state_dim = 4
feature_dim = 10

# 特征提取器
feature_extractor = LinearFeatureExtractor(
    state_dim=state_dim,
    feature_type='rbf',
    num_features=feature_dim,
)

# 生成专家演示（假设专家偏好高特征值的状态）
def generate_expert_demonstrations(n_demos: int, traj_length: int) -> List[Demonstration]:
    """生成模拟的专家演示。"""
    demos = []
    
    for _ in range(n_demos):
        # 专家倾向于访问特定区域
        states = []
        state = np.random.randn(state_dim) * 0.5  # 从随机位置开始
        
        for t in range(traj_length):
            states.append(state.copy())
            # 专家向目标区域移动
            target = np.array([1.0, 1.0, 0.5, 0.5])
            state = state + 0.1 * (target - state) + 0.05 * np.random.randn(state_dim)
        
        demos.append(Demonstration(
            states=np.array(states),
            actions=np.zeros((traj_length, 1)),  # 动作不重要
        ))
    
    return demos

# 生成演示
demonstrations = generate_expert_demonstrations(n_demos=20, traj_length=50)
print(f"生成了 {len(demonstrations)} 条专家演示")
print(f"每条轨迹长度: {len(demonstrations[0])}")

In [None]:
# 配置IRL
config = IRLConfig(
    discount_factor=0.99,
    learning_rate=0.1,
    max_iterations=50,
    convergence_threshold=0.01,
    feature_dim=feature_dim,
)

# 训练Max-Margin IRL
maxmargin_irl = MaxMarginIRL(
    config=config,
    feature_extractor=feature_extractor,
)

print("训练 Max-Margin IRL...")
reward_weights = maxmargin_irl.fit(demonstrations)

print(f"\n学习到的奖励权重:")
print(f"  形状: {reward_weights.shape}")
print(f"  范围: [{reward_weights.min():.4f}, {reward_weights.max():.4f}]")
print(f"  L2范数: {np.linalg.norm(reward_weights):.4f}")

# 可视化训练过程
if maxmargin_irl._iteration_history:
    margins = [h['margin'] for h in maxmargin_irl._iteration_history]
    
    plt.figure(figsize=(10, 4))
    plt.plot(margins, 'b-o')
    plt.xlabel('Iteration')
    plt.ylabel('Margin')
    plt.title('Max-Margin IRL Training Progress')
    plt.grid(True, alpha=0.3)
    plt.show()

## 4. 最大熵IRL (Maximum Entropy IRL)

### 4.1 核心思想

将专家行为建模为轨迹上的Boltzmann分布：

$$P(\tau | \theta) = \frac{1}{Z(\theta)} \exp\left(\sum_{t=0}^{T} R_\theta(s_t, a_t)\right)$$

最大化演示数据的对数似然：

$$\max_\theta \sum_{\tau \in D} \log P(\tau | \theta)$$

### 4.2 优势

1. **处理次优专家**：不假设专家完美最优
2. **概率框架**：提供不确定性估计
3. **最大熵原则**：在满足约束的分布中选择熵最大的（最小偏见）

In [None]:
def visualize_maxent_irl_concept():
    """可视化最大熵IRL的概念。"""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # 左图：轨迹概率分布
    ax = axes[0]
    
    # 模拟不同奖励下的轨迹概率
    trajectory_rewards = np.linspace(-5, 5, 100)
    temperatures = [0.5, 1.0, 2.0]
    
    for temp in temperatures:
        probs = np.exp(trajectory_rewards / temp)
        probs = probs / probs.sum()
        ax.plot(trajectory_rewards, probs, label=f'T = {temp}')
    
    ax.set_xlabel('Trajectory Reward $R(\\tau)$')
    ax.set_ylabel('$P(\\tau | \\theta)$')
    ax.set_title('Boltzmann Distribution over Trajectories')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # 中图：特征匹配
    ax = axes[1]
    
    iterations = range(50)
    expert_feature = 2.5
    model_features = [0.5 + 2 * (1 - np.exp(-i/10)) for i in iterations]
    
    ax.axhline(y=expert_feature, color='r', linestyle='--', label='Expert $\mu_E$')
    ax.plot(iterations, model_features, 'b-', label='Model $\mu_\\theta$')
    ax.fill_between(iterations, model_features, expert_feature, alpha=0.3)
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Feature Expectation')
    ax.set_title('Feature Matching in MaxEnt IRL')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # 右图：梯度更新
    ax = axes[2]
    
    # 梯度 = μ_E - μ_θ
    gradients = [expert_feature - mf for mf in model_features]
    ax.plot(iterations, gradients, 'g-', linewidth=2)
    ax.axhline(y=0, color='k', linestyle='-', alpha=0.3)
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Gradient $\\nabla_\\theta \mathcal{L}$')
    ax.set_title('Gradient: $\mu_E - \mu_\\theta$')
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("MaxEnt IRL 关键洞察:")
    print("1. 高奖励轨迹概率更高，但不是确定性的")
    print("2. 通过特征匹配来学习奖励权重")
    print("3. 梯度简单：专家特征期望 - 模型特征期望")

visualize_maxent_irl_concept()

In [None]:
# 训练Max-Entropy IRL
maxent_irl = MaxEntropyIRL(
    config=config,
    feature_extractor=feature_extractor,
    temperature=1.0,
)

print("训练 Max-Entropy IRL...")
maxent_weights = maxent_irl.fit(demonstrations)

print(f"\n学习到的奖励权重:")
print(f"  范围: [{maxent_weights.min():.4f}, {maxent_weights.max():.4f}]")

# 对比两种方法的权重
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].bar(range(len(reward_weights)), reward_weights, alpha=0.7)
axes[0].set_xlabel('Feature Index')
axes[0].set_ylabel('Weight')
axes[0].set_title('Max-Margin IRL Weights')

axes[1].bar(range(len(maxent_weights)), maxent_weights, alpha=0.7, color='orange')
axes[1].set_xlabel('Feature Index')
axes[1].set_ylabel('Weight')
axes[1].set_title('Max-Entropy IRL Weights')

plt.tight_layout()
plt.show()

## 5. 深度IRL与GAIL

### 5.1 深度IRL

用神经网络替代线性奖励函数：

$$R_\theta(s) = f_\theta(s)$$

可以学习任意复杂的奖励函数。

### 5.2 GAIL (Generative Adversarial Imitation Learning)

将IRL视为生成对抗网络：

- **判别器 D**: 区分专家和策略生成的状态-动作对
- **生成器 π**: 策略网络，试图欺骗判别器

损失函数：
$$\min_\pi \max_D \mathbb{E}_{\pi_E}[\log D(s,a)] + \mathbb{E}_\pi[\log(1-D(s,a))]$$

In [None]:
def visualize_gail_concept():
    """可视化GAIL的对抗训练过程。"""
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 8))
    
    # 模拟训练过程
    np.random.seed(42)
    
    # 专家数据分布（固定）
    expert_mean = np.array([2.0, 2.0])
    expert_cov = np.array([[0.3, 0.1], [0.1, 0.3]])
    expert_data = np.random.multivariate_normal(expert_mean, expert_cov, 200)
    
    # 策略数据分布（逐渐接近专家）
    policy_means = [
        np.array([0.0, 0.0]),
        np.array([1.0, 1.0]),
        np.array([1.8, 1.8]),
    ]
    
    for idx, (ax_row, policy_mean) in enumerate(zip(axes.T, policy_means)):
        policy_data = np.random.multivariate_normal(policy_mean, expert_cov, 200)
        
        # 上排：数据分布
        ax = ax_row[0]
        ax.scatter(expert_data[:, 0], expert_data[:, 1], c='red', alpha=0.5, 
                   label='Expert', s=20)
        ax.scatter(policy_data[:, 0], policy_data[:, 1], c='blue', alpha=0.5, 
                   label='Policy', s=20)
        ax.set_xlabel('$s_1$')
        ax.set_ylabel('$s_2$')
        ax.set_title(f'Iteration {idx * 50}')
        ax.legend()
        ax.set_xlim(-2, 4)
        ax.set_ylim(-2, 4)
        
        # 下排：判别器决策边界
        ax = ax_row[1]
        
        # 模拟判别器输出
        x = np.linspace(-2, 4, 50)
        y = np.linspace(-2, 4, 50)
        X, Y = np.meshgrid(x, y)
        
        # 简化的判别器决策边界
        D = 1 / (1 + np.exp(-2 * (np.sqrt((X - expert_mean[0])**2 + 
                                          (Y - expert_mean[1])**2) - 
                                  np.sqrt((X - policy_mean[0])**2 + 
                                          (Y - policy_mean[1])**2))))
        
        contour = ax.contourf(X, Y, D, levels=20, cmap='RdBu_r', alpha=0.7)
        ax.contour(X, Y, D, levels=[0.5], colors='black', linewidths=2)
        ax.set_xlabel('$s_1$')
        ax.set_ylabel('$s_2$')
        ax.set_title(f'Discriminator $D(s)$')
    
    plt.tight_layout()
    plt.show()
    
    print("GAIL训练过程:")
    print("1. 判别器学习区分专家（红）和策略（蓝）")
    print("2. 策略学习欺骗判别器，生成类似专家的行为")
    print("3. 最终两者分布重合，判别器无法区分")

visualize_gail_concept()

In [None]:
# 测试GAIL判别器
action_dim = 2

discriminator = GAILDiscriminator(
    state_dim=state_dim,
    action_dim=action_dim,
    hidden_dims=(64, 64),
    learning_rate=0.001,
)

# 生成专家和策略数据
batch_size = 64

# 专家数据（来自演示）
expert_states = demonstrations[0].states[:batch_size]
expert_actions = np.random.randn(batch_size, action_dim)

# 策略数据（随机）
policy_states = np.random.randn(batch_size, state_dim)
policy_actions = np.random.randn(batch_size, action_dim)

# 训练判别器
print("训练GAIL判别器...")
losses = []
for i in range(100):
    stats = discriminator.update(
        expert_states, expert_actions,
        policy_states, policy_actions,
    )
    losses.append(stats['total_loss'])

plt.figure(figsize=(10, 4))
plt.plot(losses)
plt.xlabel('Update Step')
plt.ylabel('Discriminator Loss')
plt.title('GAIL Discriminator Training')
plt.grid(True, alpha=0.3)
plt.show()

# 测试判别器输出
expert_d = discriminator.forward(expert_states[:10], expert_actions[:10])
policy_d = discriminator.forward(policy_states[:10], policy_actions[:10])

print(f"\n专家状态判别器输出（应接近1）: {expert_d.mean():.4f}")
print(f"策略状态判别器输出（应接近0）: {policy_d.mean():.4f}")

## 6. 实验与对比

### 6.1 方法对比

In [None]:
def compare_irl_methods():
    """对比不同IRL方法。"""
    
    # 使用相同的演示数据
    methods = {
        'Max-Margin': MaxMarginIRL(config=config, feature_extractor=feature_extractor),
        'Max-Entropy': MaxEntropyIRL(config=config, feature_extractor=feature_extractor),
        'Deep IRL': DeepIRL(state_dim=state_dim, hidden_dims=(32, 32), config=config),
    }
    
    results = {}
    
    for name, irl in methods.items():
        print(f"训练 {name}...")
        weights = irl.fit(demonstrations)
        
        # 计算专家状态的奖励
        expert_rewards = []
        for demo in demonstrations[:5]:
            for state in demo.states[:10]:
                r = irl.compute_reward(state)
                expert_rewards.append(r)
        
        # 计算随机状态的奖励
        random_rewards = []
        for _ in range(50):
            state = np.random.randn(state_dim) * 2
            r = irl.compute_reward(state)
            random_rewards.append(r)
        
        results[name] = {
            'expert_mean': np.mean(expert_rewards),
            'expert_std': np.std(expert_rewards),
            'random_mean': np.mean(random_rewards),
            'random_std': np.std(random_rewards),
            'gap': np.mean(expert_rewards) - np.mean(random_rewards),
        }
    
    # 可视化结果
    fig, ax = plt.subplots(figsize=(10, 6))
    
    x = np.arange(len(methods))
    width = 0.35
    
    expert_means = [results[m]['expert_mean'] for m in methods]
    expert_stds = [results[m]['expert_std'] for m in methods]
    random_means = [results[m]['random_mean'] for m in methods]
    random_stds = [results[m]['random_std'] for m in methods]
    
    bars1 = ax.bar(x - width/2, expert_means, width, yerr=expert_stds, 
                   label='Expert States', capsize=5)
    bars2 = ax.bar(x + width/2, random_means, width, yerr=random_stds,
                   label='Random States', capsize=5)
    
    ax.set_ylabel('Learned Reward')
    ax.set_title('IRL Methods: Expert vs Random State Rewards')
    ax.set_xticks(x)
    ax.set_xticklabels(methods.keys())
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    # 打印详细结果
    print("\n" + "=" * 60)
    print("IRL方法对比结果")
    print("=" * 60)
    print(f"{'方法':<15} {'专家奖励':<15} {'随机奖励':<15} {'差距':<10}")
    print("-" * 60)
    for name, res in results.items():
        print(f"{name:<15} {res['expert_mean']:>+.4f}±{res['expert_std']:.4f}  "
              f"{res['random_mean']:>+.4f}±{res['random_std']:.4f}  {res['gap']:>+.4f}")
    
    return results

results = compare_irl_methods()

### 6.2 特征匹配分析

In [None]:
def analyze_feature_matching():
    """分析IRL的特征匹配效果。"""
    
    # 计算专家特征期望
    expert_features = maxent_irl.compute_feature_expectations(demonstrations)
    
    # 模拟不同策略的特征期望
    np.random.seed(42)
    
    policy_features_list = []
    policy_names = ['Random', 'Semi-Optimal', 'Near-Optimal']
    
    # 随机策略
    random_features = np.random.randn(feature_dim) * 0.5
    policy_features_list.append(random_features)
    
    # 半优策略
    semi_features = expert_features * 0.6 + np.random.randn(feature_dim) * 0.2
    policy_features_list.append(semi_features)
    
    # 近优策略
    near_features = expert_features * 0.95 + np.random.randn(feature_dim) * 0.05
    policy_features_list.append(near_features)
    
    # 计算特征匹配损失
    losses = []
    for pf in policy_features_list:
        loss = compute_feature_matching_loss(expert_features, pf, 'l2')
        losses.append(loss)
    
    # 可视化
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # 左图：特征期望对比
    ax = axes[0]
    x = np.arange(feature_dim)
    width = 0.2
    
    ax.bar(x - 1.5*width, expert_features, width, label='Expert', alpha=0.8)
    for i, (pf, name) in enumerate(zip(policy_features_list, policy_names)):
        ax.bar(x + (i-0.5)*width, pf, width, label=name, alpha=0.8)
    
    ax.set_xlabel('Feature Index')
    ax.set_ylabel('Feature Expectation')
    ax.set_title('Feature Expectations: Expert vs Policies')
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    # 右图：特征匹配损失
    ax = axes[1]
    bars = ax.bar(policy_names, losses, color=['red', 'orange', 'green'])
    ax.set_ylabel('Feature Matching Loss')
    ax.set_title('Feature Matching Loss (Lower = Better)')
    ax.grid(True, alpha=0.3, axis='y')
    
    # 添加数值标签
    for bar, loss in zip(bars, losses):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                f'{loss:.2f}', ha='center')
    
    plt.tight_layout()
    plt.show()

analyze_feature_matching()

## 7. 总结与最佳实践

In [None]:
summary = """
╔══════════════════════════════════════════════════════════════════════════════╗
║                         逆强化学习核心总结                                   ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  1. 问题本质                                                                 ║
║     • 从行为推断意图/奖励                                                    ║
║     • 解决奖励函数设计困难的问题                                             ║
║     • 内在模糊性需要额外约束解决                                             ║
║                                                                              ║
║  2. 方法选择                                                                 ║
║     • Max-Margin: 假设专家最优，几何直觉强                                   ║
║     • Max-Entropy: 概率框架，处理次优专家                                    ║
║     • Deep IRL: 复杂奖励，需要更多数据                                       ║
║     • GAIL: 不显式恢复奖励，直接模仿                                         ║
║                                                                              ║
║  3. 实践建议                                                                 ║
║     • 特征设计很重要（领域知识）                                             ║
║     • 演示质量 > 演示数量                                                    ║
║     • 验证学习到的奖励是否合理                                               ║
║     • 考虑可解释性需求                                                       ║
║                                                                              ║
║  4. 常见陷阱                                                                 ║
║     • 奖励退化（R=0 everywhere）                                             ║
║     • 过拟合演示噪声                                                         ║
║     • 分布偏移（covariate shift）                                            ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝
"""

print(summary)

## 练习题

1. **理论题**：证明当折扣因子 γ→1 时，MaxEnt IRL 的解等价于行为克隆。

2. **实现题**：实现一个简单的 Bayesian IRL，使用 MCMC 采样奖励后验。

3. **实验题**：比较 GAIL 和 Behavioral Cloning 在不同演示数量下的性能。

4. **分析题**：设计实验研究 IRL 对演示噪声（次优专家）的鲁棒性。

5. **应用题**：使用 IRL 从驾驶数据中学习奖励函数，分析学习到的奖励组件。

## 参考文献

1. Ng, A. Y., & Russell, S. J. (2000). Algorithms for inverse reinforcement learning. ICML.

2. Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. ICML.

3. Ziebart, B. D., et al. (2008). Maximum entropy inverse reinforcement learning. AAAI.

4. Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. NeurIPS.

5. Fu, J., Luo, K., & Levine, S. (2018). Learning robust rewards with adversarial inverse reinforcement learning. ICLR.