# 奖励塑形 (Reward Shaping) 深度教程

## 目录
1. [问题背景与动机](#1-问题背景与动机)
2. [理论基础：势能函数与策略不变性](#2-理论基础势能函数与策略不变性)
3. [核心算法实现](#3-核心算法实现)
4. [实验：GridWorld导航](#4-实验gridworld导航)
5. [自适应奖励塑形](#5-自适应奖励塑形)
6. [与其他方法对比](#6-与其他方法对比)

## 1. 问题背景与动机

### 1.1 稀疏奖励问题

在许多强化学习任务中，智能体只有在达成最终目标时才能获得奖励。例如：
- **机器人导航**：只有到达目标位置才获得+1奖励
- **游戏通关**：只有击败Boss才获得奖励
- **机械臂操作**：只有成功抓取物体才获得奖励

这导致了**信用分配问题** (Credit Assignment Problem)：智能体难以判断哪些动作导致了最终的成功或失败。

### 1.2 朴素奖励塑形的陷阱

一个自然的想法是手动设计额外的奖励信号。但这可能导致**策略偏移**：智能体学会最大化塑形奖励而非真正的目标。

**经典案例**：在船只竞速游戏中，研究者给予智能体收集金币的额外奖励。结果智能体学会了原地打转收集金币，而不是完成比赛。

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List, Dict, Optional
import sys
sys.path.append('..')

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = (10, 6)

## 2. 理论基础：势能函数与策略不变性

### 2.1 Ng-Harada-Russell定理 (1999)

**核心问题**：如何设计奖励塑形函数，使得最优策略保持不变？

**定理 (Policy Invariance)**：对于MDP $M = (S, A, P, R, \gamma)$，设 $M' = (S, A, P, R', \gamma)$ 为塑形后的MDP，其中：

$$R'(s, a, s') = R(s, a, s') + F(s, a, s')$$

当且仅当存在势能函数 $\Phi: S \rightarrow \mathbb{R}$，使得：

$$F(s, a, s') = \gamma \Phi(s') - \Phi(s)$$

时，$M$ 和 $M'$ 具有相同的最优策略集合。

### 2.2 直觉理解

势能函数 $\Phi(s)$ 衡量状态的"潜力"或"价值预估"。塑形奖励本质上是：

> **进入高势能状态获得奖励，离开高势能状态受到惩罚**

关键洞察：由于 $\gamma < 1$，沿任意轨迹的塑形奖励之和为有限值，不会无限累积。

In [None]:
def demonstrate_pbrs_theory():
    """可视化势能函数与奖励塑形的关系。"""
    
    # 一维状态空间示例
    states = np.linspace(0, 10, 100)
    goal = 10.0
    
    # 势能函数：负距离（越近目标势能越高）
    potential = -np.abs(states - goal)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # 图1: 势能函数
    axes[0].plot(states, potential, 'b-', linewidth=2)
    axes[0].axvline(x=goal, color='r', linestyle='--', label='Goal')
    axes[0].set_xlabel('State (Position)')
    axes[0].set_ylabel('$\Phi(s)$')
    axes[0].set_title('Potential Function')
    axes[0].legend()
    
    # 图2: 塑形奖励（向目标移动）
    gamma = 0.99
    # 假设从每个状态向右移动一步
    shaping_bonus = gamma * potential[1:] - potential[:-1]
    axes[1].plot(states[:-1], shaping_bonus, 'g-', linewidth=2)
    axes[1].axhline(y=0, color='k', linestyle='-', alpha=0.3)
    axes[1].set_xlabel('Current State')
    axes[1].set_ylabel('$F(s, s\')$')
    axes[1].set_title('Shaping Bonus (Moving Right)')
    
    # 图3: 轨迹上的累积塑形奖励
    trajectory_lengths = range(1, 50)
    cumulative_shaping = []
    
    for T in trajectory_lengths:
        # 从状态0出发，每步向右移动
        total = sum(gamma**t * (gamma * potential[min(t+1, 99)] - potential[min(t, 99)]) 
                   for t in range(T))
        cumulative_shaping.append(total)
    
    axes[2].plot(trajectory_lengths, cumulative_shaping, 'purple', linewidth=2)
    axes[2].set_xlabel('Trajectory Length')
    axes[2].set_ylabel('Cumulative Shaping')
    axes[2].set_title('Total Shaping Reward (Bounded!)')
    
    plt.tight_layout()
    plt.show()
    
    # 理论分析
    print("=" * 60)
    print("理论分析：为什么PBRS保证策略不变性？")
    print("=" * 60)
    print(f"""  
对于任意轨迹 τ = (s₀, a₀, s₁, a₁, ..., sₜ)，累积塑形奖励为：

  Σ γᵗ F(sₜ, aₜ, sₜ₊₁) = Σ γᵗ [γΦ(sₜ₊₁) - Φ(sₜ)]
                        = γΦ(s₁) - Φ(s₀) + γ²Φ(s₂) - γΦ(s₁) + ...
                        = γᵀΦ(sₜ) - Φ(s₀)  (望远镜求和)

由于 |Φ(s)| 有界，累积塑形奖励有界。
因此，最大化 R + F 等价于最大化 R（对于足够长的horizon）。
    """)

demonstrate_pbrs_theory()

## 3. 核心算法实现

### 3.1 基于距离的势能函数

对于目标到达任务，最自然的势能函数是负距离：

$$\Phi(s) = -\|s_{pos} - g\|_p$$

其中 $g$ 是目标位置，$\|\cdot\|_p$ 是 $L_p$ 范数。

In [None]:
from reward_shaping import (
    ShapedRewardConfig,
    DistanceBasedShaper,
    SubgoalBasedShaper,
    AdaptiveRewardShaper,
    DynamicShapingConfig,
)

# 创建基于距离的奖励塑形器
goal = np.array([10.0, 10.0])
config = ShapedRewardConfig(
    discount_factor=0.99,
    shaping_weight=1.0,
)

shaper = DistanceBasedShaper(
    goal_position=goal,
    norm_order=2,  # 欧几里得距离
    scale=1.0,
    config=config,
)

# 演示势能函数
test_states = [
    np.array([0.0, 0.0]),    # 远离目标
    np.array([5.0, 5.0]),    # 中间位置
    np.array([9.0, 9.0]),    # 接近目标
    np.array([10.0, 10.0]),  # 目标位置
]

print("状态势能值示例：")
print("-" * 50)
for state in test_states:
    potential = shaper.potential(state)
    distance = np.linalg.norm(state - goal)
    print(f"状态 {state} -> 距离: {distance:.2f}, 势能: {potential:.2f}")

### 3.2 可视化势能场

In [None]:
def visualize_potential_field(shaper, goal, grid_size=50):
    """可视化2D势能场。"""
    
    x = np.linspace(-2, 12, grid_size)
    y = np.linspace(-2, 12, grid_size)
    X, Y = np.meshgrid(x, y)
    
    # 计算每个点的势能
    Z = np.zeros_like(X)
    for i in range(grid_size):
        for j in range(grid_size):
            state = np.array([X[i, j], Y[i, j]])
            Z[i, j] = shaper.potential(state)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # 等高线图
    contour = axes[0].contourf(X, Y, Z, levels=20, cmap='viridis')
    axes[0].plot(goal[0], goal[1], 'r*', markersize=20, label='Goal')
    axes[0].set_xlabel('X')
    axes[0].set_ylabel('Y')
    axes[0].set_title('Potential Function $\Phi(s)$')
    axes[0].legend()
    plt.colorbar(contour, ax=axes[0])
    
    # 梯度场（指向势能增加方向）
    # 计算梯度
    grad_y, grad_x = np.gradient(Z, y[1]-y[0], x[1]-x[0])
    
    # 子采样以便可视化
    skip = 3
    axes[1].quiver(
        X[::skip, ::skip], Y[::skip, ::skip],
        grad_x[::skip, ::skip], grad_y[::skip, ::skip],
        scale=50, alpha=0.7
    )
    axes[1].plot(goal[0], goal[1], 'r*', markersize=20, label='Goal')
    axes[1].set_xlabel('X')
    axes[1].set_ylabel('Y')
    axes[1].set_title('Gradient Field $\\nabla\Phi(s)$ (Direction of Positive Shaping)')
    axes[1].legend()
    axes[1].set_xlim(-2, 12)
    axes[1].set_ylim(-2, 12)
    
    plt.tight_layout()
    plt.show()

visualize_potential_field(shaper, goal)

## 4. 实验：GridWorld导航

### 4.1 环境定义

In [None]:
class GridWorldEnv:
    """简单的GridWorld环境用于演示奖励塑形。
    
    智能体需要从起点导航到目标位置，可以设置障碍物。
    """
    
    ACTIONS = {
        0: np.array([-1, 0]),   # 上
        1: np.array([1, 0]),    # 下
        2: np.array([0, -1]),   # 左
        3: np.array([0, 1]),    # 右
    }
    
    def __init__(
        self,
        size: int = 10,
        start: Tuple[int, int] = (0, 0),
        goal: Tuple[int, int] = (9, 9),
        obstacles: Optional[List[Tuple[int, int]]] = None,
    ):
        self.size = size
        self.start = np.array(start)
        self.goal = np.array(goal)
        self.obstacles = set(obstacles) if obstacles else set()
        
        self.state = self.start.copy()
        
    def reset(self) -> np.ndarray:
        self.state = self.start.copy()
        return self.state.copy()
    
    def step(self, action: int) -> Tuple[np.ndarray, float, bool, Dict]:
        """执行动作，返回(next_state, reward, done, info)。"""
        
        # 计算新位置
        new_state = self.state + self.ACTIONS[action]
        
        # 边界检查
        new_state = np.clip(new_state, 0, self.size - 1)
        
        # 障碍物检查
        if tuple(new_state) not in self.obstacles:
            self.state = new_state
        
        # 检查是否到达目标
        done = np.array_equal(self.state, self.goal)
        
        # 稀疏奖励：只有到达目标才有奖励
        reward = 1.0 if done else 0.0
        
        return self.state.copy(), reward, done, {}
    
    def render(self, ax=None, trajectory=None):
        """可视化环境。"""
        if ax is None:
            fig, ax = plt.subplots(figsize=(8, 8))
        
        # 绘制网格
        ax.set_xlim(-0.5, self.size - 0.5)
        ax.set_ylim(-0.5, self.size - 0.5)
        ax.set_xticks(range(self.size))
        ax.set_yticks(range(self.size))
        ax.grid(True)
        ax.set_aspect('equal')
        
        # 绘制障碍物
        for obs in self.obstacles:
            ax.add_patch(plt.Rectangle(
                (obs[1] - 0.5, obs[0] - 0.5), 1, 1,
                facecolor='gray', edgecolor='black'
            ))
        
        # 绘制起点和终点
        ax.plot(self.start[1], self.start[0], 'go', markersize=15, label='Start')
        ax.plot(self.goal[1], self.goal[0], 'r*', markersize=20, label='Goal')
        
        # 绘制轨迹
        if trajectory is not None:
            traj = np.array(trajectory)
            ax.plot(traj[:, 1], traj[:, 0], 'b-', linewidth=2, alpha=0.7)
            ax.plot(traj[-1, 1], traj[-1, 0], 'bs', markersize=10, label='Agent')
        else:
            ax.plot(self.state[1], self.state[0], 'bs', markersize=10, label='Agent')
        
        ax.legend(loc='upper left')
        ax.invert_yaxis()  # 使(0,0)在左上角
        return ax

# 创建带障碍物的环境
obstacles = [(3, i) for i in range(7)] + [(6, i) for i in range(3, 10)]
env = GridWorldEnv(size=10, obstacles=obstacles)

fig, ax = plt.subplots(figsize=(8, 8))
env.render(ax)
ax.set_title('GridWorld Environment with Obstacles')
plt.show()

### 4.2 Q-Learning with and without Reward Shaping

In [None]:
class QLearningAgent:
    """带奖励塑形的Q-Learning智能体。"""
    
    def __init__(
        self,
        state_size: int,
        n_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 0.1,
        reward_shaper: Optional[DistanceBasedShaper] = None,
    ):
        self.state_size = state_size
        self.n_actions = n_actions
        self.lr = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        self.reward_shaper = reward_shaper
        
        # Q表：(state_size, state_size, n_actions)
        self.q_table = np.zeros((state_size, state_size, n_actions))
        
    def get_action(self, state: np.ndarray) -> int:
        """ε-贪婪动作选择。"""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return np.argmax(self.q_table[state[0], state[1]])
    
    def update(
        self,
        state: np.ndarray,
        action: int,
        reward: float,
        next_state: np.ndarray,
        done: bool,
    ) -> float:
        """Q-Learning更新，可选奖励塑形。"""
        
        # 应用奖励塑形
        shaped_reward = reward
        if self.reward_shaper is not None:
            shaping_bonus = self.reward_shaper.compute_shaping_bonus(
                state.astype(float),
                next_state.astype(float),
                done,
            )
            shaped_reward = reward + shaping_bonus
        
        # Q-Learning更新
        current_q = self.q_table[state[0], state[1], action]
        
        if done:
            target = shaped_reward
        else:
            next_q_max = np.max(self.q_table[next_state[0], next_state[1]])
            target = shaped_reward + self.gamma * next_q_max
        
        self.q_table[state[0], state[1], action] += self.lr * (target - current_q)
        
        return shaped_reward


def train_agent(
    env: GridWorldEnv,
    agent: QLearningAgent,
    n_episodes: int = 500,
    max_steps: int = 200,
) -> Dict[str, List]:
    """训练智能体并记录统计数据。"""
    
    history = {
        'episode_rewards': [],
        'episode_lengths': [],
        'success_rate': [],
    }
    
    successes = []
    
    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        
        for step in range(max_steps):
            action = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            
            shaped_reward = agent.update(state, action, reward, next_state, done)
            total_reward += reward  # 记录原始奖励
            
            state = next_state
            
            if done:
                break
        
        history['episode_rewards'].append(total_reward)
        history['episode_lengths'].append(step + 1)
        successes.append(1 if done else 0)
        
        # 计算滑动窗口成功率
        window = 50
        if len(successes) >= window:
            history['success_rate'].append(np.mean(successes[-window:]))
        else:
            history['success_rate'].append(np.mean(successes))
    
    return history

In [None]:
# 训练两个智能体：有无奖励塑形
env = GridWorldEnv(size=10, obstacles=obstacles)

# 不使用奖励塑形
agent_no_shaping = QLearningAgent(
    state_size=10,
    n_actions=4,
    learning_rate=0.2,
    discount_factor=0.99,
    epsilon=0.2,
    reward_shaper=None,
)

# 使用距离基奖励塑形
shaper = DistanceBasedShaper(
    goal_position=np.array([9.0, 9.0]),
    scale=0.1,  # 缩放塑形奖励
    config=ShapedRewardConfig(discount_factor=0.99, shaping_weight=1.0),
)

agent_with_shaping = QLearningAgent(
    state_size=10,
    n_actions=4,
    learning_rate=0.2,
    discount_factor=0.99,
    epsilon=0.2,
    reward_shaper=shaper,
)

print("训练无奖励塑形的智能体...")
history_no_shaping = train_agent(env, agent_no_shaping, n_episodes=500)

print("训练有奖励塑形的智能体...")
history_with_shaping = train_agent(env, agent_with_shaping, n_episodes=500)

print("训练完成！")

In [None]:
def plot_training_comparison(history_no_shaping, history_with_shaping):
    """对比训练曲线。"""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # 成功率
    axes[0].plot(history_no_shaping['success_rate'], label='No Shaping', alpha=0.8)
    axes[0].plot(history_with_shaping['success_rate'], label='With PBRS', alpha=0.8)
    axes[0].set_xlabel('Episode')
    axes[0].set_ylabel('Success Rate')
    axes[0].set_title('Learning Curve: Success Rate')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Episode长度
    window = 20
    lengths_no = np.convolve(history_no_shaping['episode_lengths'], 
                             np.ones(window)/window, mode='valid')
    lengths_with = np.convolve(history_with_shaping['episode_lengths'],
                               np.ones(window)/window, mode='valid')
    
    axes[1].plot(lengths_no, label='No Shaping', alpha=0.8)
    axes[1].plot(lengths_with, label='With PBRS', alpha=0.8)
    axes[1].set_xlabel('Episode')
    axes[1].set_ylabel('Episode Length')
    axes[1].set_title('Learning Curve: Episode Length (Smoothed)')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    # 累积奖励
    cumsum_no = np.cumsum(history_no_shaping['episode_rewards'])
    cumsum_with = np.cumsum(history_with_shaping['episode_rewards'])
    
    axes[2].plot(cumsum_no, label='No Shaping', alpha=0.8)
    axes[2].plot(cumsum_with, label='With PBRS', alpha=0.8)
    axes[2].set_xlabel('Episode')
    axes[2].set_ylabel('Cumulative Reward')
    axes[2].set_title('Cumulative Original Reward')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # 统计摘要
    print("\n" + "=" * 60)
    print("训练结果对比")
    print("=" * 60)
    print(f"\n无奖励塑形:")
    print(f"  最终成功率: {history_no_shaping['success_rate'][-1]:.2%}")
    print(f"  首次达到80%成功率的episode: ", end="")
    idx = next((i for i, r in enumerate(history_no_shaping['success_rate']) if r >= 0.8), None)
    print(f"{idx}" if idx else "未达到")
    
    print(f"\n有奖励塑形 (PBRS):")
    print(f"  最终成功率: {history_with_shaping['success_rate'][-1]:.2%}")
    print(f"  首次达到80%成功率的episode: ", end="")
    idx = next((i for i, r in enumerate(history_with_shaping['success_rate']) if r >= 0.8), None)
    print(f"{idx}" if idx else "未达到")

plot_training_comparison(history_no_shaping, history_with_shaping)

### 4.3 可视化学习到的策略

In [None]:
def visualize_policy(agent: QLearningAgent, env: GridWorldEnv, title: str):
    """可视化Q表对应的策略。"""
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # 策略箭头图
    ax = axes[0]
    env.render(ax)
    
    # 动作到方向的映射（注意y轴反转）
    action_to_arrow = {
        0: (0, -0.3),   # 上 (y减少)
        1: (0, 0.3),    # 下 (y增加)
        2: (-0.3, 0),   # 左
        3: (0.3, 0),    # 右
    }
    
    for i in range(env.size):
        for j in range(env.size):
            if (i, j) in env.obstacles:
                continue
            if np.array_equal([i, j], env.goal):
                continue
            
            best_action = np.argmax(agent.q_table[i, j])
            dx, dy = action_to_arrow[best_action]
            ax.arrow(j, i, dx, dy, head_width=0.15, head_length=0.1,
                    fc='blue', ec='blue', alpha=0.6)
    
    ax.set_title(f'{title}: Learned Policy')
    
    # 价值函数热图
    ax = axes[1]
    value_map = np.max(agent.q_table, axis=2)
    
    # 遮盖障碍物
    masked_value = np.ma.array(value_map)
    for obs in env.obstacles:
        masked_value[obs] = np.ma.masked
    
    im = ax.imshow(masked_value, cmap='hot', origin='upper')
    ax.plot(env.goal[1], env.goal[0], 'g*', markersize=20)
    ax.set_title(f'{title}: Value Function V(s) = max_a Q(s,a)')
    plt.colorbar(im, ax=ax)
    
    plt.tight_layout()
    plt.show()

visualize_policy(agent_no_shaping, env, "Without Reward Shaping")
visualize_policy(agent_with_shaping, env, "With PBRS")

### 4.4 轨迹演示

In [None]:
def run_episode(env: GridWorldEnv, agent: QLearningAgent, max_steps: int = 100):
    """运行一个episode并返回轨迹。"""
    state = env.reset()
    trajectory = [state.copy()]
    
    # 使用贪婪策略
    original_epsilon = agent.epsilon
    agent.epsilon = 0
    
    for _ in range(max_steps):
        action = agent.get_action(state)
        next_state, _, done, _ = env.step(action)
        trajectory.append(next_state.copy())
        state = next_state
        
        if done:
            break
    
    agent.epsilon = original_epsilon
    return trajectory

# 比较两个智能体的轨迹
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

env_test = GridWorldEnv(size=10, obstacles=obstacles)

traj_no_shaping = run_episode(env_test, agent_no_shaping)
env_test.render(axes[0], trajectory=traj_no_shaping)
axes[0].set_title(f'Without Shaping (Steps: {len(traj_no_shaping)-1})')

traj_with_shaping = run_episode(env_test, agent_with_shaping)
env_test.render(axes[1], trajectory=traj_with_shaping)
axes[1].set_title(f'With PBRS (Steps: {len(traj_with_shaping)-1})')

plt.tight_layout()
plt.show()

## 5. 自适应奖励塑形

### 5.1 动机

固定的塑形权重可能导致问题：
- 权重过大：智能体可能过度依赖塑形奖励
- 权重过小：塑形效果不明显

**解决方案**：随着训练进行，逐渐减小塑形权重，让智能体最终完全依赖原始奖励。

In [None]:
# 演示自适应权重衰减
base_shaper = DistanceBasedShaper(
    goal_position=np.array([9.0, 9.0]),
    scale=0.1,
)

adaptive_shaper = AdaptiveRewardShaper(
    base_shaper=base_shaper,
    dynamic_config=DynamicShapingConfig(
        initial_weight=1.0,
        decay_rate=0.999,
        min_weight=0.01,
        adaptation_method='exponential',
    ),
)

# 模拟训练过程中的权重变化
steps = 5000
weights = []
bonuses = []

for _ in range(steps):
    state = np.random.rand(2) * 10
    next_state = state + np.random.randn(2) * 0.5
    
    bonus = adaptive_shaper.compute_shaping_bonus(state, next_state)
    bonuses.append(bonus)
    weights.append(adaptive_shaper.current_weight)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(weights)
axes[0].set_xlabel('Training Step')
axes[0].set_ylabel('Shaping Weight λ')
axes[0].set_title('Adaptive Weight Decay')
axes[0].set_yscale('log')
axes[0].grid(True, alpha=0.3)

# 滑动平均
window = 100
smoothed_bonuses = np.convolve(np.abs(bonuses), np.ones(window)/window, mode='valid')
axes[1].plot(smoothed_bonuses)
axes[1].set_xlabel('Training Step')
axes[1].set_ylabel('|Shaping Bonus|')
axes[1].set_title('Absolute Shaping Bonus Over Time')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"初始权重: {weights[0]:.4f}")
print(f"最终权重: {weights[-1]:.4f}")
print(f"衰减比例: {weights[-1]/weights[0]:.4f}")

## 6. 与其他方法对比

### 6.1 方法对比表

In [None]:
comparison_data = {
    '方法': ['PBRS', 'Naive Shaping', 'Curiosity (ICM)', 'HER', 'Curriculum'],
    '策略不变性': ['✓ 保证', '✗ 不保证', '✗ 不保证', '✓ 保证', '部分保证'],
    '需要领域知识': ['中等', '高', '低', '低', '中等'],
    '适用空间': ['任意', '任意', '高维/连续', '目标条件', '任意'],
    '计算开销': ['低', '低', '高', '中等', '中等'],
    '典型应用': ['导航/控制', '简单任务', '视觉RL', '机器人操作', '复杂序列'],
}

import pandas as pd
df = pd.DataFrame(comparison_data)
df.set_index('方法', inplace=True)

print("奖励优化方法对比")
print("=" * 80)
print(df.to_string())

### 6.2 关键洞察总结

In [None]:
insights = """
╔══════════════════════════════════════════════════════════════════════════════╗
║                        奖励塑形核心洞察                                      ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  1. 理论保证                                                                 ║
║     • PBRS是唯一保证策略不变性的奖励塑形形式                                 ║
║     • 其他形式的塑形可能改变最优策略                                         ║
║                                                                              ║
║  2. 势能函数设计                                                             ║
║     • 好的势能函数 ≈ 真实值函数的估计                                        ║
║     • 常用选择：负距离、子目标进度、专家演示学习                             ║
║                                                                              ║
║  3. 实践建议                                                                 ║
║     • 从简单的距离基势能开始                                                 ║
║     • 使用自适应权重避免过度依赖塑形                                         ║
║     • 监控原始奖励而非塑形奖励来评估性能                                     ║
║                                                                              ║
║  4. 局限性                                                                   ║
║     • 需要领域知识设计势能函数                                               ║
║     • 在迷宫等环境可能导致局部最优                                           ║
║     • 不能替代好的探索策略                                                   ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝
"""

print(insights)

## 练习题

1. **理论题**：证明对于无限horizon MDP，PBRS不改变状态值函数的相对排序。

2. **实现题**：实现基于子目标的势能函数，用于多阶段导航任务。

3. **实验题**：比较不同衰减策略（指数、线性、阶梯）对学习效率的影响。

4. **分析题**：设计一个实验，验证非PBRS形式的奖励塑形如何改变最优策略。

## 参考文献

1. Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. ICML.

2. Wiewiora, E. (2003). Potential-based shaping and Q-value initialization are equivalent. JAIR.

3. Devlin, S., & Kudenko, D. (2012). Dynamic potential-based reward shaping. AAMAS.

4. Brys, T., et al. (2015). Reinforcement learning from demonstration through shaping. IJCAI.