# Summary of Notation
---


(full notation in Appendix A)

> Whenever you see the word "value", think "action's Expected Value of reward" (AEVOR). This is usually a $P(win)$ or (in the bandit case) probability-weighted-avg $R_t$, given that that action is taken. It can be either a "True" value (i.e. the population mean of the action's distribution) or an estimate (i.e. sample statistic) of it.

| Symbol | Meaning |
|--------|---------|
| $\Pr\{X = x\}$ | Probability that random variable $X$ takes value $x$ |
| $X \sim p$ | $X$ is drawn from distribution $p$ |
| $\mathbb{E}[X]$ | Expectation of random variable $X$ |
| $\arg\max_a f(a)$ | Value of $a$ that maximizes $f(a)$ |
| $\alpha$, $\beta$, $\epsilon$ | Step-size, decay-rate, and exploration parameters |
| $s, s'$ | States |
| $a$ | Action |
| $r$ | Reward |
| $S, A(s), R$ | Set of states, available actions, rewards |
| $t, T$ | Discrete time step, final step |
| $S_t, A_t, R_t$ | State, action, and reward at time $t$ |
| $\pi$ | Policy (decision-making rule) |
| $\pi(s)$ | action taken in state $s$ under _deterministic_ $\pi$ |
| $\pi(a \mid s)$ | Probability of taking action $a$ in state $s$ under _stochastic_ $\pi$ |
| $G_t$ | Return from time $t$ |
| $h$ | horizon (the timestep one looks up to in a forward view) |
| $v_\pi(s)$ | Value of state $s$ under policy $\pi$ |
| $q_\pi(s, a)$ | Value of state-action pair $(s,a)$ under $\pi$ |
| $p(s', r \mid s, a)$ | Transition dynamics |
| $\hat{v}(s, \mathbf{w})$ | Approximate value function using weight vector $\mathbf{w}$ |
| $\delta_t$ | Temporal-difference error at time $t$ |

# Part I: Tabular Solution Methods
---

<img src="../img/6.png" alt="tabularsolutionmethods" width="40%"/>

---
---

# Chapter 8: Planning and Learning with Tabular Methods

**Chapter Overview**: This chapter unifies model-based methods (like dynamic programming, which use planning with an environment model) and model-free methods (like Monte Carlo and TD learning, which rely directly on experience). It introduces integrated architectures like Dyna that combine learning and planning. Key themes include using models to simulate experience, the trade-offs between expected and sample updates, focusing backups efficiently, and decision-time planning techniques like MCTS. As the capstone of Part I (tabular methods), it recaps core dimensions of RL methods and bridges to Part II's focus on scaling via function approximation.

**Key Unification Insight**: All RL methods compute value functions via backups (looking ahead to future events). Planning uses simulated experience from a model; learning uses real experience. This allows seamless integration: apply learning algorithms to model-generated data for planning.

**Sidenote on Planning vs. Learning**: Planning is "deliberative" (model-based foresight), learning is "reactive" (experience-based adaptation). But they're not opposites—many methods blend them. Intuitively, planning is like mentally rehearsing scenarios; learning is trial-and-error in the real world.

---

## 8.1 Models and Planning

**Models**: Anything an agent uses to predict environment responses to actions.

- **Given**: State $s$, action $a$
- **Predicts**: Next state $s'$ and reward $r$

**Types of Models**:
- **Distribution model**: Gives all possible $s', r$ and their probabilities (e.g., DP's $p(s', r|s, a)$ from Equation 3.2).
- **Sample model**: Samples one possible $s', r$ according to probabilities (e.g., blackjack simulation in Chapter 5).

**Intuition**: Distribution models are "stronger" (can generate samples and probabilities) but harder to obtain. Sample models are easier (e.g., simulate dice rolls) but may introduce sampling variance.

**Sidenote**: Models simulate experience. From a state and policy, generate episodes (trajectories). This "simulated experience" can train RL methods just like real experience.

**Planning**: Computation to improve policy/value function using a model.
- **State-space planning**: Focus of RL (vs. plan-space planning like evolutionary methods).
- **Core Structure**: Compute values via backups on simulated experience.

**Diagram Insight**: All state-space planning fits: (1) Compute values as intermediate step; (2) Use backups on simulated experience.

![Common Structure of Planning and Learning](../img/fig8_1.png)

**Unification with Learning**: Replace real experience in learning methods (e.g., TD) with model-simulated experience for planning. Both estimate values via backups; difference is experience source.

---

## 8.2 Dyna: Integrated Planning, Acting, and Learning

**Dyna Architecture**: Online agent integrating planning, acting, model-learning, and direct RL (all in parallel).

- **Direct RL**: Improve value/policy from real experience (e.g., Q-learning).
- **Model-learning**: Update model from real experience.
- **Planning**: Indirect RL—improve value/policy from simulated experience.

**Diagram**:

![Dyna Architecture](../img/fig8_1.png)  <!-- Note: This is Figure 8.1 from the text -->

**Relationships** (see diagram on page 163):
- Experience → Model (model-learning)
- Experience → Value/Policy (direct RL)
- Model → Simulated Experience → Value/Policy (planning)

**Search Control**: Process selecting starting states/actions for simulated experiences (e.g., random from experienced pairs).

**Tabular Dyna-Q Algorithm** (pseudocode on page 164):
- Initialize $Q(s,a)$ and Model$(s,a)$
- Loop forever:
  - (a) $S$ ← current state
  - (b) $A$ ← ε-greedy$(S, Q)$
  - (c) Execute $A$, observe $R, S'$
  - (d) $Q(S,A) \leftarrow Q(S,A) + \alpha [R + \gamma \max_{a'} Q(S',a') - Q(S,A)]$  (direct RL, one-step tabular Q-learning)
  - (e) Model$(S,A) \leftarrow R, S'$  (assuming deterministic env)
  - (f) For $n$ planning steps: Random experienced $S_{rand}, A_{rand}$; Simulate $R, S'$ from Model; Update $Q$ as in (d)

**Key Parameters**:
- $n$: Planning steps per real step (balances computation).
- Assumes deterministic model; queries only experienced pairs.

**Example: Dyna Maze** (Figure 8.2):
- Task: From S to G quickly.
- Dyna-Q with more $n$ learns faster (e.g., $n=50$ builds extensive policy in one episode).

![Dyna Maze Learning Curves](../img/fig8_2.png)

![Policies During Learning](../img/fig8_3.png)

**Intuition**: Planning builds "mental models" to accelerate learning. Without planning ($n=0$), policy grows one step per episode; with planning, it expands rapidly via backups.

**Sidenote**: Dyna-Q unifies: Same Q-update for real/simulated experience. Agent is always acting, planning, learning—model improves as experience grows.

**Exercise Insight**: Multi-step methods (Ch. 7) might match Dyna if model accurate, but Dyna's one-step simplicity + many simulations often wins.

---

## 8.3 When the Model Is Wrong

**Issue**: Models can be inaccurate (initially empty or environment changes).

**Example: Blocking Maze** (Figure 8.4):
- Path blocks after 1000 steps; new path opens.
- Dyna-Q recovers but wanders initially.

**Worse Case: Shortcut Maze** (Figure 8.5):
- Shortcut opens; Dyna-Q misses it (model says impossible, ε-greedy rarely explores enough).

**Solution: Dyna-Q+**:
- Track time $τ$ since last real try of $(s,a)$.
- Add exploration bonus: Simulated reward $r + κ\sqrt{τ}$ (encourages testing stale transitions).
- Solves shortcut by exploring long-untried actions.

![Blocking Maze Performance](../img/fig8_4.png)

![Shortcut Maze Performance](../img/fig8_5.png)

**Intuition**: Wrong models cause "model traps." Add curiosity (bonus for uncertainty) to detect changes.

**Sidenote**: Stochastic envs: Average model over experiences. Changing envs: Weight recent experiences more.

---

## 8.4 Prioritized Sweeping

**Issue**: Uniform random planning (Dyna-Q) wastes updates on low-impact states.

**Solution**: Focus backups backward from high-change states (like reverse BFS).

**Prioritized Sweeping Algorithm** (for deterministic env, pseudocode on page 170):
- Use priority queue: State-action pairs prioritized by update magnitude $P = |R + \gamma \max_a Q(S',a) - Q(S,A)|$.
- If $P > θ$ (small threshold), insert/update queue.
- Update top pair; propagate changes to predecessors.

**Intuition**: Chain reactions—big changes in one state affect predecessors. Queue ensures efficient backward focus.

**Advantages**:
- Faster than uniform (e.g., maze: 3x fewer backups for same performance).
- Stochastic extension: Sample successors.

**Example: Rod Maneuvering** (Figure 8.6): Solves large (14k states) deterministic task efficiently.

![Prioritized Sweeping on Maze](../img/fig8_6.png)  <!-- Assuming this is the rod figure or maze comparison -->

**Sidenote**: Like heuristic search but retains values long-term. "Small backups" (van Seijen & Sutton, 2013) take this further: Probability-weighted single transitions without sampling.

---

## 8.5 Expected vs. Sample Updates

**Dimensions of Updates**:
- State vs. action values.
- Optimal vs. arbitrary policy.
- Expected (all successors) vs. sample (one successor).

**Trade-off**: Expected: Accurate but compute-heavy (branching factor $b$). Sample: Noisy but cheap.

**Efficiency Insight** (Figure 8.7): Sample often better for large $b$ (reduces error faster per computation).

![Expected vs Sample Efficiency](../img/fig8_7.png)

**General Rule**: Sample wins for large $b$, deep backups; expected for small $b$, shallow.

**Sidenote**: Planning favors samples (easy from model); real experience favors expected (to reduce variance).

---

## 8.6 Trajectory Sampling

**Update Distribution**: Where to focus backups?
- **Uniform**: All states equally (like DP sweeps)—wastes on irrelevant states.
- **On-policy**: Sample trajectories following current policy—focuses on likely states.

**Advantages of On-Policy** (Figure 8.8): Faster initial learning (relevant states); but uniform may win long-term.

![Uniform vs On-Policy](../img/fig8_8.png)

**Intuition**: Like importance sampling—prioritize states under current behavior.

---

## 8.7 Real-time Dynamic Programming (RTDP)

**RTDP**: Asynchronous value iteration with on-policy trajectory sampling.
- Update only visited states greedily.
- Converges to optimal (with exploring starts).

**Advantages**: Ignores irrelevant states; policy improves during computation.

**Example: Racetrack** (Exercise 5.12): RTDP updates fewer states, finds near-optimal policy faster than full sweeps.

**Intuition**: "Relevant states" are reachable under optimal policy—RTDP naturally focuses there without visiting all.

**Sidenote**: Partial policies (optimal only on relevant states) suffice; RTDP guarantees this without infinite visits everywhere.

---

## 8.8 Planning at Decision Time

**Background Planning**: Improve overall policy/value (e.g., Dyna)—unfocused.
- **Decision-time Planning**: From current state, search deeply to select one action (discard backups after).

**Intuition**: Focus computation on now; useful when fast responses aren't needed (e.g., games).

---

## 8.9 Heuristic Search

**Classical AI Method**: From current state (root), build search tree; evaluate leaves with heuristic $v$; backup maxes to root; pick best action.

**RL Twist**: Improve heuristic $v$ over time via backups—blends with learning.

![Heuristic Search as Backups](../img/fig8_9.png)

**Intuition**: Deep search = many small backups focused downstream. Better than unfocused for current decision.

---

## 8.10 Rollout Algorithms

**Method**: Monte Carlo control from current state: Simulate trajectories per action using rollout policy $\pi$; average returns for $q(s,a)$; pick $\arg\max_a$.

**No Storage**: Discard estimates after action choice.

**Improvement**: Policy iteration—rollouts improve over $\pi$ (if $\pi$ good, rollouts better).

**Intuition**: Like MC but decision-time; balances trials vs. depth (truncate + bootstrap if needed).

**Sidenote**: Parallelizable; prune poor actions. Enhances weak $\pi$ via lookahead.

---

## 8.11 Monte Carlo Tree Search (MCTS)

**Method**: Build asymmetric tree from current state via iterations:
1. **Selection**: Tree policy (e.g., UCB) to leaf.
2. **Expansion**: Add child(ren).
3. **Simulation**: Rollout policy to end; get return.
4. **Backup**: Update $Q$ along path (average returns).

Pick action with best root $Q$.

![MCTS Process](../img/fig8_10.png)

**Intuition**: Tree focuses on promising branches; rollouts explore beyond. Balances exploration/exploitation via tree policy.

**RL View**: MC control with partial $q$ table (tree); discards between moves.

**Sidenote**: AlphaGo (Ch. 16) extends with neural nets. MCTS revolutionized AI games by smart sampling.

---

## 8.12 Summary of the Chapter

**Recap**: Models enable planning (simulated backups). Dyna integrates all; prioritized/trajectory sampling focuses efficiently. Expected/sample, background/decision-time are key axes. Rollout/MCTS: Powerful decision-time methods.

**Bridge to Part II**: Part I assumed tabular (exact values per state)—limits to small problems. Part II scales via function approximation: Approximate values over large spaces, enabling real-world RL.

---

## 8.13 Summary of Part I: Dimensions

**Core Dimensions** (Figure 8.11):
- **Depth**: Shallow (one-step) to deep (full episodes).
- **Width**: Expected (all branches) to sample (one branch).
- **On/Off-Policy**: Learn current vs. optimal policy.

![Dimensions of RL Methods](../img/fig8_11.png)

**Other Dimensions**:
- Episodic/continuing, discounted/undiscounted.
- State/action/afterstate values.
- Exploration: ε-greedy, optimistic, UCB.
- Synchronous/asynchronous.
- Real/simulated experience.
- Update location/timing/memory.

**Intuition**: Vast design space—mix for task. Function approximation (Part II) adds orthogonal scaling dimension.

**Part I Takeaway**: Tabular methods build foundations; understand backups, models, integration for intuition.

---

## Exercises

### Exercise 8.1 Solution
**Q**: Why might multi-step bootstrapping match Dyna?

**A**: Multi-step uses real multi-transitions; Dyna simulates many one-steps. If model accurate, similar. But Dyna flexible (arbitrary simulations); multi-step limited by real data—Dyna often better for planning.

### Exercise 8.2 Solution
**Q**: Why does nonplanning look poor in Fig 8.3?

**A**: It's one-step. Multi-step could propagate faster, potentially matching Dyna. But Dyna's simulations allow more flexible/backward focus.

### Exercise 8.3 Solution
**Q**: Alternate Dyna-Q+: Use $\sqrt{\tau}$ in action selection, not updates.

**A**: 
- **Strength**: Direct exploration in real actions.
- **Weakness**: Less in planning (may miss long chains). Gridworld: Good for simple changes; worse for deep dependencies.

### Exercise 8.4 Solution
**Q**: Modify Dyna-Q for stochastic envs; issues in changing envs; fix.

**A**: Model averages experiences (e.g., count-based probabilities). Changing envs: Forgets slowly—use recency weighting (e.g., higher α for model updates) or detect changes/reset.

### Exercise 8.5 Solution
**Q**: Prioritized sweeping on rod maneuvering.

**A**: (Descriptive; see text example—efficient for large deterministic tasks by focusing backups.)

(Note: Some exercises are thought experiments; solutions are conceptual.)

---
---

# Appendix A: Full Notation
---

### General Notation

> Capital letters are used for random variables.

> Lower case letters are used for the values of random variables and for scalar functions.

> Quantities that are required to be real-valued vectors are written in bold and in lower case (even if random variables). 

> Matrices are bold capitals.

| Symbol | Meaning |
|--------|---------|
| **Standard Operators** |  |
| $\doteq$ | Equality relationship that is true by definition |
| $\approx$ | Approximately equal |
| $\propto$ | Proportional to |
| $\Pr\{X = x\}$ | Probability that random variable $X$ takes value $x$ |
| $X \sim p$ | $X$ drawn from distribution $p$ |
| $\mathbb{E}[X]$ | Expectation of random variable $X$ |
| $\arg\max_a f(a)$ | Value of $a$ that maximizes $f(a)$ |
| $\ln x$ | Natural logarithm of $x$ |
| $e^x$, $\exp(x)$ | The base of the natural logarithm, $e \approx 2.71828$, carried to power $x$ |
| $\mathbb{R}$ | Set of real numbers |
| $f: \mathcal{X} \to \mathcal{Y}$ | Function from elements of set $\mathcal{X}$ to elements of set $\mathcal{Y}$ |
| $\leftarrow$ | Assignment |
| $(a, b]$ | Real interval between $a$ and $b$ including $b$ but not $a$ |
| **Standard RL Algorithm Parameters** |  |
| $\epsilon$ | Probability of taking a random action in an $\epsilon$-greedy policy |
| $\alpha, \beta$ | Step-size parameters |
| $\gamma$ | Discount-rate parameter |
| $\lambda$ | Decay-rate parameter for eligibility traces |
| $\mathbf{1}_\text{predicate}$ | Indicator function (1 if predicate is true, else 0) |

---

### Multi-Armed Bandit Notation

| Symbol | Meaning |
|--------|---------|
| $k$ | Number of actions (arms) |
| $t$ | Discrete time step or play number |
| $q_*(a)$ | True value (expected reward) of action $a$ |
| $Q_t(a)$ | Estimate at time $t$ of $q_*(a)$ |
| $N_t(a)$ | Number of times action $a$ has been selected up to time $t$ |
| $H_t(a)$ | Learned preference for selecting action $a$ at time $t$ |
| $\pi_t(a)$ | Probability of selecting action $a$ at time $t$ |
| $\bar{R}_t$ | Estimate at time $t$ of the expected reward given $\pi_t$ |

---

### Markov Decision Process Notation

| Symbol | Meaning |
|--------|---------|
| **State & Action Sets** |  |
| $s, s'$ | States |
| $a$ | An action |
| $r$ | A reward |
| $\mathcal{S}$ | Set of all nonterminal states |
| $\mathcal{S}^+$ | Set of all states, incl. terminal state |
| $\mathcal{A}(s)$ | Set of all actions available in state $s$ |
| $\mathcal{R}$ | Set of all possible rewards, a finite subset of $\mathbb{R}$ |
| $\mathcal{C}$ | Subset of (e.g., $\mathcal{R} \subset \mathbb{R}$) |
| $\in$ | Is an element of (e.g. $s \in \mathcal{S}$, $r \in \mathcal{R}$) |
| $\lvert\mathcal{S}\rvert$ | Number of elements in set $\mathcal{S}$ |
| **Time & Policy** |  |
| $t$ | Discrete time step |
| $T, T(t)$ | Final time step of episode, or including $t$ |
| $A_t$ | Action at time $t$ |
| $S_t$ | State at time $t$ |
| $R_t$ | Reward at time $t$ |
| $\pi$ | Policy (decision-making rule) |
| $\pi(s)$ | Action taken in state $s$ under deterministic $\pi$ |
| $\pi(a \mid s)$ | Probability of taking action $a$ in state $s$ under stochastic $\pi$ |
| **Returns** |  |
| $G_t$ | Return following time $t$ |
| $h$ | Horizon (timestep looked up to in forward view) |
| $G_{t:t+n}, G_{t:h}$ | $n$-step return from $t+1$ to $t+n$ or $h$ (discounted/corrected) |
| $G_{t:h}$ | Flat return (undiscounted/unadjusted) from $t+1$ to $h$ |
| $G^\lambda_t$ | $\lambda$-return |
| $G^{\wedge}_t$, $G^{\wedge a}_t$ | Truncated, corrected $\lambda$-return |
| **Transition & Reward Probabilities** |  |
| $p(s', r \mid s, a)$ | Probability of transition to $s'$ with reward $r$ from $s, a$ |
| $p(s' \mid s, a)$ | Probability of transition to $s'$ from $s$ taking $a$ |
| $r(s, a)$ | Expected immediate reward from $s$ after $a$ |
| $r(s, a, s')$ | Expected reward on transition $s \to s'$ under $a$ |
| **Value Functions** |  |
| $v_\pi(s)$ | Value of $s$ under policy $\pi$ (expected return) |
| $v_*(s)$ | Value of $s$ under optimal policy |
| $q_\pi(s, a)$ | Value of taking $a$ in $s$ under $\pi$ |
| $q_*(s, a)$ | Value of taking $a$ in $s$ under optimal policy |
| **Estimators & TD Error** |  |
| $V, V_t$ | Array estimates of $v_\pi$ or $v_*$ |
| $Q, Q_t$ | Array estimates of $q_\pi$ or $q_*$ |
| $\hat{V}(s)$ | Expected approximate action value; e.g. $\hat{V}_t(s) \doteq \sum_a \pi(a|s) Q_t(s, a)$ |
| $U_t$ | Target for estimate at time $t$ |
| $\delta_t$ | Temporal-difference (TD) error at $t$ (a random variable) |
| $\delta_t^s$, $\delta_t^a$ | State- and action-specific forms of TD error |
| $n$ | In $n$-step methods, $n$ is number of steps of bootstrapping |
---

### Function Approximation, Policy Gradient, Advanced Notation

| Symbol | Meaning |
|--------|---------|
| **Weight & Parameter Notation** |  |
| $d$ | Dimensionality—number of components of $\mathbf{w}$ |
| $d'$ | Alternate dimensionality—number of components of $\theta$ |
| $\mathbf{w}, \mathbf{w}_t$ | $d$-vector of weights underlying approximate value function |
| $w_i, w_{t,i}$ | $i$-th component of learnable weight vector |
| $\hat{v}(s, \mathbf{w})$ | Approximate value of $s$ given $\mathbf{w}$ |
| $v_\mathbf{w}(s)$ | Alternate notation for $\hat{v}(s, \mathbf{w})$ |
| $\hat{q}(s, a, \mathbf{w})$ | Approximate value of $(s, a)$ given $\mathbf{w}$ |
| $\nabla \hat{v}(s, \mathbf{w})$ | Column vector of partial derivatives w.r.t. $\mathbf{w}$ |
| $\nabla \hat{q}(s, a, \mathbf{w})$ | Column vector of partial derivatives w.r.t. $\mathbf{w}$ |
| **Feature Vectors & Eligibility Traces** |  |
| $\mathbf{x}(s)$ | Feature vector visible in $s$ |
| $\mathbf{x}(s, a)$ | Feature vector visible in $s$ taking $a$ |
| $x_i(s), x_i(s, a)$ | $i$-th component of $\mathbf{x}(s)$ or $\mathbf{x}(s, a)$ |
| $\mathbf{x}_t$ | Shorthand for $\mathbf{x}(S_t)$ or $\mathbf{x}(S_t, A_t)$ |
| $\mathbf{w}^\top \mathbf{x}$ | Inner product of vectors |
| $\mathbf{v}, \mathbf{v}_t$ | Secondary $d$-vector of weights, used to learn $\mathbf{w}$ |
| $\mathbf{z}_t$ | $d$-vector of eligibility traces at $t$ |
| **Policy Gradient Notation** |  |
| $\theta, \theta_t$ | Parameter vector of target policy |
| $\pi(a \mid s, \theta)$ | Probability of taking $a$ in $s$ given $\theta$ |
| $\pi_\theta$ | Policy corresponding to parameter $\theta$ |
| $\nabla \pi(a \mid s, \theta)$ | Partial derivatives of $\pi(a \mid s, \theta)$ w.r.t. $\theta$ |
| $J(\theta)$ | Performance measure for policy $\pi_\theta$ |
| $\nabla J(\theta)$ | Partial derivatives of $J(\theta)$ w.r.t. $\theta$ |
| $h(s, a, \theta)$ | Preference for $a$ in $s$ based on $\theta$ |
| **Behavior Policy, Baselines, Importance Sampling** |  |
| $b(a \mid s)$ | Behavior policy used to select actions while learning target $\pi$ |
| $b(s)$ | Baseline function $b: \mathcal{S} \to \mathbb{R}$ for policy-gradient methods |
| $b$ | Branching factor for MDP/search tree |
| $\rho_{t:h}$ | Importance sampling ratio for $t$ through $h$ |
| $\rho_t$ | Importance sampling ratio for time $t$ alone, $\rho_t \doteq \rho_{t:t}$ |
| $r(\pi)$ | Average reward (reward rate) for policy $\pi$ |
| $\bar{R}_t$ | Estimate of $r(\pi)$ at time $t$ |
| **State Distributions & Operators** |  |
| $\mu(s)$ | On-policy distribution over states |
| $\mu$ | $\lvert\mathcal{S}\rvert$-vector of the $\mu(s)$ for $s \in \mathcal{S}$ |
| $\|v\|^2_\mu$ | $\mu$-weighted squared norm of $v$, i.e., $\|v\|^2_\mu \doteq \sum_{s \in \mathcal{S}} \mu(s)v(s)^2$ |
| $\eta(s)$ | Expected number of visits to $s$ per episode |
| $\Pi$ | Projection operator for value functions |
| $B_\pi$ | Bellman operator for value functions |

---

### Matrices, Bellman Error, & Error Metrics

| Symbol | Meaning |
|--------|---------|
| **Matrix Notation & Linear Algebra** |  |
| $\mathbf{A}$ | $d \times d$ matrix: $\mathbf{A} \doteq \mathbb{E}\left[ \mathbf{x}_t(\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^\top \right]$ |
| $\mathbf{b}$ | $d$-dimensional vector: $\mathbf{b} \doteq \mathbb{E}[R_{t+1} \mathbf{x}_t]$ |
| $\mathbf{w}_{TD}$ | TD fixed point: $\mathbf{w}_{TD} \doteq \mathbf{A}^{-1}\mathbf{b}$ |
| $\mathbf{I}$ | Identity matrix |
| $\mathbf{P}$ | $\lvert\mathcal{S}\rvert \times \lvert\mathcal{S}\rvert$ matrix of state-transition probabilities under $\pi$ |
| $\mathbf{D}$ | $\lvert\mathcal{S}\rvert \times \lvert\mathcal{S}\rvert$ diagonal matrix with $\mu$ on its diagonal |
| $\mathbf{X}$ | $\lvert\mathcal{S}\rvert \times d$ matrix with the $\mathbf{x}(s)$ as its rows |
| **Bellman Error & Value Error Metrics** |  |
| $\bar{\delta}_\mathbf{w}(s)$ | Bellman error (expected TD error) for $v_\mathbf{w}$ at $s$ |
| $\bar{\delta}_\mathbf{w}$, BE | Bellman error vector (with components $\bar{\delta}_\mathbf{w}(s)$) |
| $\text{VE}(\mathbf{w})$ | Mean square value error: $\text{VE}(\mathbf{w}) \doteq \|v_\mathbf{w} - v_\pi\|^2_\mu$ |
| $\text{BE}(\mathbf{w})$ | Mean square Bellman error: $\text{BE}(\mathbf{w}) \doteq \|\bar{\delta}_\mathbf{w}\|^2_\mu$ |
| $\text{PBE}(\mathbf{w})$ | Mean square projected Bellman error: $\text{PBE}(\mathbf{w}) \doteq \|\Pi \bar{\delta}_\mathbf{w}\|^2_\mu$ |
| $\text{TDE}(\mathbf{w})$ | Mean square temporal-difference error: $\text{TDE}(\mathbf{w}) \doteq \mathbb{E}_b[\rho_t \delta_t^2]$ |
| $\text{RE}(\mathbf{w})$ | Mean square return error |