# Summary of Notation
---


(full notation in Appendix A)

> Whenever you see the word "value", think "action's Expected Value of reward" (AEVOR). This is usually a $P(win)$ or (in the bandit case) probability-weighted-avg $R_t$, given that that action is taken. It can be either a "True" value (i.e. the population mean of the action's distribution) or an estimate (i.e. sample statistic) of it.

| Symbol | Meaning |
|--------|---------|
| $\Pr\{X = x\}$ | Probability that random variable $X$ takes value $x$ |
| $X \sim p$ | $X$ is drawn from distribution $p$ |
| $\mathbb{E}[X]$ | Expectation of random variable $X$ |
| $\arg\max_a f(a)$ | Value of $a$ that maximizes $f(a)$ |
| $\alpha$, $\beta$, $\epsilon$ | Step-size, decay-rate, and exploration parameters |
| $s, s'$ | States |
| $a$ | Action |
| $r$ | Reward |
| $S, A(s), R$ | Set of states, available actions, rewards |
| $t, T$ | Discrete time step, final step |
| $S_t, A_t, R_t$ | State, action, and reward at time $t$ |
| $\pi$ | Policy (decision-making rule) |
| $\pi(s)$ | action taken in state $s$ under _deterministic_ $\pi$ |
| $\pi(a \mid s)$ | Probability of taking action $a$ in state $s$ under _stochastic_ $\pi$ |
| $G_t$ | Return from time $t$ |
| $h$ | horizon (the timestep one looks up to in a forward view) |
| $v_\pi(s)$ | Value of state $s$ under policy $\pi$ |
| $q_\pi(s, a)$ | Value of state-action pair $(s,a)$ under $\pi$ |
| $p(s', r \mid s, a)$ | Transition dynamics |
| $\hat{v}(s, \mathbf{w})$ | Approximate value function using weight vector $\mathbf{w}$ |
| $\delta_t$ | Temporal-difference error at time $t$ |

---
---

# Part I: Tabular Solution Methods
---

<img src="../img/6.png" alt="tabularsolutionmethods" width="40%"/>

# Chapter 2: Multi-armed Bandits
---

> The most important feature distinguishing reinforcement learning from other types of learning is that it **uses training information that evaluates the actions taken rather than instructs by giving correct actions**. This is what creates the need for active exploration, for an explicit search for good behavior. Purely evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action possible. Purely instructive feedback, on the other hand, indicates the correct action to take, independently of the action actually taken....Evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken

**Bandit problems** model decision-making under uncertainty with *no state transitions*.  
Bandit problems in general are a special case of RL where there is only a single state.

Bandit problems model a **nonassociative setting**, in which the agent must learn (from evaluative feedback) in just one situation.

## 2.1 A $k$-Armed Bandit Problem
---

Unlike supervised learning, RL uses *evaluative* feedback—it tells you how good your chosen action was, but not what the best action would have been.

- **Setting**: At each time $t$, choose one of $k$ actions (“arms”), then observe reward $R_t$ drawn from a stationary distribution for that arm. So:
	- $k$: The number of actions (arms) to choose from.
	- $A_t$: The action (arm) selected at timestep $t$.
		- This is a RV. The observation of $A_t$ (i.e. the actual arm chosen at $t$) is denoted $a$.
	- $R_t$: Reward received from $A_t$.
- **Objective**: Maximize expected cumulative reward over time (e.g., $T=1000$ rounds).
	- In the "k-armed slot machine" example, the objective is to maximize your winnings by concentrating your actions on the best levers.
- **Approach**: For each arm $a$ of the $k$ arms, maintain a "$\text{value}$ estimate" $Q_t (a)$ equal to your estimate of the expected reward from choosing that arm.
	- $Q_t (a) = \text{value}_{A_t = a} = \hat{\mathbb{E}}[R_t \mid A_t = a]$
- **True action value** (_Equation 2.0_): For an arbitrary action, the <u>ACTUAL</u> expected reward given that $a$ is selected. We don't know this (population parameter).
  $$
  q_*(a) = \mathbb{E}[R_t \mid A_t = a]
  $$
	- If we knew the true action values, we'd just choose the highest $q_*(a)$ every time.
	- We don't. So we want to make $Q_t (a)$ as close to $q_*(a)$ as possible.
- **Exploration vs. exploitation**:
  - *Exploitation*: Choose the action you *currently* think is best.
	- Good for this step.
  - *Exploration*: Try other actions to discover if they might be better.
	- Good for the long run.
  - Balancing these is a *central challenge*—no single action achieves both at once.
	- E.g. say you have certainty on a greedy action's $value$ but there are several other actions with close-but-uncertain $value$s. Say the uncertainty is such that at least one of these other actions is probably actually better than the greedy action, but you don't know which one. 
		- In this scenario, given the nature of the uncertainty - 
			- If you have many timesteps left, you should probably explore.
			- If you don't, you should probably exploit.
	- In general, balancing explore vs. exploit = f($\quad Q_t (a)\quad \forall \quad a \in K$, uncertainties associated with each $Q_t (a)$, and remaining timesteps)
	- <u>HOWEVER, most sophisticated methods for balancing explore vs. exploit make strong, unrealistic assumptions around distribution stationarity and prior knowledge, SO this book focuses only on balancing them <i>at all</i>. We now look at simple methods for balancing explore vs. exploit with the point of showing that they are all better than "always exploit".</u>


---

## 2.2 Action-Value Methods

**Action-Value Methods**: Methods for estimating the values of actions and using those action value estimates to make action selection decisions.

### Value Estimation

- **Goal**: Estimate the "true value" $q_*(a)$ for each arm $a$.
- **Sample-average estimator** (_Equation 2.1_): Estimate "true value" of each arm as the average reward actually received from selecting that arm so far.
  $$
  Q_t(a) \doteq \frac{\text { sum of rewards when } a \text { taken prior to } t}{\text { number of times } a \text { taken prior to } t}=\frac{\displaystyle\sum_{i=1}^{t-1} R_i \cdot \mathbf{1}_{A_i=a}}{\displaystyle\sum_{i=1}^{t-1} \mathbf{1}_{A_i=a}}
  $$

  - where $\mathbf{1}_\text{predicate}$ is the indicator function (= 1 if predicate is True, else 0)
  - If denominator is 0, define $Q_t(a)$ to be some default value (e.g. 0)
  - As denominator $\to \infty$, $Q_t(a) \to q_*(a)$.
	- Note that this is simply a value estimation. So this only holds assuming your action selection strategy actually selects $a$ with some non-zero probability. 
	- i.o.w. this formula never even "runs" for arms that just get ignored (e.g. by purely greedy strategies).

Note that the sample-average method for estimating values is simple & just one of many value estimation methods. But we stick with it for the next sections.

### Action Selection Strategies

- **Greedy Action Selection method** (_Equation 2.2_): Always pick the arm with the highest value estimate.
  $$
  A_t \doteq \arg\max_a Q_t(a)
  $$
  - where $\arg\max_a$ denotes "select action $a$ for which the expression that follows is maximized (with ties broken arbitrarily, perhaps randomly)"
  - *Limitation*: Purely greedy strategies may permanently overlook arms with higher true values. 
  	- i.o.w. No guarantee that all $Q_t(a)$ actually converge to their respective $q_*(a)$. May get stuck exploiting a suboptimal action (resulting in a non-maximal cumulative reward) if your initial samples (and thus value estimates) of the highest true value arms happened to produce lower rewards.
- **$\epsilon$-greedy**: Be greedy most of the time, but select an action randomly $\epsilon$% of the time (where $\epsilon$ is some small probability)
  - With probability $1 - \epsilon$, pick greedy action.
  - With probability $\epsilon$, pick a random action.
  - *Guarantees*: As $t \to \infty$, all actions will be sampled infinitely often, ensuring $Q_t(a) \to q_*(a)$ by Law of Large Numbers.
  	- > This of course implies that the probability of selecting the optimal action converges to greater than $1 - \epsilon$, that is, to near certainty. 

**Key Point**: Simple random exploration (via $\epsilon$-greedy) is enough to outperform pure exploitation in most nontrivial problems.

### Exercise 2.1 Solution
**Q**: In $\epsilon$-greedy action selection with 2 actions and $\epsilon = 0.5$, what's the probability the greedy action is selected?

**A**: 
- With probability $1 - \epsilon = 0.5$, select greedy action deterministically.
- With probability $\epsilon = 0.5$, select randomly between 2 actions (each with probability 0.25).
- **Total probability of greedy action** = $0.5 + 0.5 \times 0.5 = 0.75$

---

## 2.3 The 10-Armed Testbed (how to test different action selection strategies)

**Purpose**: Create a standardized empirical benchmark for comparing bandit algorithms and understanding the exploration-exploitation tradeoff.

### Experimental Setup

- **Standard empirical benchmark** for comparing bandit algorithms.
- **Test suite**: 2000 randomly generated 10-armed bandit problems.
- **For each individual bandit problem**:
  - $k = 10$ arms (actions available).
  - **True action values** $q_*(a)$ for each arm $a = 1, 2, ..., 10$ sampled from $\mathcal{N}(0, 1)$.
    - This means the "best" arm varies randomly across problems.
    - Action values are distributed around 0 with unit variance.
  - **Reward generation**: When action $A_t$ is selected at time $t$, reward $R_t \sim \mathcal{N}(q_*(A_t), 1)$.
    - Rewards are noisy: normally distributed around the true action value with unit variance.
    - This noise makes it challenging to identify the best action quickly.

<img src="../img/fig2_1.png" alt="true value & reward distribution" width="50%"/>

### Methodology

- **One run**: Apply a learning algorithm to one bandit problem for 1000 time steps.
- **Performance measurement**: Track reward and optimal action selection over time.
- **Statistical reliability**: Average results over 2000 independent runs (each with different bandit problems).
- **Algorithms compared**: 
  - Pure greedy ($\epsilon = 0$)
  - $\epsilon$-greedy with $\epsilon = 0.01$ 
  - $\epsilon$-greedy with $\epsilon = 0.1$
- **Value estimation**: All methods use sample-average technique with initial estimates $Q_1(a) = 0$.

### Results (Figure 2.2)

<img src="../img/fig2_2.png" alt="testbed results" width="50%"/>

**Upper graph (Average Reward over Time)**:
- **Greedy method**: 
  - Improves slightly faster initially (exploits early good estimates).
  - Plateaus at lower performance (~1.0 average reward vs. ~1.54 optimal).
  - Gets "stuck" performing suboptimal actions.
- **$\epsilon$-greedy methods**:
  - Initially perform worse due to random exploration.
  - Eventually surpass greedy method due to continued learning.
  - $\epsilon = 0.1$: Faster initial improvement, higher exploration.
  - $\epsilon = 0.01$: Slower but more precise convergence.

**Lower graph (% Optimal Action Selection)**:
- **Greedy**: Finds optimal action only ~33% of the time.
  - In 2/3 of problems, early disappointing samples from optimal action cause permanent abandonment.
- **$\epsilon = 0.1$**: Explores more, finds optimal action earlier, but caps at ~91% (due to 10% random selection).
- **$\epsilon = 0.01$**: Slower improvement but eventually achieves higher optimal action percentage.

### Key Insights and Implications

**Why Greedy Fails**:
- **Early sampling bias**: If the truly optimal action gives disappointing early rewards (due to noise), greedy method abandons it permanently.
- **No recovery mechanism**: Once a suboptimal action appears best, greedy never reconsiders abandoned actions.
- **Premature convergence**: Settles on locally optimal choice based on limited early experience.

**Why $\epsilon$-greedy Succeeds**:
- **Guaranteed exploration**: Every action will be sampled infinitely often as $t \to \infty$.
- **Convergence guarantee**: By Law of Large Numbers, $Q_t(a) \to q_*(a)$ for all actions.
- **Probability of optimal convergence**: $\lim_{t \to \infty} P(\text{select optimal action}) \geq 1 - \epsilon$.

**Parameter Tradeoffs**:
- **Larger $\epsilon$ (0.1)**: 
  - Faster exploration and discovery of optimal action.
  - Lower asymptotic performance due to continued random actions.
- **Smaller $\epsilon$ (0.01)**:
  - Slower initial learning.
  - Better long-term performance with less "wasted" exploration.

### When Different Methods Excel

**Favor more exploration when**:
- **Higher reward variance**: Noisier rewards require more samples to identify true values.
- **Nonstationary environments**: True action values change over time, requiring ongoing exploration.
- **Uncertain initial conditions**: When starting knowledge is poor.

**Favor less exploration when**:
- **Deterministic rewards**: True values can be learned quickly (in extreme case, greedy optimal after one sample per action).
- **High confidence in estimates**: When you have good prior knowledge.
- **Short time horizons**: When there's insufficient time to benefit from exploration.

### Key Takeaway
The 10-armed testbed demonstrates that **even simple exploration can dramatically outperform pure exploitation** in realistic noisy environments. The optimal balance between exploration and exploitation depends on problem characteristics, but some exploration is almost always beneficial.

---
---
### Exercise 2.2 Solution
**Q:** Consider a 4-armed bandit with ε-greedy action selection, sample-average estimates, and initial estimates Q₁(a) = 0 for all a. Given the sequence: A₁ = 1, R₁ = -1; A₂ = 2, R₂ = 1; A₃ = 2, R₃ = -2; A₄ = 2, R₄ = 2; A₅ = 3, R₅ = 0. On which time steps did exploration definitely occur? On which could it have occurred?

**A:** Let's trace through the Q-value updates:

**Initial:** Q₁(1) = Q₁(2) = Q₁(3) = Q₁(4) = 0

**Step 1:** A₁ = 1, R₁ = -1
- All actions tied at Q = 0, so action 1 **could be greedy or exploratory**
- Update: Q₂(1) = -1, others remain 0

**Step 2:** A₂ = 2, R₂ = 1  
- Q-values: Q(1) = -1, Q(2) = Q(3) = Q(4) = 0
- Action 2 tied for greedy (highest at 0), so **could be greedy or exploratory**
- Update: Q₃(2) = 1, others unchanged

**Step 3:** A₃ = 2, R₃ = -2
- Q-values: Q(1) = -1, Q(2) = 1, Q(3) = Q(4) = 0
- Action 2 is greedy (highest at 1), so **could be greedy or exploratory**
- Update: Q₄(2) = (1 + (-2))/2 = -0.5

**Step 4:** A₄ = 2, R₄ = 2
- Q-values: Q(1) = -1, Q(2) = -0.5, Q(3) = Q(4) = 0
- Actions 3 and 4 tied for greedy (0), action 2 is not greedy (-0.5)
- **Definitely exploratory**
- Update: Q₅(2) = (1 + (-2) + 2)/3 = 1/3

**Step 5:** A₅ = 3, R₅ = 0
- Q-values: Q(1) = -1, Q(2) = 1/3, Q(3) = Q(4) = 0
- Action 2 is greedy (1/3), action 3 is not greedy (0)
- **Definitely exploratory**

**Answer:** 
- **Definitely exploratory:** Steps 4 and 5
- **Could have been exploratory:** Steps 1, 2, and 3

---

### Exercise 2.3 Solution
**Q:** In Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be?

**A:** The **ε = 0.01 method** will perform best in the long run on both measures.

**Probability of selecting optimal action:**
- **ε = 0.1:** Asymptotes at ~91% (due to 10% random exploration)
- **ε = 0.01:** Will asymptote at ~99% (due to only 1% random exploration)
- **Improvement:** ~8 percentage points higher optimal action selection

**Average reward:**
- **ε = 0.1:** Limited by continued exploration reducing average reward
- **ε = 0.01:** Higher asymptotic reward due to more exploitation of learned optimal action

**Quantitative reasoning:**
- As t → ∞, both methods learn the true action values perfectly
- ε = 0.01 selects optimal action with probability ≥ 1 - 0.01 = 0.99
- ε = 0.1 selects optimal action with probability ≤ 1 - 0.1 = 0.90
- Given optimal action has highest expected reward, ε = 0.01 achieves ~9% more optimal selections
- This translates directly to higher long-run average reward proportional to the reward difference between optimal and suboptimal actions

**Trade-off:** ε = 0.01 learns more slowly initially but achieves better asymptotic performance.

---
---

---

## 2.4 Incremental Implementation (how to actually implement action-value methods efficiently)

**Core Problem**: Naive implementation of sample averaging requires storing all past rewards and recomputing the full average each time—this scales poorly in both memory and computation.

**Relationship to episodic learning**: Unlike the tic-tac-toe example (which used batch updates after episodes), bandit problems typically update **after each action selection**. However, the incremental computation method shown here works for both cases—it's simply a more efficient way to compute sample averages whenever updates occur.

### The Memory and Computation Problem

**Naive sample-average implementation**:
- Store all rewards: $R_1, R_2, ..., R_{n-1}$ for each action
- Compute: $Q_n = \frac{R_1 + R_2 + \cdots + R_{n-1}}{n-1}$
- **Memory**: $O(n)$ - grows linearly with number of samples
- **Computation**: $O(n)$ per update - must sum all rewards each time

This becomes impractical for long-running systems or problems with many actions.

### The Incremental Solution

**Key insight**: We can update estimates incrementally using only the new reward and current estimate.

**Mathematical derivation** for the $n$-th reward $R_n$ for action $a$:

$$
\begin{align}
Q_{n+1} &= \frac{1}{n}\sum_{i=1}^{n} R_i \\
&= \frac{1}{n}\left[R_n + \sum_{i=1}^{n-1} R_i\right] \\
&= \frac{1}{n}\left[R_n + (n-1) \frac{1}{n-1}\sum_{i=1}^{n-1} R_i\right] \\
&= \frac{1}{n}\left[R_n + (n-1)Q_n\right] \\
&= \frac{1}{n}\left[R_n + nQ_n - Q_n\right] \\
&= Q_n + \frac{1}{n}\left[R_n - Q_n\right]
\end{align}
$$

**Incremental update rule** (_Equation 2.3_):
$$
Q_{n+1}(a) = Q_n(a) + \frac{1}{n}\left[R_n - Q_n(a)\right]
$$

**Efficiency gains**:
- **Memory**: $O(1)$ - only store $Q_n(a)$ and count $n$ for each action
- **Computation**: $O(1)$ per update - single arithmetic operation

### The General Update Form

**Universal pattern** (_Equation 2.4_):
$$
\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} \cdot \left[\text{Target} - \text{OldEstimate}\right]
$$

**Components**:
- **Target**: $R_n$ (the new reward observed)
- **StepSize**: $\frac{1}{n}$ (decreases as more samples are collected)
- **Error**: $\left[\text{Target} - \text{OldEstimate}\right]$ (how far off our current estimate is)

**Interpretation**:
- We "step toward" the target by an amount proportional to the error
- Step size determines how much we trust the new information vs. old estimate
- This pattern appears throughout RL, not just in bandits

### Bandit vs. Episodic Update Timing

**Bandit context (this chapter)**:
- Updates happen **immediately after each action** (online/continuing learning)
- No natural episode boundaries—learning continues indefinitely
- Incremental updates enable real-time learning without memory growth

**Episodic context (like tic-tac-toe)**:
- Updates might happen **after episodes complete** (batch updates)
- Same incremental formula applies, just applied at different times
- Whether online or batch, the computational efficiency benefits remain

**Key point**: Section 2.4 focuses on **computational efficiency of the update calculation itself**, regardless of when those updates occur in the learning process.

### Implementation Details

**Step-size parameter**: In the incremental method, step-size $\alpha_t(a) = \frac{1}{n}$ where $n$ is the number of times action $a$ has been selected.

**Pseudocode** (from the book):

<img src="../img/banditalgo.png" alt="bandit algo" width="40%"/>

See [Bandit Implementation in python, here](./code_examples/bandit_implementation.py).

---

## 2.5 Tracking a Nonstationary Problem

**Context shift**: Previous methods assume **stationary** bandit problems where reward probabilities don't change over time. In practice, many RL problems are **nonstationary**—the optimal action changes over time.

**Key insight**: For nonstationary problems, recent rewards should matter more than old rewards since old information may be outdated.

### The Problem with Sample Averages in Nonstationary Environments

**Sample-average method limitations**:
- Treats all rewards equally: $Q_n = \frac{1}{n}\sum_{i=1}^{n} R_i$
- Old rewards from 1000 steps ago get same weight as recent rewards
- **Cannot adapt** when true action values change over time
- **Slow response** to changes in optimal action

**Solution**: Use **constant step-size parameter** to emphasize recent experience.

### Constant Step-Size Method

**Modified incremental update rule** (_Equation 2.5_):
$$Q_{n+1} = Q_n + \alpha[R_n - Q_n]$$

where **step-size parameter** $\alpha \in (0, 1]$ is constant.

**Key differences from sample-average**:
- Step-size doesn't decrease with more samples ($\alpha$ vs. $\frac{1}{n}$)
- More recent rewards get higher effective weight
- Enables tracking of changing environments

### Exponential Recency-Weighted Average

**Mathematical expansion** (_Equation 2.6_):
$$
\begin{align}
Q_{n+1} &= Q_n + \alpha[R_n - Q_n] \\
&= \alpha R_n + (1-\alpha)Q_n \\
&= \alpha R_n + (1-\alpha)[\alpha R_{n-1} + (1-\alpha)Q_{n-1}] \\
&= \alpha R_n + (1-\alpha)\alpha R_{n-1} + (1-\alpha)^2 Q_{n-1} \\
&= \alpha R_n + (1-\alpha)\alpha R_{n-1} + (1-\alpha)^2\alpha R_{n-2} + \cdots \\
&\quad + (1-\alpha)^{n-1}\alpha R_1 + (1-\alpha)^n Q_1 \\
&= (1-\alpha)^n Q_1 + \sum_{i=1}^{n} \alpha(1-\alpha)^{n-i} R_i
\end{align}
$$

**Interpretation**:
- **Sum of weights**: $(1-\alpha)^n + \sum_{i=1}^{n} \alpha(1-\alpha)^{n-i} = 1$ ✓
- **Exponential decay**: Weight for reward $R_i$ is $\alpha(1-\alpha)^{n-i}$
- **Recency bias**: More recent rewards get exponentially higher weight
- **Decay rate**: $(1-\alpha)$ controls how quickly old information is forgotten

**Special cases**:
- $\alpha = 1$: Only most recent reward matters ($Q_{n+1} = R_n$)
- $\alpha \to 0$: Approaches sample average
- $1-\alpha = 0$: All weight on most recent reward

### Step-Size Sequence Convergence Conditions

**General step-size sequence** $\{\alpha_n(a)\}$ **convergence conditions** (_Equation 2.7_):
$$\sum_{n=1}^{\infty} \alpha_n(a) = \infty \quad \text{and} \quad \sum_{n=1}^{\infty} \alpha_n^2(a) < \infty$$

**Condition interpretations**:
1. **First condition**: Steps large enough to overcome initial conditions and random fluctuations
2. **Second condition**: Steps eventually become small enough to ensure convergence

**Method comparisons**:
- **Sample-average** ($\alpha_n = \frac{1}{n}$): ✓ Both conditions satisfied → **guaranteed convergence**
- **Constant step-size** ($\alpha_n = \alpha$): ✓ First condition, ✗ Second condition → **no convergence guarantee**

### Why Non-Convergence Can Be Good

**For nonstationary problems**:
- **Lack of convergence is desirable**—estimates should track changing values
- **Continued adaptation** to most recent rewards
- **Responsiveness** to environmental changes

**Trade-offs**:
- **Larger $\alpha$**: Faster adaptation, higher variance in estimates
- **Smaller $\alpha$**: Slower adaptation, lower variance, more stable

**Practical note**: Theoretical convergence guarantees often require impractically slow learning rates. Constant step-sizes with good empirical performance are preferred in practice.

### When to Use Each Method

**Sample-average ($\alpha = \frac{1}{n}$)**:
- **Stationary environments** where true values don't change
- When you want **guaranteed convergence** to true values
- **Long-term accuracy** more important than adaptation speed

**Constant step-size ($\alpha = $ constant)**:
- **Nonstationary environments** where values change over time
- When **responsiveness** to recent changes is crucial
- **Online learning** scenarios with ongoing environmental shifts

### Exercise 2.4 Solution
**Q:** If step-size parameters αₙ are not constant, what is the weighting on each prior reward for the general case?

**A:** For general step-size sequence {α₁, α₂, ..., αₙ}, the weight on reward Rᵢ in estimate Qₙ₊₁ is:

$$w_i = \alpha_i \prod_{j=i+1}^{n} (1 - \alpha_j)$$

**Derivation:**
- $Q_{n+1} = Q_n + \alpha_n[R_n - Q_n] = \alpha_n R_n + (1-\alpha_n)Q_n$
- $Q_n = \alpha_{n-1} R_{n-1} + (1-\alpha_{n-1})Q_{n-1}$
- Substituting recursively gives the weight formula above

**Verification:** For constant α, this reduces to $w_i = \alpha(1-\alpha)^{n-i}$, matching equation (2.6).

### See [nonstationary testbed implementation, here](./code_examples/nonstationary_testbed.py) for Exercise 2.5.

### Key Takeaways

1. **Environment type determines method choice**: Stationary → sample-average, Nonstationary → constant step-size
2. **Step-size controls adaptation speed**: Larger α = faster adaptation but higher variance
3. **Exponential weighting emerges naturally** from constant step-size updates
4. **Non-convergence can be beneficial** when tracking changing environments
5. **This pattern generalizes** beyond bandits to full RL problems

---

## 2.6 Optimistic Initial Values

**Core idea**: Set initial action-value estimates $Q_1(a)$ to optimistically high values to encourage exploration, even with greedy action selection.

### The Technique

**Standard approach**: Initialize $Q_1(a) = 0$ for all actions
**Optimistic approach**: Initialize $Q_1(a) = +5$ (when true values $q_*(a) \sim \mathcal{N}(0,1)$)

**Why it works**:
- All actions start with unrealistically high value estimates
- Whichever action is selected first will yield disappointing reward (less than +5)
- Agent becomes "disappointed" and tries other actions
- Continues until all actions tried multiple times
- **Automatic exploration** even with purely greedy selection

### Mechanism and Limitations

**How disappointment drives exploration**:
1. Agent selects action with highest $Q(a)$ (initially all tied at +5)
2. Receives actual reward (much less than +5 in expectation)
3. Updates $Q(a)$ downward via sample averaging
4. Other actions still have optimistic values, so agent tries them
5. Process continues until realistic estimates emerge

**When it works well**:
- **Stationary problems** where true values don't change
- **Short-term exploration boost** at beginning of learning
- Problems where initial exploration is most valuable

**Limitations**:
- **Not suitable for nonstationary problems**: Drive for exploration is temporary
- **Beginning-focused**: Only encourages exploration early in learning
- **Task-dependent**: Requires setting appropriate optimistic level
- **One-time effect**: Once estimates converge, no further exploration benefit

**Key insight**: Simple tricks can be surprisingly effective, but they're not universally applicable.

---

## 2.7 Upper-Confidence-Bound Action Selection

**Motivation**: $\epsilon$-greedy explores indiscriminately. Better approach: explore actions based on their **potential** to be optimal, considering both current estimates and uncertainty.

### The UCB Algorithm

**UCB action selection** (_Equation 2.10_):
$$A_t = \arg\max_a \left[Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}}\right]$$

**Components**:
- $Q_t(a)$: Current value estimate (exploitation term)
- $c > 0$: Confidence level parameter  
- $\sqrt{\frac{\ln t}{N_t(a)}}$: Uncertainty/confidence bound (exploration term)
- $N_t(a)$: Number of times action $a$ selected by time $t$

**Interpretation**: Select action with highest **upper confidence bound** on its true value.

### Why the Natural Logarithm?

**Mathematical origin**: The $\ln t$ term comes from **concentration inequalities** (particularly Hoeffding bounds) that provide probabilistic guarantees on estimation error.

**Key properties of $\ln t$**:
- **Grows unbounded**: Ensures all actions eventually selected
- **Grows slowly**: Doesn't dominate value estimates too quickly  
- **Time-dependent**: Accounts for overall experience level
- **Theoretical optimality**: Provides optimal regret bounds in certain settings

**Intuitive behavior**:
- As $t$ increases, pressure to explore all actions grows (but slowly)
- As $N_t(a)$ increases, confidence in estimate grows (exploration bonus decreases)
- Actions with fewer samples get higher exploration bonuses

### UCB Mechanism

**How uncertainty drives exploration**:
1. **Rarely-selected actions**: Large $\sqrt{\frac{\ln t}{N_t(a)}}$ term → high selection probability
2. **Frequently-selected actions**: Small uncertainty term → selection based mainly on $Q_t(a)$  
3. **Optimal actions**: Eventually dominate due to both high value and sufficient confidence

**Adaptive exploration**: UCB automatically balances exploitation and exploration without requiring manual parameter tuning for exploration rate.

### Performance and Limitations

**Advantages**:
- **Strong empirical performance** on stationary bandits
- **Principled approach** based on statistical confidence
- **Parameter-insensitive**: Less tuning required than $\epsilon$-greedy

**Limitations**:
- **Difficult to extend** beyond bandits to full RL settings
- **Nonstationary problems**: More complex methods needed
- **Large state spaces**: Not practical with function approximation
- **Computational complexity**: More complex than $\epsilon$-greedy

**Exercise 2.8 Solution**: The spike at step 11 occurs because after 10 steps, UCB has tried each action once. At step 11, $\ln t$ jumps significantly while $N_t(a)$ values are still small, creating large exploration bonuses that temporarily override value estimates. The spike decreases afterward as $N_t(a)$ values grow and exploration bonuses shrink.

---

## 2.8 Gradient Bandit Algorithms

**Paradigm shift**: Instead of estimating action **values**, learn action **preferences** and select actions probabilistically.

### Preference-Based Action Selection

**Action preferences**: $H_t(a) \in \mathbb{R}$ for each action $a$
- **Not interpretable as values**: Only relative preferences matter
- **Scale invariant**: Adding constant to all $H_t(a)$ doesn't change behavior
- **Learned through experience**: Start with $H_1(a) = 0$ (equal preferences)

**Action selection via softmax** (_Equation 2.11_):
$$\pi_t(a) = \Pr\{A_t = a\} = \frac{e^{H_t(a)}}{\sum_{b=1}^{k} e^{H_t(b)}}$$

### Softmax Function Deep Dive

**Mathematical formulation**: 
$$\text{softmax}(\mathbf{h})_i = \frac{e^{h_i}}{\sum_{j=1}^{k} e^{h_j}}$$

**Why exponential function?**:
- **Always positive**: $e^x > 0$ for all $x$, ensuring valid probabilities
- **Monotonic**: Higher preferences → higher probabilities
- **Differentiable**: Enables gradient-based learning
- **Amplifies differences**: Small preference differences become larger probability differences

**Temperature interpretation**: Implicit temperature $T = 1$. General form:
$$\pi_t(a) = \frac{e^{H_t(a)/T}}{\sum_{b=1}^{k} e^{H_t(b)/T}}$$
- **High T**: More uniform (exploratory)
- **Low T**: More concentrated on best action (exploitative)

**Usage in ML**:
- **Neural networks**: Final layer for classification
- **Multinomial logistic regression**: Link function
- **Reinforcement learning**: Policy parameterization
- **Attention mechanisms**: Weighting scheme

**Why appropriate for bandits**:
- **Smooth selection**: Avoids discrete exploration decisions
- **Differentiable**: Enables policy gradient methods
- **Probabilistic**: Natural for stochastic policies
- **Scale-free**: Relative preferences matter, not absolute values

### Gradient Bandit Update Rules

**Preference updates** (_Equation 2.12_):

**For selected action** $A_t$:
$$H_{t+1}(A_t) = H_t(A_t) + \alpha(R_t - \bar{R}_t)[1 - \pi_t(A_t)]$$

**For non-selected actions** $a \neq A_t$:
$$H_{t+1}(a) = H_t(a) - \alpha(R_t - \bar{R}_t)\pi_t(a)$$

**Parameters**:
- $\alpha > 0$: Step-size parameter
- $\bar{R}_t$: Average reward baseline (computed incrementally)

### Understanding the Updates

**Reward above baseline** ($R_t > \bar{R}_t$):
- **Selected action**: Preference increases by $\alpha(R_t - \bar{R}_t)[1 - \pi_t(A_t)]$
- **Non-selected actions**: Preferences decrease by $\alpha(R_t - \bar{R}_t)\pi_t(a)$
- **Effect**: Higher reward → increase probability of repeating selected action

**Reward below baseline** ($R_t < \bar{R}_t$):
- **Selected action**: Preference decreases
- **Non-selected actions**: Preferences increase  
- **Effect**: Poor reward → decrease probability of repeating selected action

**Why baseline matters**:
- **Variance reduction**: Reduces update noise, faster learning
- **Reference point**: Provides context for "good" vs "bad" rewards
- **Figure 2.5 evidence**: Without baseline, performance degrades significantly

### Theoretical Foundation: Stochastic Gradient Ascent

**Objective**: Maximize expected reward $\mathbb{E}[R_t] = \sum_a \pi_t(a)q_*(a)$

**True gradient**:
$$\frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)} = \sum_x q_*(x) \frac{\partial \pi_t(x)}{\partial H_t(a)}$$

**Key mathematical result**:
$$\frac{\partial \pi_t(x)}{\partial H_t(a)} = \pi_t(x)[\mathbf{1}_{a=x} - \pi_t(a)]$$

where $\mathbf{1}_{a=x}$ is indicator function.

**Sample-based approximation**: The update rules (2.12) are unbiased estimates of the true gradient, making this algorithm an instance of **stochastic gradient ascent**.

**Convergence guarantee**: As a stochastic gradient method, has robust convergence properties to local optima.

**Exercise 2.9 Solution**: For two actions with preferences $H_t(1)$ and $H_t(2)$:
$$\pi_t(1) = \frac{e^{H_t(1)}}{e^{H_t(1)} + e^{H_t(2)}} = \frac{1}{1 + e^{H_t(2) - H_t(1)}}$$

This is the **logistic (sigmoid) function** with input $H_t(1) - H_t(2)$, commonly used in binary classification and neural networks.

---

## 2.9 Associative Search (Contextual Bandits)

**Bridge to full RL**: Introduces the concept of **state** while maintaining the single-step reward structure of bandits.

### The Associative Search Problem

**Setup**: Multiple bandit problems, with observable context indicating which problem you're facing.

**Example scenario**:
- Multiple slot machines with different colored displays
- Color indicates machine's current reward mapping
- Goal: Learn policy mapping colors (contexts) to optimal actions

**Formal definition**:
- **Context**: Observable state $s_t$ that indicates current bandit problem
- **Policy**: Mapping $\pi(s) : \mathcal{S} \to \mathcal{A}$ from contexts to actions
- **Objective**: Learn optimal action for each context

### Comparison to Other Problems

**vs. Standard k-armed bandit**:
- **Bandit**: Single stationary problem, learn one optimal action
- **Associative search**: Multiple problems, learn optimal action for each context

**vs. Full RL**:
- **Similar**: Both involve learning policies mapping states to actions
- **Different**: Actions only affect immediate reward, not next state
- **Associative search**: $s_{t+1}$ independent of $A_t$
- **Full RL**: $s_{t+1}$ depends on both $s_t$ and $A_t$

### Why Context Matters

**Without context information**:
- Appears as single nonstationary bandit
- Optimal strategy: Use methods from Section 2.5
- Performance limited by need to adapt to changing optimal action

**With context information**:
- Can learn separate policy for each context
- Much better performance possible
- Exploit structure in the environment

**Key insight**: **Information about state** dramatically improves learning performance when environment has structure.

### Solution Approaches

**Context-specific learning**:
- Maintain separate value estimates $Q_t(s,a)$ for each state-action pair
- Apply bandit algorithms within each context
- Use any method: $\epsilon$-greedy, UCB, gradient bandit, etc.

**Function approximation**:
- Learn general mapping from (context, action) to values
- Important when context space is large or continuous

**Exercise 2.10 Solution**:

**Without context knowledge**:
- Expected values: Action 1 gives $0.5 \times 10 + 0.5 \times 90 = 50$
- Action 2 gives $0.5 \times 20 + 0.5 \times 80 = 50$
- **Best strategy**: Either action, expected reward = 50

**With context knowledge**:
- **Case A**: Choose action 2 (20 > 10), expected reward = 20
- **Case B**: Choose action 1 (90 > 80), expected reward = 90  
- **Overall**: $0.5 \times 20 + 0.5 \times 90 = 55$
- **Improvement**: 10% better performance just from observing context

---

## 2.10 Summary

**Chapter overview**: Exploration vs. exploitation in the simplest RL setting—bandits provide foundation for understanding this fundamental tradeoff.

### Method Comparison and Performance

**Key algorithms covered**:

| Method | Core Idea | Best For | Limitations |
|--------|-----------|----------|-------------|
| **Greedy** | Always exploit current best estimate | Deterministic environments | Gets stuck in suboptimal choices |
| **$\epsilon$-greedy** | Random exploration with probability $\epsilon$ | General-purpose, robust | Indiscriminate exploration |
| **Optimistic initialization** | Start with overly high value estimates | Stationary problems | Temporary exploration only |
| **UCB** | Confidence-based exploration | Stationary bandits | Hard to extend to full RL |
| **Gradient bandit** | Learn action preferences, not values | When relative preferences matter | More complex than value-based |

**Performance insights from Figure 2.6**:
- **Inverted-U shape**: All methods have optimal parameter ranges
- **UCB**: Generally best performance on 10-armed testbed
- **Parameter sensitivity**: All methods fairly robust across parameter ranges
- **No universal winner**: Performance depends on problem characteristics

### Advanced Topics and Future Directions

**Bayesian approaches**:
- **Gittins indices**: Theoretically optimal for specific cases
- **Thompson sampling**: Sample from posterior distributions
- **Computational complexity**: Often intractable for large problems

**Theoretical guarantees**:
- **Regret bounds**: How quickly algorithms approach optimal performance
- **Sample complexity**: Number of steps needed to learn near-optimal policy
- **Exploration efficiency**: Formal measures of exploration quality

**Limitations of current methods**:
- **Distributional assumptions**: Most methods assume specific reward distributions
- **Prior knowledge**: Often require domain-specific parameter tuning
- **Scalability**: Simple methods may not extend to complex RL problems

### Comprehensive Symbol Reference

| Symbol | Meaning | Context |
|--------|---------|---------|
| **Problem Setup** | | |
| $k$ | Number of actions (arms) | Multi-armed bandit |
| $A_t$ | Action selected at time $t$ | Random variable |
| $R_t$ | Reward received at time $t$ | Random variable |
| $q_*(a)$ | True value of action $a$ | $\mathbb{E}[R_t \mid A_t = a]$ |
| **Value Estimation** | | |
| $Q_t(a)$ | Estimated value of action $a$ at time $t$ | Sample average or other estimate |
| $N_t(a)$ | Number of times action $a$ selected by time $t$ | Action count |
| $\alpha$ | Step-size parameter | Learning rate |
| $\bar{R}_t$ | Average reward baseline up to time $t$ | Variance reduction |
| **Action Selection** | | |
| $\epsilon$ | Exploration probability | $\epsilon$-greedy parameter |
| $\pi_t(a)$ | Probability of selecting action $a$ at time $t$ | Stochastic policy |
| $c$ | Confidence level parameter | UCB exploration control |
| **Gradient Bandits** | | |
| $H_t(a)$ | Preference for action $a$ at time $t$ | Not interpretable as value |
| **Mathematical Operators** | | |
| $\arg\max_a f(a)$ | Action $a$ that maximizes $f(a)$ | Greedy selection |
| $\mathbf{1}_{\text{condition}}$ | Indicator function | 1 if condition true, 0 otherwise |
| $\ln t$ | Natural logarithm of $t$ | UCB confidence bound |
| **Update Patterns** | | |
| Target $-$ OldEstimate | Error signal | Universal RL update pattern |
| StepSize $\times$ Error | Update magnitude | Learning rate control |

### Key Insights for Full RL

**Foundational concepts**:
1. **Exploration-exploitation tradeoff** appears in all RL problems
2. **Incremental updates** are computationally essential
3. **Step-size choice** affects adaptation vs. stability
4. **Baseline techniques** reduce variance in learning
5. **Context information** dramatically improves performance

**Limitations requiring more advanced methods**:
- **Sequential decision making**: Actions affect future states
- **Credit assignment**: Rewards may be delayed
- **Large state spaces**: Function approximation needed
- **Continuous actions**: Discrete methods don't apply

**Bridge to Chapter 3**: Associative search introduces states but maintains immediate rewards. Next: full RL where actions affect both immediate rewards and future states.

---
---

# Appendix A: Full Notation
---

### General Notation

> Capital letters are used for random variables.

> Lower case letters are used for the values of random variables and for scalar functions.

> Quantities that are required to be real-valued vectors are written in bold and in lower case (even if random variables). 

> Matrices are bold capitals.

| Symbol | Meaning |
|--------|---------|
| **Standard Operators** |  |
| $\doteq$ | Equality relationship that is true by definition |
| $\approx$ | Approximately equal |
| $\propto$ | Proportional to |
| $\Pr\{X = x\}$ | Probability that random variable $X$ takes value $x$ |
| $X \sim p$ | $X$ drawn from distribution $p$ |
| $\mathbb{E}[X]$ | Expectation of random variable $X$ |
| $\arg\max_a f(a)$ | Value of $a$ that maximizes $f(a)$ |
| $\ln x$ | Natural logarithm of $x$ |
| $e^x$, $\exp(x)$ | The base of the natural logarithm, $e \approx 2.71828$, carried to power $x$ |
| $\mathbb{R}$ | Set of real numbers |
| $f: \mathcal{X} \to \mathcal{Y}$ | Function from elements of set $\mathcal{X}$ to elements of set $\mathcal{Y}$ |
| $\leftarrow$ | Assignment |
| $(a, b]$ | Real interval between $a$ and $b$ including $b$ but not $a$ |
| **Standard RL Algorithm Parameters** |  |
| $\epsilon$ | Probability of taking a random action in an $\epsilon$-greedy policy |
| $\alpha, \beta$ | Step-size parameters |
| $\gamma$ | Discount-rate parameter |
| $\lambda$ | Decay-rate parameter for eligibility traces |
| $\mathbf{1}_\text{predicate}$ | Indicator function (1 if predicate is true, else 0) |

---

### Multi-Armed Bandit Notation

| Symbol | Meaning |
|--------|---------|
| $k$ | Number of actions (arms) |
| $t$ | Discrete time step or play number |
| $q_*(a)$ | True value (expected reward) of action $a$ |
| $Q_t(a)$ | Estimate at time $t$ of $q_*(a)$ |
| $N_t(a)$ | Number of times action $a$ has been selected up to time $t$ |
| $H_t(a)$ | Learned preference for selecting action $a$ at time $t$ |
| $\pi_t(a)$ | Probability of selecting action $a$ at time $t$ |
| $\bar{R}_t$ | Estimate at time $t$ of the expected reward given $\pi_t$ |

---

### Markov Decision Process Notation

| Symbol | Meaning |
|--------|---------|
| **State & Action Sets** |  |
| $s, s'$ | States |
| $a$ | An action |
| $r$ | A reward |
| $\mathcal{S}$ | Set of all nonterminal states |
| $\mathcal{S}^+$ | Set of all states, incl. terminal state |
| $\mathcal{A}(s)$ | Set of all actions available in state $s$ |
| $\mathcal{R}$ | Set of all possible rewards, a finite subset of $\mathbb{R}$ |
| $\mathcal{C}$ | Subset of (e.g., $\mathcal{R} \subset \mathbb{R}$) |
| $\in$ | Is an element of (e.g. $s \in \mathcal{S}$, $r \in \mathcal{R}$) |
| $\lvert\mathcal{S}\rvert$ | Number of elements in set $\mathcal{S}$ |
| **Time & Policy** |  |
| $t$ | Discrete time step |
| $T, T(t)$ | Final time step of episode, or including $t$ |
| $A_t$ | Action at time $t$ |
| $S_t$ | State at time $t$ |
| $R_t$ | Reward at time $t$ |
| $\pi$ | Policy (decision-making rule) |
| $\pi(s)$ | Action taken in state $s$ under deterministic $\pi$ |
| $\pi(a \mid s)$ | Probability of taking action $a$ in state $s$ under stochastic $\pi$ |
| **Returns** |  |
| $G_t$ | Return following time $t$ |
| $h$ | Horizon (timestep looked up to in forward view) |
| $G_{t:t+n}, G_{t:h}$ | $n$-step return from $t+1$ to $t+n$ or $h$ (discounted/corrected) |
| $G_{t:h}$ | Flat return (undiscounted/unadjusted) from $t+1$ to $h$ |
| $G^\lambda_t$ | $\lambda$-return |
| $G^{\wedge}_t$, $G^{\wedge a}_t$ | Truncated, corrected $\lambda$-return |
| **Transition & Reward Probabilities** |  |
| $p(s', r \mid s, a)$ | Probability of transition to $s'$ with reward $r$ from $s, a$ |
| $p(s' \mid s, a)$ | Probability of transition to $s'$ from $s$ taking $a$ |
| $r(s, a)$ | Expected immediate reward from $s$ after $a$ |
| $r(s, a, s')$ | Expected reward on transition $s \to s'$ under $a$ |
| **Value Functions** |  |
| $v_\pi(s)$ | Value of $s$ under policy $\pi$ (expected return) |
| $v_*(s)$ | Value of $s$ under optimal policy |
| $q_\pi(s, a)$ | Value of taking $a$ in $s$ under $\pi$ |
| $q_*(s, a)$ | Value of taking $a$ in $s$ under optimal policy |
| **Estimators & TD Error** |  |
| $V, V_t$ | Array estimates of $v_\pi$ or $v_*$ |
| $Q, Q_t$ | Array estimates of $q_\pi$ or $q_*$ |
| $\hat{V}(s)$ | Expected approximate action value; e.g. $\hat{V}_t(s) \doteq \sum_a \pi(a|s) Q_t(s, a)$ |
| $U_t$ | Target for estimate at time $t$ |
| $\delta_t$ | Temporal-difference (TD) error at $t$ (a random variable) |
| $\delta_t^s$, $\delta_t^a$ | State- and action-specific forms of TD error |
| $n$ | In $n$-step methods, $n$ is number of steps of bootstrapping |
---

### Function Approximation, Policy Gradient, Advanced Notation

| Symbol | Meaning |
|--------|---------|
| **Weight & Parameter Notation** |  |
| $d$ | Dimensionality—number of components of $\mathbf{w}$ |
| $d'$ | Alternate dimensionality—number of components of $\theta$ |
| $\mathbf{w}, \mathbf{w}_t$ | $d$-vector of weights underlying approximate value function |
| $w_i, w_{t,i}$ | $i$-th component of learnable weight vector |
| $\hat{v}(s, \mathbf{w})$ | Approximate value of $s$ given $\mathbf{w}$ |
| $v_\mathbf{w}(s)$ | Alternate notation for $\hat{v}(s, \mathbf{w})$ |
| $\hat{q}(s, a, \mathbf{w})$ | Approximate value of $(s, a)$ given $\mathbf{w}$ |
| $\nabla \hat{v}(s, \mathbf{w})$ | Column vector of partial derivatives w.r.t. $\mathbf{w}$ |
| $\nabla \hat{q}(s, a, \mathbf{w})$ | Column vector of partial derivatives w.r.t. $\mathbf{w}$ |
| **Feature Vectors & Eligibility Traces** |  |
| $\mathbf{x}(s)$ | Feature vector visible in $s$ |
| $\mathbf{x}(s, a)$ | Feature vector visible in $s$ taking $a$ |
| $x_i(s), x_i(s, a)$ | $i$-th component of $\mathbf{x}(s)$ or $\mathbf{x}(s, a)$ |
| $\mathbf{x}_t$ | Shorthand for $\mathbf{x}(S_t)$ or $\mathbf{x}(S_t, A_t)$ |
| $\mathbf{w}^\top \mathbf{x}$ | Inner product of vectors |
| $\mathbf{v}, \mathbf{v}_t$ | Secondary $d$-vector of weights, used to learn $\mathbf{w}$ |
| $\mathbf{z}_t$ | $d$-vector of eligibility traces at $t$ |
| **Policy Gradient Notation** |  |
| $\theta, \theta_t$ | Parameter vector of target policy |
| $\pi(a \mid s, \theta)$ | Probability of taking $a$ in $s$ given $\theta$ |
| $\pi_\theta$ | Policy corresponding to parameter $\theta$ |
| $\nabla \pi(a \mid s, \theta)$ | Partial derivatives of $\pi(a \mid s, \theta)$ w.r.t. $\theta$ |
| $J(\theta)$ | Performance measure for policy $\pi_\theta$ |
| $\nabla J(\theta)$ | Partial derivatives of $J(\theta)$ w.r.t. $\theta$ |
| $h(s, a, \theta)$ | Preference for $a$ in $s$ based on $\theta$ |
| **Behavior Policy, Baselines, Importance Sampling** |  |
| $b(a \mid s)$ | Behavior policy used to select actions while learning target $\pi$ |
| $b(s)$ | Baseline function $b: \mathcal{S} \to \mathbb{R}$ for policy-gradient methods |
| $b$ | Branching factor for MDP/search tree |
| $\rho_{t:h}$ | Importance sampling ratio for $t$ through $h$ |
| $\rho_t$ | Importance sampling ratio for time $t$ alone, $\rho_t \doteq \rho_{t:t}$ |
| $r(\pi)$ | Average reward (reward rate) for policy $\pi$ |
| $\bar{R}_t$ | Estimate of $r(\pi)$ at time $t$ |
| **State Distributions & Operators** |  |
| $\mu(s)$ | On-policy distribution over states |
| $\mu$ | $\lvert\mathcal{S}\rvert$-vector of the $\mu(s)$ for $s \in \mathcal{S}$ |
| $\|v\|^2_\mu$ | $\mu$-weighted squared norm of $v$, i.e., $\|v\|^2_\mu \doteq \sum_{s \in \mathcal{S}} \mu(s)v(s)^2$ |
| $\eta(s)$ | Expected number of visits to $s$ per episode |
| $\Pi$ | Projection operator for value functions |
| $B_\pi$ | Bellman operator for value functions |

---

### Matrices, Bellman Error, & Error Metrics

| Symbol | Meaning |
|--------|---------|
| **Matrix Notation & Linear Algebra** |  |
| $\mathbf{A}$ | $d \times d$ matrix: $\mathbf{A} \doteq \mathbb{E}\left[ \mathbf{x}_t(\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^\top \right]$ |
| $\mathbf{b}$ | $d$-dimensional vector: $\mathbf{b} \doteq \mathbb{E}[R_{t+1} \mathbf{x}_t]$ |
| $\mathbf{w}_{TD}$ | TD fixed point: $\mathbf{w}_{TD} \doteq \mathbf{A}^{-1}\mathbf{b}$ |
| $\mathbf{I}$ | Identity matrix |
| $\mathbf{P}$ | $\lvert\mathcal{S}\rvert \times \lvert\mathcal{S}\rvert$ matrix of state-transition probabilities under $\pi$ |
| $\mathbf{D}$ | $\lvert\mathcal{S}\rvert \times \lvert\mathcal{S}\rvert$ diagonal matrix with $\mu$ on its diagonal |
| $\mathbf{X}$ | $\lvert\mathcal{S}\rvert \times d$ matrix with the $\mathbf{x}(s)$ as its rows |
| **Bellman Error & Value Error Metrics** |  |
| $\bar{\delta}_\mathbf{w}(s)$ | Bellman error (expected TD error) for $v_\mathbf{w}$ at $s$ |
| $\bar{\delta}_\mathbf{w}$, BE | Bellman error vector (with components $\bar{\delta}_\mathbf{w}(s)$) |
| $\text{VE}(\mathbf{w})$ | Mean square value error: $\text{VE}(\mathbf{w}) \doteq \|v_\mathbf{w} - v_\pi\|^2_\mu$ |
| $\text{BE}(\mathbf{w})$ | Mean square Bellman error: $\text{BE}(\mathbf{w}) \doteq \|\bar{\delta}_\mathbf{w}\|^2_\mu$ |
| $\text{PBE}(\mathbf{w})$ | Mean square projected Bellman error: $\text{PBE}(\mathbf{w}) \doteq \|\Pi \bar{\delta}_\mathbf{w}\|^2_\mu$ |
| $\text{TDE}(\mathbf{w})$ | Mean square temporal-difference error: $\text{TDE}(\mathbf{w}) \doteq \mathbb{E}_b[\rho_t \delta_t^2]$ |
| $\text{RE}(\mathbf{w})$ | Mean square return error |