# Summary of Notation
---


(full notation in Appendix A)

> Whenever you see the word "value", think "action's Expected Value of reward" (AEVOR). This is usually a $P(win)$ or (in the bandit case) probability-weighted-avg $R_t$, given that that action is taken. It can be either a "True" value (i.e. the population mean of the action's distribution) or an estimate (i.e. sample statistic) of it.

| Symbol | Meaning |
|--------|---------|
| $\Pr\{X = x\}$ | Probability that random variable $X$ takes value $x$ |
| $X \sim p$ | $X$ is drawn from distribution $p$ |
| $\mathbb{E}[X]$ | Expectation of random variable $X$ |
| $\arg\max_a f(a)$ | Value of $a$ that maximizes $f(a)$ |
| $\alpha$, $\beta$, $\epsilon$ | Step-size, decay-rate, and exploration parameters |
| $s, s'$ | States |
| $a$ | Action |
| $r$ | Reward |
| $S, A(s), R$ | Set of states, available actions, rewards |
| $t, T$ | Discrete time step, final step |
| $S_t, A_t, R_t$ | State, action, and reward at time $t$ |
| $\pi$ | Policy (decision-making rule) |
| $\pi(s)$ | action taken in state $s$ under _deterministic_ $\pi$ |
| $\pi(a \mid s)$ | Probability of taking action $a$ in state $s$ under _stochastic_ $\pi$ |
| $G_t$ | Return from time $t$ |
| $h$ | horizon (the timestep one looks up to in a forward view) |
| $v_\pi(s)$ | Value of state $s$ under policy $\pi$ |
| $q_\pi(s, a)$ | Value of state-action pair $(s,a)$ under $\pi$ |
| $p(s', r \mid s, a)$ | Transition dynamics |
| $\hat{v}(s, \mathbf{w})$ | Approximate value function using weight vector $\mathbf{w}$ |
| $\delta_t$ | Temporal-difference error at time $t$ |

# Chapter 1: Introduction
---

## 1.1 Reinforcement Learning

- **Definition**: Learning how to map situations to actions to maximize cumulative reward.
- **Core elements**:
  - **Trial-and-error search**: The agent explores to learn.
  - **Delayed reward**: Actions impact long-term outcomes, not just immediate feedback.
- **Distinguishing features**:
  - Not told *what* actions to take, only gets feedback via rewards.
  - Must balance **exploration vs. exploitation**.
- **Compared to other paradigms**:
  - **Supervised learning**: Learns from labeled examples.
  - **Unsupervised learning**: Learns hidden structure from unlabeled data.
  - **RL**: Maximizes rewards via interaction; often needs to explore unknown state spaces.

---
## 1.2 Examples

Real-world analogies include:
- **Chess**: Intuition + planning, learning from position values.
- **Adaptive control**: Tuning refinery settings in real time.
- **Animals**: Gazelle calf learning to run soon after birth.
- **Robots**: Battery-aware navigation decisions.
- **Daily tasks**: Preparing breakfast involves conditional behavior, goals, and sensory feedback.

**Key Takeaway**: RL applies broadly wherever an agent interacts with an environment, learns over time, and must adapt to uncertainty.

---
## 1.3 Elements of Reinforcement Learning

Four main components:

1. **Policy ($\pi$)**:
   - Mapping from states to actions: $\pi(s)$ or $\pi(a|s)$.
   - Can be deterministic or stochastic.

2. **Reward Signal ($r$)**:
   - Defines the goal.
   - Immediate signal from the environment, tells what’s good or bad.

3. **Value Function ($v(s)$, $q(s,a)$)**:
   - Estimates long-term reward.
   - **$v_\pi(s)$**: Expected return from state $s$ under policy $\pi$.
   - Guides decision-making more than immediate rewards.

4. **Model (optional)**:
   - Simulates environment behavior: $p(s', r \mid s, a)$.
   - Enables **planning** (model-based) vs. **direct interaction** (model-free).

---

## 1.4 Limitations and Scope

- Assumes a **state signal** is given (e.g., as preprocessed input).
- Reinforcement learning does not inherently solve **state representation**.
- Evolutionary methods can solve RL problems without value functions, but often less efficient.
- Focus of this book: **model-free and model-based RL** with value estimation.

---

## 1.5 An Extended Example: Tic-Tac-Toe

Goal: Learn to beat an imperfect opponent without knowing its strategy a priori.

> About the best one can do on this problem is first to learn a model of the opponent’s behavior, up to some level of confidence, and then apply dynamic programming to compute an optimal solution given the approximate opponent model. In the end, this is not that different from some of the reinforcement learning methods we examine later in this book.

KEY: In this example, "value" is the estimated probability of winning in a given board state. The "value function" is the learned $P(win)$ estimates for ALL board states.

### **Approach _using a value function_**:
- Initialize a "value table" $V(s)$ that maps board states to win probabilities:
   - Each "cell" in the value table corresponds to a possible board state $s$. 
   - Each cell's value is the latest estimate of the probability of winning from that state (that state's _value_ $V(s_i)$).
   - The whole table is the learned _value function_.
   - Initialize the table with:
      - 1s for all states that have 3 Xs in a row.
      - 0s for all states that have 3 Os in a row.
      - 0.5 for all other states.
- Play a bunch of games against the opponent and update the values.
- Use mostly greedy moves (choose highest available $V(s)$), but occasionally explore (choose non-highest $V(s)$).
- Once the game is finished, **after greedy moves only**, update the value of the previous state ($V(S_t)$) to be closer to the current state ($V(S_{t+1})$):
  $$
  V(S_t) \leftarrow V(S_t) + \alpha \left[V(S_{t+1}) - V(S_t)\right]
  $$
   - where $\alpha$ is a step-size parameter (small positive fraction).
   - **START AT THE TERMINAL STATE AND MOVE BACKWARDS, BACKPROPAGATING THE UPDATES TO $S_0$ FOR GREEDY MOVES ONLY**.
   - Since value updates occur **only after greedy moves**, exploration is not used for learning.
- This is an example of a temporal-difference update rule.

<img src="../img/3.png" alt="tictactoe-backprop" width="50%"/>

#### **Note on mechanics**:
- It can (I think) be assumed that Sutton & Barto's tic-tac-toe example is "episodic TD(0)", where the update rule is applied after the game concludes.
   - This means that the post-game update applied to the penultimate move causes the update rule to backprop over multiple timesteps (in reverse order..."one-step bootsrapping"), sweeping from $S_{T-1}$ back to $S_0$ (but not applied for any exploration moves made along the way).
- "Online TD(0)": each update happens immediately after a step. 
   - In the first few games, the only signal comes from terminal states, as there is no gradient information to drive updates between intermediate timesteps.
      - In game 1, since all non-final board states are initiated with 0.5, $V(S_{t+1})$ does not impact the update to the previous state's value (because $V(S_{t+1}) - V(S_t)$ = 0) for all intermediate timesteps. However, the second-to-last board state does get updated with either $\alpha \left[1 - 0.5\right]$ or $\alpha \left[0 - 0.5\right]$ based on win or loss. In future games, the third from last board state _may_ (or may not) get updated because the penultimate is now updated.

#### **Why Exploration Helps Even Without Direct Updates**:

Exploration moves don’t directly cause value updates, but they indirectly shape which states you’ll encounter later in the game. By choosing exploratory actions, the learner encounters and evaluates board states that would never appear if playing purely greedily. This leads to two key impacts on learning:
1.	Indirect discovery of better strategies:
   - Exploration moves lead the agent into unfamiliar board states. While these states themselves aren’t updated immediately, the greedy moves taken afterwards (once back on the greedy path) do get updated based on these newly discovered outcomes.
   - Thus, exploratory moves can uncover states with higher true values than currently known, indirectly reshaping the value function.
2.	Indirect correction of overly optimistic values:
   - Conversely, exploration can also reveal states that initially appeared promising (due to limited experience), but actually lead to losses.
   - Even though the exploratory moves themselves aren’t updated directly, they still “set up” future greedy moves to be evaluated realistically, correcting inflated values indirectly.

### **Takeaways**:
- RL learns from interaction without a model of the opponent.
- It backs up values from future to past states.
- Unlike evolutionary methods, RL learns online, and credit assignment is finer-grained.

### **Model-Free Nature**:
- Doesn’t require modeling opponent or future states.
- Works via **temporal-difference (TD) learning**.
- Can generalize using function approximators (e.g., Tesauro’s neural network for Backgammon).

---

## 1.6 Summary

- RL is a third paradigm alongside supervised and unsupervised learning.
- Uses **Markov Decision Processes (MDPs)** to formalize interaction.
- Central tools:
  - **Value functions**
  - **Trial-and-error updates**
- RL = learning by *interacting*, *adapting*, and *optimizing long-term outcomes*.

---

## 1.7 Early History of Reinforcement Learning

**Two foundational threads**:
1. **Trial-and-error learning**:
   - Roots in psychology: Thorndike’s *Law of Effect*, Skinner’s reinforcement, Pavlov’s conditioning.
   - Turing (1948) described pleasure/pain driven learning machines.
   - Shannon’s *Theseus*, Michie’s *MENACE*, Samuel’s *checkers program* were early digital examples.

2. **Optimal control and dynamic programming**:
   - Bellman’s **DP** and **Bellman equations** in 1950s.
   - MDPs (Howard, 1960) formalized stochastic control problems.
   - Combined with function approximation: "neurodynamic programming".

**Modern RL** = Integration of:
- Dynamic programming theory (optimal control)
- Psychological theories of learning
- Computational models like TD-learning

---

## Chapter 1 Exercises

<img src="../img/4.png" alt="4" width="50%"/>
<br>
<img src="../img/5.png" alt="5" width="50%"/>

### Answers & Intuition

---

#### **Exercise 1.1: Self-Play**
**Q:** What happens if the RL algorithm plays against itself, both sides learning? Would it learn a different policy for selecting moves?

**A:**  
If both sides learn via self-play, the algorithm can converge to an optimal (or near-optimal) policy for both players, potentially leading to a policy that is robust to strong opponents, not just random ones. The learned policy may differ from what is learned against a random opponent, as the distribution of experiences will be more challenging and realistic.

**Intuition:**  
Self-play is foundational in RL for learning robust strategies, as seen in AlphaGo/AlphaZero. This question builds intuition for learning from interaction with non-stationary or improving opponents.

---

#### **Exercise 1.2: Symmetries**
**Q:** How might we amend the learning process to exploit board symmetries? How would this help? Should we still do this if the opponent does not exploit symmetries? Do symmetric positions necessarily have the same value?

**A:**  
We can exploit symmetries by treating symmetric board positions as equivalent, updating their values together, which reduces the state space and improves data efficiency. This accelerates learning because knowledge from one symmetric state transfers to others.  
If the opponent does *not* exploit symmetries, we may still benefit, but symmetric positions could end up having different practical values due to asymmetric play. Symmetric positions *should* have the same value only if the opponent's policy is itself symmetric.

**Intuition:**  
This introduces the idea of state abstraction and function approximation—core to generalization and sample efficiency in RL.

---

#### **Exercise 1.3: Greedy Play**
**Q:** What happens if the player always acts greedily (never explores)? Might it learn to play better or worse than a non-greedy player? What problems can occur?

**A:**  
A purely greedy player risks converging to suboptimal policies due to lack of exploration ("getting stuck" in local optima), and may never discover better moves. In practice, this can lead to worse performance compared to a player who explores and updates values more broadly. Problems include incomplete learning and overfitting to initial experiences.

**Intuition:**  
This exercise motivates the importance of exploration in RL (vs. exploitation), foundational for concepts like $\epsilon$-greedy, softmax, and exploration-exploitation tradeoff.

---

#### **Exercise 1.4: Learning from Exploration**
**Q:** What if learning updates occur after all moves (including exploratory)? What are the two sets of probabilities? Which method is better for learning/winning?

**A:**  
- **If we update after all moves (including exploratory):** The value estimates reflect the actual probabilities with which moves are taken, blending greedy and exploratory choices.
- **If we update only after greedy moves:** The values reflect the policy if only the best-known moves are always chosen.
- Continuing to make exploratory moves but learning only from greedy moves focuses learning on the policy we intend to follow for winning, which usually results in better performance than learning from all moves (which blends exploration into the policy).

**Intuition:**  
This foreshadows "on-policy" vs. "off-policy" learning distinctions and highlights how the target of updates matters in RL algorithms.

---

#### **Exercise 1.5: Other Improvements**
**Q:** What other ways can we improve the RL player? Any better way to solve the tic-tac-toe problem?

**A:**  
Possible improvements:
- Use deeper lookahead (planning) or tree search.
- Incorporate opponent modeling.
- Use value function approximation (e.g., neural nets).
- Exploit more domain knowledge (e.g., forced moves).
- Enhance exploration strategy (e.g., UCB, optimism).
For tic-tac-toe, full minimax search can solve the game exactly, but the RL approach generalizes to much harder problems.

**Intuition:**  
This encourages thinking about RL's flexibility and limitations, foreshadowing the use of model-based planning, opponent modeling, and advanced function approximation.

---

---
---

# Appendix A: Full Notation
---

### General Notation

> Capital letters are used for random variables.

> Lower case letters are used for the values of random variables and for scalar functions.

> Quantities that are required to be real-valued vectors are written in bold and in lower case (even if random variables). 

> Matrices are bold capitals.

| Symbol | Meaning |
|--------|---------|
| **Standard Operators** |  |
| $\doteq$ | Equality relationship that is true by definition |
| $\approx$ | Approximately equal |
| $\propto$ | Proportional to |
| $\Pr\{X = x\}$ | Probability that random variable $X$ takes value $x$ |
| $X \sim p$ | $X$ drawn from distribution $p$ |
| $\mathbb{E}[X]$ | Expectation of random variable $X$ |
| $\arg\max_a f(a)$ | Value of $a$ that maximizes $f(a)$ |
| $\ln x$ | Natural logarithm of $x$ |
| $e^x$, $\exp(x)$ | The base of the natural logarithm, $e \approx 2.71828$, carried to power $x$ |
| $\mathbb{R}$ | Set of real numbers |
| $f: \mathcal{X} \to \mathcal{Y}$ | Function from elements of set $\mathcal{X}$ to elements of set $\mathcal{Y}$ |
| $\leftarrow$ | Assignment |
| $(a, b]$ | Real interval between $a$ and $b$ including $b$ but not $a$ |
| **Standard RL Algorithm Parameters** |  |
| $\epsilon$ | Probability of taking a random action in an $\epsilon$-greedy policy |
| $\alpha, \beta$ | Step-size parameters |
| $\gamma$ | Discount-rate parameter |
| $\lambda$ | Decay-rate parameter for eligibility traces |
| $\mathbf{1}_\text{predicate}$ | Indicator function (1 if predicate is true, else 0) |

---

### Multi-Armed Bandit Notation

| Symbol | Meaning |
|--------|---------|
| $k$ | Number of actions (arms) |
| $t$ | Discrete time step or play number |
| $q_*(a)$ | True value (expected reward) of action $a$ |
| $Q_t(a)$ | Estimate at time $t$ of $q_*(a)$ |
| $N_t(a)$ | Number of times action $a$ has been selected up to time $t$ |
| $H_t(a)$ | Learned preference for selecting action $a$ at time $t$ |
| $\pi_t(a)$ | Probability of selecting action $a$ at time $t$ |
| $\bar{R}_t$ | Estimate at time $t$ of the expected reward given $\pi_t$ |

---

### Markov Decision Process Notation

| Symbol | Meaning |
|--------|---------|
| **State & Action Sets** |  |
| $s, s'$ | States |
| $a$ | An action |
| $r$ | A reward |
| $\mathcal{S}$ | Set of all nonterminal states |
| $\mathcal{S}^+$ | Set of all states, incl. terminal state |
| $\mathcal{A}(s)$ | Set of all actions available in state $s$ |
| $\mathcal{R}$ | Set of all possible rewards, a finite subset of $\mathbb{R}$ |
| $\mathcal{C}$ | Subset of (e.g., $\mathcal{R} \subset \mathbb{R}$) |
| $\in$ | Is an element of (e.g. $s \in \mathcal{S}$, $r \in \mathcal{R}$) |
| $\lvert\mathcal{S}\rvert$ | Number of elements in set $\mathcal{S}$ |
| **Time & Policy** |  |
| $t$ | Discrete time step |
| $T, T(t)$ | Final time step of episode, or including $t$ |
| $A_t$ | Action at time $t$ |
| $S_t$ | State at time $t$ |
| $R_t$ | Reward at time $t$ |
| $\pi$ | Policy (decision-making rule) |
| $\pi(s)$ | Action taken in state $s$ under deterministic $\pi$ |
| $\pi(a \mid s)$ | Probability of taking action $a$ in state $s$ under stochastic $\pi$ |
| **Returns** |  |
| $G_t$ | Return following time $t$ |
| $h$ | Horizon (timestep looked up to in forward view) |
| $G_{t:t+n}, G_{t:h}$ | $n$-step return from $t+1$ to $t+n$ or $h$ (discounted/corrected) |
| $G_{t:h}$ | Flat return (undiscounted/unadjusted) from $t+1$ to $h$ |
| $G^\lambda_t$ | $\lambda$-return |
| $G^{\wedge}_t$, $G^{\wedge a}_t$ | Truncated, corrected $\lambda$-return |
| **Transition & Reward Probabilities** |  |
| $p(s', r \mid s, a)$ | Probability of transition to $s'$ with reward $r$ from $s, a$ |
| $p(s' \mid s, a)$ | Probability of transition to $s'$ from $s$ taking $a$ |
| $r(s, a)$ | Expected immediate reward from $s$ after $a$ |
| $r(s, a, s')$ | Expected reward on transition $s \to s'$ under $a$ |
| **Value Functions** |  |
| $v_\pi(s)$ | Value of $s$ under policy $\pi$ (expected return) |
| $v_*(s)$ | Value of $s$ under optimal policy |
| $q_\pi(s, a)$ | Value of taking $a$ in $s$ under $\pi$ |
| $q_*(s, a)$ | Value of taking $a$ in $s$ under optimal policy |
| **Estimators & TD Error** |  |
| $V, V_t$ | Array estimates of $v_\pi$ or $v_*$ |
| $Q, Q_t$ | Array estimates of $q_\pi$ or $q_*$ |
| $\hat{V}(s)$ | Expected approximate action value; e.g. $\hat{V}_t(s) \doteq \sum_a \pi(a|s) Q_t(s, a)$ |
| $U_t$ | Target for estimate at time $t$ |
| $\delta_t$ | Temporal-difference (TD) error at $t$ (a random variable) |
| $\delta_t^s$, $\delta_t^a$ | State- and action-specific forms of TD error |
| $n$ | In $n$-step methods, $n$ is number of steps of bootstrapping |
---

### Function Approximation, Policy Gradient, Advanced Notation

| Symbol | Meaning |
|--------|---------|
| **Weight & Parameter Notation** |  |
| $d$ | Dimensionality—number of components of $\mathbf{w}$ |
| $d'$ | Alternate dimensionality—number of components of $\theta$ |
| $\mathbf{w}, \mathbf{w}_t$ | $d$-vector of weights underlying approximate value function |
| $w_i, w_{t,i}$ | $i$-th component of learnable weight vector |
| $\hat{v}(s, \mathbf{w})$ | Approximate value of $s$ given $\mathbf{w}$ |
| $v_\mathbf{w}(s)$ | Alternate notation for $\hat{v}(s, \mathbf{w})$ |
| $\hat{q}(s, a, \mathbf{w})$ | Approximate value of $(s, a)$ given $\mathbf{w}$ |
| $\nabla \hat{v}(s, \mathbf{w})$ | Column vector of partial derivatives w.r.t. $\mathbf{w}$ |
| $\nabla \hat{q}(s, a, \mathbf{w})$ | Column vector of partial derivatives w.r.t. $\mathbf{w}$ |
| **Feature Vectors & Eligibility Traces** |  |
| $\mathbf{x}(s)$ | Feature vector visible in $s$ |
| $\mathbf{x}(s, a)$ | Feature vector visible in $s$ taking $a$ |
| $x_i(s), x_i(s, a)$ | $i$-th component of $\mathbf{x}(s)$ or $\mathbf{x}(s, a)$ |
| $\mathbf{x}_t$ | Shorthand for $\mathbf{x}(S_t)$ or $\mathbf{x}(S_t, A_t)$ |
| $\mathbf{w}^\top \mathbf{x}$ | Inner product of vectors |
| $\mathbf{v}, \mathbf{v}_t$ | Secondary $d$-vector of weights, used to learn $\mathbf{w}$ |
| $\mathbf{z}_t$ | $d$-vector of eligibility traces at $t$ |
| **Policy Gradient Notation** |  |
| $\theta, \theta_t$ | Parameter vector of target policy |
| $\pi(a \mid s, \theta)$ | Probability of taking $a$ in $s$ given $\theta$ |
| $\pi_\theta$ | Policy corresponding to parameter $\theta$ |
| $\nabla \pi(a \mid s, \theta)$ | Partial derivatives of $\pi(a \mid s, \theta)$ w.r.t. $\theta$ |
| $J(\theta)$ | Performance measure for policy $\pi_\theta$ |
| $\nabla J(\theta)$ | Partial derivatives of $J(\theta)$ w.r.t. $\theta$ |
| $h(s, a, \theta)$ | Preference for $a$ in $s$ based on $\theta$ |
| **Behavior Policy, Baselines, Importance Sampling** |  |
| $b(a \mid s)$ | Behavior policy used to select actions while learning target $\pi$ |
| $b(s)$ | Baseline function $b: \mathcal{S} \to \mathbb{R}$ for policy-gradient methods |
| $b$ | Branching factor for MDP/search tree |
| $\rho_{t:h}$ | Importance sampling ratio for $t$ through $h$ |
| $\rho_t$ | Importance sampling ratio for time $t$ alone, $\rho_t \doteq \rho_{t:t}$ |
| $r(\pi)$ | Average reward (reward rate) for policy $\pi$ |
| $\bar{R}_t$ | Estimate of $r(\pi)$ at time $t$ |
| **State Distributions & Operators** |  |
| $\mu(s)$ | On-policy distribution over states |
| $\mu$ | $\lvert\mathcal{S}\rvert$-vector of the $\mu(s)$ for $s \in \mathcal{S}$ |
| $\|v\|^2_\mu$ | $\mu$-weighted squared norm of $v$, i.e., $\|v\|^2_\mu \doteq \sum_{s \in \mathcal{S}} \mu(s)v(s)^2$ |
| $\eta(s)$ | Expected number of visits to $s$ per episode |
| $\Pi$ | Projection operator for value functions |
| $B_\pi$ | Bellman operator for value functions |

---

### Matrices, Bellman Error, & Error Metrics

| Symbol | Meaning |
|--------|---------|
| **Matrix Notation & Linear Algebra** |  |
| $\mathbf{A}$ | $d \times d$ matrix: $\mathbf{A} \doteq \mathbb{E}\left[ \mathbf{x}_t(\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^\top \right]$ |
| $\mathbf{b}$ | $d$-dimensional vector: $\mathbf{b} \doteq \mathbb{E}[R_{t+1} \mathbf{x}_t]$ |
| $\mathbf{w}_{TD}$ | TD fixed point: $\mathbf{w}_{TD} \doteq \mathbf{A}^{-1}\mathbf{b}$ |
| $\mathbf{I}$ | Identity matrix |
| $\mathbf{P}$ | $\lvert\mathcal{S}\rvert \times \lvert\mathcal{S}\rvert$ matrix of state-transition probabilities under $\pi$ |
| $\mathbf{D}$ | $\lvert\mathcal{S}\rvert \times \lvert\mathcal{S}\rvert$ diagonal matrix with $\mu$ on its diagonal |
| $\mathbf{X}$ | $\lvert\mathcal{S}\rvert \times d$ matrix with the $\mathbf{x}(s)$ as its rows |
| **Bellman Error & Value Error Metrics** |  |
| $\bar{\delta}_\mathbf{w}(s)$ | Bellman error (expected TD error) for $v_\mathbf{w}$ at $s$ |
| $\bar{\delta}_\mathbf{w}$, BE | Bellman error vector (with components $\bar{\delta}_\mathbf{w}(s)$) |
| $\text{VE}(\mathbf{w})$ | Mean square value error: $\text{VE}(\mathbf{w}) \doteq \|v_\mathbf{w} - v_\pi\|^2_\mu$ |
| $\text{BE}(\mathbf{w})$ | Mean square Bellman error: $\text{BE}(\mathbf{w}) \doteq \|\bar{\delta}_\mathbf{w}\|^2_\mu$ |
| $\text{PBE}(\mathbf{w})$ | Mean square projected Bellman error: $\text{PBE}(\mathbf{w}) \doteq \|\Pi \bar{\delta}_\mathbf{w}\|^2_\mu$ |
| $\text{TDE}(\mathbf{w})$ | Mean square temporal-difference error: $\text{TDE}(\mathbf{w}) \doteq \mathbb{E}_b[\rho_t \delta_t^2]$ |
| $\text{RE}(\mathbf{w})$ | Mean square return error |