# Chapter 1: Introduction
---
---

## Key Realization:
---

> The point in RL is generally to find an optimal policy $\pi^*$ that maximizes total rewards across timesteps (a.k.a. "**Return**").  
> __**BUT**__  
> **KEY REALIZATION**: 🏆In Part I, you're not trying to directly estimate the policy. You're trying to adjust the value function (your estimate of $\mathbb{E}[PV(r^{\pi}_{t+})]$ for all states). 🏆   
> (i.e. Part I is "Value-based methods". Part II is "Policy-based methods".)  
> In TD cases like tic-tac-toe, the next state's Value estimate is used as a proxy for the true Return.  
>
> ---
>
> "Return" ($G_t \approx PV(r_{t+})$) is the cumulative reward from time $t$ onward (often discounted):
> $$G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots$$
> "Value" ($v_\pi(s)$) is the _expected_ $G_t$ (i.e., expected cumulative future reward) starting from $s$ following policy $\pi$.  
> A "Value Function" ($v_\pi$) is the "function" (or table) containing "Values" at each state.
>
> ---
>
> The point in RL is generally to find an optimal policy $\pi^*$ that maximizes total rewards across timesteps (a.k.a. "**Return**").    
> So:  
> The point in RL is generally to find an optimal policy $\pi^*$ that maximizes **Return** $G_t$.  
> Return can be thought of as $PV(r_{t+})$.
> 
> More precisely,  
> 
> The point in RL training is to make policy $\pi$ approximate optimal policy $\pi^*$ by maximizing **_Expected_ Return** $\mathbb{E}[G_t]$ (a.k.a. "**Value**").  
> So:  
> The point in RL training is to make policy $\pi$ approximate optimal policy $\pi^*$ by maximizing **Value** $v_{\pi^*}(s)$ **at each state $s$**.  
> Value can be thought of as $\mathbb{E}[PV(r^{\pi}_{t+})]$.  
> So:  
> The point in RL training is to make policy $\pi$ approximate optimal policy $\pi^*$ by maximizing  **Value** $\mathbb{E}[PV(r^{\pi}_{t+})]$ **at each state $s$**.  
>
> ---
>
> Part I focuses on methods for estimating "**value functions**" ($v_\pi$), which output "values" ($v_{\pi^*}(s)$) for all states, to which policies can be applied.  
> So:  
> ***🏆RL training in Part I is estimating "Value Functions" that, at each state, output "Values" that accurately reflect $\mathbb{E}[PV(\text{future rewards yielded by following }\pi)]$🏆.  
> $\pi$ is taken as a given and the policy's Value Function (i.e. mapping of states to values) is what is actually learned.  
> For example, the policy can simply be pre-specified as "pick next state as the one with the highest available Value". So the choice of a policy (like $\epsilon$-greedy) can result in a better Value Function.  
> But "inference" is just an algorithm that references the learned Value Table.***  

## 1.1 Reinforcement Learning
---


- **Definition**: Learning how to map situations to actions to maximize cumulative reward.
- **Core elements**:
  - **Trial-and-error search**: The agent explores to learn.
  - **Delayed reward**: Actions impact long-term outcomes, not just immediate feedback.
- **Distinguishing features**:
  - Not told *what* actions to take, only gets feedback via rewards.
  - Must balance **exploration vs. exploitation**.
- **Compared to other paradigms**:
  - **Supervised learning**: Learns from labeled examples.
  - **Unsupervised learning**: Learns hidden structure from unlabeled data.
  - **RL**: Maximizes rewards via interaction; often needs to explore unknown state spaces.



## 1.2 Examples
---


Real-world analogies include:
- **Chess**: Intuition + planning, learning from position values.
- **Adaptive control**: Tuning refinery settings in real time.
- **Animals**: Gazelle calf learning to run soon after birth.
- **Robots**: Battery-aware navigation decisions.
- **Daily tasks**: Preparing breakfast involves conditional behavior, goals, and sensory feedback.

**Key Takeaway**: RL applies broadly wherever an agent interacts with an environment, learns over time, and must adapt to uncertainty.


## 1.3 The 3 Elements of Reinforcement Learning (policy, value, and model)
---



Four main components:

1. **Policy ($\pi$)**: The agent's way of behaving at a given time by mapping from perceived environment states to actions to be taken (or action probabilities) when in those states. 
   - $\pi$ is the agent's decision-making algorithm (akin to "stimulus-response rules")
   - $\pi(s)$ is the action that policy $\pi$ outputs when given state $s$ as input
   - So $\pi$ is the agent's "mapping" from environment states to actions $\pi(s)$ (deterministic) (or $\pi(a|s)$ (stochastic))
      - In the stochastic case ($\pi(a|s)$), a policy specifies probabilities for each action
   - Can be a lookup table or simple function or an extensive computation (like search)
   - Policy alone is is sufficient to determine agent behavior
   - Notation:
      - $\pi(s)$: Action taken in state $s$ under deterministic policy $\pi$
         - Example:
            - If $\pi(s_1) = a_3$, this means "when in state $s_1$, the policy chooses action $a_3$"
               - The policy $\pi$ is the overall mapping/function, but $\pi(s)$ is the concrete action value it returns
         - This is analogous to how $f(x)$ represents the output value of function $f$ when given input $x$, not the function itself.
      - $\pi(a \mid s)$: Probability of taking action $a$ in state $s$ under stochastic policy $\pi$
         - Example:
            - If $\pi(left \mid s_1) = 0.7$ and $\pi(right \mid s_1) = 0.3$, this means "when in state $s_1$, choose left with 70% probability and right with 30% probability"
            - The agent samples from this probability distribution to decide which action to take

2. **Reward Signal ($r$)**: The immediate environmental feedback signal received by the agent at each time step. It is a single number. The agent's sole objective is to maximize the _total_ reward it receives over the long term.
   - $r_t$ is the reward received at time step $t$ (akin to pain/pleasure from taking an action).
   - Reward $r$ is a key variable in defining **Return ($G_t$)**, which is the **cumulative future reward, often discounted**: 
      $$G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots$$
      - To compare to financial theory, $r_t$ is like a future cashflow and $G_t$ is like the present value of future cash flows.
      - Note that Return $G_t$ is a random variable (since future rewards are uncertain) whose true expected value we want to estimate.
         - **In RL, the point is generally to find an optimal policy $\pi^*$ that maximizes the _EXPECTED_ Return from each state.**
         - The _EXPECTED_ Return is what the value functions estimate. Your estimate of expected long-term reward is called the "value function".
            - So your overall goal is to find the optimal policy $\pi^*$. You do this by estimating value functions (expected cumulative reward) for your current policy $\pi$, then using those estimates to improve $\pi$ toward $\pi^*$.
   - See the note below on learnable parameters

3. **Value Function ($v(s)$, $q(s,a)$)**:
   - Estimates long-term reward (expected return) from states or state-action pairs.
   - The value function is the agent's "learned intuition" about how good different situations are
   - **$v_\pi(s)$**: Expected return from state $s$ under policy $\pi$ - "How good is this state if I follow my current policy?"
      - V-value: $V(s)$ = expected return from state $s$ following the policy
      - "How good is this state?"
   - **$q_\pi(s, a)$**: Expected return from taking action $a$ in state $s$, then following policy $\pi$ - "How good is this specific action in this state?"
      - Q-value: $Q(s,a)$ = expected return from taking action $a$ in state $s$, then following the policy
      - "How good is this specific action in this state?"
   - Value functions guide decision-making more than immediate rewards because they capture long-term consequences
   - Unlike rewards (which are given by the environment), value functions are learned estimates that improve over time
   - Think of value functions as the agent's "experience-based predictions" about future success
   - In practice, value functions help answer: "Should I be optimistic or pessimistic about my current situation?"
   - Notation:
      - $v_\pi(s)$: State value function - expected cumulative reward starting from state $s$
      - $q_\pi(s, a)$: Action-value function (Q-function) - expected cumulative reward for taking action $a$ in state $s$
      - Both represent the same underlying concept (expected return) but from different starting points

4. **Model (optional)**:
   - Simulates environment behavior: $p(s', r \mid s, a)$ - "If I take action $a$ in state $s$, what state will I end up in and what reward will I get?"
   - The model is the agent's internal representation of how the world works
   - Enables **planning** (model-based) vs. **direct interaction** (model-free)
   - **Model-based RL**: Agent learns a model of the environment, then uses it to plan optimal actions (like playing out scenarios in your head)
   - **Model-free RL**: Agent learns directly from experience without explicitly modeling the environment (like learning to ride a bike through practice)
   - Models can be:
      - **Perfect**: Complete knowledge of environment dynamics (rare in practice)
      - **Learned**: Estimated from experience (common in model-based RL)
      - **Absent**: No model used (model-free methods like Q-learning)
   - Trade-off: Models enable faster learning through planning, but require additional computational resources and can be inaccurate
   - Example: A chess program might model "if I move my queen here, my opponent will likely respond with these


## An Aside: What are the "weights" (learnable parameters) in RL -- the Policy or the Value Function?
---
 


> #### Value-Based vs. Policy-Based Methods
> 
> **Value-based methods**: Learn a value function (estimates of expected returns) and derive the policy from those values (e.g., act greedily with respect to learned values).
> - The value function contains the learnable parameters
> - Policy is implicit - derived from value estimates
> - Examples: 
>   - Choose action with highest Q-value: $\pi(s) = \arg\max_a Q(s,a)$
>   - ε-greedy based on V(s): estimate Q-values from V(s) using one-step lookahead, then act ε-greedily
> 
> **Policy-based methods**: Directly learn a parameterized policy without necessarily learning explicit value functions.
> - The policy contains the learnable parameters  
> - Value functions may be used as auxiliary tools but aren't the primary learned component
> - Examples: Neural network that directly outputs action probabilities
> 
> **Actor-Critic methods**: Hybrid approach that learns both value functions (critic) and policy (actor) with separate parameters.
> 
> #### Classification of Sutton & Barto Methods
>
> **<u>NOTE: Part I methods are all value-based because they focus on learning value tables, with policies derived from those values.</u>**
> 
> **Value-Based Methods (Part I - Tabular):**
> - Multi-armed Bandits: Learn action-value estimates $Q_t(a)$
> - Dynamic Programming: Learn $v_\pi(s)$ or $q_\pi(s,a)$ tables
> - Monte Carlo Methods: Learn value tables $V(s)$ or $Q(s,a)$
> - Temporal Difference Methods: Learn $V(s)$ or $Q(s,a)$ tables
> - Hybrid Methods (TD(λ)): Learn value tables with eligibility traces
> 
> **Value-Based Methods (Part II - Function Approximation):**
> - Deep Q-Networks (DQN)
> - Value function approximation with neural networks
> 
> **Policy-Based Methods (Part II):**
> - REINFORCE (policy gradient)
> - Actor-only policy gradient methods
> 
> **Actor-Critic Methods (Part II):**
> - Actor-Critic with function approximation
> - Advanced methods like A3C, PPO, SAC (though these are beyond the book's scope)

---

#### Tic-Tac-Toe Example from Chapter 1
In Sutton & Barto's tic-tac-toe example:

* State ($s$): A board configuration (arrangement of X's and O's)
* Action ($a$): A move (placing X in an available square)
* Policy ($\pi$): Rule for selecting moves, mostly greedy (choose move leading to highest-value board) with occasional exploration
* Reward ($r_t$): +1 for win, 0 for draw, -1 for loss (received only at game end)
* Return ($G_t$): Same as final reward since it's an episodic task with rewards only at termination
   * See **KEY** note below
* Value function ($V(s)$): Estimated probability of winning from each board state
   * See **KEY** note below
* Model: Not explicitly used (model-free approach)

**Learnable parameters**: The value table $V(s)$ containing win probability estimates for each possible board configuration. These values are updated using the temporal difference rule: $$V(S_t) \leftarrow V(S_t) + \alpha [V(S_{t+1}) - V(S_t)]$$

The value function serves the same role as model weights in supervised learning - it's what gets updated during training to improve performance.

> <u>**KEY**</u>
> 
> The Return ($G_t$) IS meaningful, but what gets "backpropagated" are value estimate updates, not the return itself. Here's the distinction:
> * Return ($G_t$): The actual cumulative reward from time $t$ onward. In tic-tac-toe:
>     * From any non-terminal state: $G_t$ = final reward (+1, 0, or -1)
>     * So yes, $G_t$ equals the final reward since there are no intermediate rewards
> * What gets updated: The value estimates $V(S_t)$ using the TD rule: $$V(S_t) \leftarrow V(S_t) + \alpha [V(S_{t+1}) - V(S_t)]$$
> 
> Key insight: The TD update uses the next state's value estimate $V(S_{t+1})$ as a proxy for the true return $G_t$. This is called "bootstrapping†" - using one estimate to update another estimate. 
>
> So the algorithm is trying to make $V(S_t)$ approximate $\mathbb{E}[G_t]$ (the expected return), but it updates using $V(S_{t+1})$ rather than waiting for the actual return.
>
> ---
>
> Aside: The tic-tac-toe example is 1-step bootstrapping‡ (using $V(S_{t+1})$) within the general framework of bootstrapping† (using estimates rather than true returns).
>  
> † "bootstrapping" here is used in the general statistical/RL step (sometimes seen in finance, specifically in MC methods):
>     - Definition: Using estimates to update other estimates, rather than waiting for true/final values.
>
> ‡ "bootstrapping" here is used as per "n-step bootstrapping":
>     - Definition: How many steps ahead you look before using an estimate as your target.


## 1.4 Limitations and Scope
---


- Assumes a **state signal** is given (e.g., as preprocessed input).
- Reinforcement learning does not inherently solve **state representation**.
- Evolutionary methods can solve RL problems without value functions, but often less efficient.
- Focus of this book: **model-free and model-based RL** with value estimation.


## 1.5 An Extended Example: Tic-Tac-Toe
---

Goal: Learn to beat an imperfect opponent without knowing its strategy a priori.

> About the best one can do on this problem is first to learn a model of the opponent’s behavior, up to some level of confidence, and then apply dynamic programming to compute an optimal solution given the approximate opponent model. In the end, this is not that different from some of the reinforcement learning methods we examine later in this book.

KEY: In this example, "value" is the estimated probability of winning in a given board state. The "value function" is the learned $P(win)$ estimates for ALL board states.

### **Approach _using a value function_**:
- Initialize a "value table" $V(s)$ that maps board states to win probabilities:
   - Each "cell" in the value table corresponds to a possible board state $s$. 
   - Each cell's value is the latest estimate of the probability of winning from that state (that state's _value_ $V(s_i)$).
   - The whole table is the learned _value function_.
   - Initialize the table with:
      - 1s for all states that have 3 Xs in a row.
      - 0s for all states that have 3 Os in a row.
      - 0.5 for all other states.
- Play a bunch of games against the opponent and update the values.
- Use mostly greedy moves (choose highest available $V(s)$), but occasionally explore (choose non-highest $V(s)$).
- Once the game is finished, **after greedy moves only**, update the value of the previous state ($V(S_t)$) to be closer to the current state ($V(S_{t+1})$):
  $$
  V(S_t) \leftarrow V(S_t) + \alpha \left[V(S_{t+1}) - V(S_t)\right]
  $$
   - where $\alpha$ is a step-size parameter (small positive fraction).
   - **START AT THE TERMINAL STATE AND MOVE BACKWARDS, BACKPROPAGATING THE UPDATES TO $S_0$ FOR GREEDY MOVES ONLY**.
   - Since value updates occur **only after greedy moves**, exploration is not used for learning.
- This is an example of a temporal-difference update rule.

<img src="../img/3.png" alt="tictactoe-backprop" width="50%"/>

#### **Note on mechanics**:
- It can be assumed that Sutton & Barto's tic-tac-toe example is "episodic TD(0)", where the update rule is applied after the game concludes.
   - This means that the post-game update applied to the penultimate move causes the update rule to backprop over multiple timesteps (in reverse order..."one-step bootstrapping"), sweeping from $S_{T-1}$ back to $S_0$ (but not applied for any exploration moves made along the way).
      - "One-step bootstrapping": Each update only uses the immediate next state's value

#### **Why Exploration Helps Even Without Direct Updates**:

Exploration moves don't get updated directly, but they indirectly improve learning by leading the agent into new board states. When the agent eventually makes greedy moves from these newly discovered states, those greedy moves do get updated based on the outcomes. This allows exploration to:
- **Discover better strategies**: Find high-value states that pure greedy play would never encounter
- **Correct overconfident estimates**: Reveal that seemingly good states actually lead to losses

So exploration expands the "training data" that the value function learns from, even though only greedy moves directly update the table.

### **Takeaways**:
- RL learns from interaction without a model of the opponent.
- It backs up values from future to past states.
- Unlike evolutionary methods, RL learns online, and credit assignment is finer-grained.

### **Model-Free Nature**:
- Doesn’t require modeling opponent or future states.
- Works via **temporal-difference (TD) learning**.
- Can generalize using function approximators (e.g., Tesauro’s neural network for Backgammon).


## 1.6 Summary
---


- RL is a third paradigm alongside supervised and unsupervised learning.
- Uses **Markov Decision Processes (MDPs)** to formalize interaction.
- Central tools:
  - **Value functions**
  - **Trial-and-error updates**
- RL = learning by *interacting*, *adapting*, and *optimizing long-term outcomes*.


## 1.7 Early History of Reinforcement Learning
---


**Two foundational threads**:
1. **Trial-and-error learning**:
   - Roots in psychology: Thorndike’s *Law of Effect*, Skinner’s reinforcement, Pavlov’s conditioning.
   - Turing (1948) described pleasure/pain driven learning machines.
   - Shannon’s *Theseus*, Michie’s *MENACE*, Samuel’s *checkers program* were early digital examples.

2. **Optimal control and dynamic programming**:
   - Bellman’s **DP** and **Bellman equations** in 1950s.
   - MDPs (Howard, 1960) formalized stochastic control problems.
   - Combined with function approximation: "neurodynamic programming".

**Modern RL** = Integration of:
- Dynamic programming theory (optimal control)
- Psychological theories of learning
- Computational models like TD-learning

---
---


## Chapter 1 Exercises
---



<img src="../img/4.png" alt="4" width="50%"/>
<br>
<img src="../img/5.png" alt="5" width="50%"/>

### Answers & Intuition

---

#### **Exercise 1.1: Self-Play**
**Q:** What happens if the RL algorithm plays against itself, both sides learning? Would it learn a different policy for selecting moves?

**A:**  
If both sides learn via self-play, the algorithm can converge to an optimal (or near-optimal) policy for both players, potentially leading to a policy that is robust to strong opponents, not just random ones. The learned policy may differ from what is learned against a random opponent, as the distribution of experiences will be more challenging and realistic.

**Intuition:**  
Self-play is foundational in RL for learning robust strategies, as seen in AlphaGo/AlphaZero. This question builds intuition for learning from interaction with non-stationary or improving opponents.

---

#### **Exercise 1.2: Symmetries**
**Q:** How might we amend the learning process to exploit board symmetries? How would this help? Should we still do this if the opponent does not exploit symmetries? Do symmetric positions necessarily have the same value?

**A:**  
We can exploit symmetries by treating symmetric board positions as equivalent, updating their values together, which reduces the state space and improves data efficiency. This accelerates learning because knowledge from one symmetric state transfers to others.  
If the opponent does *not* exploit symmetries, we may still benefit, but symmetric positions could end up having different practical values due to asymmetric play. Symmetric positions *should* have the same value only if the opponent's policy is itself symmetric.

**Intuition:**  
This introduces the idea of state abstraction and function approximation—core to generalization and sample efficiency in RL.

---

#### **Exercise 1.3: Greedy Play**
**Q:** What happens if the player always acts greedily (never explores)? Might it learn to play better or worse than a non-greedy player? What problems can occur?

**A:**  
A purely greedy player risks converging to suboptimal policies due to lack of exploration ("getting stuck" in local optima), and may never discover better moves. In practice, this can lead to worse performance compared to a player who explores and updates values more broadly. Problems include incomplete learning and overfitting to initial experiences.

**Intuition:**  
This exercise motivates the importance of exploration in RL (vs. exploitation), foundational for concepts like $\epsilon$-greedy, softmax, and exploration-exploitation tradeoff.

---

#### **Exercise 1.4: Learning from Exploration**
**Q:** What if learning updates occur after all moves (including exploratory)? What are the two sets of probabilities? Which method is better for learning/winning?

**A:**  
- **If we update after all moves (including exploratory):** The value estimates reflect the actual probabilities with which moves are taken, blending greedy and exploratory choices.
- **If we update only after greedy moves:** The values reflect the policy if only the best-known moves are always chosen.
- Continuing to make exploratory moves but learning only from greedy moves focuses learning on the policy we intend to follow for winning, which usually results in better performance than learning from all moves (which blends exploration into the policy).

**Intuition:**  
This foreshadows "on-policy" vs. "off-policy" learning distinctions and highlights how the target of updates matters in RL algorithms.

---

#### **Exercise 1.5: Other Improvements**
**Q:** What other ways can we improve the RL player? Any better way to solve the tic-tac-toe problem?

**A:**  
Possible improvements:
- Use deeper lookahead (planning) or tree search.
- Incorporate opponent modeling.
- Use value function approximation (e.g., neural nets).
- Exploit more domain knowledge (e.g., forced moves).
- Enhance exploration strategy (e.g., UCB, optimism).
For tic-tac-toe, full minimax search can solve the game exactly, but the RL approach generalizes to much harder problems.

**Intuition:**  
This encourages thinking about RL's flexibility and limitations, foreshadowing the use of model-based planning, opponent modeling, and advanced function approximation.

---

---
---