## Summary of Notation

| Symbol | Meaning |
|--------|---------|
| $\Pr\{X = x\}$ | Probability that random variable $X$ takes value $x$ |
| $X \sim p$ | $X$ is drawn from distribution $p$ |
| $\mathbb{E}[X]$ | Expectation of random variable $X$ |
| $\arg\max_a f(a)$ | Value of $a$ that maximizes $f(a)$ |
| $\alpha$, $\beta$, $\epsilon$ | Step-size, decay-rate, and exploration parameters |
| $s, s'$ | States |
| $a$ | Action |
| $r$ | Reward |
| $S, A(s), R$ | Set of states, available actions, rewards |
| $t, T$ | Discrete time step, final step |
| $S_t, A_t, R_t$ | State, action, and reward at time $t$ |
| $\pi$ | Policy (decision-making rule) |
| $\pi(s)$ | action taken in state $s$ under _deterministic_ $\pi$ |
| $\pi(a \mid s)$ | Probability of taking action $a$ in state $s$ under _stochastic_ $\pi$ |
| $G_t$ | Return from time $t$ |
| $h$ | horizon (the timestep one looks up to in a forward view) |
| $v_\pi(s)$ | Value of state $s$ under policy $\pi$ |
| $q_\pi(s, a)$ | Value of state-action pair $(s,a)$ under $\pi$ |
| $p(s', r \mid s, a)$ | Transition dynamics |
| $\hat{v}(s, \mathbf{w})$ | Approximate value function using weight vector $\mathbf{w}$ |
| $\delta_t$ | Temporal-difference error at time $t$ |

---

# Chapter 1: Introduction

---

## 1.1 Reinforcement Learning

- **Definition**: Learning how to map situations to actions to maximize cumulative reward.
- **Core elements**:
  - **Trial-and-error search**: The agent explores to learn.
  - **Delayed reward**: Actions impact long-term outcomes, not just immediate feedback.
- **Distinguishing features**:
  - Not told *what* actions to take, only gets feedback via rewards.
  - Must balance **exploration vs. exploitation**.
- **Compared to other paradigms**:
  - **Supervised learning**: Learns from labeled examples.
  - **Unsupervised learning**: Learns hidden structure from unlabeled data.
  - **RL**: Maximizes rewards via interaction; often needs to explore unknown state spaces.

---

## 1.2 Examples

Real-world analogies include:
- **Chess**: Intuition + planning, learning from position values.
- **Adaptive control**: Tuning refinery settings in real time.
- **Animals**: Gazelle calf learning to run soon after birth.
- **Robots**: Battery-aware navigation decisions.
- **Daily tasks**: Preparing breakfast involves conditional behavior, goals, and sensory feedback.

**Key Takeaway**: RL applies broadly wherever an agent interacts with an environment, learns over time, and must adapt to uncertainty.

---

## 1.3 Elements of Reinforcement Learning

Four main components:

1. **Policy ($\pi$)**:
   - Mapping from states to actions: $\pi(s)$ or $\pi(a|s)$.
   - Can be deterministic or stochastic.

2. **Reward Signal ($r$)**:
   - Defines the goal.
   - Immediate signal from the environment, tells what’s good or bad.

3. **Value Function ($v(s)$, $q(s,a)$)**:
   - Estimates long-term reward.
   - **$v_\pi(s)$**: Expected return from state $s$ under policy $\pi$.
   - Guides decision-making more than immediate rewards.

4. **Model (optional)**:
   - Simulates environment behavior: $p(s', r \mid s, a)$.
   - Enables **planning** (model-based) vs. **direct interaction** (model-free).

---

## 1.4 Limitations and Scope

- Assumes a **state signal** is given (e.g., as preprocessed input).
- Reinforcement learning does not inherently solve **state representation**.
- Evolutionary methods can solve RL problems without value functions, but often less efficient.
- Focus of this book: **model-free and model-based RL** with value estimation.

---

## 1.5 An Extended Example: Tic-Tac-Toe

Goal: Learn to beat an imperfect opponent without knowing its strategy a priori.

> About the best one can do on this problem is first to learn a model of the opponent’s behavior, up to some level of confidence, and then apply dynamic programming to compute an optimal solution given the approximate opponent model. In the end, this is not that different from some of the reinforcement learning methods we examine later in this book.

**Approach _using a value function_**:
- Maintain a value table $V(s)$ for each board state. Each "cell" in the value table is a board state. Each cell's value is the latest estimate of the probability of winning from that state (that state's _value_).
   - The whole table is the learned _value function_.
   - Initialize the table with:
      - 1s for all states that have 3 Xs in a row.
      - 0s for all states that have 3 Os in a row.
      - 0.5 for all other states.
- Update values using:
  $$
  V(S_t) \leftarrow V(S_t) + \alpha \left[V(S_{t+1}) - V(S_t)\right]
  $$
- Use mostly greedy moves (highest $V(s)$), with occasional exploration.
- Value updates occur **only after greedy moves** (exploration not used for learning).

<div style="text-align: center;">
   <img src="img/1.png" alt="1" width="60%"/>
   <img src="img/2.png" alt="2" width="60%"/>
   <img src="img/3.png" alt="3" width="60%"/>
</div>

**Takeaways**:
- RL learns from interaction without a model of the opponent.
- It backs up values from future to past states.
- Unlike evolutionary methods, RL learns online, and credit assignment is finer-grained.

**Model-Free Nature**:
- Doesn’t require modeling opponent or future states.
- Works via **temporal-difference (TD) learning**.
- Can generalize using function approximators (e.g., Tesauro’s neural network for Backgammon).

---

## 1.6 Summary

- RL is a third paradigm alongside supervised and unsupervised learning.
- Uses **Markov Decision Processes (MDPs)** to formalize interaction.
- Central tools:
  - **Value functions**
  - **Trial-and-error updates**
- RL = learning by *interacting*, *adapting*, and *optimizing long-term outcomes*.

---

## 1.7 Early History of Reinforcement Learning

**Two foundational threads**:
1. **Trial-and-error learning**:
   - Roots in psychology: Thorndike’s *Law of Effect*, Skinner’s reinforcement, Pavlov’s conditioning.
   - Turing (1948) described pleasure/pain driven learning machines.
   - Shannon’s *Theseus*, Michie’s *MENACE*, Samuel’s *checkers program* were early digital examples.

2. **Optimal control and dynamic programming**:
   - Bellman’s **DP** and **Bellman equations** in 1950s.
   - MDPs (Howard, 1960) formalized stochastic control problems.
   - Combined with function approximation: "neurodynamic programming".

**Modern RL** = Integration of:
- Dynamic programming theory (optimal control)
- Psychological theories of learning
- Computational models like TD-learning

---

## Chapter 1 Exercises

<img src="img/4.png" alt="4" width="60%"/>
<img src="img/5.png" alt="5" width="60%"/>