# Chapter 3: Finite Markov Decision Processes
---

**Chapter Overview**: This chapter introduces the formal mathematical framework that underlies most of reinforcement learning. We transition from simple bandits (Chapter 2) to the full RL problem where actions affect not just immediate rewards, but also future states and opportunities.

**Key Distinction from Bandits**: 
- **Bandits**: Actions only affect immediate rewards
- **MDPs**: Actions affect both immediate rewards AND future states (and thus future rewards)

![Agent-Environment Interface](../img/fig3_1.png)

---

## 3.1 The Agent–Environment Interface

**Core Framework**: The MDP framework models interaction between an **agent** (learner/decision-maker) and **environment** (everything outside the agent).

### The Interaction Loop

**Basic sequence** at discrete time steps $t = 0, 1, 2, 3, ...$:
1. Agent observes state $S_t$
2. Agent selects action $A_t$ 
3. Environment responds with reward $R_{t+1}$ and new state $S_{t+1}$

**Trajectory**: $S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, ...$

### MDP Dynamics Function

**Four-argument dynamics function** (_Equation 3.2_):

$$p(s', r|s, a) \doteq \Pr\{S_t = s', R_t = r | S_{t-1} = s, A_{t-1} = a\}$$

**Components**:
- $s, s'$: Current and next states
- $a$: Action taken  
- $r$: Reward received
- $p(s', r|s, a)$: Probability of transitioning to state $s'$ and receiving reward $r$

**Mathematical intuition**: This function completely characterizes the environment's behavior. Given current state and action, it tells us the probability of every possible outcome.

**Probability constraint** (_Equation 3.3_):
$$\sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s', r|s, a) = 1, \text{ for all } s \in \mathcal{S}, a \in \mathcal{A}(s)$$

### Derived Functions

From the four-argument $p$ function, we can compute:

**State transition probabilities** (_Equation 3.4_):
$$p(s'|s, a) = \sum_{r \in \mathcal{R}} p(s', r|s, a)$$

**Expected rewards for state-action pairs** (_Equation 3.5_):
$$r(s, a) = \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} p(s', r|s, a)$$

**Expected rewards for state-action-next-state triples** (_Equation 3.6_):
$$r(s, a, s') = \sum_{r \in \mathcal{R}} r \frac{p(s', r|s, a)}{p(s'|s, a)}$$

### The Markov Property

**Definition**: The state must include all information about the past that affects the future. 

**Mathematical statement**: $P(S_{t+1}, R_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, ...) = P(S_{t+1}, R_{t+1} | S_t, A_t)$

**Intuition**: If you know the current state, the past doesn't matter for predicting the future.

### Agent-Environment Boundary

**Key principle**: The boundary represents the limit of the agent's **absolute control**, not its knowledge.

**Examples**:
- **Robot arm**: Motors and sensors are part of environment, not agent
- **Chess program**: Board position is state, but opponent's strategy is part of environment
- **Human**: Muscles and sensory organs are environment

**Practical guideline**: Everything the agent cannot arbitrarily change belongs to the environment.

---

## 3.2 Goals and Rewards

### The Reward Hypothesis

> **Reward Hypothesis**: All goals and purposes can be thought of as maximizing the expected value of the cumulative sum of a received scalar signal (reward).

**Key insights**:
- Rewards define the goal, not how to achieve it
- Don't reward intermediate steps—reward the actual objective
- Example: Chess agent should be rewarded for winning, not for taking pieces

**Bad example**: Rewarding a chess agent for taking opponent pieces might lead it to take pieces while losing the game.

### Reward Signal Design

**Critical principle**: Use rewards to communicate **what** you want achieved, not **how** to achieve it.

**Examples of good reward design**:
- **Walking robot**: +1 for each step forward
- **Maze escape**: -1 per time step until escape (encourages speed)
- **Game playing**: +1 win, -1 loss, 0 draw
- **Can collection**: +1 per can collected

---

## 3.3 Returns and Episodes

### Episodes vs. Continuing Tasks

**Episodic tasks**: Natural breaking points (episodes)
- Examples: Games, maze runs, conversations
- Each episode ends in **terminal state**
- Episodes are independent

**Continuing tasks**: No natural endpoints
- Examples: Process control, life-long learning
- Interaction continues indefinitely

### Return Definitions

**Simple return for episodic tasks** (_Equation 3.7_):
$$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T$$

**Components**:
- $G_t$: Return starting from time $t$
- $T$: Final time step of episode
- Simple sum of all future rewards in episode

### Discounted Return

**Discounted return** (_Equation 3.8_):
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

**Components**:
- $\gamma$: Discount rate, $0 \leq \gamma \leq 1$
- $\gamma^k$: Discount factor for reward $k$ steps in future

**Mathematical intuition**: Future rewards are worth less than immediate rewards. The discount rate $\gamma$ controls how much we value the future.

**Special cases**:
- $\gamma = 0$: Only immediate reward matters (myopic)
- $\gamma = 1$: All rewards equally important (far-sighted)
- $\gamma < 1$: Ensures finite return even for infinite sequences

### Recursive Return Relationship

**Fundamental recursion** (_Equation 3.9_):
$$G_t = R_{t+1} + \gamma G_{t+1}$$

**Mathematical intuition**: Today's return equals immediate reward plus discounted future return. This recursion is the foundation of all RL algorithms.

**Example calculation**: If rewards are constant +1 and $\gamma < 1$:
$$G_t = \sum_{k=0}^{\infty} \gamma^k = \frac{1}{1-\gamma}$$

---

## 3.4 Unified Notation for Episodic and Continuing Tasks

**Key insight**: We can treat episodic tasks as continuing tasks with absorbing terminal states that give zero reward.

![State Transition Diagram](../img/fig3_2.png)

**Unified return formula** (_Equation 3.11_):
$$G_t = \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k$$

where $T = \infty$ or $\gamma = 1$ (but not both).

---

## 3.5 Policies and Value Functions

### Policies

**Definition**: A policy $\pi$ is a mapping from states to probabilities of selecting each action.

**Stochastic policy**: $\pi(a|s) = $ probability of taking action $a$ in state $s$

**Mathematical constraint**: $\sum_{a} \pi(a|s) = 1$ for all $s$

### State-Value Functions

**State-value function** (_Equation 3.12_):
$$v_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \bigg| S_t = s\right]$$

**Components**:
- $v_\pi(s)$: Expected return starting from state $s$ following policy $\pi$
- $\mathbb{E}_\pi[\cdot]$: Expectation when following policy $\pi$

**Mathematical intuition**: How good is it to be in state $s$ if we follow policy $\pi$ from here on?

### Action-Value Functions

**Action-value function** (_Equation 3.13_):
$$q_\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \bigg| S_t = s, A_t = a\right]$$

**Mathematical intuition**: How good is it to take action $a$ in state $s$, then follow policy $\pi$?

### The Bellman Equation for $v_\pi$

**🏆 FOUNDATIONAL FORMULA** (_Equation 3.14_):
$$v_\pi(s) = \sum_a \pi(a|s) \sum_{s',r} p(s', r|s, a)[r + \gamma v_\pi(s')]$$

**Components**:
- $\pi(a|s)$: Probability of taking action $a$ in state $s$
- $p(s', r|s, a)$: Environment dynamics
- $r$: Immediate reward
- $\gamma v_\pi(s')$: Discounted future value

**Mathematical intuition**: The value of a state equals the expected immediate reward plus the expected discounted value of the next state. This captures the recursive nature of value.

**Derivation insight**:
1. Start with definition: $v_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$
2. Use recursion: $G_t = R_{t+1} + \gamma G_{t+1}$
3. Apply law of total expectation over actions and next states

![Backup Diagram for v_π](../img/fig3_3.png)

### Backup Diagrams

**Backup diagrams** show the relationship between a state (or state-action pair) and its successors:
- **Open circles**: States
- **Solid circles**: State-action pairs
- **Arrows**: Possible transitions

---

## 3.6 Optimal Policies and Optimal Value Functions

### Optimal Value Functions

**Optimal state-value function** (_Equation 3.15_):
$$v_*(s) = \max_\pi v_\pi(s)$$

**Optimal action-value function** (_Equation 3.16_):
$$q_*(s, a) = \max_\pi q_\pi(s, a)$$

**Mathematical intuition**: These represent the best possible performance achievable from each state or state-action pair.

**Relationship** (_Equation 3.17_):
$$q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a]$$

### Bellman Optimality Equations

**🏆 FOUNDATIONAL FORMULA - Bellman Optimality Equation for $v_*$** (_Equations 3.18-3.19_):
$$v_*(s) = \max_a \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a]$$
$$v_*(s) = \max_a \sum_{s',r} p(s', r|s, a)[r + \gamma v_*(s')]$$

**Components**:
- $\max_a$: Choose the best action
- Rest is same as Bellman equation, but now we optimize over actions

**Mathematical intuition**: The value of a state under optimal policy equals the value of the best action from that state. We replace expectation over policy with maximization over actions.

**🏆 FOUNDATIONAL FORMULA - Bellman Optimality Equation for $q_*$** (_Equation 3.20_):
$$q_*(s, a) = \sum_{s',r} p(s', r|s, a)[r + \gamma \max_{a'} q_*(s', a')]$$

**Mathematical intuition**: The value of taking action $a$ in state $s$ equals the expected immediate reward plus the discounted value of the best action in the next state.

![Backup Diagrams for Optimal Value Functions](../img/fig3_4.png)

### Finding Optimal Policies

**Key insight**: Once you have $v_*$, finding optimal policy is easy:

**Greedy policy extraction**:
$$\pi_*(s) = \arg\max_a \sum_{s',r} p(s', r|s, a)[r + \gamma v_*(s')]$$

**Why this works**: $v_*$ already accounts for all future consequences, so a greedy one-step lookahead gives the optimal action.

**With $q_*$, it's even easier**:
$$\pi_*(s) = \arg\max_a q_*(s, a)$$

### Solving Bellman Optimality Equations

**In principle**: Can solve the system of Bellman optimality equations directly
- For $n$ states, have $n$ equations in $n$ unknowns
- Requires knowing environment dynamics $p(s', r|s, a)$

**In practice**: Usually computationally intractable
- Example: Backgammon has ~$10^{20}$ states
- Most RL methods approximate the solution

---

## 3.7 Optimality and Approximation

### Computational Reality

**The fundamental challenge**: Computing optimal policies exactly is usually impossible due to:
1. **Computational limits**: Not enough computation per time step
2. **Memory constraints**: Cannot store values for all states  
3. **Unknown dynamics**: Don't know $p(s', r|s, a)$

### Tabular vs. Function Approximation

**Tabular methods**: Store separate value for each state
- Feasible only for small state spaces
- Can find exact solutions

**Function approximation**: Use parameterized functions to approximate values
- Necessary for large state spaces
- Can only find approximate solutions

### Approximation Opportunities

**Key insight**: RL allows focusing computational resources on frequently encountered states

**Example**: Tesauro's backgammon player
- Exceptional performance despite potentially poor decisions on rare board positions
- Focuses learning on states that actually occur in expert play

---

## 3.8 Summary

### Key Concepts Introduced

**MDPs provide the mathematical foundation for RL**:
- **States**: Basis for decision-making
- **Actions**: Choices available to agent  
- **Rewards**: Basis for evaluation
- **Policy**: Rule for selecting actions
- **Value functions**: Expected future reward

**Value functions are central to RL**:
- $v_\pi(s)$: Expected return from state $s$ under policy $\pi$
- $q_\pi(s,a)$: Expected return from taking action $a$ in state $s$ under policy $\pi$
- $v_*(s)$: Best possible return from state $s$
- $q_*(s,a)$: Best possible return from taking action $a$ in state $s$

**Bellman equations provide recursive structure**:
- Connect value of state to values of successor states
- Foundation for most RL algorithms
- Optimality equations characterize optimal behavior

### Fundamental Takeaways

1. **Actions affect both immediate rewards and future opportunities**
2. **Value functions capture long-term consequences of decisions**  
3. **Optimal policies are greedy with respect to optimal value functions**
4. **Exact solutions usually impossible; approximation necessary**
5. **RL can focus learning on frequently encountered states**

---

## Exercises

### Exercise 3.1 Solution
**Q**: Devise three example tasks that fit the MDP framework.

**A**: 
1. **Autonomous driving**:
   - States: (position, speed, traffic conditions, weather)
   - Actions: (accelerate, brake, steer left/right, change lanes)
   - Rewards: +1 for progress toward destination, -100 for accidents, -1 for traffic violations

2. **Stock trading**:
   - States: (portfolio value, market indicators, time of day, news sentiment)
   - Actions: (buy stock X, sell stock Y, hold, set stop-loss)
   - Rewards: Change in portfolio value each day

3. **Chatbot conversation**:
   - States: (conversation history, user sentiment, topic, user profile)
   - Actions: (different response templates, ask question, provide information, end conversation)
   - Rewards: +1 for positive user feedback, -1 for user ending conversation early

### Exercise 3.5 Solution
**Q**: Modify equation (3.3) for episodic tasks.

**A**: For episodic tasks, we need to account for transitions to the terminal state:
$$\sum_{s' \in \mathcal{S}^+} \sum_{r \in \mathcal{R}} p(s', r|s, a) = 1$$
where $\mathcal{S}^+$ includes the terminal state.

### Exercise 3.8 Solution  
**Q**: Given $\gamma = 0.5$ and rewards $R_1 = -1, R_2 = 2, R_3 = 6, R_4 = 3, R_5 = 2$ with $T = 5$, find $G_0, G_1, ..., G_5$.

**A**: Working backwards:
- $G_5 = 0$ (terminal)
- $G_4 = R_5 = 2$  
- $G_3 = R_4 + \gamma G_4 = 3 + 0.5(2) = 4$
- $G_2 = R_3 + \gamma G_3 = 6 + 0.5(4) = 8$
- $G_1 = R_2 + \gamma G_2 = 2 + 0.5(8) = 6$ 
- $G_0 = R_1 + \gamma G_1 = -1 + 0.5(6) = 2$

### Exercise 3.12 Solution
**Q**: Give equation for $v_\pi$ in terms of $q_\pi$ and $\pi$.

**A**: 
$$v_\pi(s) = \sum_a \pi(a|s) q_\pi(s, a)$$

**Intuition**: State value is weighted average of action values under the policy.

### Exercise 3.13 Solution
**Q**: Give equation for $q_\pi$ in terms of $v_\pi$ and $p$.

**A**:
$$q_\pi(s, a) = \sum_{s',r} p(s', r|s, a)[r + \gamma v_\pi(s')]$$

**Intuition**: Action value equals expected immediate reward plus discounted next state value.

### Exercise 3.17 Solution
**Q**: What is the Bellman equation for action values $q_\pi$?

**A**:
$$q_\pi(s, a) = \sum_{s',r} p(s', r|s, a)\left[r + \gamma \sum_{a'} \pi(a'|s') q_\pi(s', a')\right]$$

**Components**:
- Expected immediate reward: $r$
- Expected future value: $\gamma \sum_{a'} \pi(a'|s') q_\pi(s', a')$

### Exercise 3.25 Solution
**Q**: Give equation for $v_*$ in terms of $q_*$.

**A**:
$$v_*(s) = \max_a q_*(s, a)$$

### Exercise 3.27 Solution  
**Q**: Give equation for $\pi_*$ in terms of $q_*$.

**A**:
$$\pi_*(a|s) = \begin{cases} 
1 & \text{if } a \in \arg\max_{a'} q_*(s, a') \\
0 & \text{otherwise}
\end{cases}$$

**Intuition**: Optimal policy puts all probability on action(s) with highest $q_*$ value.

---
---