# Approaches to Solve the Bellman Equation

- We have established the basic setup for Reinforcement Learning by introducing the equilibrium equations of the State-Value and Action Value functions of the Bellman Equation

- Now we look into **how** the Bellman Equations can be solved in practise

\begin{aligned}
    
    &\text{State-Value Function:} \\
    &\quad v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} P(s' | s, a) [R(s, a, s') + \gamma v_{\pi}(s')] \\ \\

    &\text{Action-Value Function:} \\
    &\quad q_{\pi}(s,a) = \sum_{s'} P(s' | s, a) [R(s, a, s') + \gamma \sum_{a'} \pi(a'|s') q_{\pi}(s', a')]

\end{aligned}

## Dynamic Programming (DP) - Model Based

- The simplest case of solving the Bellman Equation is when you have **both** the transition probabilities $P(s'|s,a)$ and the rewards $R(s,a,s')$

- That is, you know the full Markov Decision Process (MDP). Hence the term "model based", because we are using the MDP model

- This is, of course, exceedingly rare. But it is instructive to see how to solve for $\pi^*$ when you have almost all available information

- We will go through 2 ways of implementing the DP solver
    - Policy Interation
    - Value Iteration

- To do this, let's assume the following set up:

In [29]:
import numpy as np

np.random.seed(123)

N_STATES = 20
N_ACTIONS = 20
GAMMA = 0.9

# Shape corresponding to State, Action, Next State
TRANSITION_PROBABILITIES = np.random.rand(N_STATES, N_ACTIONS, N_STATES)
# Normalise so transition probabilities for a given state sums to 1 across all actions
TRANSITION_PROBABILITIES /= np.sum(TRANSITION_PROBABILITIES, axis=2, keepdims=True)

REWARDS = np.random.rand(N_STATES, N_ACTIONS, N_STATES) * 5

### Policy Iteration

- In policy iteration, we create a loop that does (i) policy evaluation, and (ii) policy improvement based on the evaluated outcomes of the policy, and the loop runs until some stopping criteria is reached

- **Policy Evaluation:** 
    - Since we know the transition probabilities and rewards with full certainty, we can update the State-Value function by computing the Bellman Equation
    
    - Start with a seed array representing the State-Value function $V$, and a policy $\pi$ that deterministically tells us what action to take in a given state $S$
    
    - Compute 1 pass of the Bellman State Value function update for all states. This is just a transition-probability-weighted average of the state values of the next states $S'$

    - Check that the largest change in the state value estimates for all $N$ states exceeds some threshold $\theta$. If even the largest change falls below the threshold, we assume convergence, and we return the state values $V$

    - Otherwise, keep looping

- **Policy Improvement**
    - We've previously updated our estimate of the State-Value function $V$
    - Now, we need to update our policy based on the new estimate of the State-Value function. That is, given a state $S$, I want to know the value of taking an action $A$. 
    - That is; the **Q-Value**

    - Looping over every state $S$, we init an array to hold the new Q values

    - Given $S$, loop over every action available, and update the Action-Value function $Q$. This is simply the transition-probability-weighted average of rewards from the transition $S -> S'$, plus the discounted state-value of $S'$

    - Finally, the new policy for state $S$ is simply to take the argmax of all the Q values

- **Policy Iteration**
    - Bringing these 2 together, we simply create an infinite loop of `Evaluation --> Improvement --> Evalution ...`

    - This will keep going until some convergence criteria is met. An easy on is to say that; if for some $N$ consecutive loops we do not change our policy, we have reached convergence

In [44]:
def policy_evaluation(
    transition_probabilities: np.ndarray,
    rewards: np.ndarray,
    curr_state_value_function: np.ndarray,
    policy: np.ndarray,
    gamma: float = 0.9,
    theta: float = 1e-6
) -> np.ndarray:
    updated_state_value_function = curr_state_value_function.copy()
    while True:
        delta = 0
        for curr_state in range(N_STATES):
            v_updated = sum([
                transition_probabilities[curr_state, policy[curr_state], s_prime] * 
                (rewards[curr_state, policy[curr_state], s_prime] + gamma * updated_state_value_function[s_prime])
                for s_prime in range(N_STATES)
            ])
            delta= max(delta, abs(v_updated - updated_state_value_function[curr_state]))
            updated_state_value_function[curr_state] = v_updated
        if delta < theta:
            break
    
    return updated_state_value_function

def policy_improvement(
    transition_probabilities: np.ndarray,
    rewards: np.ndarray,
    curr_state_value_function: np.ndarray,
    curr_action_value_function: np.ndarray,
    curr_policy: np.ndarray,
    gamma: float = 0.9
) -> tuple[np.ndarray, np.ndarray]:
    updated_policy = curr_policy.copy()
    updated_action_value_function = curr_action_value_function.copy()
    for s in range(N_STATES):
        action_value_function_for_state_s = updated_action_value_function[s].copy()
        for a in range(N_ACTIONS):
            new_action_value = sum([
                transition_probabilities[s, a, s_prime] * 
                (rewards[s, a, s_prime] + gamma * curr_state_value_function[s_prime])
                for s_prime in range(N_STATES)
            ])
            action_value_function_for_state_s[a] = new_action_value
        
        updated_policy[s] = np.argmax(action_value_function_for_state_s)
        updated_action_value_function[s] = action_value_function_for_state_s
    return updated_policy, updated_action_value_function

def policy_iteration(
    transition_probabilities: np.ndarray,
    rewards: np.ndarray,
    gamma: float = 0.9,
    theta: float = 1e-6
):
    curr_state_value_function: np.ndarray = np.zeros(N_STATES)
    curr_action_value_function: np.ndarray = np.zeros((N_STATES, N_ACTIONS))
    curr_policy: np.ndarray = np.zeros(N_STATES).astype(int)
    
    count_no_change = 0
    count_iters = 0
    while True:
        print(count_iters)
        updated_state_value_function = policy_evaluation(
            transition_probabilities,
            rewards,
            curr_state_value_function,
            curr_policy,
            gamma,
            theta
        )
        updated_policy, updated_action_value_function = policy_improvement(
            transition_probabilities,
            rewards,
            updated_state_value_function,
            curr_action_value_function,
            curr_policy,
            gamma
        )
        if np.array_equal(curr_policy, updated_policy):
            count_no_change += 1
            if count_no_change >= 5:
                break
        else:
            count_no_change = 0

        curr_policy = updated_policy.copy()
        curr_action_value_function = updated_action_value_function.copy()
        curr_state_value_function = updated_state_value_function.copy()

        count_iters += 1
    
    return curr_policy, curr_action_value_function, curr_state_value_function

pi_star, q_star, v_star = policy_iteration(TRANSITION_PROBABILITIES, REWARDS)

print(f"""
    Optimal Policy: {pi_star}
    Action-Value: {q_star}
    State-Value: {v_star}
""")

0
1
2
3
4
5
6

    Optimal Policy: [12 16 18 19 17  4 16  6 19 17 17  2 10 14 18 19 17  5  7  9]
    Action-Value: [[30.94841478 31.32490423 32.19788999 31.9444741  31.79106206 31.84517046
  31.290961   31.28625888 31.67130119 31.75888851 31.64210072 31.40075868
  32.21000994 31.70431886 32.00087219 30.9218291  31.23823039 31.00114137
  31.35150682 31.38025686]
 [31.09281456 31.19694478 31.74030897 31.97479618 31.94372851 31.00817106
  31.06199313 31.43872014 31.59123965 31.17103483 31.83600553 31.7249459
  31.14808322 31.37639636 31.84617734 30.88066009 32.05620933 31.25515259
  31.48655157 31.54256258]
 [31.61237654 31.29259542 31.17586312 32.2678108  32.01706964 31.44388102
  32.08529495 31.07045071 31.49056397 31.71061504 31.02807655 30.90594336
  31.68019248 31.24952417 31.73166221 31.61723427 32.19383784 31.65359456
  32.53945021 31.81073227]
 [32.05617899 31.82333462 31.68052992 31.05107583 32.03523231 31.60465759
  31.81101442 32.20996754 31.45421427 31.10691636 31.48853139 31.

- To ascertain if your policy iteration has converged correctly, check that the action values computed from the current transition probability, rewards, and optimal state values matches the values in your optimal q value array 

In [None]:
for s in range(N_STATES):
    for a in range(N_ACTIONS):
        q = sum([
            TRANSITION_PROBABILITIES[s, a, s_prime] *
            (REWARDS[s, a, s_prime] + GAMMA * v_star[s_prime])
            for s_prime in range(N_STATES)
        ])
        assert np.isclose(q, q_star[s, a])

### Value Iteration

- In policy iteration, notice we have 2 distinct steps
    - There is policy evaluation, which exhaustively uses given policy $\pi$ to update the state value function $v$ until convergence
    - Following which there is policy improvement, which uses the updated state value function $v$ to pick the best action $a$ for every state $s$

- The key difference between policy iteration and value iteraton is in the order we perform these steps. The methods are exactly the same, For value iteration:
    - We don't maintain an explicit policy. Instead, we focus on updating the state-value function $v$
    - The action taken at each time step is simply the maximum of the action values computed using the current $v$

- Note that there is no performance difference expected on average, and both approaches should converge to the same solution

- TLDR; 
    - Policy Iteration does Loop(evaluate policy --> improve policy)
    - Value Iteration does Loop(update state value function) 

In [42]:
def value_iteration(
    transition_probabilities: np.ndarray,
    rewards: np.ndarray,
    gamma: float = 0.9,
    theta: float = 1e-6
) -> np.ndarray:
    curr_state_value_function = np.zeros(N_STATES)
    updated_state_value_function = np.zeros(N_STATES)
    while True:
        delta = 0
        for s in range(N_STATES):
            new_action_values = [
                sum([
                    transition_probabilities[s, a, s_prime] * 
                    (rewards[s, a, s_prime] + gamma * curr_state_value_function[s_prime])
                    for s_prime in range(N_STATES)
                ])
                for a in range(N_ACTIONS)
            ]
            updated_state_value_function[s] = max(new_action_values)
            delta = max(delta, abs(curr_state_value_function[s] - updated_state_value_function[s]))
        if delta < theta:
            break
            
        curr_state_value_function = updated_state_value_function.copy()
    
    policy = np.zeros(N_STATES).astype(int)
    for s in range(N_STATES):
        action_values = [
            sum(transition_probabilities[s, a, s_prime] * 
            (rewards[s, a, s_prime] + gamma * updated_state_value_function[s_prime])
            for s_prime in range(N_STATES))
            for a in range(N_ACTIONS)
        ]
        policy[s] = np.argmax(action_values)
    
    return policy, updated_state_value_function

In [None]:
value_iteration(TRANSITION_PROBABILITIES, REWARDS)

(array([12, 16, 18, 19, 17,  4, 16,  6, 19, 17, 17,  2, 10, 14, 18, 19, 17,
         5,  7,  9]),
 array([32.21000379, 32.0562032 , 32.53944408, 32.33846026, 32.17409677,
        32.48933021, 32.14463111, 32.2494927 , 32.26843013, 32.42455566,
        32.26554727, 32.31171317, 32.43795555, 32.0959939 , 32.64112159,
        32.31968229, 32.60391839, 32.1939142 , 32.53185021, 32.59173977]))

## Monte Carlo - Sampling-Based

- We've covered the DP approach, which looks at the fairly restrictive case of solving the Bellman Equations when the full MDP is known; i.e. you know both the transition probability, and the reward. This is typically unrealistic. Very few process in reality will be kind enough to provide you both pieces of information perfectly. And even if you estimate both using data, it will be estimated under some uncertainty

- Therefore, we need more general methods to solve for the optimal policy when the transition probabilities and rewards are not known

- Here, we consider the Monte-Carlo approach. 

- Note that the traditional Monte-Carlo approach assumes an **episodic task**. That is, the iteration will run for some time, and reach a terminal point.
    - You can manually impose a terminal point for Monte-Carlo, but note that this isn't a true Monte Carlo approach, but a **truncated** Monte Ccarlo

- Idea:
    - The idea of the monte carlo is to simply let entire episodes play out multiple times, to find the long term average state-value function and hence the optimal policy
    - So you have a `step()` method that determines 2 things; (i) what is the next state you are in, and (ii) is this next state a terminal one?
    - Having done this, you now have 2 series recorded (i) a series of states, ending with a terminal state, and (ii) a series of rewards, indicating the reward you get for landing on that state
        - Note that the series of rewards is NOT the sum of your future rewards (i.e. not the true Q-value). It is the specific reward of reaching some state $S$
    - Now, for every point we land on in our array, we want to compute the discounted future rewards observed from that point, so we can derive a proper state value
    - Init an array to hold the state-value function $V$
    - We'll compute this in reverse order of the rewards array;
        - Init $G$ as the running total of the discounted rewards. This starts at 0
        - We know that the last state is terminal. Suppose this is the rightmost terminal state. 
            - We know that reaching this state gives a reward of 1. 
            - We also know that, this being the terminal state, there is no future reward. 
            - So the cumulative future reward $G = 1$
            - Since we know the rightmost state $N$ has value $G=1$, update its state-value function $V[N]$ by taking $V[N] + \alpha[G - V[N]]$
        - Moving to the second last state. 
            - We know that this state has reward 0. Therefore, the value of this state comes only from the discounted value of transitioning to the last state and receiving the terminal reward 1. 
            - Thus, update $G = reward[N-1] + \gamma \cdot G$
            - Then, update the state value function $V[N-1] + \alpha[G - V[N-1]]$
        - Iterate until you reach the end of the episode 
    - Continue to the next episode for a fixed number of episodes

In [None]:
import numpy as np
np.random.seed(0)

# Environment setup; we create a number line with 10 states. The right-most state has a reward of 1
# Therefore, train the RL loop to take rightward actions [+1] more than leftward actions [-1]
# Terminate if you reach the left or rightmost states
N_STATES = 5
N_ACTIONS = 5
GAMMA = 0.9 ## Discounting factor for future reward
ALPHA = 0.1 ## Learning rate
EPSILON = 0.5 ## Epsilon Greedy parameter

# Shape corresponding to State, Action, Next State
TRANSITION_PROBABILITIES = np.random.rand(N_STATES, N_ACTIONS, N_STATES)
# Normalise so transition probabilities for a given state sums to 1 across all actions
TRANSITION_PROBABILITIES /= np.sum(TRANSITION_PROBABILITIES, axis=2, keepdims=True)

REWARDS = np.random.rand(N_STATES, N_ACTIONS, N_STATES) * 5
REWARDS[:, :, -1] = 5

EPISODES = 1000

TERMINAL_STATES = [0, N_STATES - 1]

In [245]:
from collections import namedtuple

episode_step = namedtuple('episode_step', ['curr_state', 'action', 'new_state', 'reward'])

def take_action(curr_state: int, action_values: np.ndarray) -> tuple[int]:
    action_values_for_curr_state = action_values[curr_state, :]
    
    if np.random.rand() >= EPSILON:
        highest_action_value_action = np.argmax(action_values_for_curr_state)
        return highest_action_value_action
    else:
        random_action = np.random.choice(N_ACTIONS)
        return random_action

def monte_carlo_epsilon_greedy():
    action_values = np.zeros((N_STATES, N_ACTIONS))
    state_values = np.zeros(N_STATES)

    rewards_intermediate_store = {
        (s,a): [] for s in range(N_STATES) for a in range(N_ACTIONS)
    }
    for _ in range(EPISODES):
        curr_state = np.random.choice(list(range(1,N_STATES-1)))
        episode_history = []

        terminal_state_reached = curr_state in [0, N_STATES-1]
        while not terminal_state_reached:
            action = take_action(curr_state, action_values)
            new_state = np.random.choice(
                N_STATES, p=TRANSITION_PROBABILITIES[curr_state, action]
            )
            terminal_state_reached = new_state in [0, N_STATES-1]
            reward = REWARDS[curr_state, action, new_state]
            episode_history.append(
                episode_step(curr_state=curr_state, action=action, new_state=new_state, reward=reward)
            )
            curr_state = new_state
        
        return_val = 0
        for step in reversed(episode_history):
            return_val = step.reward + GAMMA * return_val
            rewards_intermediate_store[(step.curr_state, step.action)].append(return_val)
        
        for curr_state, action in rewards_intermediate_store.keys():
            action_values[curr_state, action] = np.mean(rewards_intermediate_store.get((curr_state, action))) if rewards_intermediate_store.get((curr_state, action)) != [] else 0
        
        state_values = np.max(action_values, axis=1)
    
    return state_values, action_values


In [246]:
monte_carlo_epsilon_greedy()

(array([ 0.        , 11.06340833,  9.65237593, 11.93171671,  0.        ]),
 array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
        [ 7.31191006, 11.06340833,  9.40697801,  7.43068361,  5.76089819],
        [ 8.37778159,  9.59896629,  9.65237593,  8.6572607 ,  7.6457071 ],
        [ 9.9728158 ,  8.30667043,  7.33269837,  9.40128791, 11.93171671],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]]))

## Temporal-Difference (TD) - Bootstrapped Learning

- The motivation for this is similar to the Monte Carlo approach; this is used when we can't solve the Bellman Equation because transition probabilities and/or rewards are not known

- Unlike Monte Carlo, which relies on the assumption that the task is episodic (i.e. terminates at some point), TD lets you update state value functions and policy incrementally. Therefore, TD is applied to tasks with no natural termination state

- We'll cover 2 forms of TD;
    - **TD(N)**: N-Step TD Learning
    - **TD($\lambda$)**: $\lambda$ Return TD Learning

### N-Step TD

- In the N-Step TD, we want to update our state value function $V(S_t)$ based on some observed reward plus the estimated value of the next state.
    - The $N$ simply tells you how many observed rewards we use in updating our current state value $V(S_t)$

- For example, in `TD(0)`, the updated state-value is the current state-value plus some learning rate multipled by the observed deviation from your current state-value. The observed deviation is the return you experience $R_t + \gamma V(S_{t+1})$ minus the current estimated expected return $V(S_t)$
\begin{aligned}
    V(S_t) &= V(S_t) + \alpha (\hat{G_t^{(1)}} - V(S_t)) \\
    &= V(S_t) + \alpha (R_t + \gamma V(S_{t+1}) - V(S_t))
\end{aligned}
    - Here, we update $V(S_t)$ at every step. So $G_t^{(1)}$ is not known, but your best guess at the time step
    - This is why some tutorials mention that `TD(N)` approach requires bootstrapping; it comes from the fact that that $\hat{G_t^{(1)}} = R_t + \gamma V(S_{t+1})$ is an estimate based on $V(S_{t+1})$
    - This differs from something like Monte Carlo, for example, where episodes are played out till the end, and we do not need to bootstrap because we use the actual rewards $R_t, R_{t+1}, ...$ to compute our state-value function instead of some intermediate state-value estimate $V(S_{t+1})$

- Extending this idea, for `TD(N)`, we accumulate $N$ steps of the future rewards $R_t, R_{t+1}, ... R_{t+n}$ plus the estimated state value $V(S_{t+n})$ to compute the **TD error** 
\begin{aligned}
    V(S_t) &= V(S_t) + \alpha (\hat{G_t^{(N)}} - V(S_t)) \\
    &= V(S_t) + \alpha (\sum_{i=0}^{n-1} \gamma^{i} R_{t+i} +  \gamma^n V(S_{t+n}) - V(S_t))
\end{aligned}
    - Fun fact; if N is the episode length, then this is simply Monte Carlo!

- The intuition here is: if $V(S_t)$ has converged, then there should be no TD error observed, the current step return should equal the current state values. So $V(S_t)$ should be unchanged

In [1]:
import numpy as np
np.random.seed(0)

# Environment setup; we create a number line with 10 states. The right-most state has a reward of 1
# Therefore, train the RL loop to take rightward actions [+1] more than leftward actions [-1]
# Terminate if you reach the left or rightmost states
N_STATES = 5
N_ACTIONS = 5
GAMMA = 0.9 ## Discounting factor for future reward
ALPHA = 0.1 ## Learning rate
EPSILON = 0.5 ## Epsilon Greedy parameter

# Shape corresponding to State, Action, Next State
TRANSITION_PROBABILITIES = np.random.rand(N_STATES, N_ACTIONS, N_STATES)
# Normalise so transition probabilities for a given state sums to 1 across all actions
TRANSITION_PROBABILITIES /= np.sum(TRANSITION_PROBABILITIES, axis=2, keepdims=True)

REWARDS = np.random.rand(N_STATES, N_ACTIONS, N_STATES) * 5

EPISODES = 1000

In [20]:
import numpy as np
from collections import namedtuple

episode_step = namedtuple('episode_step', ['state', 'reward', 'next_state'])

def n_step_td(TD_N: int):
    state_values = np.zeros(N_STATES)

    for _ in range(EPISODES):
        # start new episode
        state = np.random.choice(N_STATES)
        episode = []

        # generate a trajectory of at least TD_N steps
        for _ in range(TD_N):
            # pick next_state from transition probabilities averaged over actions
            # (for state-value prediction, we can assume random policy)
            action = np.random.randint(N_ACTIONS)
            next_state = np.random.choice(
                N_STATES, p=TRANSITION_PROBABILITIES[state, action]
            )
            reward = REWARDS[state, action, next_state]
            episode.append(episode_step(state, reward, next_state))
            state = next_state

        # --- compute N-step TD update ---
        # bootstrap from final state's current value estimate
        G = state_values[state]  # V(S_{t+n})
        for step in reversed(episode):
            G = step.reward + GAMMA * G
            s = step.state
            state_values[s] += ALPHA * (G - state_values[s])

    return state_values


In [21]:
n_step_td(5)

array([24.66800009, 23.44499357, 24.20170967, 24.79193746, 24.14574764])

### $\lambda$ Return TD

- 

# Solution Methods
---

## 3. Temporal-Difference (TD) – Bootstrapped Learning

- Combines DP and MC:
  - Like DP: updates use existing estimates (bootstrapping).  
  - Like MC: learns from sampled experience, not full model.
- **TD(0) update rule (state-value)**:

V(s_t) ← V(s_t) + α [ r_{t+1} + γ V(s_{t+1}) − V(s_t) ]


- **Pros**: can learn online, before episode ends; lower variance than MC.  
- **Cons**: introduces bias due to bootstrapping; sensitive to step-size α.

- **TD control methods**: SARSA (on-policy), Q-learning (off-policy).

---

## 4. Policy Gradient – Direct Optimization

- Instead of computing value functions, **directly parameterize the policy** π_θ(a|s) and optimize expected return:
  


J(θ) = E_πθ [ G_t ]


- **Update rule (gradient ascent)**:



θ ← θ + α ∇_θ J(θ)


- **Pros**: handles continuous action spaces naturally; can represent stochastic policies.  
- **Cons**: high variance in gradients; slower convergence.

---

## 5. Actor–Critic Overview

- Combines **value-based** and **policy-based** methods:
  - **Actor**: updates the policy (like policy gradient).  
  - **Critic**: estimates value function (like TD) to reduce variance of updates.
- Intuition: the critic **guides** the actor by telling it whether actions are good or bad.

---

## 6. Key Takeaways

| Method | Model Needed | Online Learning | Bias vs Variance | Action Space |
|--------|--------------|----------------|----------------|--------------|
| Dynamic Programming | Yes | No | None | Discrete |
| Monte Carlo | No | Episodic | Low variance, high bias? | Discrete/Continuous |
| Temporal-Difference | No | Yes | Some bias, lower variance | Discrete/Continuous |
| Policy Gradient | No | Yes | High variance | Continuous-friendly |
| Actor–Critic | No | Yes | Balanced bias/variance | Continuous-friendly |

- Choice depends on **model availability**, **task type**, and **action space**.
- These methods form the backbone for **practical RL algorithms** like DQN, PPO, A3C, etc.

---

*Next: Practical Implementation — coding examples, environments, and debugging tips.*
