# Approaches to Solve the Bellman Equation

- We have established the basic setup for Reinforcement Learning by introducing the equilibrium equations of the State-Value and Action Value functions of the Bellman Equation

- Now we look into **how** the Bellman Equations can be solved in practise

\begin{aligned}
    
    &\text{State-Value Function:} \\
    &\quad v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} P(s' | s, a) [R(s, a, s') + \gamma v_{\pi}(s')] \\ \\

    &\text{Action-Value Function:} \\
    &\quad q_{\pi}(s,a) = \sum_{s'} P(s' | s, a) [R(s, a, s') + \gamma \sum_{a'} \pi(a'|s') q_{\pi}(s', a')]

\end{aligned}

## Dynamic Programming (DP) - Model Based

- The simplest case of solving the Bellman Equation is when you have **both** the transition probabilities $P(s'|s,a)$ and the rewards $R(s,a,s')$

- That is, you know the full Markov Decision Process (MDP)

- This is, of course, exceedingly rare. But it is instructive to see how to solve for $\pi^*$ when you have almost all available information

- We will go through 2 ways of implementing the DP solver
    - Policy Interation
    - Value Iteration

- To do this, let's assume the following set up:
    - States $S = {0,1,2}$
    - Actions $S = {0,1}$
    - Transitions and rewards are known
    - Discount Factor $\gamma = 0.9$ 


In [29]:
import numpy as np

np.random.seed(123)

N_STATES = 20
N_ACTIONS = 20
GAMMA = 0.9

# Shape corresponding to State, Action, Next State
TRANSITION_PROBABILITIES = np.random.rand(N_STATES, N_ACTIONS, N_STATES)
# Normalise so transition probabilities for a given state sums to 1 across all actions
TRANSITION_PROBABILITIES /= np.sum(TRANSITION_PROBABILITIES, axis=2, keepdims=True)

REWARDS = np.random.rand(N_STATES, N_ACTIONS, N_STATES) * 5

### Policy Iteration

- In policy iteration, we create a loop that does (i) policy evaluation, and (ii) policy improvement based on the evaluated outcomes of the policy, and the loop runs until some stopping criteria is reached

- **Policy Evaluation:** 
    - Since we know the transition probabilities and rewards with full certainty, we can update the State-Value function by computing the Bellman Equation
    
    - Start with a seed array representing the State-Value function $V$, and a policy $\pi$ that deterministically tells us what action to take in a given state $S$
    
    - Compute 1 pass of the Bellman State Value function update for all states. This is just a transition-probability-weighted average of the state values of the next states $S'$

    - Check that the largest change in the state value estimates for all $N$ states exceeds some threshold $\theta$. If even the largest change falls below the threshold, we assume convergence, and we return the state values $V$

    - Otherwise, keep looping

- **Policy Improvement**
    - We've previously updated our estimate of the State-Value function $V$
    - Now, we need to update our policy based on the new estimate of the State-Value function. That is, given a state $S$, I want to know the value of taking an action $A$. 
    - That is; the **Q-Value**

    - Looping over every state $S$, we init an array to hold the new Q values

    - Given $S$, loop over every action available, and update the Action-Value function $Q$. This is simply the transition-probability-weighted average of rewards from the transition $S -> S'$, plus the discounted state-value of $S'$

    - Finally, the new policy for state $S$ is simply to take the argmax of all the Q values

- **Policy Iteration**
    - Bringing these 2 together, we simply create an infinite loop of `Evaluation --> Improvement --> Evalution ...`

    - This will keep going until some convergence criteria is met. An easy on is to say that; if for some $N$ consecutive loops we do not change our policy, we have reached convergence

In [None]:
def policy_evaluation(
    transition_probabilities: np.ndarray,
    rewards: np.ndarray,
    curr_state_value_function: np.ndarray,
    policy: np.ndarray,
    gamma: float = 0.9,
    theta: float = 1e-6
) -> np.ndarray:
    updated_state_value_function = curr_state_value_function.copy()
    while True:
        delta = 0
        for curr_state in range(N_STATES):
            v_updated = sum([
                transition_probabilities[curr_state, policy[curr_state], s_prime] * 
                (rewards[curr_state, policy[curr_state], s_prime] + gamma * updated_state_value_function[s_prime])
                for s_prime in range(N_STATES)
            ])
            delta= max(delta, abs(v_updated - updated_state_value_function[curr_state]))
            updated_state_value_function[curr_state] = v_updated
        if delta < theta:
            break
    
    return updated_state_value_function

def policy_improvement(
    transition_probabilities: np.ndarray,
    rewards: np.ndarray,
    curr_state_value_function: np.ndarray,
    curr_action_value_function: np.ndarray,
    curr_policy: np.ndarray,
    gamma: float = 0.9
) -> tuple[np.ndarray, np.ndarray]:
    updated_policy = curr_policy.copy()
    updated_action_value_function = curr_action_value_function.copy()
    for s in range(N_STATES):
        action_value_function_for_state_s = updated_action_value_function[s].copy()
        for a in range(N_ACTIONS):
            new_action_value = sum([
                transition_probabilities[s, a, s_prime] * 
                (rewards[s, a, s_prime] + gamma * curr_state_value_function[s_prime])
                for s_prime in range(N_STATES)
            ])
            action_value_function_for_state_s[a] = new_action_value
        
        updated_policy[s] = np.argmax(action_value_function_for_state_s)
        updated_action_value_function[s] = action_value_function_for_state_s
    return updated_policy, updated_action_value_function

def policy_iteration(
    transition_probabilities: np.ndarray,
    rewards: np.ndarray,
    gamma: float = 0.9,
    theta: float = 1e-6
):
    curr_state_value_function: np.ndarray = np.zeros(N_STATES)
    curr_action_value_function: np.ndarray = np.zeros((N_STATES, N_ACTIONS))
    curr_policy: np.ndarray = np.zeros(N_STATES).astype(int)
    
    count_no_change = 0
    count_iters = 0
    while True:
        print(count_iters)
        updated_state_value_function = policy_evaluation(
            transition_probabilities,
            rewards,
            curr_state_value_function,
            curr_policy,
            gamma,
            theta
        )
        updated_policy, updated_action_value_function = policy_improvement(
            transition_probabilities,
            rewards,
            updated_state_value_function,
            curr_action_value_function,
            curr_policy,
            gamma
        )
        if np.array_equal(curr_policy, updated_policy):
            count_no_change += 1
            if count_no_change >= 5:
                break
        else:
            count_no_change = 0

        curr_policy = updated_policy.copy()
        curr_action_value_function = updated_action_value_function.copy()
        curr_state_value_function = updated_state_value_function.copy()

        count_iters += 1
    
    return curr_policy, curr_action_value_function, curr_state_value_function

pi_star, q_star, v_star = policy_iteration(TRANSITION_PROBABILITIES, REWARDS)

print(f"""
    Optimal Policy: {pi_star}
    Action-Value: {q_star}
    State-Value: {v_star}
""")

- To ascertain if your policy iteration has converged correctly, check that the action values computed from the current transition probability, rewards, and optimal state values matches the values in your optimal q value array 

In [None]:
for s in range(N_STATES):
    for a in range(N_ACTIONS):
        q = sum([
            TRANSITION_PROBABILITIES[s, a, s_prime] *
            (REWARDS[s, a, s_prime] + GAMMA * v_star[s_prime])
            for s_prime in range(N_STATES)
        ])
        assert np.isclose(q, q_star[s, a])

### Value Iteration

- In value iteration, the idea is almost identical. We are doing the same update of state value, 

# Solution Methods
---

## 2. Monte Carlo (MC) – Sampling-Based, Episodic

- **Learns from experience** rather than knowing the model.
- **Key idea**: estimate value functions by averaging returns from multiple episodes.
- Only works well for **episodic tasks** (episodes must terminate).  
- **Pros**: no need for environment model; simple conceptually.  
- **Cons**: high variance; inefficient for long episodes; must wait until episode ends.

---

## 3. Temporal-Difference (TD) – Bootstrapped Learning

- Combines DP and MC:
  - Like DP: updates use existing estimates (bootstrapping).  
  - Like MC: learns from sampled experience, not full model.
- **TD(0) update rule (state-value)**:

V(s_t) ← V(s_t) + α [ r_{t+1} + γ V(s_{t+1}) − V(s_t) ]


- **Pros**: can learn online, before episode ends; lower variance than MC.  
- **Cons**: introduces bias due to bootstrapping; sensitive to step-size α.

- **TD control methods**: SARSA (on-policy), Q-learning (off-policy).

---

## 4. Policy Gradient – Direct Optimization

- Instead of computing value functions, **directly parameterize the policy** π_θ(a|s) and optimize expected return:
  


J(θ) = E_πθ [ G_t ]


- **Update rule (gradient ascent)**:



θ ← θ + α ∇_θ J(θ)


- **Pros**: handles continuous action spaces naturally; can represent stochastic policies.  
- **Cons**: high variance in gradients; slower convergence.

---

## 5. Actor–Critic Overview

- Combines **value-based** and **policy-based** methods:
  - **Actor**: updates the policy (like policy gradient).  
  - **Critic**: estimates value function (like TD) to reduce variance of updates.
- Intuition: the critic **guides** the actor by telling it whether actions are good or bad.

---

## 6. Key Takeaways

| Method | Model Needed | Online Learning | Bias vs Variance | Action Space |
|--------|--------------|----------------|----------------|--------------|
| Dynamic Programming | Yes | No | None | Discrete |
| Monte Carlo | No | Episodic | Low variance, high bias? | Discrete/Continuous |
| Temporal-Difference | No | Yes | Some bias, lower variance | Discrete/Continuous |
| Policy Gradient | No | Yes | High variance | Continuous-friendly |
| Actor–Critic | No | Yes | Balanced bias/variance | Continuous-friendly |

- Choice depends on **model availability**, **task type**, and **action space**.
- These methods form the backbone for **practical RL algorithms** like DQN, PPO, A3C, etc.

---

*Next: Practical Implementation — coding examples, environments, and debugging tips.*
