# Dynamic Programming in Reinforcement Learning

This notebook provides an introduction to Dynamic Programming (DP) and its relevance to Reinforcement Learning (RL).

## Topics Covered:
- What is DP?
- Policy Evaluation
- Policy Improvement
- Value Iteration
- Implementation with simple examples

## 1. Introduction to Dynamic Programming
Dynamic Programming (DP) is a method used in Reinforcement Learning to solve problems where an agent interacts with an environment to maximize rewards.

### Bellman Equation
The Bellman Equation expresses the value of a state as the expected return starting from that state and following a given policy.

$ V(s) = \mathbb{E} [ R + \gamma V(s') ] $

## 2. Policy Evaluation
Policy evaluation calculates the value function for a given policy. It iteratively updates state values using the Bellman expectation equation:

$ V(s) = \sum_a \pi(a|s) \sum_{s',r} P(s', r | s, a) [ r + \gamma V(s') ]$


In [None]:
# Policy Evaluation Example (Gridworld)
import numpy as np

gamma = 1.0  # Discount factor
V = np.zeros(4)  # Assume a simple 4-state environment
rewards = np.array([0, -1, -1, 10])  # Rewards for each state

for _ in range(10):  # Iterate policy evaluation
    for s in range(4):
        V[s] = rewards[s] + gamma * (V[s-1] if s > 0 else 0)

print("State Values:", V)

## 3. Policy Improvement
Once we evaluate a policy, we can improve it by selecting actions that maximize future rewards. The policy improvement step updates the policy based on the value function.

In [None]:
# Policy Improvement Example
policy = [0, 1, 1, 0]  # Example policy

def improve_policy(V):
    return [np.argmax([V[s-1] if s > 0 else 0, V[s]]) for s in range(4)]

new_policy = improve_policy(V)
print("Improved Policy:", new_policy)

## 4. Value Iteration
Instead of evaluating policies iteratively, we can directly compute the optimal value function using:

$ V(s) \leftarrow \max_a \sum_{s',r} P(s', r | s, a) [ r + \gamma V(s') ]$

In [None]:
# Value Iteration Example
V = np.zeros(4)

for _ in range(10):  # Iterate value iteration
    for s in range(4):
        V[s] = max(rewards[s] + gamma * (V[s-1] if s > 0 else 0), V[s])

print("Optimal State Values:", V)