## DP

Dynamic programming exploits the Markov property to find an optimal policy. The main idea in Dynamic Programming (DP) is to use memory to store values that we calculated instead of recalculating them over and over. DP methods are guaranteed to find optimal policy and values. However, these methods require full knowledge of the MDP.

Recall the MazeWorld environment from the MDP notebook. It is fully deterministic. But this is not the case for most of the problems. In this notebook, we will be working on StochasticMaze. Which has following properties:

- The agent does not move if the chosen action leads to a non-passable state.
- Transitions in the environments are stochastic.
- Intended action is chosen 70% of the time and one of the remaining 3 actions is chosen with 10% probability
- The environment terminates when the state with the goal is reached.
- Rewards obtained during a transition depends only on the next state.

In order to use DP methods, we need to build a map of transitions. This map has a structure as given below:


```
transition_map[state][action] -> [(probability, neighboor_state, reward, termination), ...]

```

**Note** that, each state-action pair points to a list of possible transitions and each transition contains 4 tuples of probability, next state, reward, and termination (binary) values. Since the agent can only transition into 4 different states at max, the length of the list is at most 4 for each state-action pair.

Let's render the initial board of the environemnt. 
<span style="color:#989898">Dark gray cells</span> are impassable while
<span style="color:#DADADA">light gray cells</span> are passable empty cells.
<span style="color:#00B8FA">Blue cell</span> represents the agent and
<span style="color:#DADA22"> golden cell</span> reprensents the goal.


In [9]:
%load_ext autoreload
%autoreload 2

import time
import numpy as np
import matplotlib.cm as cm

CELL_SIZE = 30

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
from rl_hw1.env.mazeworld import StochasticMaze

env = StochasticMaze(cell_size=CELL_SIZE)
state = env.reset()
board = env.board.copy()
state

env.step(1)

# in order to visualize this must be the last line of the cell
env.init_render()

HBox(children=(Canvas(layout=Layout(height='500px', width='700px'), size=(700, 500)),), layout=Layout(height='…

You can access the board of the game using ```env.board```. Note that the game must be initiated at least once.

**Question 1)** In the ```rl_hw1/dp/transition.py``` module, implement ```make_transition_map``` as described above.

We stated that values in a Markov chain can be evaluated iteratively. This procedure is called **Policy Evaluation**. In order to update the policy, we need to know the values at each state so that we can improve it. Since we know everything about the MDP, we can iteratively compute the optimal value for a fixed policy.

Policy evaluation at a state $s$ can be simply calculated by following the equation shown below:
$$ V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$$ 
$$ V^\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s]$$ 
$$ V^\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma V^\pi(S_{t+1}) | S_t = s]$$
$$ V^\pi(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r|s, a)\big[r + \gamma V^\pi(s')\big]$$

Uppercase letters denote random variables while lowercase letters denote scalar values.

**Question 2)** In the third equation, $ G_{t+1} $ replaced with $V^\pi(S_{t+1})$. Why not it is replaced with $V^\pi(s')$?

Because $s'$ is only obtained after calculating a given set of probabilities $p(s', r|s, a)$ where $a$ is an action at a current state $s$ resulting in a reward $r$. $V^\pi(S_{t+1})$ could be calculated by taking the expectation of applying policy $\pi$. In order to compute this expectation, summation of probabilities $p(s', r|s, a)$ of all possible states $s \in S$ given an action $a$. The next state is dependent on an expectation, so it has to be defined in a random variable format.

**Question 3)** In the ```rl_hw1/dp/methods.py``` module, implement ```one_step_policy_eval``` in the ```DPAgent``` class.

Let's test your implementation to see if we can calculate the values. We will be visualizing the values that will be drawn on the right side of the canvas. Remember that the initial policy is random so, we expect to see some sort of diffusion. Let's iterate the ```one_step_policy_eval``` 40 times.

**Question 4)** To test your policy evaluation code we need a policy. So, in the ```rl_hw1/dp/methods.py``` module, implement ```policy``` in the ```DPAgent``` class.

In [11]:
env = StochasticMaze(cell_size=CELL_SIZE)

# in order to visualize this must be the last line of the cell
env.init_render()

HBox(children=(Canvas(layout=Layout(height='500px', width='700px'), size=(700, 500)),), layout=Layout(height='…

In [13]:
from rl_hw1.dp import make_transition_map, DPAgent

tmap = make_transition_map(board)
agent = DPAgent(4, tmap)

# color map for values
cmap = cm.get_cmap("viridis", 100)

# initiating a 2D value image
values = np.zeros_like(board)

for i in range(40):
    agent.one_step_policy_eval(gamma=0.95)
    
    # filling the value image
    for key, val in agent.values.items():
        values[key] = val * 100
    
    # painting new values
    env._renderer.draw(values, 0, CELL_SIZE * (board.shape[1] + 1), cmap)
    time.sleep(1 / 10)

Now we can calculate updated values. The next step is to improve the policy according to the updated values. We call this procedure **policy improvement**. At each state, we update the policy by changing action probabilities so that the policy is improved.

**Question 5)** in the ```rl_hw1/dp/methods.py``` module, implement ```policy_improvement``` in the ```DPAgent``` class.

**Question 6)**  Why do we need to evaluate values whenever we improve the policy? Is it possible to find the optimal values(values of the optimal policy) without using policy improvement(explain your answer)? 

We have everything necessary to move on to Policy and Value iteration methods. The idea behind them is very simple. But before defining them, let's look at what we have. On one hand, we can evaluate the current value iteratively; on the other hand, we can instantly improve our policy to make it follow the high-value path. The first approach that comes into mind is to call these methods in turns. We can find the perfect values by calling ```policy_evaluation``` until convergence and improve the policy with ```policy_improvement```. This method is called **Policy Iteration**.

**Question 7)** in the ```rl_hw1/dp/methods.py``` module, implement ```PolicyIteration``` class.

One drawback of the **Policy Iteration** is that we need to evaluate values until convergence after each ```policy_improvement``` call. It is observed that we don't need perfect values to improve the policy. Instead of evaluating the values until convergence we can call ```policy_improvement``` after each value evaluation step. This strategy is called **Value Iteration**.  

**Question 8)** in the ```rl_hw1/dp/methods.py``` module, implement ```ValueIteration``` class.

Now, let's compare these two.

In [None]:
def training_loop(agent, env, x_offset, y_offset):
    # Scroll up before it starts
    time.sleep(2)

    # Initiating a 2D value image
    values = np.zeros_like(board)
    time.sleep(1.0)
    for i in range(20):
        agent.optimize(0.95)
        # Filling the value image
        for key, val in agent.values.items():
            values[key] = val * 100
        # Painting new values
        env._renderer.draw(values, x_offset, y_offset, cmap)
        time.sleep(0.2)

In [None]:
def running_loop(agent, env):
    state = env.reset()
    done = False
    while not done:
        action = agent.policy(state)
        state, reward, done, info = env.step(action)
        time.sleep(1/10)
        env.render()

In [None]:
from rl_hw1.dp import make_transition_map, PolicyIteration, ValueIteration

CELL_SIZE = 21
tmap = make_transition_map(board)
pi_agent = PolicyIteration(tmap)
vi_agent = ValueIteration(tmap)
# Color map for values
cmap = cm.get_cmap("viridis", 100)
env = StochasticMaze(cell_size=CELL_SIZE)
env.init_render() # In order to visualize this must be the last line of the cell

In [None]:
# Policy Iteartion
training_loop(pi_agent, env, 0, CELL_SIZE*(board.shape[1]+1))
running_loop(pi_agent, env)

In [None]:
# Value Iteartion
training_loop(vi_agent, env, 0, CELL_SIZE*(board.shape[1]*2+2))
running_loop(vi_agent, env)

If you see both having reasonable value diffusions and near-optaimal policies(can reach the goal) then **well done!**