# Exercise 2: Dynamic Programming

## 1) Policy Evaluation

On Reunification day you decide to do a pub crawl with your friends.
Therefore, you have to drink a beer each in three different pubs. 
There are six good pubs available in town, you start at the schloss and will (hopefully) end up at the banks of Neckar. The problem is depicted in the following picture:

![](mannheim_pub_crawl.png)

In our first example we follow the 50/50 policy. 
So after drinking in a pub - e.g. Cafe Vienna, there is a $50 \, \%$ probability to go "east" to the Blau and  $50\, \%$ probability to go "west" to the Kombinat.
Evaluate the state values using policy evaluation ($v_\mathcal{X} = \mathcal{R}_\mathcal{X} + \gamma \mathcal{P}_{xx'} v_\mathcal{X}$):

\begin{align*}
\begin{bmatrix}
v^{50/50}_{1}\\
.\\
.\\
.\\
v^{50/50}_{n}\\
\end{bmatrix}
=
\begin{bmatrix}
\mathcal{R}^{50/50}_{1}\\
.\\
.\\
.\\
\mathcal{R}^{50/50}_{n}\\
\end{bmatrix}
+
\gamma
\begin{bmatrix}
{p}^{50/50}_{11}&...&{p}^{50/50}_{1n}\\
.& &.\\
.& &.\\
.& &.\\
{p}^{50/50}_{n1}&...&{p}^{50/50}_{nn}\\
\end{bmatrix}
\begin{bmatrix}
v^{50/50}_{1}\\
.\\
.\\
.\\
v^{50/50}_{n}\\
\end{bmatrix}
\end{align*}

The rewards are given as negative numbers next to the arrows and represent the distances between two bars as a penalty.
In this exercise we will set $\gamma = 0.9$. 
In the shown problem we have $n = 8$ states (pubs, including start-schloss and end-neckar), ordered as given by the state space:

\begin{align*}
\mathcal{X} =
\left\lbrace \begin{matrix}
\text{Start: Schloss}\\
\text{Hagestolz}\\
\text{Cafe Vienna}\\
\text{Blau}\\
\text{Kombinat}\\
\text{Kazzwoo}\\
\text{Römer}\\
\text{End: Neckar}\\
\end{matrix}
\right\rbrace
\end{align*}

Use a little python script to calculate the state values!

(Hint: First calculate the expected reward for each state.)

YOUR ANSWER HERE

In [1]:
import numpy as np

# define given parameters
gamma = 0.9 # discount factor

# YOUR CODE HERE
P_xx = np.array([[0, .5, .5, 0, 0, 0, 0, 0],
                 [0, 0, 0, .5, .5, 0, 0, 0],
                 [0, 0, 0, .5, .5, 0, 0, 0],
                 [0, 0, 0, 0, 0, .5, .5, 0],
                 [0, 0, 0, 0, 0, .5, .5, 0],
                 [0, 0, 0, 0, 0, 0, 0, 1],
                 [0, 0, 0, 0, 0, 0, 0, 1],
                 [0, 0, 0, 0, 0, 0, 0, 1]])

R_x = np.array([-3.5, -3.5, -3.5, -2.5, -2.5, -6, -2, 0]).reshape(-1, 1)

v_X = np.matmul(np.linalg.inv(np.eye(8) - gamma * P_xx), R_x)
                
print(v_X)


[[-11.591]
 [ -8.99 ]
 [ -8.99 ]
 [ -6.1  ]
 [ -6.1  ]
 [ -6.   ]
 [ -2.   ]
 [  0.   ]]


## 2) Exhaustive Policy Search 

From now on use $\gamma = 1$.

As you have pre knowledge from your master degree, you try to minimize the distance of the way you have to take during your tour in order to have more time in the pubs. Therefore, you perform the following exhaustive search algorithm:

1. Write down all possible path-permutations and calculate the distances.
2. Which is the best path concerning most beer per distance?
3. Derive the formula to calculate the number of necessary path comparisons. 



YOUR ANSWER HERE

Hagestolz -> Blau -> Kazzwoo = -12

Hagestolz -> Blau -> Römer = -11

Hagestolz -> Kombinat -> Kazzwoo = -19

Hagestolz -> Kombinat -> Römer = -14

Cafe Vienna -> Blau -> Kazzwoo = -14

Cafe Vienna -> Blau -> Römer = -13

Cafe Vienna -> Kombinat -> Kazzwoo = -15

Cafe Vienna -> Kombinat -> Römer = -10

At each step, we have two different actions to choose from (up, down) and therefore the number of different paths is given by $N^k = 2^3 = 8$, as we perform three steps. The number of necessary path comparisons hence results in $N^k -1 = 2^3 -1= 7$.

## 3) Dynamic Programming - The Idea

Trying out all combinations might not be best for your liver, so you want to solve the problem above using dynamic programming. 

Making use of value iteration, derive the values resulting from the optimal policy: $v_{i+1}^*(x_k) = \text{max}_u (r_{k+1} + v_{i}^*(x_{k+1}))$.



How many value comparisons have to be made?

YOUR ANSWER HERE

We start by initializing the value for the end state, which is Neckar, to 0 because there is no further distance after reaching the goal:

v(Neckar) = 0


We then propagate the values backwards from the end state through the other states using the Bellman equation.

- **For Kazzwoo**:  
  The only possible path is to Neckar with a distance of -3, so:
  
  v(Kazzwoo) = -6 + v(Neckar) = -6

- **For Römer**:  
  The only possible path is to Neckar with a distance of -2, so:
  
  v(Römer) = -2 + v(Neckar) = -2
  

- **For Blau**:  
  There are two possible paths:
  1. To Kazzwoo with a distance of -1:  
     v(Blau) = -1 + v(Kazzwoo) = -7
     
  2. To Römer with a distance of -4:  
     v(Blau) = -4 + v(Römer) = -6 **(optimal)**
     
- **For Kombinat**:  
    There are two possible paths:
  1. To Kazzwoo with a distance of -3:  
     v(Kombinat) = -3 + v(Kazzwoo) = -9
     
  2. To Römer with a distance of -2:  
     v(Kombinat) = -2 + v(Römer) = -4 **(optimal)**


- **For Hagestolz**:  
  There are two possible paths:
  1. To Blau with a distance of -1:  
     v(Hagestolz) = -1 + v(Blau) = -7 **(optimal)**
     
  2. To Kombinat with a distance of -4:  
     v(Hagestolz) = -6 + v(Kombinat) = -10
     
- **For Cafe Vienna**:  
    There are two possible paths:
  1. To Blau with a distance of -4:  
     v(Cafe Vienna) = -4 + v(Blau) = -10
     
  2. To Kombinat with a distance of -3:  
     v(Cafe Vienna) = -3 + v(Kombinat) = -7 **(optimal)**


- **For Schloss**:  
    There are two possible paths:
  1. To Hagestolz with a distance of -4:  
     v(Schloss) = -4 + v(Hagestolz) = -11
     
  2. To Cafe Vienna with a distance of -3:  
     v(Schloss) = -3 + v(Cafe Vienna) = -10 **(optimal)**

As we do not have a decision to make at Kazzwoo or Römer, we end up with 5 comparisons.
