# Planning by Dynamic Programming

In this tutorial we will use dynamic programming to solve a simple MDP task, represented by a simple **Grid world** environment. The outline is the following :

* RL refresher
* DP methods
* Setting up the environment
* Solution methods :
    * Policy Evaluation
    * Policy Iteration
    * Value Iteration

## RL refresher

### RL formulation
Recall the interaction loop of an agent with an environment :

![rl loop](imgs/img_rl_interaction_loop.png)

* The agent is in a certain state $s_{t} \in \lbrace S \rbrace$ and it interacts with the environment via its actions $a_{t} \in \lbrace A \rbrace$. 

* Because of this interaction, the environment returns back some information about how well this information went, namely the reward signal $r_{t+1}$ (a random variable from a distribution $R$), and the next state the agent landed $s_{t+1}$, by means of the dynamics of the environment (which can be modeled as $P(s^{'} | s, a)$).

* The dynamics model and the reward distribution are usually thought as a single distribution:

\begin{equation}
    T(s^{'},r,s,a) = P( s^{'}, r | s, a )
\end{equation}

* Then the process repeats until we get to a terminal state (episodic tasks) or it continues indefinitely (continuing tasks).

* And the agents objective is to learn from this interaction in order to get the largest sum of rewards possible, which is called the **Return**, $G_{t}$

\begin{equation}
    G_{t} = \sum_{k=0}^{\infty} \gamma^{k} r_{t + k + 1}
\end{equation}

* Here we are discounting with a factor of $\gamma$ to make the sum bounded, and mostly to make the math work. Also, as described earlier, $r$ is a random variable, so the return is also a random variable. Because of this, we objective of the agent is to maximize the **Expected Return** :

\begin{equation}
    \mathbb{E} \lbrace G_{t} | s_t \rbrace
\end{equation}

So, we can formulate the RL problem mathematically by using this components $(S,A,T,\gamma)$ into what is called a **Markov Decision Process (MDP)**, which is basically defined by that 4-tuple mentioned before, working as described earlier :

* The agent in state $s$ from $S$ picks an action $a$ from $A$.
* The environment takes one step using the dynamics $T$ and returns a reward $r$ and a new state $s^{'}$.
* The process then repeats.

**Notes**

* The dynamics of the environment only need the current state to compute the future state. This is called the **Markov Property**, and the state (environment state) satisfies it.
* Usually the environment state is not visible to the agent. Instead just some observations $o_{t} \in {O}$ are visible in return (which are a sort of state/configuration). If this observations are not sufficient to compute the future states, then the environment is said to be **Partially observable**, and we are in the case of a POMDP (Partially Observable Markov Decision Process)
* Usually, to avoid this, the state representation given to the agent is made such that it can have enough information to satisfy this property (state augmentation), and usually this works well in Deep RL.


### RL solution

As described earlier,the objective of the agent is to get the largest expected return, and the way it can do this is by means of the actions it can take. So the agent has to pick actions accordingly. The agent has to pick an action in every environment step, so we can formulate this decision as a mapping : for a given state/configuration $s_{t}$ the agent is currently in, the agent chooses an action $a_{t}$. This mapping is called a **policy**, and can be :

* Deterministic function mapping (Deterministic policy $\pi(s)$) :
    \begin{equation}
        a_{t} = \pi(s_{t})
    \end{equation}
    
* Sampling from a distribution (Stochastic policy $\pi(a|s)$) :
    \begin{equation}
        a_{t} \sim \pi(a | s_{t})
    \end{equation}
    
So, the objective of the agent is to find a **policy** $\pi$ (deterministic or stochastic) such that if the agent follows this policy it will **maximize the expected return** it gets from its interaction with the environment.

In [2]:
# configure paths
import sys
sys.path.insert( 0, '../' )

## Creating the environment