## Chapter 3: Finite Markov Decision Processes

In bandit problems we estimate the value $q_*(a)$ of each action $a$, in MDPs we estimate the value $q_*(s,a)$ of each action $a$ in each state $s$ or we estimate the value $v_*(s)$ of each state given **<span style="color:blue">optimal action selections</span>**.
We focus on assigning **<span style="color:blue">credit for long-term consequences to individual action selections</span>**.

For particular values of the random variables $R_t$ and $S_t$, there is a probability of these values occuring at time $t$, given the values of the preceding state $S_{t-1} $and action $A_{t-1}$: <br/> 
$p(s^\prime, r\:|\:s, a) =\text{Pr}\{S_t=s^\prime, R_t=r| S_{t-1} = s, A_{t-1}=a\}$, <br/> <br/>
$p:\mathcal{S}\times\mathcal{R}\times\mathcal{S}\times\mathcal{A}\rightarrow [0,1]$ is a deterministic function defining the joint distribution of the states and rewards given the previous states and actions; thus <br/>
$\sum_{s^\prime \in \mathcal{S}}\sum_{r\in \mathcal{R}} p(s^\prime,r\: |\: s,a)=1, \forall s\in\mathcal{S}, a\in\mathcal{A} \quad (3.3)$ <br/><br/>
This joint distribution completely describes the whole dynamics of a **finite MDP**.From (3.3), we have **the state-transition probabilities** $p:\mathcal{S}\times\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$  defined as <br/>
$p(s^\prime\:|\:s,a)=\mbox{Pr}\{S_t = s^\prime \:|\: S_{t-1} = s, A_{t-1} = a\}=\sum_{r\in\mathcal{R}} p(s^\prime, r\:|\:s,a) \quad (3.4)$ <br/> <br/>
From (3.3), the **expected rewards** for state-action pairs is defined as a 2-argument function, $r:\mathcal{S}\times\mathcal{A}\rightarrow \mathbb{R}$:
$r(s,a) = \mathbb{E}[R_t \:|\: S_{t-1} = s, A_{t-1} = a]=\sum_{r\in \mathcal{R}}r\sum_{s^\prime}p(s^\prime,r\:|\:s,a)  \quad (3.5)$ <br/> <br/>
The expected rewards as a 3-argument function is defined as $r:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}$<br/>
$r(s^\prime,s,a)=\mathbb{E}[R_t \:|\: S_{t}=s^\prime, A_t = a, S_{t-1}=s] \\
=\sum_{r\in\mathcal{R}} r \sum_{s^\prime\in\mathcal{S}} p(r\:|\: s^\prime, a, s) \\
= \sum_{r\in\mathcal{R}} r \sum_{s^\prime\in\mathcal{S}} \frac{p(s^\prime, r\:|\: a, s)}{p( s^\prime\:|\:s,a)} \quad (3.6) $ <br/> 
where $p(r\:|\: s^\prime, a, s) = \frac{p(s^\prime, r\:|\: s,a)}{p( s^\prime\:|\:s,a)}$ from the definition of a conditional probability <br/>
$p(a\:|\:b) = \frac{p(a,b)}{p(b)}$

### Expected Return 
The objective/goal of the agent is to maximize the cumulative reward it receives in the long run. If the sequence of rewards recieved after time step $t$ is denoted as $R_{t+1}, R_{t+2}, R_{t+3}, ...$, then the **expected return** at time step $t$ is <br/>
$G_t=R_{t+1}+R_{t+2}+R_{t+3}+...+R_T$ <br/>
where $T$ is a final time step. <br/><br/>
For a continuing task, we compute **expected discounted return** using a **discount rate** $\gamma\in[0,1]$ as <br/>
$G_t = R_{t+1}+\gamma R_{t+2} +\gamma^2 R_{t+3}+...\\
= \sum_{i=1}^{\infty} \gamma^{i-1}R_{t+i} = \sum_{i=0}^{\infty}\gamma^iR_{t+i+1}$ <br/><br/>

We can rewrite the expected discounted return using recursion as: <br/>
$G_t=R_{t+1}+\gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4}+.... =R_{t+1}+ \gamma G_{t+1}$ <br/>
since $G_{t+1}=R_{t+2}+\gamma R_{t+3} + \gamma^2 R_{t+4} + \gamma^3 R_{t+5}+....$. <br/> <br/>

This works for all time step $t <T$ and if we define $G_T = 0$, it works when termination occurs. We can also define the expected discounted return as <br/>
$G_t = \sum_{k=t+1}^{T} \gamma^{k-t-1}R_k$ <br/>
with possibility that $T=\infty$ or $\gamma = 1$ but not both.

### Policies

A policy is a mapping from states to probabilities of selecting each possible action for each time step $t$ <br/>
$\pi(a|s):\mathcal{S}\rightarrow\mbox{Pr}\{A_t=a\:|\:S_t=a\}$ <br/>
It defines **a probability over $a\in\mathcal{A}$ for each $s\in\mathcal{S}$**. <br/><br/>
The value of a state $s$ under a policy $\pi$, denoted $v_{\pi}(s)$, is the **expected return** when starting from $s$ and following $\pi$ thereafter. <br/><br/>
For MDPs, the value of state called the **state-value function for policy $\pi$** is <br/>
$v_{\pi}(s)=\mathbb{E}_{\pi}[G_t\:|\: S_t=s]=\mathbb{E}_{\pi}[\sum_{i=0}^{\infty}\gamma^iR_{t+i+1}\:|\: S_t=s]$ <br/>where $t$ is any time step.<br/>
**The value of the terminal state is always 0**. <br/><br/>
The value of taking an action $a$ in a state $s$ under a policy $\pi$, called the **action-value function for policy $\pi$**, is the expected return starting from $s$, taking the action $a$, and thereafter following the policy $\pi$ <br/>
$q_{\pi}(s,a) = \mathbb{E}_{\pi}[G_t \:|\:S_t = s, A_t = a]$ <br/><br/>

The important property of the state-value function and the action-value function (i.e., value functions) is that they satisfy recursive relations. <br/>
$v_{\pi}(s) = \mathbb{E}[G_t\:|\:S_t=s] \\
=\mathbb{E}[R_{t+1}+\gamma G_{t+1}\:|\:S_t=s] \\
=\sum_{s^\prime\in\mathcal{S}}^{}\sum_{r\in\mathcal{R}}^{}\sum_{a\in\mathcal{A}(s)}^{}p(s^\prime, r\:|\:s,a)\pi(a|s)[r+\gamma v_{\pi}(s^\prime)] \quad (3.14)$ <br/>
Note that $a$ is taken from $\mathcal{A}(s)$. The final expression can be read easily as an expected value with the probability $p(s^\prime, r\:|\:s,a)\pi(a|s)$ weighting $[r+\gamma v_{\pi}(s^\prime)]$. <br/><br/>
Equation (3.14) is the **Bellman equation for $v_{\pi}$**. ![alt text](backup_diagram_v_pi.png "Backup diagram for a state-value function" )

The backup diagram for the **Bellman equation for $q_{\pi}$** is <br/>
![alt text](backup_diagram_q_pi.png "Backup diagram for a state-value function" )

and the **Bellman equation for $q_{\pi}$** is <br/>
$q_{\pi}(s,a) = \mathbb{E}[G_t \:|\: S_t=s, A_t = a] \\
=\mathbb{E}[R_{t+1}+\gamma G_{t+1} \:|\: S_t=s, A_t = a] \\
=\sum_{r\in\mathcal{R}}\sum_{s^\prime\in \mathcal{S}}p(s^\prime, r\:|\:s,a)[r+\gamma\sum_{a^\prime\in \mathcal{A}(s^\prime)}\pi(a^\prime\:|\:s^\prime)q_{\pi}(s^\prime, a^\prime)]$
