A (finite) Markov Decision Process (MDP) is defined by:
* A finite set of states S
* A finite set of action A
* A set of rewards R

**One step dynamics gives the probability of Next State given the previous state and the action taken in that state.**

> $p(s^{'},r|s,a) = P(S_{t+1}=s^{'},R_{t+1} = r|S_t =  s, A_t=a)$

**And next state and previous state decides the Reward given to the agent**


## Policies

#### Deterministic Policy
* The simplest policy is __deterministic policy__.
    * It is the **mapping between state and action**
    * $\pi:S \to A$
    * **Deterministic** itself means one exact/definate action.
    
#### Stochastic Policy
* It is another type of policy
    * It allows policy to choose actions **Randomly**
    * It is a mapping $\pi:S\times A \to [0,1]$
    * We define a stochastic policy as a mapping that accepts an environment state S and action A and gives **probability that agent takes action $a$ while in state $s$**(probability is given against all **possible** state actions pairs)
    * $\pi(a|s) = P(A_t = a|S_t = s)$

<img src = "images/a1.png">

### State-value Function
* For each state, the **state-value function** yields the __expected return__, if the agent started in that state, and then __followed the policy__ for all time steps.

#### Definition
We call $v_\pi$ the state value function for policy $\pi$ is 
>$V_\pi(s) = E_\pi[G_t|S_t =s]$

**For each state $S$ it yields the `expected return` if the agent starts is state $S$ and then uses the policy to choose its actions for all time steps**

### Note
The notation $E_\pi [.]$ is defined as the expected value of a random variable, given that the agent follows policy $\pi$

## Bellman Equations
* In place of finding sum of all subsequent rewards after a particular state to find out its state value function.(which is redundant), we can use Bellman Equation.

* Value of any state can be represented as the sum of immediate reward plus the value of state(discounted) that follows.

> $v_\pi(s) = E_\pi[R_{t+1} + \gamma(v_\pi(S_{t+1})|S_t = s)]$

<img src = "images/a2.png">

## Expected Value
$E = \sum x . P(x)$

In this simple example, we saw that the value of any state can be calculated as the sum of the immediate reward and the (discount) value of the next state.

For general MDP, we have to instead work in terms of an *expectation*, since it's not often the case that the immediate reward and the next state can be predicted with certainty. Indeed, **we saw in an earlier lesson that the reward and next state are chosen according to the ONE STEP DYNAMICS of the MDP**.In this case, where the reward r and next state s' are drawn from a (conditional) probability distribution $p(s^{'},r|s,a)$.
The Bellman Expectation Equation (for $v_\pi$) expresses the value of any state s in terms of the *expected* immediate reward and the *expected* value of the next state:
> $v_\pi(s) = E_\pi[R_{t+1} + \gamma(v_\pi(S_{t+1})|S_t = s)]$

### Calculating the Expectation

In the event that the agent's policy $\pi$ is **deterministic** the agent selects actions $\pi(s)$ when in state $s$, and the Bellman Expectation Equation can be rewritten as the sum over **two** variables (s' and r):

> $V_\pi(s) = \sum_{s^{'} \in S^+, r \in R} p(s^{'}, r|s, \pi(s))(r + \gamma v_\pi(s^{'}))$

Because in deterministic policy $\pi(s)$ maps the actions a, $\pi(s) \to a$


If the agent's policy $\pi$ is **stochastic**, the agent selects action a with probability $\pi(a|s)$ when state s, and the Bellman Expectation Equation can be rewritten as the sum over three variables (s'r and a):

>$V_\pi(s) = \sum_{s^{'} \in S^+, r \in R, a \in A(s)} \pi(a|s) p(s^{'}, r|s, a)(r + \gamma v_\pi(s^{'}))$

In this case, we multiply the sum of the reward and discounted value of the next state $(r + \gamma v_\pi(s^{'}))$ by its corresponding probability $\pi(a|s)p(s^{'},r|s,a)$ and **Sum over all possibilities** to yield the *expected value*.

> All of the Bellman equations attest to the fact that value functions satisfy recursive relationship

## Optimality
### Definition

**$\pi^{'} >= \pi$ if and only if $v_{\pi^{'}}(s) >= v_{\pi}$ for all $s \in S$**

By definition, we say that a policy Pi-prime is better than or equal to a policy Pi if it's state-value function is greater than or eqal to that of policy Pi for **all states**

#### Note 
It is often possible to find two policies that cannot be compared.

But there always be a policy which will be better than or equal to all other policies. -- **Optimal Policy**

**It is guaranteed to exist BUT may not be unique.**

**Optimal Policy** $v_*$

## Action Value Function


<img src = "images/a3.png">

### Action Value function
* For each **state** $s$ and **action** $a$ it yields the **expected return** if the agent starts in state $s$ then chooses action $a$ and then uses *the policy* to choose its action for all time steps.

> $q_\pi (s,a) = E_\pi[G_t|S_t = s, A_t = a ]$


Now with the action value function for each state we will need n number of values where n equals to number of possible action in the corresponding state.

The **optimal action-value** function is denoted by $q_*$


$v_\pi(s) = q_\pi(s, \pi(s))$

## Optimal Policies

$Interaction \to q_* \to \pi_*$ 

* If an agent have optimal action value function then it can easily obtain optimal state value function.

## Summary

### Policies
***

* A **deterministic policy** is a mapping $\pi: S \to A$. For each state $s \in S$, it yields the action $a \in A$ that the agent will choose while in state s.
* A **stochastic policy** is a mapping $\pi : S \times A \to [0,1]$. For each state $s \in S$ and actions $a \in A$, it yields the probability $\pi(a|s)$ that the agent chooses action a while in state s. 



### State-Value Functions
***

* The **state-value function** for a policy $\pi$ is denoted $v_\pi$. For each state $s \in S$, it yields the expected return if the agent starts in state s and then uses the policy to choose its action for all time steps. That is, $v_\pi (s) = E_\pi[G_t|S_t = s]$. We refer to $v_\pi(s)$ as the **value of state $s$ under policy** $\pi$.


### Bellman Equations
***

* The **Bellman expectation equation for $v_\pi$** is:
> $v_\pi (s) = E_\pi[R_{t+1} +\gamma v_\pi(S_{t+1})|S_t = s]$

### Optimality
***

* A policy $\pi ^{'}$ is defined to be better than or equal to a policy $\pi$ if and only if $v_{\pi^{'}}(s) >= v_{\pi}(s)$ for all $s \in S$.
* An **optimal policy** $\pi_*$ satisfies $\pi_* >= \pi $ for all policies $\pi$. An optimal policy is guaranteed to exists but _may not be unique_.
* All optimial policies have the same state-value function $v_*$, called the **optimal state-value function**.


## Action-value Functions
***

* The **action value function** for a policy $\pi$ is denoted $q_\pi$. For each state $s \in S$ and action $a \in A$, it yields the expected return if the agent starts in state s, takes action a, and __then follows the policy for all future time steps__. That is. $q_\pi (s,a) = E_\pi [G_t | S_t =s, A_t = a]$. We refer to $q_\pi(s,a)$ as the **value of taking action** a **in state** s **under policy** $\pi$ (or alternatively as the **Value of the state-action pair** s,a ).

* All optimal policies have the same action-value function $q_*$, called the **optimal action value function**.


## Optimal Policies
***

* Once the agent determines the optimial action-value function $q_*$, it can quickly obtain an optimal policy $\pi_*$ by setting $\pi_*(s = argmax_{a \in A(s)} q_*(s,a))$.