# Policy

Deterministic Policy 

pi : S -> A

Stochastic Policy

pi : S x A

![policyDeterStoca.png](attachment:policyDeterStoca.png)

## Gridworld

3x3 grid, mountins in the middle, grass making an M

-1 reward for grass
-3 reward for mountain

![gridworldrewards.png](attachment:gridworldrewards.png)

![discountedReturnMathGridworld.png](attachment:discountedReturnMathGridworld.png)

### State Value Function

![stateValueFunctionGridworld.png](attachment:stateValueFunctionGridworld.png)

### Bellman Equations

Value functions can have a nice resursive property.

In this gridworld example, once the agent selects an action,

    it always moves in the chosen direction (contrasting general MDPs where the agent doesn't always have complete control over what the next state will be), and
    the reward can be predicted with complete certainty (contrasting general MDPs where the reward is a random draw from a probability distribution).

In this simple example, we saw that the value of any state can be calculated as the sum of the immediate reward and the (discounted) value of the next state.

Alexis mentioned that for a general MDP, we have to instead work in terms of an expectation, since it's not often the case that the immediate reward and next state can be predicted with certainty. Indeed, we saw in an earlier lesson that the reward and next state are chosen according to the one-step dynamics of the MDP. In this case, where the reward rrr and next state s′s's′ are drawn from a (conditional) probability distribution p(s′,r∣s,a)p(s',r|s,a)p(s′,r∣s,a), the Bellman Expectation Equation (for v_pi) expresses the value of any state sss in terms of the expected immediate reward and the expected value of the next state:

![bellmanExpectation.png](attachment:bellmanExpectation.png)

### Calculating the Expectation

In the event that the agent's policy π is deterministic, the agent selects action π(s) when in state sss, and the Bellman Expectation Equation can be rewritten as the sum over two variables (s′ and r):

![BellmanDeterministic.png](attachment:BellmanDeterministic.png)

In this case, we multiply the sum of the reward and discounted value of the next state (r + γ vπ(s′)) by its corresponding probability p(s′,r∣s,π(s)) and sum over all possibilities to yield the expected value.

If the agent's policy π is stochastic, the agent selects action aaa with probability π(a∣s) when in state sss, and the Bellman Expectation Equation can be rewritten as the sum over three variables (s′, r, and a):

![BellmanStochastic.png](attachment:BellmanStochastic.png)

In this case, we multiply the sum of the reward and discounted value of the next state (r + γvπ(s′)) by its corresponding probability π(a∣s)p(s′,r∣s,a) and sum over all possibilities to yield the expected value.

(Read the reinforcement learning book!)


## Optimality

![optimalPolicies.png](attachment:optimalPolicies.png)

![optimalGridWorlds.png](attachment:optimalGridWorlds.png)

## Action-Value Function

An action value function can be defined as:

![actionValueFdefinition.png](attachment:actionValueFdefinition.png)

![gridWorldActionValueFuction.png](attachment:gridWorldActionValueFuction.png)

## Optimal Policies

The agent interacts with the enviroment to estimate the optimum value function to find the optimum policy.

Interaction -> q* (action value) -> pi* (policy)

Pick the actions that yierld the highest expected return.

![interactionForOptimalValueFunction.png](attachment:interactionForOptimalValueFunction.png)

# Summary

## Policies

    A deterministic policy is a mapping π:S→A. For each state s∈Ss\in\mathcal{S}s∈S, it yields the action a∈Aa that the agent will choose while in state sss.
    
    A stochastic policy is a mapping π:S×A→[0,1]. For each state s∈Ss and action a∈Aa, it yields the probability π(a∣s) that the agent chooses action a while in state s.

## State-Value Functions

    The state-value function for a policy π\piπ is denoted vπv_\pivπ​. For each state s∈Ss \in\mathcal{S}s∈S, it yields the expected return if the agent starts in state sss and then uses the policy to choose its actions for all time steps. That is, vπ(s)≐Eπ[Gt∣St=s]. We refer to vπ(s) as the value of state sss under policy π.
    
    The notation Eπ is borrowed from the suggested textbook, where Eπ[⋅] is defined as the expected value of a random variable, given that the agent follows policy π.

## Bellman Equations

The Bellman expectation equation for vπ is: vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s].

## Optimality


    A policy π′ is defined to be better than or equal to a policy π if and only if vπ′(s)≥vπ(s) for all s∈S in.
    
    An optimal policy π∗ satisfies π∗≥π for all policies π\piπ. An optimal policy is guaranteed to exist but may not be unique.
    
    All optimal policies have the same state-value function v∗, called the optimal state-value function.

## Action-Value Functions


    The action-value function for a policy π\piπ is denoted qπ. For each state s∈Ss and action a∈Aa, it yields the expected return if the agent starts in state sss, takes action aaa, and then follows the policy for all future time steps. That is, qπ(s,a)≐Eπ[Gt∣St=s,At=a]. We refer to qπ(s,a) as the value of taking action aaa in state sss under a policy π\piπ (or alternatively as the value of the state-action pair s,as, as,a).
    
    All optimal policies have the same action-value function q∗, called the optimal action-value function.


## Optimal Policies

Once the agent determines the optimal action-value function q∗, it can quickly obtain an optimal policy π∗ by setting π∗(s)=argmaxa∈A(s)q∗(s,a).
