# More on Bellman equations

## MDP

Define our MDP different from Introduction to RL book:

<img src='./pngs/mdp.png>

Where $M(X)$ is the space of all probability distributions defined over the state space X

So we could describe the process as a sequence of state, actions, and rewards:

$X_0, A_0, R_1, X_1, A_1, R_2, ...$

A deterministic dynamical system always behave exactly the same given the same starting state and action, that is, they can be described as a transition function f instead of transition distribution $\rho$

$x_{t+1} = f(x, a)$


## Policy

<img src='pngs/policy_1.png'>
<img src='pngs/policy.png'>

## Policy induced Transition Kernels

<img src='pngs/transition_kernel.png'>

## Rewards and Returns

Let's define expected reward as

$r(x, a) = E[R | X=x, A=a]$, then the expected reward following policy pi is defined as

$r^{\pi} (x) = E_{a \sim \pi(a | s)} [R | X=x, A=a]$

Define return following policy $\pi$ as discounted sum of rewards starting from $X_0$:

$G^{\pi} = R_1 + \gamma R_2 + ... + \gamma^{T-1} R_{T} = \sum^{\infty}_{t=1} \gamma^{t-1}R_{t}$

or any starting at any step $X_{t}$:

$G^{\pi}_{t} = \sum^{\infty}_{i=t+1} \gamma^{i + t - 1}R_{i}$

Note that $\gamma$ is a part of the problem setting, it is usually not a hyperparameter.

## Value functions

Next we can define value functions as:

State value function starting from state x and following policy $\pi$:

$V^{\pi}(x) = E[G^{\pi} | X_{0} = x] = E[G^{\pi}_{t} | X_{t} = x]$ (Markov Property)

Action value function following policy $\pi$ starting from state x and taking action a:

$Q^{\pi} (x, a) = E[G^{\pi} | X_{0} = x, A_{0} = a] = E[G^{\pi}_{t} | X_{t} = x, A_{t} = a]$ (Markov Property)

## Bellman equations

By expanding value functions, we reveal the recursive property of return

\begin{aligned}
V^{\pi} (x) &= E[G^{\pi}_{t} | X_{t} = x]\\
& = E[R_{t} + \gamma G^{\pi}_{t+1} | X_{t} = x]\\
& = E_{a \sim \pi(a | s)}[R_{t} | X_{t} = x] + \gamma E[G^{\pi}_{t+1} | X_{t} = x]\\
& = r^{\pi} (x) + \gamma E[V^{\pi} (X_{t+1}) | X_{t} = x]\\
\end{aligned}

Where
$ E[V^{\pi} (X_{t+1}) | X_{t} = x] = E[E[G^{\pi}_{t+1} | X_{t+1} = x^\prime] | X_{t} = x] = E[G^{\pi}_{t+1} | X_{t} = x]$

This implies that neither side is random, by expanding $E[V^{\pi} (X_{t+1}) | X_{t} = x]$, we get

$E[V^{\pi} (X_{t+1}) | X_{t} = x] = \int P(dx^\prime | x, a) \pi (a | x) V^{\pi} (x^\prime)$

Thus, the value function can be expressed as:

$V^{\pi} (x) = r^{\pi} (x) + \gamma \int P(dx^\prime | x, a) \pi (a | x) V^{\pi} (x^\prime)$

This is the **bellman equation** for a policy $\pi$, this can be interpreted as: The value of following a policy $\pi$ is the expected immediate reward that the $\pi$-following agent receives at that state plus the
discounted average(expected) value that the agent receives at the next-state.

same for action-value function, we have bellman equation:

$Q^{\pi} (x, a) = r(x, a) + \gamma \int P(dx^\prime | x, a) V^{\pi}(x^\prime) = r(x, a) + \gamma \int P(dx^\prime | x, a) \pi (a^\prime | x^\prime) Q^{\pi} (x^\prime, a^\prime)$

Compare with value function, we have expected reward base on action a plus discounted average(expected) value that the agent receives at the next-state following $\pi$ (the choice of action is given at state s in action value function)

## Bellman Equations for Optimal Value Functions

So it is natural to ask, does the optimal value function of optimal policy $V^{\pi^*}$ have similar structure? the answer is yes, however, we need to prove this.

The proof goes with 3 claims which we will prove later:

1.
$\exists$ a value function $V^*$ s.t $\forall x \in X$, we have:
$V^{*}(x) = max_{a} \{Q^{*}(x, a)\} = max_{a} \{r(x, a) + \gamma \int P(dx^\prime | x, a) V^{*}(x^\prime)\}$.

$\exists$ a value function $Q^*$ s.t $\forall x, a \in X, A$, we have:
$Q^{*}(x, a) = r(x, a) + \gamma \int P(dx^\prime | x, a) max_{a^\prime} \{Q^{*} (x^\prime, a^\prime)\}$.

These equations are called **bellman optimality equations for the value functions**

2.
$V^{*}$ is the same as ${V^{\pi^*}}$, the optimal value function when $\pi$ is restricted to be within the space of all stationary and non-stationary policies.

3.
For discounted continuing MDPs, we can always find a stationary policy that is optimal within the space of all stationary and non-stationary policies.

In summary, we claim that $V^{*}$ exists and it is unique,  $V^{*} = V^{\pi^{*}}$. Same for $Q^{*}$.

## Bellman Operators

In order to prove the claims, we need several concepts:

<img src='pngs/bellman-operator.png'>
<img src='pngs/bellman-opt-operator.png'>

These operators are linear and recall that:

$Q^{\pi} (x, a) = r(x, a) + \gamma \int P(dx^\prime | x, a) V^{\pi}(x^\prime) = r(x, a) + \gamma \int P(dx^\prime | x, a) \pi (a^\prime | x^\prime) Q^{\pi} (x^\prime, a^\prime)$

This implies $T^{\pi} Q^{\pi} = Q^{\pi}$, same for state-value function and bellman optimality operators. Thus, we can conclude that:

$V^{\pi} = T^{\pi} V^{\pi}$

$Q^{\pi} = T^{\pi} Q^{\pi}$

$V^{*} = T^{*} V^{*}$

$Q^{*} = T^{*} Q^{*}$

### Properties of Bellman operators

Bellman operators have several important properties. The properties that matters for us the most are:

1. Monotonicity
2. Contraction

#### Monotonicity

<img src='pngs/monotonicity.png'>



Suppose we have $V^{*}$ or $Q^{*}$, we can easily find $\pi^{*}$

For any $x \in X$, the optimal policy is a greedy policy:

$\pi_g^{*} (x) = argmax_{a} Q^{*} (x, a)$

