# Value Functions and Bellman Equations

- In the previous set of notes, we have proven that the discounted future rewards under some specific policy will always converge, so long as the discounting factor $0 \le \gamma \lt 1$

\begin{aligned}
    G_t &= r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... & \gamma \lt 1
\end{aligned}

- Because convergence is guaranteed, we can now define the long term values of both actions and states!

- **State-Value Function**: This is the expected return of starting from state $s$ under policy $\pi$
    
    \begin{aligned}
        v_{\pi}(s) &= E_{\pi}[G_t | s_t = s] \\
        &\text{where} \\
        &\quad G_t = \text{Total discounted reward from time } t \\
        &\quad E_{\pi} = \text{Expectation of trajectories generated by policy } \pi
    \end{aligned}

- **Action-Value Function**: This is the expected return of taking action $a$ from state $s$ then following policy $\pi$
    
    \begin{aligned}
        q_{\pi}(s, a) &= E_{\pi}[G_t | s_t = s, a_t = a] \\
        &\text{where} \\
        &\quad G_t = \text{Total discounted reward from time } t \\
        &\quad E_{\pi} = \text{Expectation of trajectories generated by policy } \pi
    \end{aligned}

## Bellman Equations

- The Bellman Equations are simply a recursive representation of this idea!

\begin{aligned}
    
    &\text{State-Value Function:} \\
    &\quad v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} P(s' | s, a) [R(s, a, s') + \gamma v_{\pi}(s')] \\ \\

    &\text{Action-Value Function:} \\
    &\quad q_{\pi}(s,a) = \sum_{s'} P(s' | s, a) [R(s, a, s') + \gamma \sum_{a'} \pi(a'|s') q_{\pi}(s', a')]

\end{aligned}

- Intuition
    - **State-Value Function**: The value of state $s$ is the sum of of the next step's reward plus the discounted state value of the next step, weighted over all possible next steps, weighted over the probability of taking action $a$ given the current state $s$
    - **Action-Value Function**: The value of action $a$ given state $s$ is the sum of the reward of transitioning from $s$ to $s'$ given $a$, plus the discounted action values of the next state $s'$ weighted by the probability of the next action set $a'$, weighted by the probability of transitioning to $s'$ when the current state is $s$ and the action taken is $a$

- What happens under optimal policy $\pi^*$?

- State-Value Function
    - For the state value function, if we know $\pi^*$, there is no longer a need to average over all possible actions, because we can always take the best action $a^*$. Therefore, the outer summation disappears, leaving only the inner summation, because an action $a$ transitions us into a new state $s'$ probabilistically

    \begin{aligned}
        &\quad v^*(s) = \max_a \sum_{s'} P(s' | s, a) [R(s, a, s') + \gamma \cdot v^{*}(s')] \\ \\
    \end{aligned} 

- Action-Value Function
    - For the action value function, if we know $\pi^*$, then we are no longer uncertain about what actions we will take in subsequent states $s'$. Therefore, the recursive sum over next actions $a'$ disappears; we will simply take actions that maximise our reward

    \begin{aligned}    
        &\quad q^{*}(s,a) = \sum_{s'} P(s' | s, a) [R(s, a, s') + \gamma \max_{a'} q^{*}(s', a')]
    \end{aligned} 

- Optimal Policy
    - The optimal policy, therfore, is simply to greedily pick the action $a$ with the highest action-value

    \begin{aligned}    
        &\quad \pi^*(s) = \argmax_a q^*(s,a)
    \end{aligned} 

- Taken together:

\begin{aligned}
q^*(s,a) &= \mathbb{E}[r + \gamma \max_{a'} q^*(s',a')] \\
v^*(s) &= \max_a q^*(s,a) \\
\pi^*(s) &= \arg\max_a q^*(s,a)
\end{aligned}
