title	date
Class 8	January 31, 2023

MDPs

$$ V_t^{\pi_t}(s_t) := E\bs{V_{t+1}^{\pi_{t+1}} (s')} + r_t(s_t, \pi_t) $$

Here Expectation is not over revenue as we have assumed it to be deterministic (over the state). (This is when MDP is a Markov Reward Process).

$$ Q_t(s,a) = r_t(s,a) + E_{s,a} \bs{V_{t+1} (f_t(s,a,W_t))} $$

Example

Two state MDP

Notation: (reward, probability)

Write the Bellman Optimality Equation and find the optimal policy.

(Assume the N for the finite horizon problem and solve for that).