# Lecture 2: Markov Decision Process

### Introduction to MDPs

- MDPs formalyy describe the environment for RL
- The environment is fully observable
- i.e. The current state completely characterizes the process (the way the environment unfolds depends on a state and we know that state)
- Almost all RL problems can be formalized as MDPs e.g.
    - Optimal control primarily deals with continuous MDPs (octopus swimming in through fluid)
    - Partially observable problems can be converted into MDPs
    - Bandits are MDPs with one state

### State Transition Matrix

For a Markov state $s$ and successor state $s'$, the state transition probability is given by
$$P_{ss'} = \mathbb P[S_{t+1} = s' | S_{t} = s]$$  

Since we can observe all the states, we can build the state transition matrix containing the transition probability from all states $s$ to all successor states $s'$.

### Markov Process

A Markov process (or Markov Chain) is a memoryless random process, i.e. a sequence of random states $S_{1}, S_{2}, ...$ with the Markov property. It is denoted by a tuple $(S, T)$.
* $S$ is a finite set of states 
* $P$ is the transition probability matrix, $P_{ss'} = \mathbb P[S_{t+1} = s' | S_{t} = s]$


### Markov Reward Process

A Markov reward process is a Markov Chain with values. It is denoted by a tuple $(S, T, R, \gamma)$.
* $S$ is a finite set of states 
* $P$ is the transition probability matrix, $P_{ss'} = \mathbb P[S_{t+1} = s' | S_{t} = s]$
* $R$ is a immediate reward function, $R_{s} = \mathbb E[R_{t+1}|S_{t} = s]$
* $\gamma$ is a discount factor, $\gamma \in [0,1]$

#### Return

The return $G_{t}$ (a random variable - one sample) is the total discounted reward from time step $t$.
$$G_{t} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...$$
* $\gamma$ close to 0 leads to myopic evaluation
* $\gamma$ close to 1 leads to far-sighted evaluation
* Discount is important because our model might not be perfect
* Discout also avoids infinite retrurns in cyclic Markov processes

#### Value Function

The value function $v(s)$ gives the long term value of state $s$. The state value function of an MRP is the expected return starting from state $s$
$$v(s) = \mathbb E[G_{t} | S_{t} = s]$$

### Bellman Equation for MRPs

The value function can be decomposed into 2 parts:
* immediate reward $R_{t+1}$
* discounted value of successor state $\gamma v(S_{t+1})$

This is helpful in formulating RL problems as dynamic programming problems.

<img src="Figures/02-bellman-equation.png" style="width: 550px;"/>

<img src="Figures/02-value-function-computation.png" style="width: 550px;"/>

<img src="Figures/02-bellman-matrix.png" style="width: 550px;"/>

<img src="Figures/02-solving-bellman-matrix.png" style="width: 550px;"/>