# Markov Decision Processes (MDPs)

- This notebook will focus on the mathematical framework for modelling RL problems, using the Markov Decision Process

## MDP Basics

- Formally, an MDP has 5 components
    - **States**: What are the possible situations the agent can be in. States will contain all contextual information for the agent
    - **Actions**: All possible moves an agent can make
    - **Transitions**: How the environment responds to actions
    - **Rewards**: Feedback signal from the environment  
    - **Policy**: The strategy used by the agent to choose actions

    \begin{aligned}
        &(S, A, P, R, \gamma) \\
        & \text{where} \\
        &\quad S = \text{set of states} \\
        &\quad A = \text{set of actions} \\
        &\quad P(S' | S, A) = \text{probability that taking action A at state S bring you to S'} \\
        &\quad R(S, A, S') = \text{Reward received from transitioning from S to S' by taking action A} \\
        &\quad \gamma = \text{discount factor to weight future rewards} \\
    \end{aligned}

- All that's missing from this setup is how an agent will decide on the action to take given $S$, or the **policy**

\begin{aligned}
    \pi(a|s) &= \text{Probability of action } A \text{ in state } S
\end{aligned} 

- Thus, the goal of RL is to find the optimal policy $\pi(a|s)$ so that we maximise some cumulative future reward $G_t$ such that:

\begin{aligned}
    G_t &= r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... & \gamma \lt 1
\end{aligned}

### Prove that if $\gamma < 1$, then $G_t$ converges

- Let's assume the rewards are bounded by some reward $R_{\max}$. This is logical; there should not be a process that gives infinite reward

\begin{aligned}
    |r_t| \le R_{\max} \quad \forall t
\end{aligned}

- Then, it must be true that any discounted reward $r$ must also follow this bound

\begin{aligned}
    |\gamma^k r_{t+k}| \le \gamma^k R_{\max} \quad \forall t, k
\end{aligned}

- Let's consider the discounted sum up to an arbitrary timestep $N$

\begin{aligned}
    S_n &= \sum_{k=0}^{n} \gamma^k r_{t+k+1}
\end{aligned}

- This must be bounded by

\begin{aligned}
    |S_n| &\le \sum_{k=0}^{n} |\gamma^k r_{t+k+1}| \\
    &\le \sum_{k=0}^{n} \gamma^k R_{\max} \\
    &= R_{\max} \sum_{k=0}^{n} \gamma^k \\
\end{aligned}

- But the last term is just a geometric series of $\gamma$. Since $0 \le \gamma \lt 1$, by geometric series we know that

\begin{aligned}
    \sum_{k=0}^{n} \gamma^k &= \frac{1 - \gamma^{n+1}}{1 - \gamma}
\end{aligned}

- Therefore

\begin{aligned}
    |S_n| &\le R_{\max} \sum_{k=0}^{n} \gamma^k \\
    &= R_{\max} \frac{1 - \gamma^{n+1}}{1 - \gamma} \\
    &\le \frac{R_{\max}}{1 - \gamma} 
\end{aligned}

- The last condition holds, because $1 - \gamma^{n+1} \lt 1$ if $0 \le \gamma \lt 1$

- Therefore, since $R_{\max}$ is defined and $\gamma$ is defined, it must be true that $|S_n|$ is defined, and therefore the discounted sum of future rewards must converge for any arbitrary $N$

## What is the meaning of Markov?

- Why do we consider this a Markov process?

- The core assumption here is that the transition between states $P(S_{t+1} | S_t, A_t)$ is ONLY dependent on $S_t$; or formally

\begin{aligned}
    P(S_{t+1}| S_t, a_t, S_{t-1}, a_{t-1} ...) &= P(S_{t+1}| S_t, a_t)
\end{aligned}

- The current state is a sufficient statistic of the past for decision making. Under this assumption, you can make a Dynamic Programming / RL solving algorithms work