# Markov Decision Processes (MDPs)
A Markov Decision Process extends the basic Markov chain framework by introducing actions and rewards, enabling agents to make optimal decisions in stochastic environments. While standard Markov chains evolve passively according to fixed transition probabilities, MDPs allow an agent to choose actions that influence state transitions and generate rewards, forming the foundation for sequential decision-making under uncertainty.

> __Learning Objectives:__
> 
> By the end of this module, you will be able to:
>
> * __MDP Components:__ Understand how states, actions, rewards, transition probabilities, and discount factors work together to define a sequential decision problem. Learn how policies map states to actions and how the discount factor balances immediate versus long-term rewards.
> * __Value Iteration Algorithm:__ Apply the Bellman backup operation to compute optimal value functions and extract optimal policies. Understand how iterative updates converge to the maximum expected cumulative discounted reward from each state.
> * __Random Rollout Sampling:__ Estimate state values through Monte Carlo simulation by generating random trajectories and averaging cumulative discounted rewards. Recognize how model-free approaches enable learning from direct experience without requiring transition probabilities.

Let's get started!
___

## MDP Components
Formally, an MDP is defined by the tuple $\left(\mathcal{S}, \mathcal{A}, R\left(s, a\right), T\left(s^{\prime}\,|\,s,a\right), \gamma\right)$:

* **State space** $\mathcal{S}$: The set of all possible states $s$ the system can occupy.
* **Action space** $\mathcal{A}$: The set of all possible actions $a$ available to the agent. For a given state $s$, the available actions are denoted as $\mathcal{A}_{s} \subseteq \mathcal{A}$.
* **Reward function** $R\left(s, a\right)$: The immediate reward received when taking action $a$ in state $s$.
* **Transition model** $T\left(s^{\prime}\,|\,s,a\right) = P(s_{t+1} = s^{\prime}\,|\,s_{t}=s,a_{t} = a)$: The probability that action $a$ in state $s$ at time $t$ results in state $s^{\prime}$ at time $t+1$.
* **Discount factor** $\gamma \in [0,1]$: A parameter that weights future rewards relative to immediate rewards, with $\gamma = 0$ prioritizing immediate rewards and $\gamma \to 1$ valuing long-term returns.

The agent's goal is to find a policy $\pi: \mathcal{S} \to \mathcal{A}$ that maps states to actions ($\pi(s) = a$) while maximizing the expected cumulative discounted reward over time. 

Two common approaches for solving MDPs are __Value Iteration__ and __Random Rollout__ algorithms.

### Example: Grid World Navigation

Consider a robot navigating a $3\times 3$ grid world where the goal is to reach a target location while avoiding obstacles and minimizing travel time. This concrete example illustrates how the abstract MDP components work together:

* **State space** $\mathcal{S}$: The set of 9 grid positions, represented as $(i,j)$ coordinates where $i,j \in \{1,2,3\}$.
* **Action space** $\mathcal{A}$: The set $\{\text{up}, \text{down}, \text{left}, \text{right}\}$. At boundary positions, actions that would move the agent outside the grid are unavailable, so $\mathcal{A}_s \subsetneq \mathcal{A}$ for edge and corner states.
* **Reward function** $R(s,a)$: The agent receives $-1$ for each step (encouraging shorter paths), $+10$ for reaching the goal at position $(3,3)$, and $-5$ for attempting to move into an obstacle at position $(2,2)$.
* **Transition model** $T(s'|s,a)$: Actions are stochastic due to slippery terrain. The intended direction succeeds with probability $0.8$, while the two perpendicular directions each occur with probability $0.1$. For example, choosing "up" results in moving up with probability $0.8$, and moving left or right with probability $0.1$ each.
* **Discount factor** $\gamma = 0.9$: This values the $+10$ goal reward over immediate $-1$ step costs, encouraging the agent to pursue the goal despite accumulated penalties.

In this setting, an optimal policy $\pi^{*}$ would map each grid position to the action that maximizes expected cumulative discounted reward, accounting for both the stochastic transitions and the spatial layout of rewards and obstacles.
___

## Value Iteration
Value iteration is a dynamic programming algorithm that computes the optimal value function $U^{*}(s)$ by iteratively applying the Bellman backup operation:

$$
\begin{equation*}
U_{k+1}(s) = \max_{a\in\mathcal{A}_s}\left(R(s,a) + \gamma\sum_{s^{\prime}\in\mathcal{S}}T\left(s^{\prime}\,|\,s,a\right)\cdot{U}_{k}(s^{\prime})\right)
\end{equation*}
$$

As $k \to \infty$, the value function converges such that $U_k(s) \to U^{*}(s)$. The optimal value function represents the maximum expected cumulative discounted reward achievable from each state under the best possible policy.

#### Algorithm
The value iteration algorithm computes the optimal value function $U^{*}(s)$ and policy $\pi^{*}(s)$.

__Initialize__: Given an MDP with state space $\mathcal{S}$, action space $\mathcal{A}$, reward function $R(s,a)$, transition model $T\left(s^{\prime}\,|\,s,a\right)$, discount factor $\gamma$, tolerance parameter $\epsilon$, and maximum number of iterations $T$, initialize the iteration counter $k\gets 0$, the initial value function $U_{0}(s) \gets 0$ for all $s \in \mathcal{S}$, and $\texttt{converged}\gets\texttt{false}$.

While $\texttt{converged}$ is $\texttt{false}$ __do__:
1. For each state $s \in \mathcal{S}$, compute the updated value:
   $$U_{k+1}(s) \gets \max_{a\in\mathcal{A}_s}\left(R(s,a) + \gamma\sum_{s^{\prime}\in\mathcal{S}}T\left(s^{\prime}\,|\,s,a\right)\cdot{U}_{k}(s^{\prime})\right)$$
2. Check for convergence:
    - If $\max_{s\in\mathcal{S}} \left|U_{k+1}(s) - U_{k}(s)\right| \leq \epsilon$, then set $\texttt{converged}\gets\texttt{true}$ and $U^{*}\gets{U}_{k+1}$.
    - If $\max_{s\in\mathcal{S}} \left|U_{k+1}(s) - U_{k}(s)\right| > \epsilon$, update $k\gets{k+1}$ and $U_{k}\gets{U}_{k+1}$.
3. Update the $\texttt{converged}$ flag:
    - If $k\geq{T}$, then set $\texttt{converged}\gets\texttt{true}$ and $U^{*}\gets{U}_{k+1}$.

__Extract Policy__: For each state $s \in \mathcal{S}$, compute:
$$\pi^{*}(s) \gets \arg\max_{a\in\mathcal{A}_s}\left(R(s,a) + \gamma\sum_{s^{\prime}\in\mathcal{S}}T\left(s^{\prime}\,|\,s,a\right)\cdot{U^{*}}(s^{\prime})\right)$$

#### Computational Complexity

Value iteration requires $O(|\mathcal{S}|^2 \cdot |\mathcal{A}|)$ operations per iteration. For each of the $|\mathcal{S}|$ states, we must evaluate $|\mathcal{A}|$ actions, and each action evaluation requires summing over $|\mathcal{S}|$ possible next states. The number of iterations until convergence depends on the discount factor $\gamma$ and the tolerance $\epsilon$, with convergence rate proportional to $\gamma$.

This complexity makes value iteration tractable for problems with thousands of states but challenging for continuous or very large discrete state spaces. In such cases, function approximation or sampling-based methods become necessary.

___

## Random Rollout Algorithm

The random rollout algorithm estimates state values by simulating random trajectories from a starting state and computing the cumulative discounted reward obtained along each path. Unlike value iteration, which requires knowledge of the transition model $T(s^{\prime}\,|\,s,a)$ and computes values for all states simultaneously, random rollout is a model-free sampling approach that explores the state space through direct interaction with the environment.

The algorithm generates trajectories by selecting random actions at each state, transitioning according to the environment dynamics, and accumulating discounted rewards. By averaging the returns from multiple rollouts starting from a given state $s$, we obtain an empirical estimate of the value function $\hat{U}(s)$.

#### Algorithm

The random rollout algorithm estimates the value function $\hat{U}(s)$ through Monte Carlo sampling:

__Initialize__: Given an MDP with state space $\mathcal{S}$, action space $\mathcal{A}$, reward function $R(s,a)$, discount factor $\gamma$, maximum depth $d$, and number of rollouts $N$, initialize the value estimates $\hat{U}(s) \gets 0$ and visit counts $n(s) \gets 0$ for all $s \in \mathcal{S}$.

For $i = 1$ to $N$ __do__:

1. Initialize the rollout:
    - Set starting state $s_{0}$ (chosen uniformly or from a start distribution).
    - Set depth counter $t\gets 0$, cumulative return $G\gets 0$, and visited states $\mathcal{V}\gets\{s_{0}\}$.
    - Set current state $s \gets s_{0}$ and $\texttt{terminated}\gets\texttt{false}$.

2. While $\texttt{terminated}$ is $\texttt{false}$ __do__:
    - Select action $a$ uniformly at random from $\mathcal{A}_{s}$.
    - Observe reward $r = R(s, a)$ and update cumulative return: $G \gets G + \gamma^{t} \cdot r$.
    - Execute action $a$ and observe next state $s^{\prime}$.
    - Update depth: $t \gets t + 1$.
    - Check termination conditions:
        - If $t \geq d$ (maximum depth reached), set $\texttt{terminated}\gets\texttt{true}$.
        - If $s^{\prime} \in \mathcal{V}$ (cycle detected), set $\texttt{terminated}\gets\texttt{true}$.
        - If $s^{\prime}$ is an absorbing state (terminal state), set $\texttt{terminated}\gets\texttt{true}$.
        - Otherwise, update $\mathcal{V} \gets \mathcal{V} \cup \{s^{\prime}\}$ and $s\gets s^{\prime}$.

3. Update value estimates:
    - Increment visit count: $n(s_{0}) \gets n(s_{0}) + 1$.
    - Update value estimate using incremental mean:
       $$\hat{U}(s_{0}) \gets \hat{U}(s_{0}) + \frac{1}{n(s_{0})}\left(G - \hat{U}(s_{0})\right)$$

__Output__: The estimated value function $\hat{U}(s)$ for all states visited during the $N$ rollouts.

#### Convergence Properties

Random rollout converges to the true value function $U(s)$ as $N \to \infty$, without requiring knowledge of the transition model. By averaging empirical returns from multiple rollouts, the sample mean converges to the expected cumulative discounted reward. 

This model-free property forms the foundation of Monte Carlo methods in reinforcement learning, enabling agents to learn from direct experience in stochastic environments.

#### Computational Complexity

Random rollout requires $O(N \cdot d \cdot |\mathcal{A}|)$ operations, where $N$ is the number of rollouts, $d$ is the maximum depth per rollout, and $|\mathcal{A}|$ represents the cost of selecting and executing actions. Unlike value iteration, the complexity does not depend on the state space size $|\mathcal{S}|$, making random rollout particularly attractive for large or continuous state spaces.

The variance of the value estimates decreases as $O(1/\sqrt{N})$, meaning accuracy improves slowly with additional samples. However, the algorithm can provide useful estimates with relatively few rollouts and scales naturally to high-dimensional problems where model-based approaches become intractable.
___

## Summary

In this module, we explored how Markov Decision Processes extend Markov chains by adding actions and rewards to enable optimal sequential decision-making:

> __Key Takeaways:__
>
> * **MDPs add agency to Markov chains**: By introducing actions that influence state transitions and generate rewards, MDPs transform passive stochastic processes into active decision-making frameworks. The discount factor controls how much we value future rewards relative to immediate ones, while policies define how agents choose actions based on current states.
> * **Value iteration finds optimal policies through dynamic programming**: The Bellman backup operation iteratively computes the maximum expected cumulative discounted reward for each state by considering all possible actions and their consequences. Convergence guarantees that repeated updates produce the optimal value function and corresponding policy.
> * **Random rollout estimates values through sampling**: Monte Carlo simulation generates trajectories by taking random actions and accumulating discounted rewards, providing model-free value estimates that converge as the number of rollouts increases. This sampling approach enables learning from experience without requiring knowledge of transition probabilities.

These foundational concepts enable practical applications in robotics, game playing, resource allocation, and autonomous systems where agents must make sequential decisions under uncertainty.
___