# Markov Decision Processes (MDP)

In this notebook, we will cover the fundamental concepts of Markov Decision Processes (MDPs) and Reinforcement Learning, including Agent, States, Actions, Rewards, Policy, types of RL, model types, transition probabilities, discount factor, and the MDP framework.

## Agent, States, Actions, Rewards, Policy

### Agent
An agent is an entity that makes decisions by interacting with an environment to achieve a goal. It observes the state of the environment, takes actions, and receives rewards as feedback.

### States
A state represents the current situation or configuration of the environment. It is a snapshot of all relevant information needed by the agent to make a decision.

### Actions
Actions are the choices available to the agent at each state. The agent selects actions based on its policy to influence the environment.

### Rewards
Rewards are feedback signals received from the environment in response to the agent's actions. The goal of the agent is to maximize the cumulative reward over time.

### Policy
A policy defines the strategy used by the agent to decide which actions to take in each state. It can be deterministic (specific action for each state) or stochastic (probabilities assigned to actions).

## Types of Reinforcement Learning

### Episodic for Model-Based
In episodic reinforcement learning, tasks are divided into episodes, each with a clear starting and ending point. The agent's experience is reset after each episode. Model-based methods use a model of the environment to make decisions.

### Continuous for Model-Free
In continuous reinforcement learning, tasks do not have a clear endpoint, and the agent continuously interacts with the environment. Model-free methods do not use a model of the environment; instead, they learn directly from interactions.

## Types of Models

### Model-Free
Model-free methods learn the value of actions and states directly from experience without modeling the environment's dynamics. Examples include Q-learning and SARSA.

### Model-Based
Model-based methods build a model of the environment's dynamics and use this model to plan actions. They can simulate the environment to predict future states and rewards.

## Transition Probabilities

Transition probabilities represent the likelihood of transitioning from one state to another given a specific action. They are denoted as $P(s'|s, a)$, where $s$ is the current state, $a$ is the action taken, and $s'$ is the next state.

## Discount Factor

The discount factor (gamma, $\gamma$) is used to weigh future rewards compared to immediate rewards. It ranges from 0 to 1. A discount factor close to 0 prioritizes immediate rewards, while a factor close to 1 values future rewards more.

## MDP Framework

The MDP framework provides a mathematical model for decision-making problems where outcomes are partly random and partly under the control of the agent. An MDP is defined by:

- A set of states ($S$)
- A set of actions ($A$)
- A transition model ($P$): $P(s'|s, a)$
- A reward function ($R$): $R(s, a, s')$
- A discount factor ($\gamma$)


## Markov Decision Processes

### Markov Decision Processes Formally Describe an Environment for Reinforcement Learning

- The environment is fully observable.
- The current state completely characterizes the process.
- Almost all RL problems can be formalized as an MDP.
- Partially observable problems can also be converted to MDPs.

### Markov Property

The future is independent of the past given the present.
- All necessary information from the past is encapsulated in the present state, so there is no need to reconsider the past.

A state $S_t$ is Markov if and only if:

---


>$$P(S_{t+1} | S_t) = P(S_{t+1} | S_1, \ldots, S_t)$$


---
The state captures all relevant information from the history. Once the state is known, the history is irrelevant. "The future is independent of the past given the present."

### State Transition Matrix

For a Markov state $s$ and a successor state $s'$, the state transition probability is defined as:

---


>$$P_{ss'} = P(S_{t+1} = s' | S_t = s)$$


---

### Markov Process

A stochastic process involves some randomness in obtaining the outcome. We consider the states $S$ of our environment.
- A Markov process is a sequence of states in this environment where the Markov property holds.
A Markov process based on this environment consists of a tuple $(S, P) $ where:
- $S$ is a finite set of states.
- $P$ is the state transition probability matrix.

### Markov Process with Rewards

A Markov process with rewards is a Markov chain with values. We introduce values into the Markov process with states, forming the tuple $(S, P, R, \gamma)$:
- $S$ is a finite set of states.
- $P$ is the state transition probability matrix.
- $R$ is a reward function, $R_s = \mathbb{E}[R_{t+1} | S_t = s]$.
- $\gamma$ is a discount factor, $\gamma \in [0, 1]$.

### Return

The return $G_t$ is the total discounted reward from time $t$:

---


>$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

- $\gamma \in [0, 1]$
- $\gamma$ close to 0 prioritizes immediate rewards.
- $\gamma$ close to 1 prioritizes future rewards.


---

### Value Function

The value function $V(s)$ gives the long-term value of the state $s$ or an action.
The state value function $V(s)$ of an MRP is the expected return from state $s$:

---


>$$V(s) = \mathbb{E}[G_t | S_t = s]$$


---

### Bellman Equation

The value function can be decomposed into:
- Immediate reward: $R_{t+1}$
- Discounted value of the next state: $\gamma V(S_{t+1})$

Rewriting the equation introduces the state transition matrix $P_{ss'}$:

---


>$$V(s) = R_s + \gamma \sum_{s' \in S} P_{ss'} V(s')$$

### Matrix Form

The Bellman equation can also be written in matrix form:
>$$V = R + \gamma P V$$

The Bellman equation is a linear equation with the solution:
>$$V = (1 - \gamma P)^{-1} R$$

### Solution Techniques

- Dynamic Programming
- Monte Carlo Methods
- Temporal Difference Methods

### Markov Decision Process

A Markov Decision Process is a Markov reward process with decisions.
- All states are also Markov.

Introducing decisions into the Markov reward process forms the tuple $\langle S, A, P, R, \gamma \rangle$:
- $S$ is a finite set of states.
- $A$ is a finite set of actions.
- $P$ is the state transition probability matrix.
- $R$ is a reward function, $R_s = \mathbb{E}[R_{t+1} | S_t = s, A_t = a]$.
- $\gamma$ is a discount factor, $\gamma \in [0, 1]$.

Now, let's explore these concepts with some Python code.

## Deterministic Example

In [3]:
import numpy as np

# Define the states, actions, and rewards for a simple MDP
states = ['S0', 'S1', 'S2']
actions = ['A0', 'A1']

In [4]:
# Transition probabilities P(s'|s, a)
transition_probabilities = {
    'S0': {'A0': {'S0': 0.1, 'S1': 0.9},
           'A1': {'S0': 0.7, 'S1': 0.3}},
    'S1': {'A0': {'S1': 0.8, 'S2': 0.2},
           'A1': {'S1': 0.4, 'S2': 0.6}},
    'S2': {'A0': {'S0': 0.3, 'S2': 0.7},
           'A1': {'S0': 0.4, 'S2': 0.6}}
}

In [5]:
# Rewards R(s, a, s')
rewards = {
    'S0': {'A0': {'S0': 0, 'S1': 10},
           'A1': {'S0': 0, 'S1': 0}},
    'S1': {'A0': {'S1': 0, 'S2': 10},
           'A1': {'S1': 0, 'S2': 10}},
    'S2': {'A0': {'S0': 0, 'S2': 10},
           'A1': {'S0': 0, 'S2': 10}}
}

In [6]:
# Define a policy (deterministic for simplicity)
policy = {
    'S0': 'A0',
    'S1': 'A0',
    'S2': 'A1'
}

In [7]:
# Function to simulate one step in the MDP
def step(state, action):
    next_state = np.random.choice(list(transition_probabilities[state][action].keys()),
                                  p=list(transition_probabilities[state][action].values()))
    reward = rewards[state][action][next_state]
    return next_state, reward

# Simulate an episode
def simulate_episode(start_state, policy, max_steps=10):
    state = start_state
    total_reward = 0
    for _ in range(max_steps):
        action = policy[state]
        next_state, reward = step(state, action)
        print(f"State: {state}, Action: {action}, Reward: {reward}, Next State: {next_state}")
        total_reward += reward
        state = next_state
        if state == 'S2':  # End of episode
            break
    print(f"Total Reward: {total_reward}")

In [21]:
# Run a simulation
simulate_episode('S0', policy)

State: S0, Action: A0, Reward: 10, Next State: S1
State: S1, Action: A0, Reward: 10, Next State: S2
Total Reward: 20


## Stochastic Example

In [23]:
# Define a policy (stochastic for this example)
policy = {
    'S0': {'A0': 0.5, 'A1': 0.5},
    'S1': {'A0': 0.7, 'A1': 0.3},
    'S2': {'A0': 0.4, 'A1': 0.6}
}

# Function to simulate one step in the MDP
def step(state, action):
    next_state = np.random.choice(list(transition_probabilities[state][action].keys()),
                                  p=list(transition_probabilities[state][action].values()))
    reward = rewards[state][action][next_state]
    return next_state, reward

# Simulate an episode with a stochastic policy
def simulate_episode(start_state, policy, max_steps=10):
    state = start_state
    total_reward = 0
    for _ in range(max_steps):
        action = np.random.choice(list(policy[state].keys()), p=list(policy[state].values()))
        next_state, reward = step(state, action)
        print(f"State: {state}, Action: {action}, Reward: {reward}, Next State: {next_state}")
        total_reward += reward
        state = next_state
        if state == 'S2':  # End of episode
            break
    print(f"Total Reward: {total_reward}")

# Run a simulation with a stochastic policy
simulate_episode('S0', policy)

State: S0, Action: A1, Reward: 0, Next State: S1
State: S1, Action: A1, Reward: 10, Next State: S2
Total Reward: 10
