# 1 | Markov Decision Processes: A Model of Sequential Decision Making

## 1.1. MDP (semi-)Formalism 

In reinforcement learning (RL), an *agent* takes *actions* in an *environment* to change its state, with the goal of maximising the expected sum of future *rewards*. We formalise this interaction as an agent-environment loop, mathematically described as a Markov Decision Process (MDP). 

<img src='https://github.com/enjeeneer/sutton_and_barto/blob/main/images/chapter3_1.png?raw=true' width='700'>

MDPs break the I.I.D. data assumption of supervised and unsupervised learning; the agent *causally influences* the data it sees through its choice of actions. However, one assumption we do make is the *Markov property*, which says that the state representation captures *all relevent information* from the past. Formally, state transitions depend only on the most recent state and action,
$$
\mathbb{P}[S_{t+1} | S_t,A_t] = \mathbb{P}[S_{t+1} | S_1,A_1 \ldots\, S_t,A_t],
$$
and rewards depend only on the most recent transition,
$$
\mathbb{P}[R_{t+1} | S_t,A_t,S_{t+1}] = \mathbb{P}[R_{t+1} | S_1,A_1 \ldots\, S_t,A_t,S_{t+1}].
$$
- Note: different sources use different notation here, but this is the most general.

In some MDPs, a subset of states are designated as *terminal* (or *absorbing*). The agent-environment interaction loop ceases once a terminal state is reached.

The goal of an RL agent is to pick actions that maximise the discounted cumulative sum of future rewards, also known as the *return* $G_t$:
$$
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots + \gamma^{T-t-1}R_{T},
$$
where $\gamma\in[0,1]$ is a discount factor and $T$ is the time of termination (may be $\infty$).

To do so, it needs the ability to forecast the reward-getting effect of taking each action $A$ in each state $S$, potentially many timesteps into the future. This *temporal credit assignment* problem is one of the key factors that makes RL so challenging.

## 1.2 MDP Example

Here's a simple MDP (courtesy of David Silver @ DeepMind/UCL), which we'll be using throughout this course.
- White circle: non-terminal state
- White square: terminal state
- Black circle: action
- <span style="color:green">Green:</span> reward (depends only on $S_{t+1}$ here)
- <span style="color:blue">Blue:</span> state transition probability
- <span style="color:red">Red:</span> action probability for an exemplar policy
- Note: edges with probability $1$ are unlabelled

<img src='https://github.com/tombewley/one-hour-rl/blob/main/images/student-mdp.svg?raw=true' width='700'>

## 1.3 Open AI Gym

[Open AI Gym](https://gym.openai.com/) provides a unified framework for testing and comparing RL algorithms, and offers a suite of MDPs that researchers can use to benchmark their work. It's important to be familiar with the conventions of Gym, because almost all modern RL code is built to work with it. Gym environment classes have two key methods:

- `mdp.reset()`: resets the MDP to an initial state $S_0$ according to an initialisation distribution.
- `mdp.step(action)` : takes an action $A_t$, combines with the current environment state $S_t$, produces the next state $S_{t+1}$ and delivers the agent a scalar reward $R_{t+1}$.

A Gym-compatible class for the student MDP shown above can be found in `mdp.py` in this repository. Let's import it now and explore what it can do!

In [1]:
from mdp import StudentMDP
mdp = StudentMDP()

Firstly, we'll have a look at the initialisation probabilities and the behaviour of `mdp.reset()`.

In [2]:
print(mdp.initial_probs())
mdp.reset()
print(mdp.state)

{'Class 1': 1.0, 'Class 2': 0.0, 'Class 3': 0.0, 'Facebook': 0.0, 'Pub': 0.0, 'Pass': 0.0, 'Asleep': 0.0}
Class 1


Next, let's check which actions are available in this initial state, and the action-dependent transition probabilities.
- Reminder: the Markov property dictates that transition probabilities depend *only* on the current state and action.

In [3]:
print(mdp.action_space(mdp.state))
print(mdp.transition_probs(mdp.state, "Study"))

{'Go on Facebook', 'Study'}
{'Class 2': 1.0}


Calling `mdp.step(action)` samples and returns the next state $S_{t+1}$, alongside a scalar reward $R_{t+1}$.

Let's try calling this method repeatedly. What's happening here?

In [7]:
state, reward, _, _ = mdp.step("Study") 
print(state, reward)

KeyError: 'Study'

Transitions out of the `Pub` state are *non-deterministic*; they go to one of the three classes with specified probabilities.

In [8]:
mdp.state = "Pub"
print(mdp.action_space(mdp.state))
print(mdp.transition_probs(mdp.state, "Have a pint"))

{'Have a pint'}
{'Class 1': 0.2, 'Class 2': 0.4, 'Class 3': 0.4}


In this state, the behaviour of `mdp.step(action)` is stochastic, even for a constant action.

In [9]:
mdp.state = "Pub" # Note that we're resetting the state to Pub each time
state, reward, _, _ = mdp.step("Have a pint")
print(state, reward)

Class 1 -2.0


This MDP has just one terminal state.

In [12]:
print(mdp.terminal_states())

{'Asleep'}


`mdp.step(action)` also returns a binary `done` flag, which is set to `True` if $S_{t+1}$ is a terminal state.

In [13]:
mdp.state = "Class 2" 
state, reward, done, _ = mdp.step("Fall asleep")
print(state, reward, done)

mdp.state = "Pass" 
state, reward, done, _ = mdp.step("Fall asleep")
print(state, reward, done)

Asleep 0.0 True
Asleep 0.0 True


Now let's bring an agent into the mix, and give it the exemplar policy shown in the diagram above.

In [2]:
from agent import Agent
agent = Agent(mdp) 
agent.policy = {
    "Class 1":  {"Study": 0.5, "Go on Facebook": 0.5},
    "Class 2":  {"Study": 0.8, "Fall asleep": 0.2},
    "Class 3":  {"Study": 0.6, "Go to the pub": 0.4},
    "Facebook": {"Keep scrolling": 0.9, "Close Facebook": 0.1},
    "Pub":      {"Have a pint": 1.},
    "Pass":     {"Fall asleep": 1.},
    "Asleep":   {"Stay asleep": 1.}
}

We can query the policy in a similar way to the MDP's properties, and observe its stochastic behaviour.

In [3]:
print(agent.policy["Class 1"])
print([agent.act("Class 1") for _ in range(20)])

{'Study': 0.5, 'Go on Facebook': 0.5}
['Go on Facebook', 'Go on Facebook', 'Study', 'Study', 'Go on Facebook', 'Study', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Study', 'Study', 'Go on Facebook', 'Go on Facebook', 'Go on Facebook', 'Go on Facebook', 'Go on Facebook', 'Study', 'Study']


Bringing it all together

In [4]:
mdp.verbose = True
state = mdp.reset()
done = False
while not done:
    state, reward, done, info = mdp.step(agent.act(state))

| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Go on Facebook | -1.0   | Facebook   | False |
| 1     | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 2     | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 3     | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 4     | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 5     | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 6     | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 7     | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 8     | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 9     | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 10    | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 11    | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 12    | Facebook | Keep scrolling | -1.0   | F

How "good" is this policy? To answer this, we need to calculate its expected return.