# 1 | Markov Decision Processes: A Model of Sequential Decision Making

<img src='https://github.com/enjeeneer/sutton_and_barto/blob/main/images/chapter3_1.png?raw=true' width='700'>

## 1.1. MDP (semi-)Formalism 

- In RL, an *agent* takes *actions* in an *environment* to maximise the sum of its cumulative *rewards*.
- Unlike supervised and unsupervised learning, we lose the I.I.D. data assumption; the RL agent *affects* its dataset by taking actions that change its observations.
- We formalise this interaction as the agent-environment loop, mathematically described as a Markov Decision Process (MDP).
- In MDPs, we assume the state representation is *Markov* i.e. it captures all relevent information from the history:
$$
\mathbb{P}[S_{t+1} | S_t] = \mathbb{P}[S_{t+1} | S_1, \ldots\, S_t]
$$
- The goal of an RL agent is to pick actions that maximise the cumulative sum of future rewards, also known as the *discounted return* $G_t$:
$$
G_t = R_{t+1} + R_{t+2} + R_{t+3} + \ldots + R_{T}
$$
where $T$ is a terminal state. 

## 1.2 MDP Example

TODO: Update this diagram. It's a bit misleading because it represents the stochastic *policy* and stochastic *dynamics* in the same way.

![The student MDP](https://github.com/tombewley/one-hour-rl/blob/main/images/student-mdp.png?raw=true "The student MDP")

## 1.3 Open AI Gym

[Open AI Gym](https://gym.openai.com/) is the premier toolkit for testing and comparing RL algorithms. They offer a suite of MDPs that RL researchers use to benchmark their work. It's important to be familiar with their conventions because all modern RL problems are built a-top their framework. *Gym* environment classes have two key methods:

- reset(): resets the MDP to an initial state that may/may not be random
- step() : takes an action $a_t$, combines with the current environment state $s_t$ and produces the next state $s_{t+1}$ and delivers the agent a reward $r_t$.

The core *Gym* environment class can be found 

In [1]:
from mdp import StudentMDP
mdp = StudentMDP()

Introduce initialisation probabilities

In [2]:
print(mdp.initial_probs())
mdp.reset()
print(mdp.state)

{'Class 1': 1.0, 'Class 2': 0.0, 'Class 3': 0.0, 'Facebook': 0.0, 'Pub': 0.0, 'Pass': 0.0, 'Asleep': 0.0}
Class 1


Introduce action space and transition probabilities. Reminder of Markov property: only dependent on current state.

In [3]:
print(mdp.action_space(mdp.state))
print(mdp.transition_probs(mdp.state, "Study"))

{'Go on Facebook', 'Study'}
{'Class 2': 1.0}


mdp.step(action) returns the state and reward.

Run repeatedly... what's happening here?

In [4]:
state, reward, _, _ = mdp.step("Study") 
print(state, reward)

Class 2 -2.0


Non-deterministic transitions

In [5]:
mdp.state = "Pub"
print(mdp.action_space(mdp.state))
print(mdp.transition_probs(mdp.state, "Have a pint"))

{'Have a pint'}
{'Class 1': 0.2, 'Class 2': 0.4, 'Class 3': 0.4}


Run repeatedly... what's happening here?

In [6]:
mdp.state = "Pub" # Note that we're resetting the state to "Pub" each time
state, reward, _, _ = mdp.step("Have a pint")
print(state, reward)

Class 3


Introduce terminal probs

In [7]:
print(mdp.done_probs())

{'Class 1': 0.0, 'Class 2': 0.0, 'Class 3': 0.0, 'Facebook': 0.0, 'Pub': 0.0, 'Pass': 0.0, 'Asleep': 1.0}


mdp.step(action) also returns a "done" flag

In [10]:
mdp.state = "Class 2" 
state, reward, done, _ = mdp.step("Fall asleep")
print(state, reward, done)

mdp.state = "Pass" 
state, reward, done, _ = mdp.step("Fall asleep")
print(state, reward, done)

Asleep 0.0 True
Asleep 0.0 True


Default policy shown in images/student-mdp.png

In [6]:
from agent import Agent
agent = Agent(mdp) 
agent.policy = {
    "Class 1":  {"Study": 0.5, "Go on Facebook": 0.5},
    "Class 2":  {"Study": 0.8, "Fall asleep": 0.2},
    "Class 3":  {"Study": 0.6, "Go to the pub": 0.4},
    "Facebook": {"Keep scrolling": 0.9, "Close Facebook": 0.1},
    "Pub":      {"Have a pint": 1.},
    "Pass":     {"Fall asleep": 1.},
    "Asleep":   {"Stay asleep": 1.}
}

In [7]:
print(agent.policy["Class 1"])
print([agent.act("Class 1") for _ in range(20)])

{'Study': 0.5, 'Go on Facebook': 0.5}
['Study', 'Go on Facebook', 'Go on Facebook', 'Go on Facebook', 'Go on Facebook', 'Go on Facebook', 'Study', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Study', 'Study']


Bringing it all together

In [13]:
mdp.verbose = True
state = mdp.reset()
done = False
while not done:
    state, reward, done, info = mdp.step(agent.act(state))

| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
| 2     | Class 3  | Study          | 10.0   | Pass       | False |
| 3     | Pass     | Fall asleep    |  0.0   | Asleep     | True  |


How "good" is this policy?