# 1 | Markov Decision Processes: A Model of Sequential Decision Making

<img src='https://github.com/enjeeneer/sutton_and_barto/blob/main/images/chapter3_1.png?raw=true' width='700'>

## 1.1. MDP (semi-)Formalism 

- In RL, *agents* take *actions* in an *environment* to maximise the sum of its cumulative *rewards*.
- Unlike supervised and unsupervised learning, we lose the I.I.D date assumption; the RL agent *affects* its dataset by taking actions that change its observations.
- We formalise this interaction as the agent-environment loop, mathematically described as a Markov Decision Process (MDP).
- In MDPs, we assume the state representation is *Markov* i.e. it captures all relevent information from the history:

$
\mathbb{P}[S_{t+1} | S_t] = \mathbb{P}[S_{t+1} | S_1, \ldots\, S_t]
$

- The goal of an RL agent is to pick actions that maximise the cumulative sum of future rewards, also known as the *discounted return* $G_t$:

$
G_t = R_{t+1} + R_{t+2} + R_{t+3} + \ldots + R_{T}
$

where $T$ is a terminal state. 

## 1.2 MDP Example

![The student MDP](images/student-mdp.png?raw=true "The student MDP")

In [1]:
from mdp import StudentMDP
mdp = StudentMDP(verbose=True)

In [2]:
mdp.state = "Class 2"
mdp.print_header()
mdp.step("Study");

| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|


AttributeError: 'StudentMDP' object has no attribute 't'

Default policy shown in images/student-mdp.png

In [14]:
from agent import Agent
agent = Agent(mdp.action_space) 
agent.pi = {
    "Class 1":  {"Study": 0.5, "Go on Facebook": 0.5},
    "Class 2":  {"Study": 0.8, "Fall asleep": 0.2},
    "Class 3":  {"Study": 0.6, "Go to the pub": 0.4},
    "Facebook": {"Keep scrolling": 0.9, "Close Facebook": 0.1},
    "Pub":      {"Have a pint": 1.},
    "Pass":     {"Fall asleep": 1.},
    "Asleep":   {"Stay asleep": 1.}
}

In [15]:
print(agent.pi["Class 1"])
print([agent.act("Class 1") for _ in range(20)])

{'Study': 0.5, 'Go on Facebook': 0.5}
['Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Study', 'Study', 'Study', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Study', 'Study', 'Study']


In [19]:
state = mdp.reset()
done = False
while not done:
    state, _, done, _ = mdp.step(agent.act(state))

| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
| 2     | Class 3  | Study          | 10.0   | Pass       | False |
| 3     | Pass     | Fall asleep    |  0.0   | Asleep     | True  |


## 1.3 Open AI Gym

[Open AI gym](https://gym.openai.com/) is the premier toolkit for testing and comparing RL algorithms. They offer a suite of MDPs that RL researchers use to benchmark their work. It's important to be familiar with their conventions because all modern RL problems are built a-top their framework. *Gym* environment classes have two key methods:

- reset(): resets the MDP to an initial state that may/may not be random
- step() : takes an action $a_t$, combines with the current environment state $s_t$ and produces the next state $s_{t+1}$ and delivers the agent a reward $r_t$.

The core *Gym* environment class can be found 