# 1 | Markov Decision Processes: A Model of Sequential Decision Making

- High-level motivation: sequential decision-making
- The MDP formalism 
    - Mathematical notation *and* Gym as an implementation convention
    - Introduce the Student MDP using diagram and Gym class

![The student MDP](https://github.com/tombewley/one-hour-rl/blob/main/images/student-mdp.png?raw=true "The student MDP")

In [1]:
from mdp import StudentMDP
from agent import Agent
mdp = StudentMDP(verbose=True)
agent = Agent(mdp) 

In [2]:
print(mdp.initial_probs())
state = mdp.sample_initial_state()
print(state)
print(mdp.action_space(state))
print(mdp.transition_probs(state, "Study"))

{'Class 1': 1.0, 'Class 2': 0.0, 'Class 3': 0.0, 'Facebook': 0.0, 'Pub': 0.0, 'Pass': 0.0, 'Asleep': 0.0}
Class 1
{'Go on Facebook', 'Study'}
{'Class 2': 1.0}


In [4]:
print(mdp.sample_next_state(state, "Study"))

Class 2


In [5]:
state = "Pub"
print(mdp.action_space(state))
print(mdp.transition_probs(state, "Have a pint"))

{'Have a pint'}
{'Class 1': 0.2, 'Class 2': 0.4, 'Class 3': 0.4}


In [7]:
print(mdp.sample_next_state(state, "Have a pint"))

Class 1


Default policy shown in images/student-mdp.png

In [8]:
agent.pi = {
    "Class 1":  {"Study": 0.5, "Go on Facebook": 0.5},
    "Class 2":  {"Study": 0.8, "Fall asleep": 0.2},
    "Class 3":  {"Study": 0.6, "Go to the pub": 0.4},
    "Facebook": {"Keep scrolling": 0.9, "Close Facebook": 0.1},
    "Pub":      {"Have a pint": 1.},
    "Pass":     {"Fall asleep": 1.},
    "Asleep":   {"Stay asleep": 1.}
}

In [9]:
print(agent.pi["Class 1"])
print([agent.act("Class 1") for _ in range(20)])

{'Study': 0.5, 'Go on Facebook': 0.5}
['Go on Facebook', 'Study', 'Study', 'Study', 'Study', 'Go on Facebook', 'Go on Facebook', 'Go on Facebook', 'Study', 'Go on Facebook', 'Go on Facebook', 'Study', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Go on Facebook', 'Go on Facebook', 'Study', 'Go on Facebook']


In [11]:
state = mdp.reset()
done = False
while not done:
    state, _, done, _ = mdp.step(agent.act(state))

| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
| 2     | Class 3  | Go to the pub  |  1.0   | Pub        | False |
| 3     | Pub      | Have a pint    | -2.0   | Class 1    | False |
| 4     | Class 1  | Study          | -2.0   | Class 2    | False |
| 5     | Class 2  | Study          | -2.0   | Class 3    | False |
| 6     | Class 3  | Go to the pub  |  1.0   | Pub        | False |
| 7     | Pub      | Have a pint    | -2.0   | Class 2    | False |
| 8     | Class 2  | Study          | -2.0   | Class 3    | False |
| 9     | Class 3  | Go to the pub  |  1.0   | Pub        | False |
| 10    | Pub      | Have a pint    | -2.0   | Class 2    | False |
| 11    | Class 2  | Study          | -2.0   | Class 3    | False |
| 12    | Class 3  | Go to the pub  |  1.0   | P

How "good" is this policy?