# 2 | Policy Evaluation: The Temporal Difference Method

In RL, the agent selects actions from its policy $\pi$, which is a mapping from states to actions $\mathcal{S} \rightarrow \mathcal{A}$. In the previous example the policy was formalised as a series of action probabilities given a state. An important aspect of RL, is quantifying how good a policy is as achieving the goal we'd like to achieve. We do this using *Value Functions*, the most common being the state-action value function $Q(s_t, a_t)$.

$Q$ estimates the expected future return from a state $s_t$ if action $a_t$ is chosen, and the policy $\pi$ is followed at all subsequent states. By sampling episodes in our MDP using the current policy we can collect rewards and update our Q-function accordingly. The algorithm we use to evaluate policies is called policy evaluation, and it uses the Bellman back-up which has two hyperparameters $\gamma$ and $\alpha$. $\gamma$ is the discount factor that

Import same MDP and policy as last time

In [152]:
from mdp import StudentMDP
from agent import Agent
mdp = StudentMDP(verbose=True)
agent = Agent(mdp.action_space) 
agent.policy = {
    "Class 1":  {"Study": 0.5, "Go on Facebook": 0.5},
    "Class 2":  {"Study": 0.8, "Fall asleep": 0.2},
    "Class 3":  {"Study": 0.6, "Go to the pub": 0.4},
    "Facebook": {"Keep scrolling": 0.9, "Close Facebook": 0.1},
    "Pub":      {"Have a pint": 1.},
    "Pass":     {"Fall asleep": 1.},
    "Asleep":   {"Stay asleep": 1.}
}

Introduce the notion of a $Q$ value, $\gamma$ and $\alpha$

In [153]:
agent.Q

{'Class 1': {'Study': 0.0, 'Go on Facebook': 0.0},
 'Class 2': {'Study': 0.0, 'Fall asleep': 0.0},
 'Class 3': {'Study': 0.0, 'Go to the pub': 0.0},
 'Facebook': {'Keep scrolling': 0.0, 'Close Facebook': 0.0},
 'Pub': {'Have a pint': 0.0},
 'Pass': {'Fall asleep': 0.0},
 'Asleep': {'Stay asleep': 0.0}}

In [154]:
GAMMA = 0.9
ALPHA = 0.01

def bellman_backup(agent, state, action, reward, next_state, done):

    Q_next = 0. if done else agent.Q[next_state][agent.act(next_state)]

    agent.Q[state][action] += ALPHA * ( reward + GAMMA * Q_next - agent.Q[state][action])

In [148]:
mdp.verbose = True

In [155]:
for _ in range(1):
    state = mdp.reset()
    done = False
    while not done:
        action = agent.act(state)
        next_state, reward, done, _ = mdp.step(action)

        print(agent.Q[state][action])
        bellman_backup(agent, state, action, reward, next_state, done)
        print(agent.Q[state][action])

        state = next_state

| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
0.0
-0.02
| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
0.0
-0.02
| 2     | Class 3  | Study          | 10.0   | Pass       | False |
0.0
0.1
| 3     | Pass     | Fall asleep    |  0.0   | Asleep     | True  |
0.0
0.0
