In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np

## Debugging RL Algorithms

Debugging RL algorithms can be challenging. If you implement an algorithm and run it on a challenging benchmark, you may not see it learning. This can mean a number of things.
1. There is a bug in the algorithm.
2. It's learning extremely slowly, so you can't tell.
3. The algorithm is correct but it isn't learning because of the hyperparameter choices.

**A quick sanity check is to run the algorithm on a very simple problem.** If you can't get it to work on the simple problem, then there is probably a bug in the algorithm. If it does work on the simple problem, then the problem may be with the hyperparameter choices or the speed at which the algorithm is learning.

Below is an example test environment in which the reward is -1 if the action is 0 and the reward is 1 if the action is 1. The reward completely ignores the state of of the environment. The optimal policy is to always take action 1.

In [None]:
class TestEnvironment1(object):
    def __init__(self):
        self.reset()
        
    def reset(self):
        self.state = np.zeros(4)
        self.iter = 0
        return self.state
    
    def step(self, action):
        if action not in [0, 1]:
            raise ValueError("The action must be either 0 or 1.")
        self.iter += 1
        reward = -1 if action == 0 else 1
        done = self.iter == 20
        info = {}
        return self.state, reward, done, info

**Exercise:** Implement a policy by hand that is optimal for this test environment.

In [None]:
def optimal_policy1(state):
    # This policy should return an action that is optimal for this
    # environment.
    raise NotImplementedError


def do_rollout(env, policy):
    state = env.reset()
    done = False
    cumulative_reward = 0
    while not done:
        state, reward, done, info = env.step(policy(state))
        cumulative_reward += reward
    return cumulative_reward

env1 = TestEnvironment1()
# The optimal policy should achieve the maximum reward.
assert do_rollout(env1, optimal_policy1) == 20

**Exercise:** Implement the following test environment:
- The state is an integer initialized at 0.
- The possible actions are 0 and 1.
- An action of 0 decrements the state by 1. An action of 1 increments the state by 1.
- After taking an action, the reward is 1 if the new state is greater than 5. Otherwise it is 0.
- The environment terminates after 20 time steps.

In [None]:
class TestEnvironment2(object):
    def __init__(self):
        raise NotImplementedError
        
    def reset(self):
        raise NotImplementedError
    
    def step(self, action):
        raise NotImplementedError

test_env = TestEnvironment2()

**Exercise:** Implement a policy by hand that is optimal for this test environment. 

In [None]:
def optimal_policy2(state):
    # This policy should return an action that is optimal for this
    # environment.
    raise NotImplementedError


env1 = TestEnvironment1()
# The optimal policy should achieve the maximum reward.
assert do_rollout(env2, optimal_policy2) == 15

These environments will come in handy when debugging the RL algorithms in the next exercises.