# RL Exercise 3 - Custom Environments and Reward Shaping

**GOAL:** The goal of this exercise is to demonstrate how to adapt your own problem to use RLlib.

To understand how to use **RLlib**, see the documentation at http://rllib.io.

RLlib is not only easy to use in simulated benchmarks but also in the real-world. Here, we will cover two important concepts: how to create your own Markov Decision Process abstraction, and how to shape the reward of your environment so make your agent more effective. 

In [None]:
! pip install -U ray[rllib]

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
from gym import spaces
import numpy as np
import test_exercises

import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG

ray.init(ignore_reinit_error=True, log_to_driver=False)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
lz4 not available, disabling sample compression. This will significantly impact RLlib performance. To install lz4, run `pip install lz4`.
2020-01-24 14:42:09,225	INFO resource_spec.py:212 -- Starting Ray with 4.0 GiB memory available for workers and up to 2.01 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-01-24 14:

{'node_ip_address': '192.168.1.27',
 'redis_address': '192.168.1.27:22733',
 'object_store_address': '/tmp/ray/session_2020-01-24_14-42-09_215023_11778/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-01-24_14-42-09_215023_11778/sockets/raylet',
 'webui_url': 'localhost:8270',
 'session_dir': '/tmp/ray/session_2020-01-24_14-42-09_215023_11778'}

## 1. Different Spaces

The first thing to do when formulating an RL problem is to specify the dimensions of your observation space and action space. Abstractions for these are provided in ``gym``. 

### **Exercise 1:** Match different actions to their corresponding space.

The purpose of this exercise is to familiarize you with different Gym spaces. For example:

    discrete = spaces.Discrete(10)
    print("Random sample of this space: ", [discrete.sample() for i in range(4)])

Use `help(spaces)` or `help([specific space])` (i.e., `help(spaces.Discrete)`) for more info.

In [5]:
action_space_map = {
    "discrete_10": spaces.Discrete(10),
    "box_1": spaces.Box(0, 1, shape=(1,)),
    "box_3x1": spaces.Box(-2, 2, shape=(3, 1)),
    "multi_discrete": spaces.MultiDiscrete([ 5, 2, 2, 4 ])
}

action_space_jumble = {
    "discrete_10": 1,
    "multi_discrete": np.array([0, 0, 0, 2]),
    "box_3x1": np.array([[-1.2657754], [-1.6528835], [ 0.5982418]]),
    "box_1": np.array([0.89089584]),
}


for space_id, state in action_space_jumble.items():
    assert action_space_map[space_id].contains(state), (
        "Looks like {} to {} is matched incorrectly.".format(space_id, state))
    
print("Success!")

Success!


## **Exercise 2**: Setting up a custom environment with rewards

We'll setup an `n-Chain` environment, which presents moves along a linear chain of states, with two actions:

     (0) forward, which moves along the chain but returns no reward
     (1) backward, which returns to the beginning and has a small reward

The end of the chain, however, presents a large reward, and by moving 'forward', at the end of the chain this large reward can be repeated.

#### Step 1: Implement ``ChainEnv._setup_spaces``

We'll use a `spaces.Discrete` action space and observation space. Implement `ChainEnv._setup_spaces` so that `self.action_space` and `self.obseration_space` are proper gym spaces.
  
1. Observation space is an integer in ``[0 to n-1]``.
2. Action space is an integer in ``[0, 1]``.

For example:

```python
    self.action_space = spaces.Discrete(2)
    self.observation_space = ...
```

You should see a message indicating tests passing when done correctly!

#### Step 2: Implement a reward function.

When `env.step` is called, it returns a tuple of ``(state, reward, done, info)``. Right now, the reward is always 0. 

Implement it so that 

1. ``action == 1`` will return `self.small_reward`.
2. ``action == 0`` will return 0 if `self.state < self.n - 1`.
3. ``action == 0`` will return `self.large_reward` if `self.state == self.n - 1`.

You should see a message indicating tests passing when done correctly. 

In [8]:
class ChainEnv(gym.Env):
    
    def __init__(self, env_config = None):
        env_config = env_config or {}
        self.n = env_config.get("n", 20)
        self.small_reward = env_config.get("small", 2)  # payout for 'backwards' action
        self.large_reward = env_config.get("large", 10)  # payout at end of chain for 'forwards' action
        self.state = 0  # Start at beginning of the chain
        self._horizon = self.n
        self._counter = 0  # For terminating the episode
        self._setup_spaces()
    
    def _setup_spaces(self):
        ##############
        # TODO: Implement this so that it passes tests
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Discrete(self.n)
        ##############

    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:  # 'backwards': go back to the beginning, get small reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = self.small_reward
            ##############
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = 0
            self.state += 1
        else:  # 'forwards': stay at the end of the chain, collect large reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = self.large_reward
            ##############
        self._counter += 1
        done = self._counter >= self._horizon
        return self.state, reward, done, {}

    def reset(self):
        self.state = 0
        self._counter = 0
        return self.state
    
# Tests here:
test_exercises.test_chain_env_spaces(ChainEnv)
test_exercises.test_chain_env_reward(ChainEnv)

Testing if spaces have been setup correctly...
Success! You've setup the spaces correctly.
Testing if reward has been setup correctly...
Success! You've setup the rewards correctly.


### Let's now train a policy on the environment and evaluate this policy on our environment.

You'll see that despite an extremely high reward, the policy has barely explored the state space.

In [9]:
trainer_config = DEFAULT_CONFIG.copy()
trainer_config['num_workers'] = 1
trainer_config["train_batch_size"] = 400
trainer_config["sgd_minibatch_size"] = 64
trainer_config["num_sgd_iter"] = 10

In [10]:
trainer = PPOTrainer(trainer_config, ChainEnv);
for i in range(20):
    print("Training iteration {}...".format(i))
    trainer.train()

2020-01-24 14:45:27,438	INFO trainer.py:377 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-01-24 14:45:27,444	ERROR logger.py:328 -- pip install 'ray[tune]' to see TensorBoard files.
2020-01-24 14:45:27,446	INFO trainer.py:524 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Training iteration 0...
Training iteration 1...
Training iteration 2...
Training iteration 3...
Training iteration 4...
Training iteration 5...
Training iteration 6...
Training iteration 7...
Training iteration 8...
Training iteration 9...
Training iteration 10...
Training iteration 11...
Training iteration 12...
Training iteration 13...
Training iteration 14...
Training iteration 15...
Training iteration 16...
Training iteration 17...
Training iteration 18...
Training iteration 19...


In [11]:
env = ChainEnv({})
state = env.reset()

done = False
max_state = -1
cumulative_reward = 0

while not done:
    action = trainer.compute_action(state)
    state, reward, done, results = env.step(action)
    max_state = max(max_state, state)
    cumulative_reward += reward

print("Cumulative reward you've received is: {}. Congratulations!".format(cumulative_reward))
print("Max state you've visited is: {}. This is out of {} states.".format(max_state, env.n))

Cumulative reward you've received is: 40. Congratulations!
Max state you've visited is: 0. This is out of 20 states.


## Exercise 3: Shaping the reward to encourage proper behavior.

You'll see that despite an extremely high reward, the policy has barely explored the state space. This is often the situation - where the reward designed to encourage a particular solution is suboptimal, and the behavior created is unintended.

#### Modify `ShapedChainEnv.step` to provide a reward that encourages the policy to traverse the chain (not just stick to 0). Do not change the behavior of the environment (the action -> state behavior should be the same).

You can change the reward to be whatever you wish.

In [12]:
class ShapedChainEnv(ChainEnv):
    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:  # 'backwards': go back to the beginning
            reward = -1
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            reward = -1
            self.state += 1
        else:  # 'forwards': stay at the end of the chain
            reward = -1
        self._counter += 1
        done = self._counter >= self._horizon
        return self.state, reward, done, {}
    
test_exercises.test_chain_env_behavior(ShapedChainEnv)

Testing if behavior has been changed...
Success! Behavior of environment is correct.


### Evaluate `ShapedChainEnv` by running the cell below.

This trains PPO on the new env and counts the number of states seen.

In [13]:
trainer = PPOTrainer(trainer_config, ShapedChainEnv);
for i in range(20):
    print("Training iteration {}...".format(i))
    trainer.train()

env = ShapedChainEnv({})

max_states = []

for i in range(5):
    state = env.reset()
    done = False
    max_state = -1
    cumulative_reward = 0
    while not done:
        action = trainer.compute_action(state)
        state, reward, done, results = env.step(action)
        max_state = max(max_state, state)
        cumulative_reward += reward
    max_states += [max_state]

print("Cumulative reward you've received is: {}!".format(cumulative_reward))
print("Max state you've visited is: {}. This is out of {} states.".format(np.mean(max_states), env.n))
assert (env.n - np.mean(max_states)) / env.n < 0.2, "This policy did not traverse many states."

2020-01-24 14:46:04,206	ERROR logger.py:328 -- pip install 'ray[tune]' to see TensorBoard files.


Training iteration 0...
Training iteration 1...
Training iteration 2...
Training iteration 3...
Training iteration 4...
Training iteration 5...
Training iteration 6...
Training iteration 7...
Training iteration 8...
Training iteration 9...
Training iteration 10...
Training iteration 11...
Training iteration 12...
Training iteration 13...
Training iteration 14...
Training iteration 15...
Training iteration 16...
Training iteration 17...
Training iteration 18...
Training iteration 19...
Cumulative reward you've received is: -20!
Max state you've visited is: 3.2. This is out of 20 states.


AssertionError: This policy did not traverse many states.