# RL Exercise 3 - Custom Environments and Reward Shaping

**GOAL:** The goal of this exercise is to demonstrate how to adapt your own problem to use RLlib.

To understand how to use **RLlib**, see the documentation at http://rllib.io.

RLlib is not only easy to use in simulated benchmarks but also in the real-world. Here, we will cover two important concepts: how to create your own Markov Decision Process abstraction, and how to shape the reward of your environment so make your agent more effective. 

In [3]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
from gym import spaces
import ray
import numpy as np
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
import test_exercises
ray.init(ignore_reinit_error=True, log_to_driver=False)

2019-06-16 19:45:45,546	ERROR worker.py:1346 -- Calling ray.init() again after it has already been called.


## 1. Different Spaces

The first thing to do when formulating an RL problem is to specify the dimensions of your observation space and action space. Abstractions for these are provided in ``gym``. 

In [4]:
num_states = 10
discrete = spaces.Discrete(num_states)

print("Number of states: ", discrete.n)
print("Random sample of this space: ", [discrete.sample() for i in range(4)])

Number of states:  10
Random sample of this space:  [0, 5, 0, 2]


### **Exercise 1:** Match different actions to their corresponding space.

The purpose of this exercise is to familiarize you with 


Use `help(spaces)` or `help([specific space])` (i.e., `help(spaces.Discrete)`) for more info.

In [5]:
action_space_map = {
    "discrete_10": spaces.Discrete(10),
    "box_1": spaces.Box(0, 1, shape=(1,)),
    "box_3x1": spaces.Box(-2, 2, shape=(3, 1)),
    "discrete_10": spaces.Discrete(10),
    "multi_discrete": spaces.MultiDiscrete([ 5, 2, 2, 4 ])
}

action_space_jumble = {
    "discrete_10": 1,
    "box_3x1": np.array([[-1.2657754], [-1.6528835], [ 0.5982418]]),
    "box_1": np.array([0.89089584]),
    "multi_discrete": np.array([0, 0, 0, 2])
}


for space_id, state in action_space_jumble.items():
    assert action_space_map[space_id].contains(state), (
        "Looks like {} to {} is matched incorrectly.".format(space_id, state))
    
print("Success!")

Success!


## **Exercise 2**: Setting up a custom environment with rewards

We'll setup an `n-Chain` environment. 

This environment presents moves along a linear chain of states, with two actions:

     (0) forward, which moves along the chain but returns no reward
     (1) backward, which returns to the beginning and has a small reward

The end of the chain, however, presents a large reward, and by moving 'forward', at the end of the chain this large reward can be repeated.

#### Step 1: Implement ``ChainEnv._setup_spaces``

We'll use a `spaces.Discrete` action space and observation space. Implement `ChainEnv._setup_spaces` so that `self.action_space` and `self.obseration_space` are proper gym spaces.
  
1. Observation space corresponds to the current state in the chain and is an integer in ``[0 to n-1]``.
2. Action space corresponds to the two actions and is an integer in ``[0, 1]``.

You should see a message indicating tests passing when done correctly. 

#### Step 2: Implement a reward function.

When `env.step` is called, it returns a tuple of ``(state, reward, done, info)``. Right now, the reward is always 0. 

Implement it so that 

1. ``action == 1`` will return `self.small_reward`. This corresponds to the `backward` action, which provides a small reward but returns the agent to the beginning
2. ``action == 0`` will return 0 if `self.state < self.n - 1`. This corresponds to the `backward` action, which provides a small reward but returns the agent to the beginning
3. ``action == 0`` will return `self.large_reward` if `self.state == self.n - 1`.

You should see a message indicating tests passing when done correctly. 

In [6]:
import gym

class ChainEnv(gym.Env):
    
    def __init__(self, env_config = None):
        env_config = env_config or {}
        self.n = env_config.get("n", 50)
        self.small_reward = env_config.get("small", 2)  # payout for 'backwards' action
        self.large_reward = env_config.get("large", 10)  # payout at end of chain for 'forwards' action
        self.state = 0  # Start at beginning of the chain
        self._horizon = 200
        self._counter = 0  # For terminating the episode
        self._setup_spaces()
    
    def _setup_spaces(self):
        # TODO: Implement this so that it passes tests
#         self.action_space = None
#         self.observation_space = None
        
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Discrete(self.n)

    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:  # 'backwards': go back to the beginning, get small reward
            ##############
            # TODO 2: Implement this so that it passes tests
#             reward = -1 
            reward = self.small_reward
            ##############
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            ##############
            # TODO 2: Implement this so that it passes tests
#             reward = -1 
            reward = 0
            self.state += 1
        else:  # 'forwards': stay at the end of the chain, collect large reward
            ##############
            # TODO 2: Implement this so that it passes tests
#             reward = -1
            reward = self.large_reward
            ##############
        self._counter += 1
        done = self._counter >= self._horizon
        return self.state, reward, done, {}

    def reset(self):
        self.state = 0
        self._counter = 0
        return self.state
    
test_exercises.test_chain_env_spaces(ChainEnv)
test_exercises.test_chain_env_reward(ChainEnv)

Testing if spaces have been setup correctly...
Success! You've setup the spaces correctly.
Testing if reward has been setup correctly...
Success! You've setup the rewards correctly.


### Let's now train a policy on the environment and evaluate this policy on our environment.

You'll see that despite an extremely high reward, the policy has barely explored the state space.

In [7]:
trainer_config = DEFAULT_CONFIG.copy()
trainer_config['num_workers'] = 1

trainer = PPOTrainer(trainer_config, ChainEnv);
for i in range(5):
    trainer.train()

2019-06-16 19:45:56,202	INFO rollout_worker.py:301 -- Creating policy evaluation worker 0 on CPU (please ignore any CUDA init errors)


Instructions for updating:
Use keras.layers.dense instead.
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.random.categorical instead.


2019-06-16 19:45:56,468	INFO dynamic_tf_policy.py:313 -- Initializing loss function with dummy input:

{ 'action_prob': <tf.Tensor 'default_policy/action_prob:0' shape=(?,) dtype=float32>,
  'actions': <tf.Tensor 'default_policy/actions:0' shape=(?,) dtype=int64>,
  'advantages': <tf.Tensor 'default_policy/advantages:0' shape=(?,) dtype=float32>,
  'behaviour_logits': <tf.Tensor 'default_policy/behaviour_logits:0' shape=(?, 2) dtype=float32>,
  'dones': <tf.Tensor 'default_policy/dones:0' shape=(?,) dtype=bool>,
  'new_obs': <tf.Tensor 'default_policy/new_obs:0' shape=(?, 100) dtype=float32>,
  'obs': <tf.Tensor 'default_policy/observation:0' shape=(?, 100) dtype=float32>,
  'prev_actions': <tf.Tensor 'default_policy/action:0' shape=(?,) dtype=int64>,
  'prev_rewards': <tf.Tensor 'default_policy/prev_reward:0' shape=(?,) dtype=float32>,
  'rewards': <tf.Tensor 'default_policy/rewards:0' shape=(?,) dtype=float32>,
  'value_targets': <tf.Tensor 'default_policy/value_targets:0' shape=(?,)

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.


2019-06-16 19:45:57,395	INFO rollout_worker.py:719 -- Built policy map: {'default_policy': <ray.rllib.policy.tf_policy_template.PPOTFPolicy object at 0x10b057b00>}
2019-06-16 19:45:57,396	INFO rollout_worker.py:720 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.OneHotPreprocessor object at 0x14ce1ceb8>}
2019-06-16 19:45:57,397	INFO rollout_worker.py:333 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x10b0571d0>}
2019-06-16 19:45:57,431	INFO multi_gpu_optimizer.py:79 -- LocalMultiGPUOptimizer devices ['/cpu:0']
2019-06-16 19:46:09,077	INFO multi_gpu_impl.py:146 -- Training on concatenated sample batches:

{ 'inputs': [ np.ndarray((4000,), dtype=int64, min=0.0, max=1.0, mean=0.496),
              np.ndarray((4000,), dtype=float32, min=0.0, max=2.0, mean=0.993),
              np.ndarray((4000, 100), dtype=float32, min=0.0, max=1.0, mean=0.01),
              np.ndarray((4000,), dtype=int64, min=0.0, max=1.0, mean=0.5),
   

In [9]:
from ray.rllib.models import ModelCatalog
env = ChainEnv({})
prepare = ModelCatalog.get_preprocessor(env)
state = env.reset()

done = False
max_state = -1
cumulative_reward = 0

while not done:
    action = trainer.compute_action(state)
    state, reward, done, results = env.step(action)
    max_state = max(max_state, state)
    cumulative_reward += reward

print("Cumulative reward you've received is: {}. Congratulations!".format(cumulative_reward))
print("Max state you've visited is: {}. This is out of {} states.".format(max_state, env.n))

Cumulative reward you've received is: 336. Congratulations!
Max state you've visited is: 4. This is out of 100 states.


## Exercise 3: Shaping the reward to encourage proper behavior.

You'll see that despite an extremely high reward, the policy has barely explored the state space. This is often the situation - where the reward designed to encourage a particular solution is suboptimal, and the behavior created is unintended.

#### Modify `ShapedChainEnv.step` to provide a reward that encourages the policy to traverse the chain (not just stick to 0). Do not change the behavior of the environment (the action -> state behavior should be the same).

You can change the reward to be whatever you wish.

In [13]:
class ShapedChainEnv(ChainEnv):
    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:  # 'backwards': go back to the beginning
            reward = -1 
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            reward = self.state / self.n
            self.state += 1
        else:  # 'forwards': stay at the end of the chain
            reward = self.state / self.n
        self._counter += 1
        done = self._counter >= self._horizon
        return self.state, reward, done, {}
    
test_exercises.test_chain_env_behavior(ShapedChainEnv)

Testing if behavior has been changed...
Success! Behavior of environment is correct.


### Evaluate `ShapedChainEnv` by running the cell below.

This trains PPO on the new env and counts the number of states seen.

In [15]:
trainer_config = DEFAULT_CONFIG.copy()
trainer_config['num_workers'] = 2

trainer = PPOTrainer(trainer_config, ShapedChainEnv);
for i in range(5):
    trainer.train()

    from ray.rllib.models import ModelCatalog
env = ShapedChainEnv({})
prepare = ModelCatalog.get_preprocessor(env)
state = env.reset()

done = False
max_state = -1
cumulative_reward = 0

while not done:
    action = trainer.compute_action(state)
    state, reward, done, results = env.step(action)
    max_state = max(max_state, state)
    cumulative_reward += reward

print("Cumulative reward you've received is: {}!".format(cumulative_reward))
print("Max state you've visited is: {}. This is out of {} states.".format(max_state, env.n))
assert (env.n - max_state) < 5, "This policy did not traverse many states."

2019-06-16 19:49:56,645	INFO rollout_worker.py:301 -- Creating policy evaluation worker 0 on CPU (please ignore any CUDA init errors)
2019-06-16 19:49:58,020	INFO rollout_worker.py:719 -- Built policy map: {'default_policy': <ray.rllib.policy.tf_policy_template.PPOTFPolicy object at 0x2823fd208>}
2019-06-16 19:49:58,021	INFO rollout_worker.py:720 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.OneHotPreprocessor object at 0x2834cb588>}
2019-06-16 19:49:58,022	INFO rollout_worker.py:333 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x2834cb668>}
2019-06-16 19:49:58,063	INFO multi_gpu_optimizer.py:79 -- LocalMultiGPUOptimizer devices ['/cpu:0']


Cumulative reward you've received is: -41.129999999999995!
Max state you've visited is: 13. This is out of 100 states.


AssertionError: This policy did not traverse many states.