<a href="https://colab.research.google.com/github/rlworkgroup/garage/blob/master/examples/jupyter/custom_env.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Demonstrate usage of a custom openai/gym environment with rlworkgroup/garage

Demonstrate usage of [garage](https://github.com/rlworkgroup/garage) with a custom `openai/gym` environment in a jupyter notebook.

## Install pre-requisites

In [3]:
%%shell

echo "abcd" > mujoco_fake_key


git clone --depth 1 https://github.com/rlworkgroup/garage/

cd garage
bash scripts/setup_colab.sh --mjkey ../mujoco_fake_key --no-modify-bashrc > /dev/null

fatal: destination path 'garage' already exists and is not an empty directory.
start of setup_colab.sh






Cloning into '/tmp/tmp.4slNT9c4V6/glfw'...
remote: Enumerating objects: 23479, done.[K
remote: Total 23479 (delta 0), reused 0 (delta 0), pack-reused 23479[K
Receiving objects: 100% (23479/23479), 11.64 MiB | 22.78 MiB/s, done.
Resolving deltas: 100% (16458/16458), done.
Note: checking out '0be4f3f75aebd9d24583ee86590a38e741db0904'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 0be4f3f7 Add GLFW_FOCUS_ON_SHOW window hint and attribute
Cloning into '/tmp/tmp.Xq2ekFYVDa/mujoco_150'...
remote: Enumerating objects:



In [0]:
raise Exception("Please restart your runtime so that the installed dependencies for 'garage' can be loaded, and then resume running the notebook")



---



---



# custom gym environment

In [0]:
# Create a gym env that simulates the current water treatment plant
# Based on https://github.com/openai/gym/blob/master/gym/envs/toy_text/nchain.py

import gym
from gym import spaces
import numpy as np
import random



# Gym env
class MyEnv(gym.Env):
    """Custom gym environment
    
    Observation: Coin flip (Discrete binary: 0/1)
      
    Actions: Guess of coin flip outcome (Discrete binary: 0/1)
      
    Reward: Guess the coin flip correctly
      
    Episode termination: Make 5 correct guesses within 20 attempts
    """
    def __init__(self):
        # set action/observation spaces
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Discrete(2)
        self.reset()

    def step(self, action):
        assert self.action_space.contains(action), "action not in action space!"
        
        # flip a coin
        self.state = np.random.rand() < 0.5

        # increment number of attempts
        self.attempt += 1
        
        # calculate reward of this element
        reward = (action == self.state)
        self.score += reward
          
        # allow a maximum number of attempts or reach max score
        done = (self.attempt >= 20) | (self.score >= 5)
          
        return self.state, reward, done, {}
      
    def reset(self):
      # accumulate score
      self.score = 0
      # count number of attempts
      self.attempt = 0
      
      return 0

In [2]:
# some smoke testing
env_test = MyEnv()
observation = env_test.reset()

for step in range(40):
  action = np.random.rand() < 0.5
  observation, reward, done, _ = env_test.step(action)
  print("step %i: action=%i, observation=%i => reward = %i, done = %s" % (step, action, observation, reward, done))
  if done: break

step 0: action=0, observation=1 => reward = 0, done = False
step 1: action=1, observation=0 => reward = 0, done = False
step 2: action=0, observation=1 => reward = 0, done = False
step 3: action=0, observation=0 => reward = 1, done = False
step 4: action=1, observation=0 => reward = 0, done = False
step 5: action=1, observation=0 => reward = 0, done = False
step 6: action=1, observation=0 => reward = 0, done = False
step 7: action=0, observation=1 => reward = 0, done = False
step 8: action=0, observation=1 => reward = 0, done = False
step 9: action=1, observation=1 => reward = 1, done = False
step 10: action=1, observation=1 => reward = 1, done = False
step 11: action=1, observation=0 => reward = 0, done = False
step 12: action=0, observation=0 => reward = 1, done = False
step 13: action=1, observation=1 => reward = 1, done = True


# Prepare training

In [0]:
# The contents of this cell are mostly copied from garage/examples/...

from garage.np.baselines import LinearFeatureBaseline # <<<<<< requires restarting the runtime in colab after the 1st dependency installation above
from garage.envs import normalize
#from garage.envs.box2d import CartpoleEnv # no need since will use WtpDesignerEnv_v0 defined above
from garage.experiment import run_experiment
from garage.tf.algos import TRPO
from garage.tf.envs import TfEnv
#from garage.tf.policies import GaussianMLPPolicy
from garage.tf.policies import CategoricalMLPPolicy

import gym # already imported before


from garage.experiment import LocalRunner
from garage.logger import logger, StdOutput

In [0]:
# register the env with gym
# https://github.com/openai/gym/tree/master/gym/envs#how-to-create-new-environments-for-gym
from gym.envs.registration import register

register(
    id='MyEnv-v0',
    entry_point=MyEnv,
)

# test registration was successful
env = gym.make("MyEnv-v0")
# env = TfEnv(normalize(gym.make("MyEnv-v0")))
# env = TfEnv(env_name='MyEnv-v0') 

In [0]:
# Wrap the environment to convert the observation to numpy array
# Not sure why this is necessary ATM
# Based on https://github.com/openai/gym/blob/5404b39d06f72012f562ec41f60734bd4b5ceb4b/gym/wrappers/dict.py

      
# from gym import wrappers

class NpWrapper(gym.ObservationWrapper):
    def observation(self, observation):
        obs = np.array(observation).astype('int')
        return obs
      
      
env = NpWrapper(env)
env = TfEnv(normalize(env))

In [0]:
policy = CategoricalMLPPolicy(
    name="policy", env_spec=env.spec, hidden_sizes=(32, 32))

baseline = LinearFeatureBaseline(env_spec=env.spec)


algo = TRPO(
    env_spec=env.spec,
    policy=policy,
    baseline=baseline,
    max_path_length=50,
    n_itr=50,
    discount=0.99,
    max_kl_step=0.01
)


## Start training

In [0]:
# log to stdout
logger.add_output(StdOutput())

In [0]:
# start a tensorflow session so that we can keep it open after training and use the trained network to see it performing
import tensorflow as tf
sess = tf.InteractiveSession()

# no need to initialize
sess.run(tf.compat.v1.global_variables_initializer())


In [9]:
# Train the policy (neural network) on the environment
runner = LocalRunner()

runner.setup(algo=algo, env=env)

# use n_epochs = 2 for quick demo
runner.train(n_epochs=2, batch_size=10000, plot=False)

2019-04-01 06:42:42 | epoch #0 | Obtaining samples...
2019-04-01 06:42:42 | epoch #0 | Obtaining samples for iteration 0...


  logger.log(self.sess.graph)
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


2019-04-01 06:42:49 | epoch #0 | Logging diagnostics...
2019-04-01 06:42:49 | epoch #0 | Optimizing policy...
2019-04-01 06:42:49 | epoch #0 | Computing loss before
2019-04-01 06:42:49 | epoch #0 | Computing KL before
2019-04-01 06:42:49 | epoch #0 | Optimizing
2019-04-01 06:42:49 | epoch #0 | Start CG optimization: #parameters: 1218, #inputs: 1013, #subsample_inputs: 1013
2019-04-01 06:42:49 | epoch #0 | computing loss before
2019-04-01 06:42:49 | epoch #0 | performing update
2019-04-01 06:42:49 | epoch #0 | computing gradient
2019-04-01 06:42:49 | epoch #0 | gradient computed
2019-04-01 06:42:49 | epoch #0 | computing descent direction
2019-04-01 06:42:50 | epoch #0 | descent direction computed
2019-04-01 06:42:50 | epoch #0 | backtrack iters: 0
2019-04-01 06:42:50 | epoch #0 | computing loss after
2019-04-01 06:42:50 | epoch #0 | optimization finished
2019-04-01 06:42:50 | epoch #0 | Computing KL after
2019-04-01 06:42:50 | epoch #0 | Computing loss after
2019-04-01 06:42:50 | epoch

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:07


2019-04-01 06:42:57 | epoch #1 | Logging diagnostics...
2019-04-01 06:42:57 | epoch #1 | Optimizing policy...
2019-04-01 06:42:57 | epoch #1 | Computing loss before
2019-04-01 06:42:57 | epoch #1 | Computing KL before
2019-04-01 06:42:57 | epoch #1 | Optimizing
2019-04-01 06:42:57 | epoch #1 | Start CG optimization: #parameters: 1218, #inputs: 1018, #subsample_inputs: 1018
2019-04-01 06:42:57 | epoch #1 | computing loss before
2019-04-01 06:42:57 | epoch #1 | performing update
2019-04-01 06:42:57 | epoch #1 | computing gradient
2019-04-01 06:42:57 | epoch #1 | gradient computed
2019-04-01 06:42:57 | epoch #1 | computing descent direction
2019-04-01 06:42:58 | epoch #1 | descent direction computed
2019-04-01 06:42:58 | epoch #1 | backtrack iters: 0
2019-04-01 06:42:58 | epoch #1 | computing loss after
2019-04-01 06:42:58 | epoch #1 | optimization finished
2019-04-01 06:42:58 | epoch #1 | Computing KL after
2019-04-01 06:42:58 | epoch #1 | Computing loss after
2019-04-01 06:42:58 | epoch

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:07


2019-04-01 06:43:05 | epoch #2 | Logging diagnostics...
2019-04-01 06:43:05 | epoch #2 | Optimizing policy...
2019-04-01 06:43:05 | epoch #2 | Computing loss before
2019-04-01 06:43:05 | epoch #2 | Computing KL before
2019-04-01 06:43:05 | epoch #2 | Optimizing
2019-04-01 06:43:05 | epoch #2 | Start CG optimization: #parameters: 1218, #inputs: 1000, #subsample_inputs: 1000
2019-04-01 06:43:05 | epoch #2 | computing loss before
2019-04-01 06:43:05 | epoch #2 | performing update
2019-04-01 06:43:05 | epoch #2 | computing gradient
2019-04-01 06:43:05 | epoch #2 | gradient computed
2019-04-01 06:43:05 | epoch #2 | computing descent direction
2019-04-01 06:43:06 | epoch #2 | descent direction computed
2019-04-01 06:43:06 | epoch #2 | backtrack iters: 0
2019-04-01 06:43:06 | epoch #2 | computing loss after
2019-04-01 06:43:06 | epoch #2 | optimization finished
2019-04-01 06:43:06 | epoch #2 | Computing KL after
2019-04-01 06:43:06 | epoch #2 | Computing loss after
2019-04-01 06:43:06 | epoch

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


2019-04-01 06:43:13 | epoch #3 | Logging diagnostics...
2019-04-01 06:43:13 | epoch #3 | Optimizing policy...
2019-04-01 06:43:13 | epoch #3 | Computing loss before
2019-04-01 06:43:13 | epoch #3 | Computing KL before
2019-04-01 06:43:13 | epoch #3 | Optimizing
2019-04-01 06:43:13 | epoch #3 | Start CG optimization: #parameters: 1218, #inputs: 1006, #subsample_inputs: 1006
2019-04-01 06:43:13 | epoch #3 | computing loss before
2019-04-01 06:43:13 | epoch #3 | performing update
2019-04-01 06:43:13 | epoch #3 | computing gradient
2019-04-01 06:43:13 | epoch #3 | gradient computed
2019-04-01 06:43:13 | epoch #3 | computing descent direction
2019-04-01 06:43:13 | epoch #3 | descent direction computed
2019-04-01 06:43:13 | epoch #3 | backtrack iters: 1
2019-04-01 06:43:13 | epoch #3 | computing loss after
2019-04-01 06:43:13 | epoch #3 | optimization finished
2019-04-01 06:43:13 | epoch #3 | Computing KL after
2019-04-01 06:43:13 | epoch #3 | Computing loss after
2019-04-01 06:43:13 | epoch

0% [######                        ] 100% | ETA: 00:00:05

KeyboardInterrupt: ignored

In [20]:
# test results
n_experiments = 10
row_all = []

for i in range(n_experiments):
  #print("experiment ", i+1)

  # reset
  obs_initial = env.reset()

  # start
  done = False
  obs_i = obs_initial
  while not done:
    row_i = {}
    row_i['exp'] = i + 1
    row_i['obs'] = obs_i
    act_i, _ = policy.get_action(obs_i)
    row_i['act'] = act_i
    obs_i, rew_i, done, _ = env.step(act_i)
    row_i['obs'] = obs_i
    row_i['rew'] = rew_i
    row_all.append(row_i)
    
    if done: break
    
#env.close()

import pandas as pd
df = pd.DataFrame(row_all)
pd.DataFrame({
    'score': df.groupby('exp')['rew'].sum(),
    'nstep': df.groupby('exp')['rew'].count()
})

Unnamed: 0_level_0,nstep,score
exp,Unnamed: 1_level_1,Unnamed: 2_level_1
1,7,5.0
2,8,5.0
3,7,5.0
4,13,5.0
5,8,5.0
6,7,5.0
7,8,5.0
8,10,5.0
9,7,5.0
10,10,5.0
