<a href="https://colab.research.google.com/github/rlworkgroup/garage/blob/master/examples/jupyter/custom_env.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Demonstrate usage of a custom openai/gym environment with rlworkgroup/garage

Demonstrate usage of [garage](https://github.com/rlworkgroup/garage) with a custom `openai/gym` environment in a jupyter notebook.

## Install pre-requisites

In [1]:
!echo "abcd" > mujoco_fake_key


!git clone --depth 1 https://github.com/rlworkgroup/garage/

!cd garage
!bash scripts/setup_colab.sh --mjkey ../mujoco_fake_key --no-modify-bashrc > /dev/null

fatal: destination path 'garage' already exists and is not an empty directory.
bash: scripts/setup_colab.sh: No such file or directory


In [2]:
raise Exception("Please restart your runtime so that the installed dependencies for 'garage' can be loaded, and then resume running the notebook")

Exception: Please restart your runtime so that the installed dependencies for 'garage' can be loaded, and then resume running the notebook



---



---



# custom gym environment

In [7]:
# Create a gym env that simulates the current water treatment plant
# Based on https://github.com/openai/gym/blob/master/gym/envs/toy_text/nchain.py
import os
import gym
from gym import spaces
import numpy as np
import random



# Gym env
class MyEnv(gym.Env):
    """Custom gym environment
    
    Observation: Coin flip (Discrete binary: 0/1)
      
    Actions: Guess of coin flip outcome (Discrete binary: 0/1)
      
    Reward: Guess the coin flip correctly
      
    Episode termination: Make 5 correct guesses within 20 attempts
    """
    def __init__(self):
        # set action/observation spaces
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Discrete(2)
        self.reset()

    def step(self, action):
        assert self.action_space.contains(action), "action not in action space!"
        
        # flip a coin
        self.state = np.random.rand() < 0.5

        # increment number of attempts
        self.attempt += 1
        
        # calculate reward of this element
        reward = (action == self.state)
        self.score += reward
          
        # allow a maximum number of attempts or reach max score
        done = (self.attempt >= 20) | (self.score >= 5)
          
        return self.state, reward, done, {}
      
    def reset(self):
      # accumulate score
      self.score = 0
      # count number of attempts
      self.attempt = 0
      
      return 0

In [8]:
# some smoke testing
env_test = MyEnv()
observation = env_test.reset()

for step in range(40):
  action = np.random.rand() < 0.5
  observation, reward, done, _ = env_test.step(action)
  print("step %i: action=%i, observation=%i => reward = %i, done = %s" % (step, action, observation, reward, done))
  if done: break

step 0: action=0, observation=1 => reward = 0, done = False
step 1: action=1, observation=0 => reward = 0, done = False
step 2: action=0, observation=1 => reward = 0, done = False
step 3: action=0, observation=1 => reward = 0, done = False
step 4: action=1, observation=0 => reward = 0, done = False
step 5: action=1, observation=0 => reward = 0, done = False
step 6: action=1, observation=1 => reward = 1, done = False
step 7: action=0, observation=0 => reward = 1, done = False
step 8: action=0, observation=0 => reward = 1, done = False
step 9: action=0, observation=1 => reward = 0, done = False
step 10: action=1, observation=0 => reward = 0, done = False
step 11: action=0, observation=0 => reward = 1, done = False
step 12: action=1, observation=0 => reward = 0, done = False
step 13: action=0, observation=1 => reward = 0, done = False
step 14: action=0, observation=1 => reward = 0, done = False
step 15: action=1, observation=1 => reward = 1, done = True


## Import dependencies 

In [9]:
# The contents of this cell are mostly copied from garage/examples/...

from garage.np.baselines import LinearFeatureBaseline # <<<<<< requires restarting the runtime in colab after the 1st dependency installation above
from garage.envs import GymEnv
from garage.envs import normalize
from garage.tf.algos import TRPO
from garage.tf.policies import GaussianMLPPolicy
from garage.tf.policies import CategoricalMLPPolicy

import gym # already imported before


from garage.experiment import TFTrainer
from garage.experiment.deterministic import set_seed
from dowel import logger, StdOutput

## Prepare the logger

In [10]:
import os
import garage
# set up the dowel logger
log_dir = os.path.join(os.getcwd(), 'data')
ctxt=garage.experiment.SnapshotConfig(snapshot_dir=log_dir,
                                      snapshot_mode='last',
                                      snapshot_gap=1)

# log to stdout
logger.add_output(StdOutput())

## Register Env

In [11]:
# register the env with gym
# https://github.com/openai/gym/tree/master/gym/envs#how-to-create-new-environments-for-gym
from gym.envs.registration import register

register(
    id='MyEnv-v0',
    entry_point=MyEnv,
)

# test registration was successful
env = gym.make("MyEnv-v0")
# env = normalize(GymEnv(gym.make("MyEnv-v0")))
# env = GymEnv('MyEnv-v0') 
# env = GymEnv(gym.make('MyEnv-v0'))

In [12]:
# Wrap the environment to convert the observation to numpy array
# Not sure why this is necessary ATM
# Based on https://github.com/openai/gym/blob/5404b39d06f72012f562ec41f60734bd4b5ceb4b/gym/wrappers/dict.py

      
from gym import wrappers

class NpWrapper(gym.ObservationWrapper):
    def observation(self, observation):
        obs = np.array(observation).astype('int')
        return obs
      
env = NpWrapper(env)
env = normalize(GymEnv(env))

## Define result analytic function

In [13]:
def display_test_result(env, policy, obs_var):

    # test results
    n_experiments = 10
    row_all = []

    for i in range(n_experiments):
      #print("experiment ", i+1)

      policy.build(obs_var)
      # reset
      obs_initial = env.reset()

      # start
      done = False
      obs_i = obs_initial
      while not done:
        row_i = {}
        row_i['exp'] = i + 1
        row_i['obs'] = obs_i
        act_i, _ = policy.get_action(obs_i.flatten())
        row_i['act'] = act_i
        obs_i, rew_i, done, _ = env.step(act_i)
        row_i['obs'] = obs_i
        row_i['rew'] = rew_i
        row_all.append(row_i)

        if done: break

    env.close()
    
    return row_all



## Define and train the algorithm

In [14]:
import tensorflow as tf

hyper_parameters = {
    'hidden_sizes': [32, 32],
    'max_kl': 0.01,
    'gae_lambda': 0.97,
    'discount': 0.99,
    'max_episode_length': 100,
    'n_epochs': 2,
    'batch_size': 10000,
}


def test_gym_environment(env, ctxt=None, seed=1):
    set_seed(seed)
    with TFTrainer(snapshot_config=ctxt) as trainer:
        policy = CategoricalMLPPolicy(
            name="policy", env_spec=env.spec, hidden_sizes=(32, 32))

        obs_var = tf.compat.v1.placeholder(
                    tf.float32,
                    shape=[None, None, env.observation_space.flat_dim],
                    name='obs')


        baseline = LinearFeatureBaseline(env_spec=env.spec)

        algo = TRPO(env_spec=env.spec,
                    policy=policy,
                    baseline=baseline,
                    max_episode_length=hyper_parameters['max_episode_length'],
                    discount=hyper_parameters['discount'],
                    gae_lambda=hyper_parameters['gae_lambda'],
                    max_kl_step=hyper_parameters['max_kl'])

        # train the algorithm
        trainer.setup(algo, env)
        trainer.train(n_epochs=hyper_parameters['n_epochs'],
                     batch_size=hyper_parameters['batch_size'])
        
        # display test results
#         result = display_test_result(env, policy, obs_var)
        
       
#     return result

In [15]:
# result = test_gym_environment(env, ctxt)
test_gym_environment(env, ctxt)

2020-06-17 23:21:19 | Setting seed to 1
2020-06-17 23:21:19 | Setting seed to 1
Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.


  logger.log(self.sess.graph)


2020-06-17 23:21:19 | Obtaining samples...
2020-06-17 23:21:19 | Obtaining samples...
2020-06-17 23:21:19 | epoch #0 | Obtaining samples for iteration 0...
2020-06-17 23:21:19 | epoch #0 | Obtaining samples for iteration 0...
2020-06-17 23:21:21 | epoch #0 | Logging diagnostics...
2020-06-17 23:21:21 | epoch #0 | Logging diagnostics...
2020-06-17 23:21:21 | epoch #0 | Optimizing policy...
2020-06-17 23:21:21 | epoch #0 | Optimizing policy...
2020-06-17 23:21:21 | epoch #0 | Computing loss before
2020-06-17 23:21:21 | epoch #0 | Computing loss before
2020-06-17 23:21:21 | epoch #0 | Computing KL before
2020-06-17 23:21:21 | epoch #0 | Computing KL before
2020-06-17 23:21:21 | epoch #0 | Optimizing
2020-06-17 23:21:21 | epoch #0 | Optimizing
2020-06-17 23:21:21 | epoch #0 | Start CG optimization: #parameters: 1218, #inputs: 992, #subsample_inputs: 992
2020-06-17 23:21:21 | epoch #0 | Start CG optimization: #parameters: 1218, #inputs: 992, #subsample_inputs: 992
2020-06-17 23:21:21 | epoc

## Display test result

In [None]:
# # pandas test results
# ! pip install pandas
# import pandas as pd
# df = pd.DataFrame(result)
# pd.DataFrame({
#     'score': df.groupby('exp')['rew'].sum(),
#     'nstep': df.groupby('exp')['rew'].count()
# })