<a href="https://colab.research.google.com/github/rlworkgroup/garage/blob/master/examples/jupyter/trpo_gym_tf_cartpole.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a jupyter notebook demonstrating usage of [garage](https://github.com/rlworkgroup/garage) in a jupyter notebook.

In particular, it demonstrates the example `trpo_gym_tf_cartpole.py` file already available in [garage/examples/tf/](https://github.com/rlworkgroup/garage/blob/master/examples/tf/)

## Install pre-requisites

In [1]:
!echo "abcd" > mujoco_fake_key


!git clone --depth 1 https://github.com/rlworkgroup/garage/

!cd garage
!bash scripts/setup_colab.sh --mjkey ../mujoco_fake_key --no-modify-bashrc > /dev/null!

fatal: destination path 'garage' already exists and is not an empty directory.
/bin/sh: /dev/null!: Operation not permitted


In [2]:
raise Exception("Please restart your runtime so that the installed dependencies for 'garage' can be loaded, and then resume running the notebook")

Exception: Please restart your runtime so that the installed dependencies for 'garage' can be loaded, and then resume running the notebook



---



---



---



## Import dependencies 

In [8]:
# The contents of this cell and the one after are mostly copied from garage/examples/...
# Note that these need to be run twice if for the first time on a colab.research.google.com instance
# 1st time is to create the "personal config from template"
# 2nd time is the charm
import os
import garage
from garage.np.baselines import LinearFeatureBaseline
from garage.envs import GymEnv
from garage.envs import normalize
from garage.tf.algos import TRPO
#from garage.tf.policies import GaussianMLPPolicy
from garage.tf.policies import CategoricalMLPPolicy

import gym

from garage.experiment import TFTrainer
from garage.experiment.deterministic import set_seed
from dowel import logger, StdOutput


## Prepare the logger

In [9]:
# set up the dowel logger
log_dir = os.path.join(os.getcwd(), 'data')
ctxt=garage.experiment.SnapshotConfig(snapshot_dir=log_dir,
                                      snapshot_mode='last',
                                      snapshot_gap=1)

# log to stdout
logger.add_output(StdOutput())

## Set up video visualization of the algorithm

In [None]:
# tested and worked for linux os
# Prepare display for seeing a video of the policy in action in the jupyter notebook
# Note that this doesn't require a runtime restart
# https://stackoverflow.com/a/51183488/4126114

# Linux installation
!apt-get install python-opengl ffmpeg xvfb
!pip3 install pyvirtualdisplay 

In [None]:
# Virtual display (for Linux only)
from pyvirtualdisplay import Display

# launch stuff inside virtual display here

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

In [12]:
# MacOS installation
!pip3 install pyvirtualdisplay 
! pip install PyOpenGL ffmpeg

[33mYou are using pip version 18.1, however version 20.2b1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 18.1, however version 20.2b1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [13]:
def visualize_algorithm_as_video(env, policy):
            env.spec.id = 1
            env = wrappers.Monitor(env, "./gym-results", force=True)
            obs = env.reset()
            for i in range(1000):
                #action = env.action_space.sample()
                action, _ = policy.get_action(obs)
                obs, reward, done, info = env.step(action)
                if done: break

            print("done at step %i"%i)
            env.close()

## Define and train the algorithm

In [14]:
import gym
from gym import wrappers
def trpo_gym_tf_cartpole(ctxt=None, seed=1):
    """Train TRPO with CartPole-v0 environment.
    
    Args:
        ctxt (garage.experiment.ExperimentContext): The experiment
            configuration used by Trainer to create the snapshotter.
        seed (int): Used to seed the random number generator to produce
            determinism.

    """
    set_seed(seed)
    with TFTrainer(snapshot_config=ctxt) as trainer:
        # garage version of CartPole environment, has Discrete action space instead of Box
        ###########################################
        # env = normalize(GymEnv(CartpoleEnv()))
        # policy = GaussianMLPPolicy(
        #    name="policy", env_spec=env.spec, hidden_sizes=(32, 32))
        
        # gym version of CartPole. Check note above
        ###########################################
        # env = normalize(GymEnv(gym.make("CartPole-v0")))
        
        # garage's updated method of getting env
        env = GymEnv(gym.make('CartPole-v1'))

        policy = CategoricalMLPPolicy(name='policy',
                                      env_spec=env.spec,
                                      hidden_sizes=(32, 32))

         # create baseline
        baseline = LinearFeatureBaseline(env_spec=env.spec)

        # specify an algorithm
        algo = TRPO(
            env_spec=env.spec,
            policy=policy,
            baseline=baseline,
            max_episode_length=200,
            discount=0.99,
            max_kl_step=0.01,
        )

        # train the algorithm
        trainer.setup(algo, env)
        trainer.train(n_epochs=10, batch_size=10000, plot=False)
        
        visualize_algorithm_as_video(env, policy)




In [15]:
trpo_gym_tf_cartpole(ctxt=ctxt)

2020-06-17 23:39:28 | Setting seed to 1
Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.


  logger.log(self.sess.graph)


2020-06-17 23:39:28 | Obtaining samples...
2020-06-17 23:39:28 | epoch #0 | Obtaining samples for iteration 0...
2020-06-17 23:39:30 | epoch #0 | Logging diagnostics...
2020-06-17 23:39:30 | epoch #0 | Optimizing policy...
2020-06-17 23:39:30 | epoch #0 | Computing loss before
2020-06-17 23:39:30 | epoch #0 | Computing KL before
2020-06-17 23:39:30 | epoch #0 | Optimizing
2020-06-17 23:39:30 | epoch #0 | Start CG optimization: #parameters: 1282, #inputs: 286, #subsample_inputs: 286
2020-06-17 23:39:30 | epoch #0 | computing loss before
2020-06-17 23:39:30 | epoch #0 | computing gradient
2020-06-17 23:39:30 | epoch #0 | gradient computed
2020-06-17 23:39:30 | epoch #0 | computing descent direction
2020-06-17 23:39:32 | epoch #0 | descent direction computed
2020-06-17 23:39:32 | epoch #0 | backtrack iters: 8
2020-06-17 23:39:32 | epoch #0 | optimization finished
2020-06-17 23:39:32 | epoch #0 | Computing KL after
2020-06-17 23:39:32 | epoch #0 | Computing loss after
2020-06-17 23:39:32 |

2020-06-17 23:39:39 | epoch #4 | Obtaining samples for iteration 4...
2020-06-17 23:39:40 | epoch #4 | Logging diagnostics...
2020-06-17 23:39:40 | epoch #4 | Optimizing policy...
2020-06-17 23:39:40 | epoch #4 | Computing loss before
2020-06-17 23:39:40 | epoch #4 | Computing KL before
2020-06-17 23:39:40 | epoch #4 | Optimizing
2020-06-17 23:39:40 | epoch #4 | Start CG optimization: #parameters: 1282, #inputs: 57, #subsample_inputs: 57
2020-06-17 23:39:40 | epoch #4 | computing loss before
2020-06-17 23:39:40 | epoch #4 | computing gradient
2020-06-17 23:39:40 | epoch #4 | gradient computed
2020-06-17 23:39:40 | epoch #4 | computing descent direction
2020-06-17 23:39:41 | epoch #4 | descent direction computed
2020-06-17 23:39:41 | epoch #4 | backtrack iters: 0
2020-06-17 23:39:41 | epoch #4 | optimization finished
2020-06-17 23:39:41 | epoch #4 | Computing KL after
2020-06-17 23:39:41 | epoch #4 | Computing loss after
2020-06-17 23:39:41 | epoch #4 | Fitting baseline...
2020-06-17 23

2020-06-17 23:39:46 | epoch #8 | Obtaining samples for iteration 8...
2020-06-17 23:39:48 | epoch #8 | Logging diagnostics...
2020-06-17 23:39:48 | epoch #8 | Optimizing policy...
2020-06-17 23:39:48 | epoch #8 | Computing loss before
2020-06-17 23:39:48 | epoch #8 | Computing KL before
2020-06-17 23:39:48 | epoch #8 | Optimizing
2020-06-17 23:39:48 | epoch #8 | Start CG optimization: #parameters: 1282, #inputs: 52, #subsample_inputs: 52
2020-06-17 23:39:48 | epoch #8 | computing loss before
2020-06-17 23:39:48 | epoch #8 | computing gradient
2020-06-17 23:39:48 | epoch #8 | gradient computed
2020-06-17 23:39:48 | epoch #8 | computing descent direction
2020-06-17 23:39:48 | epoch #8 | descent direction computed
2020-06-17 23:39:48 | epoch #8 | backtrack iters: 1
2020-06-17 23:39:48 | epoch #8 | optimization finished
2020-06-17 23:39:48 | epoch #8 | Computing KL after
2020-06-17 23:39:48 | epoch #8 | Computing loss after
2020-06-17 23:39:48 | epoch #8 | Fitting baseline...
2020-06-17 23

In [16]:
# Display the video in the jupyter notebook
# Click the play button below to watch the video
!pip install glob2

import io
import base64
import glob2
from IPython.display import HTML

latest_video_name = glob2.iglob('./gym-results/*.mp4')
video = io.open(list(latest_video_name)[-1], 'r+b').read()
encoded = base64.b64encode(video)


vid_ascii = encoded.decode('ascii')

# play the last video
HTML(data='''
        <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
    .format(vid_ascii))

[33mYou are using pip version 18.1, however version 20.2b1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
