# Using Gym and other rl frameworks

[Gym](https://github.com/openai/gym/) provides a set of environments to test reinforcement learning algorithms. In this quick intro we will discuss:

* [How to use it](#1-how-to-use-openai-gym)
* [How to create our own wrappers](#2-creating-gym-environment-wrappers)
<!--* [How to use some other environments from different libraries](#3-using-other-environments)-->

In [None]:
# some other functionality we might need
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
%matplotlib inline

# some fixes required to run some environments
def loadDynamicDeps():
    """Hacky-fix to load the appropiate GLEW library

    Based on this fixes from controlsuite repo:
    https://github.com/deepmind/dm_control/blob/978230f1376de1826c430dd3dfc0e3c7f742f5fe/dm_control/mujoco/wrapper/util.py#L100
    """
    import ctypes

    _libGlewLibraryPath = ctypes.util.find_library( 'GLEW' )
    ctypes.CDLL( _libGlewLibraryPath, ctypes.RTLD_GLOBAL )
    
loadDynamicDeps()

## 1. How to use OpenAI Gym


In [None]:
# just a single-line import does the trick
import gym

Gym provides various environments, each accessible through its own environment-id. The list can be found [here](https://github.com/openai/gym/blob/master/docs/environments.md)

In [None]:
# create and environment through its ID
env = gym.make( 'CartPole-v0' )
# inspect it a little bit
print( 'type: %s' % type(env) )
print( 'components: \n\r %s' % str( dir( env ) ) )

### Components of an environment

Let's check some of the main components of the environment interface provided by gym. This implements most
of the agent-environment interface discusses in the slides, as shown in the figure below:

<img src='./imgs/img_rl_loop.png' width='50%'/>

#### State-space $S=\lbrace s \rbrace$ 

This is defined by the `observation_space` attribute of the `env` wrapper, which can be either a `gym.spaces.Discrete` or a `gym.spaces.Box`, for **discrete** and **continuous** state/observation spaces respectively.

#### Action-space $A=\lbrace a \rbrace$ 

This one is defined by the `action_space` attribute of the `env` wrapper, which can be also either a `gym.spaces.Discrete` or a `gym.spaces.Box`, for **discrete** and **continuous** state/observation spaces respectively.

#### Initial state-distribution $\rho_{0}(s)$

We don't have full access to this distribution. Instead, it's already encoded in each environment implementation and
by calling the `env.reset` method we get a sample from this distribution.

```python
# sample initial state from initial state distribution
state = env.reset()
```

### Reward function $R(s,a,s')$ and Transition model $p(s'|s,a)$

We don't have complete access to these two. Instead, we only have limited access via samples taken by
stepping on the environment. These samples can be obtained with the `env.step` method, which accepts an
action to be taken in the environment.

```python
# decide some action to take in state 'state'
action = some_policy( state )
# take a step in the environment (samples from transition model and reward function)
next_state, reward, done, info = env.step( action )
```

### Seeding

Recall that most of the components from before are probability distributions, which in the core are implemented using some random generators or similar. We can tell the environment to use a specific seed such that we could later reproduce
our experiments or see variability over different random seeds. Just use the `env.seed` method, and pass an integer as seed.

```python
# create a registered environment
env = gym.make( 'SOME-TOTALLY-AWESOME-ENVIRONMENT' )
# seed the environment
env.seed( 0 )
```

In [None]:
##---------- Let's play a bit with those components ----------------

def runner( env_name ) :
    # create the environment given the name
    env = gym.make( env_name )
    # grab some information about the environment type
    _isObsContinuous = isinstance( env.observation_space, gym.spaces.Box )
    _isActContinuous = isinstance( env.action_space, gym.spaces.Box )
    
    print( 'Environment: %s' % env_name )
    print( 'observation-space: ', env.observation_space )
    print( 'action_space: ', env.action_space )
    
    if _isObsContinuous :
        print( 'Continuous state|observation space' )
        print( 'obs-space-shape: ', env.observation_space.shape )
    else :
        print( 'Discrete state|observation space' )
        print( 'obs-space-n-states: ', env.observation_space.n )
        
    if _isActContinuous :
        print( 'Continuous action space' )
        print( 'act-space-shape: ', env.action_space.shape )
    else :
        print( 'Discrete action space' )
        print( 'act-space-n-actions: ', env.action_space.n )
    
    print( '-------------------------------------' )

# Frozen-lake: discrete state-space, discrete action-space
env = runner( 'FrozenLake-v0' )
# Lunar-lander: continuous state-space, discrete action-space
env = runner( 'LunarLander-v2' )
# Pendulum: continuous state-space, continuous action-space
env = runner( 'Pendulum-v0' )

Let's create a simple rl-loop

In [None]:
env = gym.make( 'FrozenLake-v0', is_slippery = False )
# env = gym.make( 'LunarLander-v2' )
# env = gym.make( 'Pendulum-v0' )
# env = gym.make( 'Humanoid-v2' )

# env.seed( 0 )

# reset environment and grab state from initial distribution
s = env.reset()

aGamma = 0.99    # discount factor
aScore = 0.      # score obtained in the whole episode
aReturn = 0.     # Return 'G' from starting step
aTrajectory = [] # trajectory taken by the agent [(s,a)]

def random_policy( s ) :
    global env
    return env.action_space.sample()

def naive_discrete_policy( s ) :
    return 0

def naive_continuous_policy( s ) :
    global env
    return np.zeros( *env.action_space.shape )

# policy_fn = random_policy
policy_fn = naive_discrete_policy if isinstance( env.action_space, gym.spaces.Discrete ) \
                                  else naive_continuous_policy

# loop for a whole episode
for i in tqdm( range( env.spec.max_episode_steps ) ) :
    a = policy_fn( s )
    snext, r, done, _ = env.step( a )
    env.render()
    
    # some book-keeping
    aTrajectory.append( (s, a) )
    aScore += r
    aReturn += r * ( aGamma ** i )
    s = snext
    
    if done :
        break
        
print( 'score: ', aScore  )
print( 'return: ', aReturn )
print( 'trajectory: ', aTrajectory )
        
env.close()

## 2. Creating Gym environment wrappers

Most of the publicly available baselines (like [this](https://github.com/hill-a/stable-baselines) one) require a gym-like interface to be used. So, if you want to test an off-the-shelf implementation of an rl-algorithm, the it's a good idea to wrap whatever environment you're currently working on with a gym-like interface.

Also, this interface is kind of standard, and other researchers that might want to try to run their own experiments might have it easier if your environment is nicely wrapped and ready to use with their own baselines. So, let's see how we can kind of standarize this using a gym-like interface.

### Just implement the gym.Env interface

All there's to do is to implement the "abstract" methods of the gym.Env class, and also define your observation and action space. Below there's a snippet of what to implement:

```python

class MyEnv( gym.Env ) :
    
    def __init__( self, SOME-SEXY-ARGUMENTS ) :
        super( MyEnv, self ).__init__()
        
        # ... initialize your own stuff ...
        
        # define your observation space
        self.observation_space = gym.spaces.Discrete( N-STATES ) # in case of a discrete state-space
        self.observation_space = gym.spaces.Continuous( STATES-DIM ) # in case of a continuous state-space
        
        # define your action space
        self.action_space = gym.spaces.Discrete( N-ACTIONS ) # in case of discrete action-space
        self.action_space = gym.spaces.Continuous( ACTIONS-DIM ) # in case of Continuous action-space
        
        # ... some more stuff ...
        
    def reset( self ) :
        # > define here your own initial state|observation distribution
        # > and return a sample from it
        return INITIAL-STATE
    
    def seed( self ) :
        # > seed your own generators here
        # e.g. torch.manual_seed(0), np.random.seed(20)
    
    def step( self, action ) :
        # > define the dynamics of your environment
        # > use the action given to take a step in the environment
        # > return the next state, reward, a termination flag, and some other info (dict, ...)
        
        return NEXT-STATE, REWARD, TERMINATION-FLAG, EXTRA-INFO
    
    def render( self ) :
        # > use the visualizer that comes with your environment and render in the supported form
```

In [3]:
from env.gridworld.environment import GridWorldEnv
GridWorldEnv??

In [4]:
from env.mlagents.environment import UnityEnvWrapper
UnityEnvWrapper??

<!--## 3. Using other environments-->

<!--Finally, just for the heck of it, let's check some other environments that are available online, either from competitions or other benchmarks that you might end up using. This isn't at all a comprehensive list, and some
of the benchmarks that I've picked are the ones I'm most interested of. Nevertheless, it's very likely that other environments not covered here will follow a similar interface to the one provided by gym.Env.-->