# Intro to OpenAI Gym
In this demo, we will explore a reinforcement learning agent training platform OpenAI Gym.      
You can see documentation and more info here: https://gym.openai.com/     
You can see their source code and mode details how it works in their git: https://github.com/openai/gym   
Extending the collections that gym already has, OpenAI Universe https://github.com/openai/universe (deprecated as of now; April 2020) or Retro https://github.com/openai/retro provides a platform with which you can convert existing games to openai gym environment.
Note that the development of tools in reinforcement learning is happening very rapidly and things are mostly experimental that it's uncommon that certain packages get updated without backward compatibilty in a matter of a few months to a year. Currently, this demo uses gym version 0.17.1.

## High-level overview

Gym has been deveoped and maintained by OpenAI. `gym` provides a rich collection of environments for reinforcement learning experiments in an unified interface. Although each environment may look differently and can do different things, the main structure of a gym envoronment includes    

- Action space: set of actions either discrete or continuous, or both.
- Observation space: a boundary within an agent can be.
- `step` method: executes an action and returns reward and whether it's terminal state.
- `reset` method: initializes the environment.    

These core functions are in cluded in the class `Env` defined in https://github.com/openai/gym/blob/master/gym/core.py
Note that the actual detailed implementation of what values to return (e.g in `step` method) is defined in the each environment, and the `Env` class in`core.py` defines the overall structure.

Environment has a few [environment groups](https://github.com/openai/gym/blob/master/docs/environments.md) that ships with standard gym package, and it also has 3rd party environments as well as lets you to build your own.
Among its many environments, we will use a basic environment called "Classic control" in this demo.


In [0]:
import gym

In [2]:
gym.__version__

'0.17.1'

In [0]:
e = gym.make('MountainCar-v0') 

In [0]:
obs=e.reset()

In [5]:
obs

array([-0.42452761,  0.        ])

In [0]:
d = e.action_space #means possible actions are left and right
#the action space is defined in gym.spaces


In [0]:
from gym.spaces.discrete import Discrete

In [0]:
d = Discrete(3) #the Discrete is a class that has methods .sample and .contains

In [9]:
[d.sample() for x in range(10)] #sample generates an action output

[0, 0, 1, 0, 2, 0, 0, 2, 0, 1]

In [10]:
print(d.contains(0), d.contains(3), d.contains(2)) #with .contains method, you can check whether an integer is a valid action

True False True


In [11]:
e.observation_space #returns Box class, which represents n-dim tensor

Box(2,)

In [12]:
e.step(0)

(array([-0.42625975, -0.00173214]), -1.0, False, {})

In [13]:
e.step(1) 
# see https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py
# force = self.force_mag if action==1 else -self.force_mag

(array([-0.4287116 , -0.00245185]), -1.0, False, {})

In [14]:
e.step(0)

(array([-0.43286554, -0.00415394]), -1.0, False, {})

In [15]:
e.step(0)

(array([-0.4386916 , -0.00582606]), -1.0, False, {})

## Making an agent
Since the `gym` provides us the environment, it's our job to make an agent (policy) that can interact with the environemnt. Here is an example of a random agent.

In [16]:
import gym 

env = gym.make("MountainCar-v0")

total_reward = 0.0
total_steps = 0
obs = env.reset()

while True:
    action = env.action_space.sample() #.sample method gives a random action sample
    obs, reward, done, _ = env.step(action)
    total_reward += reward
    total_steps += 1
    if done:
        break

print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))

Episode done in 200 steps, total reward -200.00


## Monitoring the agent
`gym` proviedes Monitors through [wrappers](https://github.com/openai/gym/tree/master/gym/wrappers) module. 

**Caution** below snippet won't work in jupyter. You can run in python or ipython below if you have linux or unix machine, which then will pop up a display window.     
If you use Windows, there will be lots of problems with making this Monitor wrapper to work.(e.g. requiring ffmpeg which is trickly ton install in Windows and alternatively, Windows bash which can run ffmpeg will have a display issue- the linux subsystem in Windows 10 won't connect to display by default).     
Regardless of your OS, if you still want to display in Jupyter, you can try using `Ipython display` (skip the ssh tunneling the jupyter server part if you're running jupyter locally).

In [0]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [18]:
import io
import glob
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

xdpyinfo was not found, X start can not be checked! Please install xdpyinfo!


<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1009'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1009'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

In [0]:
import gym
from gym import wrappers
def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = wrappers.Monitor(env, './video', force=True)
  return env

In [0]:
from gym.spaces.discrete import Discrete
import numpy as np
from numpy import random

#get action with max Qvalue
def getAction(obs_new, obs, w):
    Qmax = np.NINF
    
    #d = env.action_space()
    actions = [0,1,2]  #need to generalize this for other environments
    for action in actions:  
        f = get_f(obs_new, obs, action)
        Qtemp = np.dot(w,f)  # Q = w1 * fpos(s,a) + w2 * fvel(s, a)
        if Qtemp > Qmax:
            Qmax =  Qtemp
            Amax = action
            fnew = np.copy(f)    #should use np.copy() here ?
    return Amax, Qmax, fnew

# get features in approximate RL equation f(s,a), returns vector f
def get_f(obs_new, obs, action):
    fpos = (obs_new[0]- obs[0])*(action-1)
    fvel = obs_new[1]*(action-1)  
    f = np.array([fpos, fvel])
    return f

#update weights
def update_w(diff, w, f ):
    w = w - diff*f  #element wise multiplication, diff is scalar, f is vector
    return w


def getActionFixedW(obs_new, obs, w):
    Qmax = 0
    Amax = None
    #d = env.action_space()
    actions = [0,1,2]  #need to generalize this for other environments
    for action in actions:  
        f = get_f(obs_new, obs, action)
        Qtemp = np.dot(w,f)  # Q = w1 * fpos(s,a) + w2 * fvel(s, a)
        if Qtemp > Qmax:
            Qmax =  Qtemp
            Amax = action
            fnew = np.copy(f)    #should use np.copy() here ?
    #return Amax, Qmax, fnew
    return Amax

def runNewWeights(w):
    env = gym.make("MountainCar-v0")
    env = wrap_env(env)

    total_reward = 0.0
    total_steps = 0
    obs = env.reset()
    action = 0
    obs_new, r, done, _ = env.step(action)
    action = getActionFixedW(obs_new, obs, w)

    while True:
        obs = obs_new
        obs_new, reward, done, _ = env.step(action)  #action here is "a"
        #print('obs_new = ', obs_new, ' total_steps = ', total_steps)
        total_reward += reward
        total_steps += 1
        #print('total_reward ', total_reward)    
        action = getActionFixedW(obs_new, obs, w)  #action is "a prime" 
            
        
        if done:
            break

    print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))
    env.close()
    show_video()
    env.env.close()    
    

In [21]:

env = gym.make("MountainCar-v0")
env = wrap_env(env)
obs = env.reset()

total_reward = 0.0
total_steps = 0
alpha = 0.5
num_weights = 2
best_reward = np.NINF
best_steps = np.Inf

w = np.zeros(num_weights)
w = np.random.rand(num_weights)

action = 0
obs_new, r, done, _ = env.step(action)
#print('obs_new = ', obs_new, 'r = ', r, 'done = ', done)
action, Q0, f0 = getAction(obs_new, obs, w)
#print('action = ', action, 'Q0 = ', Q0, 'f0 = ', f0)

for i in range(5):
    print('i = ', i)
    while True:
        obs = obs_new
        obs_new, reward, done, _ = env.step(action)  #action here is "a"
        #print('obs_new = ', obs_new, ' total_steps = ', total_steps)
        total_reward += reward
        total_steps += 1
        #print('total_reward ', total_reward)    
        action, Qmax, fnew = getAction(obs_new, obs, w)  #action is "a prime" 
        sample = reward + alpha * Qmax
        #print('sampe = ', sample)
        #print('Q0 = ', Q0)
        #print('diff = ', diff)
        diff = sample - Q0 #result should be scalar!
        w = update_w(diff, w, f0)  #keep old f here
        Q0 = np.dot(w, fnew)  #calculate the new Q0 for "a prime" - use fnew
        #print('Q0 = ', Q0)
        f0 = np.copy(fnew)  #should use np.copy() here ?
        
        if done:
            break
    print('total_reward = ', total_reward)        
    if total_reward > best_reward:
        best_reward = total_reward
        best_steps = total_steps
        best_w = np.copy(w)

    #w = np.random.rand(num_weights)
    total_steps = 0    
    total_reward = 0
    action = 0
    obs = env.reset()           
    obs_new, r, done, _ = env.step(action)
    action, Q0, f0 = getAction(obs_new, obs, w)

print("Episode done in %d steps, total reward %.2f" % (best_steps, best_reward))

#print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))
#env.close()
#show_video()
#env.env.close()

i =  0
total_reward =  -86.0
i =  1
total_reward =  -153.0
i =  2
total_reward =  -164.0
i =  3
total_reward =  -90.0
i =  4
total_reward =  -86.0
Episode done in 86 steps, total reward -86.00


In [22]:
best_w

array([2.14714473, 2.28041181])

In [25]:
runNewWeights(best_w)

Episode done in 92 steps, total reward -92.00
