# Intro to OpenAI Gym
In this demo, we will explore a reinforcement learning agent training platform OpenAI Gym.      
You can see documentation and more info here: https://gym.openai.com/     
You can see their source code and mode details how it works in their git: https://github.com/openai/gym   
Extending the collections that gym already has, OpenAI Universe https://github.com/openai/universe (deprecated as of now; April 2020) or Retro https://github.com/openai/retro provides a platform with which you can convert existing games to openai gym environment.
Note that the development of tools in reinforcement learning is happening very rapidly and things are mostly experimental that it's uncommon that certain packages get updated without backward compatibilty in a matter of a few months to a year. Currently, this demo uses gym version 0.17.1.

## High-level overview

Gym has been deveoped and maintained by OpenAI. `gym` provides a rich collection of environments for reinforcement learning experiments in an unified interface. Although each environment may look differently and can do different things, the main structure of a gym envoronment includes    

- Action space: set of actions either discrete or continuous, or both.
- Observation space: a boundary within an agent can be.
- `step` method: executes an action and returns reward and whether it's terminal state.
- `reset` method: initializes the environment.    

These core functions are in cluded in the class `Env` defined in https://github.com/openai/gym/blob/master/gym/core.py
Note that the actual detailed implementation of what values to return (e.g in `step` method) is defined in the each environment, and the `Env` class in`core.py` defines the overall structure.

Environment has a few [environment groups](https://github.com/openai/gym/blob/master/docs/environments.md) that ships with standard gym package, and it also has 3rd party environments as well as lets you to build your own.
Among its many environments, we will use a basic environment called "Classic control" in this demo.


In [0]:
import gym

In [2]:
gym.__version__

'0.17.1'

In [0]:
e = gym.make('Acrobot-v1') 

In [0]:
obs=e.reset()

In [5]:
obs

array([ 0.99634804, -0.08538492,  0.99979755, -0.02012134, -0.08352754,
       -0.03452268])

In [6]:
d = e.action_space #means possible actions are left and right
#the action space is defined in gym.spaces
d

Discrete(3)

In [0]:
from gym.spaces.discrete import Discrete

In [0]:
d = Discrete(3) #the Discrete is a class that has methods .sample and .contains

In [10]:
[d.sample() for x in range(10)] #sample generates an action output

[2, 2, 0, 0, 2, 1, 0, 2, 1, 2]

In [9]:
print(d.contains(0), d.contains(3), d.contains(2)) #with .contains method, you can check whether an integer is a valid action

True False True


In [11]:
e.observation_space #returns Box class, which represents n-dim tensor

Box(6,)

In [12]:
e.step(0)

(array([ 0.99693675, -0.07821201,  0.99773058, -0.06733267,  0.1522195 ,
        -0.42821611]), -1.0, False, {})

In [13]:
e.step(0)

(array([ 0.99960235, -0.02819839,  0.98321708, -0.18243949,  0.33449925,
        -0.70401511]), -1.0, False, {})

## Monitoring the agent
`gym` proviedes Monitors through [wrappers](https://github.com/openai/gym/tree/master/gym/wrappers) module. 

**Caution** below snippet won't work in jupyter. You can run in python or ipython below if you have linux or unix machine, which then will pop up a display window.     
If you use Windows, there will be lots of problems with making this Monitor wrapper to work.(e.g. requiring ffmpeg which is trickly ton install in Windows and alternatively, Windows bash which can run ffmpeg will have a display issue- the linux subsystem in Windows 10 won't connect to display by default).     
Regardless of your OS, if you still want to display in Jupyter, you can try using `Ipython display` (skip the ssh tunneling the jupyter server part if you're running jupyter locally).

In [0]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [18]:
import io
import glob
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

xdpyinfo was not found, X start can not be checked! Please install xdpyinfo!


<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1001'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1001'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

In [0]:
import gym
from gym import wrappers
def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = wrappers.Monitor(env, './video', force=True)
  return env

In [0]:
from gym.spaces.discrete import Discrete
import numpy as np
from numpy import random

#get action with max Qvalue
def getAction(obs_new, obs, w):
    Qmax = np.NINF
    actions = [0,1,2] 
    for action in actions:  
        f = get_f(obs_new, obs, action)
        Qtemp = np.dot(w,f)  # Q = w1 * f1(s,a) + w2 * f2(s, a) + ...
        if Qtemp > Qmax:
            Qmax =  Qtemp
            Amax = action
            #fnew = f  #should i use np.copy()
            fnew = np.copy(f)
    return Amax, Qmax, fnew

# get features in approximate RL equation f(s,a), returns vector f
def get_f(obs_new, obs, action):

    if (obs_new[4]*obs_new[5]) < 0:
        signVel = -1
    else:
        signVel = 1 

    f0 = signVel*(obs_new[2]*obs_new[3])*(action-1)  #when both links have same angular vel direction
    f1 = signVel* obs[5]*(action-1) #when links have opposite angular vel
    #f2 = abs(obs_new[2]) - abs(action-1)
    f2 = obs_new[2]
    #f2 = 1
    #f3 = 1
    #f4 = 1
    f5 = obs_new[5]
    
    
    #f = np.array([f0, f1, f2, f3, f4, f5])
    f = np.array([f0, f1, f2, f5])
    #print('f = ', f)
    return f  #f is an array

#update weights
def update_w(diff, w, f ):
    w = w - diff*f  #this needs to be element wise multiplication
    return w
 

    

In [51]:

env = gym.make("Acrobot-v1")
env = wrap_env(env)
obs = env.reset()

total_reward = 0.0
total_steps = 0
alpha = 0.0005
num_weights = 4
best_reward = np.NINF ##
best_steps = np.Inf  ##
w = np.zeros(num_weights)
w = np.random.sample(num_weights)
print('w = ', w)
action = 0
obs_new, r, done, _ = env.step(action)
action, Q0, f0 = getAction(obs_new, obs, w)

print('obs_new = ', obs_new, 'r = ', r, 'done = ', done)
print('action = ', action, ', Q0 = ', Q0, ', f0 = ', f0)

w =  [0.9571516  0.99702155 0.73424886 0.36312148]
obs_new =  [ 0.99806075 -0.06224735  0.99934867 -0.03608647  0.19365789 -0.50314686] r =  -1.0 done =  False
action =  2 , Q0 =  0.6658413632581168 , f0 =  [ 0.03606296  0.08049621  0.99934867 -0.50314686]


In [52]:
for i in range(1):
    
    #print('i= ', i)
    

    while True:
        obs = obs_new
        print('action = ', action)
        obs_new, reward, done, _ = env.step(action)  #action here is "a"
        #print('obs_new = ', obs_new, ' total_steps = ', total_steps)
        #print('reward = ', reward)
        total_reward += reward
        total_steps += 1
        print('total_reward ', total_reward)    
        action, Qmax, fnew = getAction(obs_new, obs, w)  #action is "a prime" 
        sample = reward + alpha * Qmax

        diff = sample - Q0
        w = update_w(diff, w, f0)  #keep old f here
        print('w = ', w)
        print('fnew = ', fnew)
        print('sample = ', sample, 'diff = ', diff)
        print('Q0 = ', Q0, 'Qmax = ', Qmax)
      
        Q0 = np.dot(w,fnew)  #calculate the new Q0 for "a prime" - use fnew
        f0 = np.copy(fnew)

        if done:
            break


    #print('weights = ', w)        
    action = 0
    obs = env.reset()
    obs_new, r, done, _ = env.step(action)
    action, Q0, f0 = getAction(obs_new, obs, w)        
    
print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))
env.close()
show_video()
env.env.close()

action =  2
total_reward  -1.0
w =  [ 1.01720362  1.13106378  2.39836352 -0.4747183 ]
fnew =  [ 0.09904391  0.50314686  0.99503374 -0.11794906]
sample =  -0.9993578895741915 diff =  -1.6651992528323083
Q0 =  0.6658413632581168 Qmax =  1.2842208516171547
action =  2
total_reward  -2.0
w =  [ 1.42437858  3.19952815  6.48900184 -0.95961336]
fnew =  [0.07998908 0.11794906 0.99677494 0.30921159]
sample =  -0.9987706933275512 diff =  -4.111054899813087
Q0 =  3.1122842064855365 Qmax =  2.4586133448976777
action =  2
total_reward  -3.0
w =  [ 2.03703381  4.1029278  14.12353602  1.40871105]
fnew =  [0.0195348  0.30921159 0.9998091  0.66951767]
sample =  -0.9965687794324831 diff =  -7.6592356909145956
Q0 =  6.662666911482113 Qmax =  6.862441135033727
action =  0
total_reward  -4.0
w =  [ 2.37623251  9.47202188 31.48404061 13.03409487]
fnew =  [0.10708497 0.66951767 0.99418218 0.19903443]
sample =  -0.991356565900078 diff =  -17.363819262052946
Q0 =  16.37246269615287 Qmax =  17.28686819984403
ac