# [ Workshop 1 ] - NCU AI Math Group
# Intro to Reinforcement Learning - Cartpole
2019/10/17

**[ Reference ]**
1. Richard Sutton and Andrew Barto , “Reinforcement Learning: An Introduction”, 2nd ed., MIT Press, 2018. http://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf

2. UC Berkeley - CS188 Spring 2014
"Lecture 10 Reinforcement Learning I"
https://youtu.be/IXuHxkpO5E8


3. Denny Britz, "Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course."
https://github.com/dennybritz/reinforcement-learning
4. OpenAI - Gym "A toolkit for developing and comparing reinforcement learning algorithms."
https://github.com/openai/gym

## < CONTENTS >
#### 1. [Gym from OpenAI](#GymfromOpenAI)
#### 2. [CartPole-v0](#CartPole-v0)

--------------------------------------
<a id='GymfromOpenAI'></a>
## 1. Gym from OpenAI 
+ https://gym.openai.com/docs/
+ **`Gym` is a toolkit for developing and comparing reinforcement learning algorithms.** It makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as TensorFlow or Theano.

+ The `gym` library is a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.

+ **[ Installation ]**
    + To get started, you’ll need to have Python 3.5+ installed. Simply install gym using pip:

    + `pip install gym`

In [3]:
 !pip install gym

Collecting gym
[?25l  Downloading https://files.pythonhosted.org/packages/d2/88/a7186ffe1f33570ad3b8cd635996e5a3e3e155736e180ae6a2ad5e826a60/gym-0.15.3.tar.gz (1.6MB)
[K    100% |████████████████████████████████| 1.6MB 607kB/s ta 0:00:01
Collecting pyglet<=1.3.2,>=1.2.0 (from gym)
[?25l  Downloading https://files.pythonhosted.org/packages/1c/fc/dad5eaaab68f0c21e2f906a94ddb98175662cc5a654eee404d59554ce0fa/pyglet-1.3.2-py2.py3-none-any.whl (1.0MB)
[K    100% |████████████████████████████████| 1.0MB 680kB/s ta 0:00:01
[?25hCollecting cloudpickle~=1.2.0 (from gym)
  Downloading https://files.pythonhosted.org/packages/c1/49/334e279caa3231255725c8e860fa93e72083567625573421db8875846c14/cloudpickle-1.2.2-py2.py3-none-any.whl
Building wheels for collected packages: gym
  Running setup.py bdist_wheel for gym ... [?25ldone
[?25h  Stored in directory: /Users/willsu/Library/Caches/pip/wheels/8a/71/10/30f9b16332ecfd6318ac290445c696fe809bcbe40a05f9a799
Successfully built gym
[31mspyder 3.3.2 

### Available Environments from Gym
> + `Classic control` (https://gym.openai.com/envs/#classic_control)
+ `toy text` (https://gym.openai.com/envs/#toy_text)
+ `Algorithmic` (https://gym.openai.com/envs/#algorithmic) 
+ `Atari` (https://gym.openai.com/envs/#atari) 
+ `2D and 3D robots` (https://gym.openai.com/envs/#mujoco)

--------------------------------------
<a id='CartPole-v0'></a>
## 2. CartPole-v0
> + **cartpole.py** : https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py#L75
+ **CartPole-v0** (https://gym.openai.com/envs/CartPole-v0/#barto83)
    + A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. 
    + The pendulum starts upright, and the goal is to prevent it from falling over. 
    + A reward of +1 is provided for every timestep that the pole remains upright. 
    + The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
    + CartPole-v0 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.

**This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson (1983):** 
+ AG Barto, RS Sutton and CW Anderson, "**Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem**", IEEE Transactions on Systems, Man, and Cybernetics, 1983.
+ http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf

The environment’s `step` function returns exactly what we need. In fact, `step` returns four values. These are:

>+ `observation` (**object**): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
+ `reward` (**float**): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
+ `done` (**boolean**): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
+ `info` (**dict**): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward.



> **[NOTE]**:
+ **The general-purpose agents don't need to know the semantics of the observations: they can learn how to map observations to actions to maximize reward without any prior knowledge.**


> + 4 numbers in `observation`:
**[position of cart, velocity of cart, angle of pole, rotation rate of pole]**. 
+ Defined at https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py#L75

In [4]:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode {} finished after {} timesteps : Reward {}".format(i_episode, t+1, reward))
            break
env.close()

[-0.00974793  0.01805022 -0.01929365 -0.00355016]
[-0.00938693 -0.1767898  -0.01936465  0.28298349]
[-0.01292272  0.01860293 -0.01370498 -0.0157435 ]
[-0.01255066  0.21391872 -0.01401985 -0.31271881]
[-0.00827229  0.40923756 -0.02027423 -0.60978998]
[-8.75389532e-05  6.04636962e-01 -3.24700305e-02 -9.08789067e-01]
[ 0.0120052   0.40996921 -0.05064581 -0.62648586]
[ 0.02020458  0.21558941 -0.06317553 -0.35017353]
[ 0.02451637  0.02142016 -0.070179   -0.07806193]
[ 0.02494478  0.21747433 -0.07174024 -0.39203498]
[ 0.02929426  0.0234399  -0.07958094 -0.12280555]
[ 0.02976306 -0.17045706 -0.08203705  0.1437473 ]
[ 0.02635392  0.0257381  -0.0791621  -0.17364754]
[ 0.02686868 -0.16816689 -0.08263505  0.09304971]
[ 0.02350534  0.02803628 -0.08077406 -0.22451783]
[ 0.02406607  0.22421419 -0.08526442 -0.54154703]
[ 0.02855035  0.42042456 -0.09609536 -0.85983075]
[ 0.03695884  0.61671476 -0.11329197 -1.18111622]
[ 0.04929314  0.42323079 -0.1369143  -0.92598744]
[ 0.05775776  0.23019688 -0.155434

[ 0.05129308  0.36453281  0.01160987 -0.51756433]
[ 0.05858373  0.55948938  0.00125859 -0.80656625]
[ 0.06977352  0.3643502  -0.01487274 -0.51348769]
[ 0.07706052  0.55967843 -0.02514249 -0.81082006]
[ 0.08825409  0.36490979 -0.04135889 -0.52615055]
[ 0.09555229  0.56058857 -0.0518819  -0.83157376]
[ 0.10676406  0.75637974 -0.06851338 -1.14011177]
[ 0.12189166  0.95232718 -0.09131561 -1.45347054]
[ 0.1409382   0.75843759 -0.12038503 -1.19065737]
[ 0.15610695  0.95489598 -0.14419817 -1.51851992]
[ 0.17520487  1.15143716 -0.17456857 -1.85251947]
Episode 8 finished after 15 timesteps : Reward 1.0
[ 0.0026991  -0.02615602  0.03792886  0.01853285]
[ 0.00217598 -0.22180081  0.03829952  0.32293738]
[-0.00226003 -0.02724457  0.04475827  0.04257434]
[-0.00280492  0.16720794  0.04560976 -0.23565777]
[ 0.00053924 -0.02853498  0.0408966   0.07105588]
[-3.14643807e-05  1.65977523e-01  4.23177176e-02 -2.08448659e-01]
[ 0.00328809  0.36046962  0.03814874 -0.48748793]
[ 0.01049748  0.16483076  0.02839

[-0.03280672 -0.22233441  0.02989641  0.35387864]
[-0.03725341 -0.02765003  0.03697398  0.07077082]
[-0.03780641  0.16692286  0.0383894  -0.21002119]
[-0.03446795  0.36147548  0.03418898 -0.49035138]
[-0.02723844  0.16588833  0.02438195 -0.1870925 ]
[-0.02392067  0.36065311  0.0206401  -0.47198526]
[-0.01670761  0.55547755  0.01120039 -0.75809191]
[-0.00559806  0.75044337 -0.00396144 -1.04722946]
[ 0.00941081  0.55537422 -0.02490603 -0.75579269]
[ 0.02051829  0.75083046 -0.04002189 -1.05620774]
[ 0.0355349   0.55626112 -0.06114604 -0.77635072]
[ 0.04666012  0.36203104 -0.07667306 -0.50351582]
[ 0.05390074  0.16806872 -0.08674337 -0.23594691]
[ 0.05726212 -0.02571372 -0.09146231  0.02816328]
[ 0.05674784 -0.21941311 -0.09089905  0.29064549]
[ 0.05235958 -0.02312053 -0.08508614 -0.02926532]
[ 0.05189717 -0.21692576 -0.08567144  0.23540569]
[ 0.04755866 -0.02069091 -0.08096333 -0.08302363]
[ 0.04714484 -0.21456456 -0.0826238   0.18305699]
[ 0.04285355 -0.01836353 -0.07896266 -0.13450434]


### Spaces
> + In the examples above, we’ve been sampling random actions from the environment’s action space. But what actually are those actions? 
+ Every environment comes with an `action_space` and an `observation_space`. 
+ These attributes are of type `Space`, and they describe the format of valid actions and observations:

In [8]:
import gym
env = gym.make('CartPole-v0')
print(env.action_space)         #> Discrete(2)
print(env.observation_space)    #> Box(4,)

Discrete(2)
Box(4,)


> + The `Discrete` space allows a fixed range of non-negative numbers, so in this case valid `actions` are either 0 or 1. 
+ The `Box` space represents an n-dimensional box, so valid `observations` will be an array of 4 numbers. We can also check the `Box`’s bounds:

In [9]:
print(env.observation_space.high)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)
#> array([-2.4       ,        -inf, -0.20943951,        -inf])

[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


> + This introspection can be helpful to write generic code that works for many different environments. `Box` and `Discrete` are the most common `Spaces`. You can sample from a `Space` or check that something belongs to it:

In [10]:
from gym import spaces
space = spaces.Discrete(8) # Set with 8 elements {0, 1, 2, ..., 7}
x = space.sample()
assert space.contains(x)
assert space.n == 8

> + For `CartPole-v0`, one of the actions applies force to the left, and one of them applies force to the right.

### [NOTE]: 
+ **Check out the cartpole experiment on a visualization window.**
+ **The experiment fails when the pendulum tilts reaching an angle of 15 degrees or the moving range larger than 2.4 units !!**