# This is basics tutorial on GYM - a Reinforcement Learning framework with Python

In [1]:
import gym

In [2]:
"""Get your environment"""
env = gym.make('CartPole-v0')

[2017-09-22 12:18:51,866] Making new env: CartPole-v0


For more information about environments, visits:  <br>
https://gym.openai.com/envs/

** For reinforcement learning, we're facing a scenario that we do not have any information about future reward or the transistion function T(s,a,s'), thus it is important to get samples (episodes)** <br>


In [3]:
#put ourselves in start states
#also return a state
env.reset()

array([-0.03443529, -0.0317131 , -0.01125142,  0.04994234])

You might wonder what do the states represent? Let's look at the images:<br>
<img src="cartpole.png"> 

<font color = blue >**System (short) description** </font> : a pole is attached to a cart, when the cart moves, the pole will have a swinging motion follow, our input to the cart is either force, accleration,etc... For more details about the system and what we are about to do, I encourage you to visit this link: https://www.youtube.com/watch?v=Lt-KLtkDlh8  <br>

<font color = blue >**Goal (short) description** </font> : Keep the pole stand upward by applying correct force to the cart. <br>

For the state we've been shown, here's a details: <br>
index: 0 - cart position | 1 - cart velocity | 2 - pole angle | 3 - pole velocity <br>
**For almost dynamical systems, position and velocity are defined as the state of them.**
What about acceleration? For short and simple answer, dynamical systems are often represented with the following differential equations, given state vector (postion, velocity) q and input (force, acceleration, etc...) u: <br>
\begin{align}
\dot{q} & = A(q,u) \\
\end{align}

If we recall, derivative of velocity of acceleration, and that is represented in the RHS of our equation. <br>

For the control theory (if you known something about this), we know the function dynamics function A (which is somehow related to the transistion function), but in reinforcement learning scenario, little to none we have any information about it.



In [4]:
box = env.observation_space

In [7]:
#To observe, press tab after box. for more information

In [9]:
env.action_space # 2 types of action: push left, push right

Discrete(2)

In [12]:
env.action_space.n # Seems like the input is not continuous....

2

In [15]:
"""Let's play an episode:"""
obser, reward, done,info = env.step(action=0)

In [16]:
"""Let's see them"""
obser

array([-0.03960299, -0.4216465 , -0.00347149,  0.62848651])

In [17]:
reward

1.0

In [18]:
done

False

In [19]:
info

{}

In [22]:
done = False
env.reset()
while not done:
    action = env.action_space.sample()
    obser, reward, done,info = env.step(env.action_space.sample())
    print 'state: ', obser,' actions: ', action

state:  [-0.03741105 -0.18824982  0.03783748  0.31403223]  actions:  1
state:  [-0.04117604  0.00631327  0.04411813  0.03351805]  actions:  0
state:  [-0.04104978  0.2007757   0.04478849 -0.24492523]  actions:  0
state:  [-0.03703426  0.39523027  0.03988998 -0.52315117]  actions:  1
state:  [-0.02912966  0.58976876  0.02942696 -0.80300223]  actions:  1
state:  [-0.01733428  0.7844751   0.01336691 -1.08628502]  actions:  0
state:  [-0.00164478  0.97941822 -0.00835879 -1.37474381]  actions:  0
state:  [ 0.01794358  1.17464363 -0.03585366 -1.67002916]  actions:  1
state:  [ 0.04143646  0.9799562  -0.06925424 -1.38872441]  actions:  1
state:  [ 0.06103558  0.78576213 -0.09702873 -1.11847603]  actions:  1
state:  [ 0.07675082  0.98201389 -0.11939825 -1.43995075]  actions:  0
state:  [ 0.0963911   0.78854795 -0.14819727 -1.18683766]  actions:  1
state:  [ 0.11216206  0.59562524 -0.17193402 -0.94403625]  actions:  1
state:  [ 0.12407456  0.40318418 -0.19081475 -0.70993069]  actions:  1
state:

# Random Search:

Oftenly, we would like to find optimal weights to obtain the maximum (minimum) values. Now, we will not talk about gradient descent, instead, we are going to implement a random search algorithm. The pseudocode for random search for finding maxima is as follows: <br>

--------------
1. Initialize x (weights) & calculate f(x)<br>
2. While not satisfied: <br>
          -Randomly pick new weight values y
          -If (f(y) > f(x))
                set: x = y
---------------

<font color = green > **As we may figured out, doing this way is extremly inefficient (space complexity, no heuristic, etc...). However, as introduction to gym, this is a good thing to try out.** </font>


In [23]:
import numpy as np #our old friends


In [26]:
mu = 0
sigma = 1
w = np.random.normal(mu, sigma, 4)
print w

[ 0.68716042 -0.18711006  1.45110201 -1.09870933]


In [None]:
#Our policy is simple, if the state * w >0, take action 1, else 0
#Let's "train" our model
