# Introduction to Reinforcement Learning with OpenAI gym

In [1]:
import tensorflow as tf
import gym
import numpy as np
import time
from scipy.stats import zscore as z_transform
tf.config.set_visible_devices([], 'GPU')

To get started, create a new environment! CartPole is a game where a cart moves left or right along a frictionless track to try to balance a pole placed on top.

In [2]:
env = gym.make('CartPole-v1')

The environment is the system in which our AI will learn. We recieve information from the environment by making observations, and influence the environment by taking actions. What do these observations look like?

In [3]:
print(env.observation_space.shape)
print(env.observation_space.high)
print(env.observation_space.low)

(4,)
[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


The observation is 4 dimensional with the upper and lower bounds show. What are these dimensions? We will try to find this out later!

In [4]:
print(env.action_space)
print([env.action_space.sample() for _ in range(10)])

Discrete(2)
[0, 1, 1, 0, 0, 0, 1, 0, 1, 0]


The action space allows us to take two actions: applying a momentary force to the left or right. But which is which? Let's take a look at the environment:

In [5]:
# reset the environment to start a new session, with the pole near vertical and the cart in the middle.
env.reset()
for _ in range(30):
    env.render() # prints the environment to another window
    #observation, reward, done, info = env.step(0)
    #observation, reward, done, info = env.step(1)
    observation, reward, done, info = env.step(env.action_space.sample()) # choose a random action and send it to the env!
    print(observation, reward, done, info)
    time.sleep(0.5)
env.close() # this is the only way to close the window!

[ 0.02723134  0.19014457  0.04552725 -0.25160991] 1.0 False {}
[ 0.03103423  0.38458783  0.04049505 -0.5295922 ] 1.0 False {}
[ 0.03872599  0.1889203   0.02990321 -0.22442923] 1.0 False {}
[ 0.0425044  -0.00661599  0.02541462  0.07753432] 1.0 False {}
[ 0.04237208 -0.20209288  0.02696531  0.37812606] 1.0 False {}
[ 0.03833022 -0.00736406  0.03452783  0.09406586] 1.0 False {}
[ 0.03818294  0.18724643  0.03640915 -0.18752673] 1.0 False {}
[ 0.04192787  0.38182907  0.03265861 -0.46850532] 1.0 False {}
[ 0.04956445  0.18626132  0.02328851 -0.16571021] 1.0 False {}
[ 0.05328967  0.3810423   0.0199743  -0.45095624] 1.0 False {}
[ 0.06091052  0.18564363  0.01095518 -0.15204465] 1.0 False {}
[ 0.06462339 -0.00963346  0.00791429  0.14407416] 1.0 False {}
[ 0.06443072 -0.20486785  0.01079577  0.43924336] 1.0 False {}
[ 0.06033337 -0.40014092  0.01958064  0.73530979] 1.0 False {}
[ 0.05233055 -0.20529485  0.03428683  0.44885305] 1.0 False {}
[ 0.04822465 -0.01067424  0.04326389  0.16717206] 1.0 F



[-0.06049362 -0.99994047  0.26230139  1.99032628] 0.0 True {}
[-0.08049243 -1.19668391  0.30210791  2.3515826 ] 0.0 True {}
[-0.10442611 -1.3930028   0.34913956  2.72019924] 0.0 True {}
[-0.13228617 -1.20165981  0.40354355  2.55107536] 0.0 True {}


Maybe if you're brighter than me you can figure out exactly what each of these dimensions mean. Luckily, we don't care! Our machine will figure it out. What's important is that we see that we are getting a reward of 1 at each time step that done = False, and after done = True, we get no more reward. This occurs when the angle of the pole is too large, or the cart has drifted too far from the center. This is the end of a "session".

To solve the problem of what action to take given an observation, we will use a neural network. For our current purposes, you can think of it as a magical black box that converts an input (4-dimensional observation) to an output (probabilities of taking each action in the 2-dimensional action space).

In [6]:
model = tf.keras.Sequential([tf.keras.layers.Input(shape=env.observation_space.shape),
                             tf.keras.layers.Dense(4, activation='sigmoid'),
                             tf.keras.layers.Dense(env.action_space.n, activation=tf.nn.softmax)])

Congratulations, you just made a neural network! The code above uses a high-level interface called keras to easily generate models. This neural network has an input layer that takes in the observations, a hidden middle layer which has magical unknown abilities, and an output layer that corresponds to probabilities of actions. Unfortunately, it is not smart:

In [7]:
# reset the environment to start a new session, with the pole near vertical and the cart in the middle.
observation = env.reset()
for _ in range(30):
    env.render()
    # convert the observation into a form that the model can use as an input
    action_dist = model.predict(tf.convert_to_tensor(tf.expand_dims(observation, 0)))[0]
    # sample from the action space
    action = int(np.random.choice(np.arange(len(action_dist)), p=action_dist))
    # send that action to the environment to see what happens
    observation, reward, done, info = env.step(action) 
    time.sleep(0.5)
env.close() # this is the only way to close the window!

Because our model (that is, our AI, or our neural network) has been randomly initialized, its behaviour should also be pretty random. It could also be biased and only move the cart one direction. Our goal is to have it make educated choices about what action to take given an observation. The first step in this goal is to run a bunch of sessions and see how the model is performing. Then we will know how to adjust its behaviour. Here's what a function to run a session looks like:

In [8]:
def run(render=False):
    all_obs = []
    all_ac = []
    rw = 0
    observation = env.reset()

    while True:
        if render:
            env.render()
        # get the distribution over action space from the model
        action_dist = model.predict(tf.convert_to_tensor(tf.expand_dims(observation, 0)))[0]
        # sample from the action space
        action = int(np.random.choice(np.arange(len(action_dist)), p=action_dist))
        # get the information about the state of the system following the action
        observation, reward, done, info = env.step(action)
        all_obs.append(observation)
        all_ac.append(action)
        rw += reward
        # done specifies that the session is over, usually due to a win or loss
        if done:
            break
    env.close()

    # return the observations and rewards (the reward will be discounted later)
    return all_obs, all_ac, rw

In [9]:
run()

([array([-0.0082659 ,  0.19769832,  0.03488744, -0.2875753 ]),
  array([-0.00431194,  0.00209666,  0.02913593,  0.01590347]),
  array([-0.00427   , -0.19343076,  0.029454  ,  0.31763487]),
  array([-0.00813862,  0.00125957,  0.0358067 ,  0.03438427]),
  array([-0.00811343,  0.19585023,  0.03649438, -0.2467897 ]),
  array([-0.00419642,  0.39043248,  0.03155859, -0.52774177]),
  array([ 0.00361223,  0.19488105,  0.02100375, -0.225284  ]),
  array([ 0.00750985, -0.00053469,  0.01649807,  0.07394961]),
  array([ 0.00749916,  0.19434691,  0.01797707, -0.21348279]),
  array([ 0.01138609, -0.00102739,  0.01370741,  0.08481627]),
  array([ 0.01136555,  0.19389542,  0.01540374, -0.20351062]),
  array([ 0.01524345,  0.38879373,  0.01133352, -0.49129489]),
  array([ 0.02301933,  0.19351376,  0.00150762, -0.19506176]),
  array([ 0.0268896 , -0.00162972, -0.00239361,  0.09809638]),
  array([ 0.02685701,  0.19352645, -0.00043168, -0.19534076]),
  array([ 0.03072754,  0.38865457, -0.0043385 , -0.4881

Essentially, we're just collecting information in a bunch of lists to use for training. We have made the decision to count the reward for the session as the sum of rewards recieved throughout the session: the higher the final reward, the longer the pole was balanced. Also, we stop getting rewards when 'done' (when the pole has rotated a certain angle from normal), so we will stop the session at that point.

A problem we now encounter is that there is no real association between rewards recieved and specific actions taken; a reward is given based on the collection of all actions taken in a single session, but which actions contributed to success? Here we will make another simplifying assumption: the actions taken closest to the end of the session matter the most. This isn't the case for every problem, but in this case you can imagine that the decisions made when the pole is tipping are the most important. How do we account for this? We make the rewards smaller the further away from the end of the session they occur, 'discounting' them.

In [10]:
def discount(rw, gamma=0.9):
    # weight individual rewards with an exponential decay function
    # since the magnitude should be largest close to the end of the session, apply the weights in reverse order
    weights = np.array([gamma**(rw.shape[0]-i-1) for i in range(rw.shape[0])])
    discounted = tf.convert_to_tensor(weights, dtype=tf.float32) * rw
    return discounted

Here, rw is a list of the same length as the number of frames in a corresponding session. These represent the reward given for the actions taken in each frame. We discount the rewards by weighting them with an exponential decay function, so that the action in the last frame of the session gets a full reward, but actions that happened early on get almost no reward.

In our current framework, we are handing out all rewards and no punishments. This might work, but it will stabilize training to cause the rewards to have a mean of zero. That means some of the rewards will be negative, or more like punishments.

In [11]:
def normalize(rw_list):
    rw_norm = [r - np.mean(rw_list) for r in rw_list]
    return rw_norm

So how do these collected observations, actions, and discounted rewards apply to the model? They are connected by a loss function, which determines how the parameters of the model should change to better react to its environment:

In [12]:
def compute_loss(obs, ac, rw):
    y_pred = model(obs)
    y_true = ac
    # binary cross-entropy is just one kind of loss function that is larger when y_true and y_pred are more different
    loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)
    # element-wise multiplication
    rw_weighted_loss = tf.math.multiply(loss, rw)
    return rw_weighted_loss

The 'loss' is a function of an observation, an associated action, and the received reward for that action in that context. y_pred is the probability distribution (over the two actions) that the model predicts, while y_true is the action that it actually chose. y_pred is binomial and sums to 1, something like [0.25, 0.75], while y_true is binomial and discrete, like [0, 1]. If y_pred was [0.25, 0.75], that means the force-right action has a 75% possibility of being sampled, and we see that in the case of [0, 1], it was indeed chosen.

So what is the point of this loss function? Well, let's say that we are looking at one single frame of a session, where we think the chosen action led to a high reward, such that rw is positive and large. We want to reinforce the selected action, to make it more probable to occur in the future. So we want y_pred to closer match y_true (increasing its probability), and we tell our model to make it so by changing its parameters to minimize this loss. Ultimately, this is just a fancy optimization task. This loss function is computed for every frame of every section, each with its own action, observation, and discounted reward. That's a lot of data to use to improve!

How do you think we would want to change the probability distribution (with respect to the chosen action) if the reward was negative?

Speaking of optimization and optimization of optimizers, one of the central hyperparameters we can control in our algorithm is something called the "learning rate" of the optimizer. A hyperparameter is a parameter that controls other parameters, such as the parameters used in the neural network (the weights and biases). 

The job of the optimizer is to use the collected information from the loss function to determine how to update the parameters of the model. You may have heard that neural networks use something called 'gradient decent' to optimize their mapping from inputs to outputs. The optimizer we will used, called 'Adam' is a fancier version of gradient decent. The hyperparameter "learning rate" controls how quickly the model parameters change in response to new training data.

In [13]:
optimizer = tf.keras.optimizers.Adam(0.01)

What do you think might happen if the learning rate is very large or small? Are there consequences to quickly changing the parameters in response to new data? What is the ideal balance? It turns out that for some tasks (this one included) performance is quite dependent on this hyperparameter.

At this point, we are surprisingly close to being finished! The last step is to put everything together in a training loop. This loop will take our existing functions and tie them all together. 

    -First, we will run a number of sessions and collect all of the data: observations, actions, and rewards.

    -Then, we will normalize the rewards. If you have been paying attention, you may have noticed that until now, the 
    rewards are all positive! We want to punish failure and reward success for stable training. We then discount the rewards 
    for each section.

    -Then, we will compute the loss. This code is vectorized, enabling us to use matrix math to compute the loss for a whole 
    session all at once! The gradient tape tool will keep track of how all the parameters need to change to reduce the loss. 
    All we need to do is calculate the loss "with" the gradient tape, and this information will be saved.

    -Finally, we apply the gradients to our neural network, changing its parameters to minimize the loss, and hopefully, 
    respond more appropriately to its environment!

In [16]:
def batch_train(model, batch_size, render):
    obs_list = []
    ac_list = []
    rw_list = []

    for _ in range(batch_size):
        obs, ac, rw = run(render)
        obs_list.append(obs)
        ac_list.append(ac)
        rw_list.append(rw)
    
    # track how the model is doing!
    print(np.mean(rw_list))

    # normalize rewards. this will cause the worst performing sessions of the batch to recieve negative rewards
    # while the best performing ones recieve positive rewards
    rw_norm = normalize(rw_list)
    # convert to tensors to discount. each frame of a session is initially assigned the same reward
    # which is the sum of the rewards obtained at each frame of the session
    rw_tensors = [tf.ones(shape=len(obs_list[i])) * rw for i, rw in enumerate(rw_norm)]
    # list of discounted rewards in session
    rw_discount = [discount(rw) for rw in rw_tensors]
    # list of observations
    obs_tensors = [tf.convert_to_tensor(obs, dtype=tf.float32) for obs in obs_list]
    # list of actions
    ac_tensors = [tf.one_hot(ac, depth=env.action_space.n) for ac in ac_list]

    gradients = []

    # for each session of observations and discounted rewards
    for obs, act, rw in zip(obs_tensors, ac_tensors, rw_discount):
        # compute the gradient of the selected actions with respect to the observations of the environment
        with tf.GradientTape() as tape:
            loss = compute_loss(obs, act, rw)
            loss = tf.convert_to_tensor(loss, dtype=tf.float32)
        g = tape.gradient(loss, model.trainable_variables)
        # collect all the gradients, instead of applying them at each step which would give inaccurate rewards
        gradients.append(g)
    avg_gradients = []

    # each k represents a kernel's gradients
    # can't do this as a tensor because shapes change
    for k in range(len(gradients[0])):
        # get all of the gradients associated to one kernel
        t = tf.convert_to_tensor([grad[k] for grad in gradients])
        t = tf.reduce_mean(t, axis=0)
        avg_gradients.append(t)

    # apply the gradients to their respective variables
    # can't do this until all gradients have been calculated
    optimizer.apply_gradients(zip(avg_gradients, model.trainable_variables))
    
    # track how the model weights change over time
    # print(model.weights)

    return 1

The majority of steps not discussed just involve reshaping the data into a form that can be used by tensorflow.

In [17]:
for _ in range(10):
    batch_train(model=model, batch_size=10, render=False)

21.8
20.6
17.2
18.2
23.1
23.2
18.8
19.8
22.5
23.8


Why don't you think the reward goes up consistently? You can render the model if it helps.

How could the reward system be changed so that the algorithm was discouraged from moving the cart so much?