<a href="https://colab.research.google.com/github/zako42/nautica-pg/blob/main/nautica_pg_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Code below is from OpenAI "Spinning Up" tutorial:

https://spinningup.openai.com/en/latest/index.html

Go through the background information first:
https://spinningup.openai.com/en/latest/spinningup/rl_intro.html

We wont go into part 2, but will look at part 3:
https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html

This includes some RL background info, and going thru the code for REINFORCE, and then your homework will be to add "advantage" to it, going from REINFORCE -> Vanilla PG

Section below is the code from the vanilla policy gradient "the simplest equation describing the gradient of policy performance with respect to policy parameters" (this is the REINFORCE algorithm)

https://github.com/openai/spinningup/blob/master/spinup/examples/pytorch/pg_math/1_simple_pg.py


In [2]:
import torch
import torch.nn as nn
from torch.distributions.categorical import Categorical
from torch.optim import Adam
import numpy as np
import gym
from gym.spaces import Discrete, Box


After the imports, the method below dynamically sets up a fully connected (Linear) MLP, using sizes passed in. This uses tanh for neurons activation.

In [3]:
def mlp(sizes, activation=nn.Tanh, output_activation=nn.Identity):
    # Build a feedforward neural network.
    # note it's a fully connected, with however many layers
    # we take the 
    layers = []
    for j in range(len(sizes)-1):
        act = activation if j < len(sizes)-2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j+1]), act()]
    return nn.Sequential(*layers)


The "simplest" code below, doesn't use advantage. It does just the policy gradient part. You can add in the advantage for homework. The full code for Vanilla PG (with advantage) is here:  https://github.com/jachiam/rl-intro/blob/master/pg_cartpole.py



In [5]:
def train(env_name='CartPole-v0', hidden_sizes=[32], lr=1e-2, 
          epochs=50, batch_size=5000, render=False):

    # make environment, check spaces, get obs / act dims
    env = gym.make(env_name)
    assert isinstance(env.observation_space, Box), \
        "This example only works for envs with continuous state spaces."
    assert isinstance(env.action_space, Discrete), \
        "This example only works for envs with discrete action spaces."

    obs_dim = env.observation_space.shape[0]
    n_acts = env.action_space.n

    # make core of policy network
    logits_net = mlp(sizes=[obs_dim]+hidden_sizes+[n_acts])

    # make function to compute action distribution
    # (does the forward pass on the NN)
    def get_policy(obs):
        logits = logits_net(obs)
        return Categorical(logits=logits)

    # make action selection function (outputs int actions, sampled from policy)
    # note this gets an actual action by sampling from the distribution
    def get_action(obs):
        return get_policy(obs).sample().item()

    # make loss function whose gradient, for the right data, is policy gradient
    def compute_loss(obs, act, weights):
        logp = get_policy(obs).log_prob(act)
        return -(logp * weights).mean()

    # make optimizer
    optimizer = Adam(logits_net.parameters(), lr=lr)

    # for training policy
    def train_one_epoch():
        # make some empty lists for logging.
        batch_obs = []          # for observations
        batch_acts = []         # for actions
        batch_weights = []      # for R(tau) weighting in policy gradient
        batch_rets = []         # for measuring episode returns
        batch_lens = []         # for measuring episode lengths

        # reset episode-specific variables
        obs = env.reset()       # first obs comes from starting distribution
        done = False            # signal from environment that episode is over
        ep_rews = []            # list for rewards accrued throughout ep

        # render first episode of each epoch
        finished_rendering_this_epoch = False

        # collect experience by acting in the environment with current policy
        # Notice below, almost all the code is for running episodes, and collecting the states, actions, rewards
        while True:

            # rendering
            if (not finished_rendering_this_epoch) and render:
                env.render()

            # save obs
            batch_obs.append(obs.copy())

            # act in the environment
            # calls the methods defined above to get an actual action from the policy
            act = get_action(torch.as_tensor(obs, dtype=torch.float32))
            obs, rew, done, _ = env.step(act)

            # save action, reward
            batch_acts.append(act)
            ep_rews.append(rew)

            if done:
                # if episode is over, record info about episode
                ep_ret, ep_len = sum(ep_rews), len(ep_rews)
                batch_rets.append(ep_ret)
                batch_lens.append(ep_len)

                # the weight for each logprob(a|s) is R(tau)
                batch_weights += [ep_ret] * ep_len

                # reset episode-specific variables
                obs, done, ep_rews = env.reset(), False, []

                # won't render again this epoch
                finished_rendering_this_epoch = True

                # end experience loop if we have enough of it
                if len(batch_obs) > batch_size:
                    break

        # take a single policy gradient update step
        optimizer.zero_grad()
        batch_loss = compute_loss(obs=torch.as_tensor(batch_obs, dtype=torch.float32),
                                  act=torch.as_tensor(batch_acts, dtype=torch.int32),
                                  weights=torch.as_tensor(batch_weights, dtype=torch.float32)
                                  )
        batch_loss.backward()
        optimizer.step()
        return batch_loss, batch_rets, batch_lens

    # training loop
    for i in range(epochs):
        batch_loss, batch_rets, batch_lens = train_one_epoch()
        print('epoch: %3d \t loss: %.3f \t return: %.3f \t ep_len: %.3f'%
                (i, batch_loss, np.mean(batch_rets), np.mean(batch_lens)))


In the code from the link, it has the main() but causes error in the notebook, so I removed it. They have it set up with default args, so we can just call train() below, and it will use CartPole v0 (see train() default args above)

In [6]:
train()

epoch:   0 	 loss: 18.784 	 return: 21.586 	 ep_len: 21.586
epoch:   1 	 loss: 19.365 	 return: 22.719 	 ep_len: 22.719
epoch:   2 	 loss: 25.177 	 return: 25.431 	 ep_len: 25.431
epoch:   3 	 loss: 25.423 	 return: 29.057 	 ep_len: 29.057
epoch:   4 	 loss: 27.824 	 return: 31.604 	 ep_len: 31.604
epoch:   5 	 loss: 30.288 	 return: 34.854 	 ep_len: 34.854
epoch:   6 	 loss: 37.806 	 return: 39.945 	 ep_len: 39.945
epoch:   7 	 loss: 34.039 	 return: 41.106 	 ep_len: 41.106
epoch:   8 	 loss: 37.616 	 return: 44.257 	 ep_len: 44.257
epoch:   9 	 loss: 45.801 	 return: 51.296 	 ep_len: 51.296
epoch:  10 	 loss: 46.574 	 return: 56.843 	 ep_len: 56.843
epoch:  11 	 loss: 50.739 	 return: 56.494 	 ep_len: 56.494
epoch:  12 	 loss: 48.301 	 return: 61.171 	 ep_len: 61.171
epoch:  13 	 loss: 59.522 	 return: 70.746 	 ep_len: 70.746
epoch:  14 	 loss: 62.610 	 return: 73.087 	 ep_len: 73.087
epoch:  15 	 loss: 59.381 	 return: 75.194 	 ep_len: 75.194
epoch:  16 	 loss: 69.299 	 return: 86.4

Below is the Vanilla PG code from the spinning up tutorial: https://github.com/openai/spinningup/blob/master/spinup/examples/pytorch/pg_math/1_simple_pg.py

In [7]:
import tensorflow as tf
import numpy as np
import gym

def mlp(x, hidden_sizes=(32,32), activation=tf.tanh):
    for size in hidden_sizes:
        x = tf.layers.dense(x, units=size, activation=activation)
    return x

def discount_cumsum(x, gamma):
    n = len(x)
    x = np.array(x)
    y = gamma**np.arange(n)
    z = np.zeros_like(x, dtype=np.float32)
    for j in range(n):
        z[j] = sum(x[j:] * y[:n-j])
    return z

def train(env_name='CartPole-v0', hidden_dim=32, n_layers=1,
          lr=1e-2, gamma=0.99, n_iters=50, batch_size=5000
          ):

    env = gym.make(env_name)
    obs_dim = env.observation_space.shape[0]
    n_acts = env.action_space.n

    # make model
    with tf.variable_scope('model'):
        obs_ph = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
        net = mlp(obs_ph, hidden_sizes=[hidden_dim]*n_layers)
        logits = tf.layers.dense(net, units=n_acts, activation=None)
        actions = tf.squeeze(tf.multinomial(logits=logits,num_samples=1), axis=1)

    # make loss
    adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32)
    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32)
    action_one_hots = tf.one_hot(act_ph, n_acts)
    log_probs = tf.reduce_sum(action_one_hots * tf.nn.log_softmax(logits), axis=1)
    loss = -tf.reduce_mean(adv_ph * log_probs)

    # make train op
    train_op = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)

    sess = tf.InteractiveSession()
    sess.run(tf.global_variables_initializer())

    # train model
    def train_one_iteration():
        batch_obs, batch_acts, batch_rtgs, batch_rets, batch_lens = [], [], [], [], []

        obs, rew, done, ep_rews = env.reset(), 0, False, []
        while True:
            batch_obs.append(obs.copy())
            act = sess.run(actions, {obs_ph: obs.reshape(1,-1)})[0]
            obs, rew, done, _ = env.step(act)
            batch_acts.append(act)
            ep_rews.append(rew)
            if done:
                batch_rets.append(sum(ep_rews))
                batch_lens.append(len(ep_rews))
                batch_rtgs += list(discount_cumsum(ep_rews, gamma))
                obs, rew, done, ep_rews = env.reset(), 0, False, []
                if len(batch_obs) > batch_size:
                    break

        # normalize advs trick:
        batch_advs = np.array(batch_rtgs)
        batch_advs = (batch_advs - np.mean(batch_advs))/(np.std(batch_advs) + 1e-8)
        batch_loss, _ = sess.run([loss, train_op], feed_dict={obs_ph: np.array(batch_obs),
                                                              act_ph: np.array(batch_acts),
                                                              adv_ph: batch_advs})
        return batch_loss, batch_rets, batch_lens

    for i in range(n_iters):
        batch_loss, batch_rets, batch_lens = train_one_iteration()
        print('itr: %d \t loss: %.3f \t return: %.3f \t ep_len: %.3f'%
                (i, batch_loss, np.mean(batch_rets), np.mean(batch_lens)))



The code above is using TF 1.x and I don't think it runs. But you can look at the code example to get the idea for adding advantage into the REINFORCE code earlier, to make it into VPG.