## POLICY GRADIENT on CartPole

Policy Gradient algorithms find an optimal behavior strategy optimizing directly the policy. 
The policy is a parametrized function respect to $\theta$ $\pi_\theta(a|s)$

The reward function is defined as 
$$J(\theta) = \sum_{s}d^\pi(s)\sum_{a}\pi_\theta(a|s)Q^\pi(s,a)$$

In Vanilla Policy Gradient, we estimate the return $R_t$ (REINFORCE algorithm) and update the policy subtracting a baseline value from $R_t$ to reduce the variance.

<img src="https://github.com/stevearonson/Reinforcement-Learning/blob/master/Week4/imgs/Vanilla_policy_gradient.png?raw=1" alt="drawing" width="500"/>
Credit: John Schulman

In [2]:
!pip install tensorboardX

Collecting tensorboardX
[?25l  Downloading https://files.pythonhosted.org/packages/af/0c/4f41bcd45db376e6fe5c619c01100e9b7531c55791b7244815bac6eac32c/tensorboardX-2.1-py2.py3-none-any.whl (308kB)
[K     |█                               | 10kB 14.2MB/s eta 0:00:01[K     |██▏                             | 20kB 1.7MB/s eta 0:00:01[K     |███▏                            | 30kB 2.3MB/s eta 0:00:01[K     |████▎                           | 40kB 2.6MB/s eta 0:00:01[K     |█████▎                          | 51kB 2.0MB/s eta 0:00:01[K     |██████▍                         | 61kB 2.2MB/s eta 0:00:01[K     |███████▍                        | 71kB 2.5MB/s eta 0:00:01[K     |████████▌                       | 81kB 2.7MB/s eta 0:00:01[K     |█████████▌                      | 92kB 2.9MB/s eta 0:00:01[K     |██████████▋                     | 102kB 2.8MB/s eta 0:00:01[K     |███████████▊                    | 112kB 2.8MB/s eta 0:00:01[K     |████████████▊                   | 122kB 2.

In [3]:
import numpy as np
import gym
from tensorboardX import SummaryWriter

import time
from collections import namedtuple
from collections import deque
import datetime

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [4]:
class PG_nn(nn.Module):
    '''
    Policy neural net
    '''
    def __init__(self, input_shape, n_actions):
        super(PG_nn, self).__init__()

        self.mlp = nn.Sequential(
            nn.Linear(input_shape[0], 64),
            nn.ReLU(),
            nn.Linear(64, n_actions))

    def forward(self, x):
        return self.mlp(x.float())

In [5]:
def discounted_rewards(memories, gamma):
    '''
    Compute the discounted reward backward
    '''

    disc_rew = np.zeros(len(memories))
    run_add = 0

    for t in reversed(range(len(memories))):
        if memories[t].done: run_add = 0
        run_add = run_add * gamma + memories[t].reward
        disc_rew[t] = run_add

    return disc_rew

In [6]:
Memory = namedtuple('Memory', ['obs', 'action', 'new_obs', 'reward', 'done'], verbose=False, rename=False)

GAMMA = 0.99
LEARNING_RATE = 0.002
ENTROPY_BETA = 0.01
ENV_NAME = 'CartPole-v0'

MAX_N_GAMES = 10000
n_games = 0

device = 'cpu'

now = datetime.datetime.now()
date_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, now.second)

In [7]:
env = gym.make(ENV_NAME)
obs = env.reset()

# Initialize the writer
writer = SummaryWriter(log_dir='content/runs/A2C'+ENV_NAME+'_'+date_time)

# create the agent neural net
action_n = env.action_space.n
agent_nn = PG_nn(env.observation_space.shape, action_n).to(device)

# Adam optimizer
optimizer = optim.Adam(agent_nn.parameters(), lr=LEARNING_RATE)

experience = []
tot_reward = 0
n_iter = 0
# deque list to keep the baseline
baseline = deque(maxlen=30000)
game_rew = 0

## MAIN BODY
while n_games < MAX_N_GAMES:

    n_iter += 1

    # execute the agent
    act = agent_nn(torch.tensor(obs))
    act_soft = F.softmax(act)
    # get an action following the policy distribution
    action = int(np.random.choice(np.arange(action_n), p=act_soft.detach().numpy(), size=1))

    # make a step in the env
    new_obs, reward, done, _ = env.step(action)

    game_rew += reward
    # update the experience list with the last memory
    experience.append(Memory(obs=obs, action=action, new_obs=new_obs, reward=reward, done=done))

    obs = new_obs

    if done:
        # Calculate the discounted rewards
        disc_rewards = discounted_rewards(experience, GAMMA)

        # update the baseline
        baseline.extend(disc_rewards)
        # subtract the baseline mean from the discounted reward.
        disc_rewards -= np.mean(baseline)

        # run the agent NN on the obs in the experience list
        acts = agent_nn(torch.tensor([e.obs for e in experience]))

        # take the log softmax of the action taken previously
        game_act_log_softmax_t = F.log_softmax(acts, dim=1)[:,[e.action for e in experience]]

        disc_rewards_t = torch.tensor(disc_rewards, dtype=torch.float32).to(device)

        # compute the loss entropy
        l_entropy = ENTROPY_BETA * torch.mean(torch.sum(F.softmax(acts, dim=1) * F.log_softmax(acts, dim=1), dim=1))

        # compute the loss
        loss = - torch.mean(disc_rewards_t * game_act_log_softmax_t)
        loss = loss + l_entropy

        # optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # print the stats
        writer.add_scalar('loss', loss, n_iter)
        writer.add_scalar('reward', game_rew, n_iter)

        print(n_games, loss.detach().numpy(), game_rew, np.mean(disc_rewards), np.mean(baseline))

        # reset the variables and env
        experience = []
        game_rew = 0
        obs = env.reset()
        n_games += 1


writer.close()



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
5001 0.0 8.0 -2.7563691308100795 7.152928485361203
5002 0.0 9.0 -2.2833709881674613 7.15234322017245
5003 -1448.177 11.0 -1.3458932067186604 7.150322039563461
5004 0.0 10.0 -1.8103545642088874 7.148608822925333
5005 0.0 10.0 -1.8070520763541544 7.145306335070599
5006 -1466.5908 11.0 -1.3408775022257986 7.145306335070599
5007 0.0 8.0 -2.744528032843519 7.141087387394642
5008 0.0 10.0 -1.8003954213137852 7.13864968003023
5009 0.0 10.0 -1.799495305256158 7.137749563972603
5010 0.0 10.0 -1.795353721756444 7.133607980472889
5011 0.0 9.0 -2.2632370383907396 7.1322092703957285
5012 0.0 9.0 -2.2614001615046373 7.130372393509626
5013 0.0 8.0 -2.7306905291963597 7.127249883747483
5014 0.0 10.0 -1.7880584421942856 7.126312700910731
5015 0.0 10.0 -1.7853942078060938 7.123648466522539
5016 0.0 10.0 -1.7835476774000205 7.121801936116466
5017 0.0 9.0 -2.2514589924249133 7.120431224429902
5018 0.0 10.0 -1.7785919635848013 7.1168462223012

![Reward](https://github.com/stevearonson/Reinforcement-Learning/blob/master/Week4/imgs/reward_pg.png?raw=1)