# Reinforce & Actor-Advantage Critic (A2C)

[You can find the original paper here](https://arxiv.org/pdf/1602.01783.pdf).

## Imports

In [None]:
!pip install pyvirtualdisplay

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F 
import numpy as np

from pyvirtualdisplay import Display
from IPython import display as ipythondisplay
from IPython.display import clear_output
from pathlib import Path

import random, os.path, math, glob, csv, base64
import gym
from gym.wrappers import Monitor

In [2]:
def show_video(directory):
    html = []
    for mp4 in Path(directory).glob("*.mp4"):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append('''<video alt="{}" autoplay 
                      loop controls style="height: 400px;">
                      <source src="data:video/mp4;base64,{}" type="video/mp4" />
                 </video>'''.format(mp4, video_b64.decode('ascii')))
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [3]:
display = Display(visible=0, size=(1400, 900))
display.start()

<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1021'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1021'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

## Intro

In this tutorial we will focus on Deep Reinforcement Learning with **Reinforce** and the **Actor-Advantage Critic** algorithm. This tutorial is composed of:
* A quick reminder of the RL setting,
* A theoritical approch of Reinforce
* A theoritical approch of A2C,
* An introduction to the deep learning framework: **PyTorch**, 
* A coding part with experiments.


## Introduction to PyTorch

*If you already know PyTorch you can skip this part. From this part on we assume that you have some experience with Python and Numpy.*

PyTorch is a Python package that provides two high-level features:
- Tensor computation (like NumPy) with strong GPU acceleration
- Deep neural networks built on a tape-based autograd system

At a granular level, PyTorch is a library that consists of the following components:

| Component | Description |
| ---- | --- |
| [**torch**](https://pytorch.org/docs/stable/torch.html) | a Tensor library like NumPy, with strong GPU support |
| [**torch.autograd**](https://pytorch.org/docs/stable/autograd.html) | a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch |
| [**torch.jit**](https://pytorch.org/docs/stable/jit.html) | a compilation stack (TorchScript) to create serializable and optimizable models from PyTorch code  |
| [**torch.nn**](https://pytorch.org/docs/stable/nn.html) | a neural networks library deeply integrated with autograd designed for maximum flexibility |
| [**torch.multiprocessing**](https://pytorch.org/docs/stable/multiprocessing.html) | Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training |
| [**torch.utils**](https://pytorch.org/docs/stable/data.html) | DataLoader and other utility functions for convenience |



PyTorch works in a very similar way as Numpy and PyTorch's Tensors are the equivalent of Numpy's Arrays.

You can initialize an zero filled tensor just like in numpy.

In [4]:
torch.zeros(5,3)

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])

In [5]:
torch.eye(3)

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])

You can also convert an array to a tensor.

In [6]:
torch.tensor(np.eye(3))

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]], dtype=torch.float64)

And you can transform a tensor to an array.

In [7]:
torch.tensor(np.eye(3)).numpy()

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

You can sum, substract, multiply arrays just like in numpy.

In [8]:
a = torch.randint(0,10,(2,3))
print(a)

tensor([[2, 2, 3],
        [9, 6, 8]])


In [9]:
b = torch.randint(0,10,(2,3))
print(b)

tensor([[7, 3, 7],
        [3, 1, 9]])


In [10]:
print(f'a + b = {a + b}')
print(f'a * b = {a * b}')

a + b = tensor([[ 9,  5, 10],
        [12,  7, 17]])
a * b = tensor([[14,  6, 21],
        [27,  6, 72]])


You can make matrix products.

In [11]:
a @ b.t()

tensor([[ 41,  35],
        [137, 105]])

### AUTOGRAD: automatic differentiation

The autograd package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

``torch.Tensor`` is the central class of the package. If you set its attribute
``.requires_grad`` as ``True``, it starts to track all operations on it. When
you finish your computation you can call ``.backward()`` and have all the
gradients computed automatically. The gradient for this tensor will be
accumulated into ``.grad`` attribute.

To stop a tensor from tracking history, you can call ``.detach()`` to detach
it from the computation history, and to prevent future computation from being
tracked.

To prevent tracking history (and using memory), you can also wrap the code block
in ``with torch.no_grad():``. This can be particularly helpful when evaluating a
model because the model may have trainable parameters with
``requires_grad=True``, but for which we don't need the gradients.

There’s one more class which is very important for autograd
implementation - a ``Function``.

``Tensor`` and ``Function`` are interconnected and build up an acyclic
graph, that encodes a complete history of computation. Each tensor has
a ``.grad_fn`` attribute that references a ``Function`` that has created
the ``Tensor`` (except for Tensors created by the user - their
``grad_fn is None``).

If you want to compute the derivatives, you can call ``.backward()`` on
a ``Tensor``. If ``Tensor`` is a scalar (i.e. it holds a one element
data), you don’t need to specify any arguments to ``backward()``,
however if it has more elements, you need to specify a ``gradient``
argument that is a tensor of matching shape.

## Reminder of the RL setting

As always we will consider a MDP $M = (\mathcal{X}, \mathcal{A}, p, r, \gamma)$ with:
* $\mathcal{X}$ the state space,
* $\mathcal{A}$ the action space,
* $p(x^\prime \mid x, a)$ the transition probability,
* $r(x, a, x^\prime)$ the reward of the transition $(x, a, x^\prime)$,
* $\gamma \in [0,1)$ is the discount factor.

A policy $\pi$ is a mapping from the state space $\mathcal{X}$ to the probability of selecting each action.

The action value function of a policy is the overall expected reward from a state action. $Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\big[R(\tau) \mid s_0=s, a_0=a\big]$ where $R(\tau)$ is the random variable defined as the sum of the discounted reward.

The goal is to maximize the agent's reward.

$$ J(\pi) = \mathbb{E}_{\tau \sim \pi}\big[ \sum_{t} \gamma^t R_t \mid x_0, \pi \big]$$

# Gym Environnement

In this lab and also the next one we are going to use the [OpenAI's Gym library](https://gym.openai.com/envs/). This library provides a large number of environnements to test RL algorithm.

We will focus on three different environnements in this lab but we encourage you to test other ones.
* Acrobot-v1
* CartPole-v1
* MountainCar-v0

| Env Info          	| CartPole-v1 	| Acrobot-v1                	| MountainCar-v0 	|
|-------------------	|-------------	|---------------------------	|----------------	|
| **Observation Space** 	| Box(4)      	| Box(6)                    	| Box(2)         	|
| **Action Space**      	| Discrete(2) 	| Discrete(3)               	| Discrete(3)    	|
| **Rewards**           	| 1 per step  	| -1 if not terminal else 0 	| -1 per step    	|

A gym environnement is loaded with the command `env = gym.make(env_id)`. Once the environnement is created, you need to reset it with `observation = env.reset()` and then you can interact with it using the method step: `observation, reward, done, info = env.step(action)`.

### Carpole

In [12]:
# We load CartPole-v1
env = gym.make('CartPole-v1')
# We wrap it in order to save our experiment on a file.
env = Monitor(env, "./gym-results", force=True, video_callable=lambda episode: True)

In [13]:
done = False
obs = env.reset()
while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
env.close()
show_video("./gym-results")

### Acrobot-v1

In [19]:
# We load Acrobot-v1
env = gym.make('Acrobot-v1')
# We wrap it in order to save our experiment on a file.
env = Monitor(env, "./gym-results", force=True, video_callable=lambda episode: True)

In [20]:
done = False
obs = env.reset()
while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
env.close()
show_video("./gym-results")

### MountainCar-v0

In [21]:
# We load Acrobot-v1
env = gym.make('MountainCar-v0')
# We wrap it in order to save our experiment on a file.
env = Monitor(env, "./gym-results", force=True, video_callable=lambda episode: True)

In [22]:
done = False
obs = env.reset()
while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
env.close()
show_video("./gym-results")

## REINFORCE

### Introduction

Reinforce is an actor-based **on policy** method. The policy $\pi_{\theta}$ is parametrized by a function approximator (e.g. a neural network).

Recall: $$ J(\pi) = \mathbb{E}_{\tau \sim \pi}\big[ \sum_{t} \gamma^t R_t \mid x_0, \pi \big].$$

To update the parameters $\theta$ of the policy, one has to do gradient ascent: $\theta_{k+1} = \theta_{k} + \alpha \nabla_{\theta}J(\pi_{\theta})|_{\theta_{k}}$.

Advantages of this approach:
- Compared to a Q-learning approach, here the policy is directly parametrized so a small change of the parameters will not dramatically change the policy whereas this is not the case for Q-learning approaches.
- The stochasticity of the policy allows exploration. In off policy learning, one has to deal with both a behaviour policy and an exploration policy.

### Policy Gradient Theorem

**Q.1: Prove the Policy Gradient Theorem:** $$ \displaystyle \nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}\left[{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau)}\right]$$

Hint 1 - The probability of a trajectory $\tau = (s_{0}, a_{0},\dots, s_{T+1}$) with action chosen from $\displaystyle \pi_{\theta}$ is $P(\tau|\theta) = \rho_{0}(s_{0})\prod_{t=0}^{T}P\left(s_{t+1}|s_{t}, a_{t}\right) \pi_{\theta}(a_{t}|s_{t})$

Hint 2 - Gradient-log trick: $  \nabla_{\theta}P(\tau|\theta)= P(\tau|\theta)\nabla_{\theta}\log P(\tau|\theta). $

The policy gradient can therefore be approximated with:
$$ \hat{g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau) $$

**Q.2: Implement the REINFORCE algorithm**

First we will define our model:

In [14]:
class Model(nn.Module):
    def __init__(self, dim_observation, n_actions):
        super(Model, self).__init__()
        
        self.n_actions = n_actions
        self.dim_observation = dim_observation
        
        self.net = nn.Sequential(
            nn.Linear(in_features=self.dim_observation, out_features=16),
            nn.ReLU(),
            nn.Linear(in_features=16, out_features=8),
            nn.ReLU(),
            nn.Linear(in_features=8, out_features=self.n_actions),
            nn.Softmax(dim=0)
        )
        
    def forward(self, state):
        return self.net(state)
    
    def select_action(self, state):
        action = torch.multinomial(self.forward(state), 1)
        return action

It is always nice to visualize the differents layers of our model.

In [15]:
env_id = 'CartPole-v1'
env = gym.make(env_id)
model = Model(env.observation_space.shape[0], env.action_space.n)
print(f'The model we created correspond to:\n{model}')

The model we created correspond to:
Model(
  (net): Sequential(
    (0): Linear(in_features=4, out_features=16, bias=True)
    (1): ReLU()
    (2): Linear(in_features=16, out_features=8, bias=True)
    (3): ReLU()
    (4): Linear(in_features=8, out_features=2, bias=True)
    (5): Softmax()
  )
)


In [16]:
class BaseAgent:
    
    def __init__(self, dim_observation, n_actions, gamma):
        self.model = Model(dim_observation=dim_observation, n_actions=n_actions)
        self.gamma = gamma
        self.optimizer = torch.optim.Adam(self.model.net.parameters(), lr=0.01)
    
    def _make_returns(self, rewards):
        returns = np.zeros_like(rewards)
        returns[-1] = rewards[-1]
        for t in reversed(range(len(rewards) - 1)):
            returns[t] = rewards[t] + self.gamma * returns[t + 1]
        return returns
    
    def optimize_model(self):
        raise NotImplementedError
    
    def train(self, env, n_trajectories, n_update):
        for episode in range(n_update):
            mean_reward, std_reward, min_reward, max_reward = self.optimize_model(env, n_trajectories)
            print("Episode {}".format(episode+1))
            print("Reward:μσmM {:.2f} {:.2f} {:.2f} {:.2f}"
              .format(mean_reward, std_reward, min_reward, max_reward))

    def evaluate(self, env, n_trajectories):
        reward_trajectories = np.zeros(n_trajectories)
        for i in range(n_trajectories):
            # New episode
            observation = env.reset()
            observation = torch.tensor(observation, dtype=torch.float)
            reward_episode = 0
            done = False
            
            while not done:
                env.render()
                action = self.model.select_action(observation)
                observation, reward, done, info = env.step(int(action))
                observation = torch.tensor(observation, dtype=torch.float)
                reward_episode += reward
            
            reward_trajectories[i] = reward_episode
        env.close()
        print("Reward:μσmM {:.2f} {:.2f} {:.2f} {:.2f}"
              .format(reward_trajectories.mean(), reward_trajectories.std(), 
                      reward_trajectories.min(), reward_trajectories.max()))
        

In [17]:
class SimpleAgent(BaseAgent):
    
    def optimize_model(self, env, n_trajectories):
        weighted_logproba = torch.zeros(n_trajectories)
        reward_trajectories = np.zeros(n_trajectories)

        for i in range(n_trajectories):
            # New episode
            observation = env.reset()
            rewards_episode = []
            logproba_episode = []
            discount_factor = 1
            observation = torch.tensor(observation, dtype=torch.float)
            done = False
            
            while not done:
                action = self.model.select_action(observation)
                logproba_episode.append(torch.log(self.model.forward(observation))[action])
                # Interaction with the environment
                observation, reward, done, info = env.step(int(action))
                observation = torch.tensor(observation, dtype=torch.float)
                rewards_episode.append(discount_factor * reward)
                discount_factor *= self.gamma
            
            cum_rewards_episode = np.sum(rewards_episode)
            weighted_logproba[i]= cum_rewards_episode * torch.cat(logproba_episode).sum()
            reward_trajectories[i] = cum_rewards_episode
            
        loss = - weighted_logproba.mean()
        
        self.optimizer.zero_grad()
        # Compute the gradient 
        loss.backward()
        # Do the gradient descent step
        self.optimizer.step()
        return reward_trajectories.mean(), reward_trajectories.std(), reward_trajectories.min(), reward_trajectories.max()


In [18]:
# Make the environment
env = gym.make("CartPole-v1")

# Seeds
seed = 1
env.seed(seed=seed)
np.random.seed(seed=seed)
torch.manual_seed(seed=seed)

observation = env.reset()

n_actions = env.action_space.n
dim_observation = observation.shape[0]

Agent = SimpleAgent(dim_observation=dim_observation, n_actions=n_actions, gamma=1)
Agent.train(env=env, n_trajectories=50, n_update=50)

Episode 1
Reward:μσmM 19.44 9.94 8.00 48.00
Episode 2
Reward:μσmM 21.26 10.15 8.00 58.00
Episode 3
Reward:μσmM 20.42 9.13 8.00 56.00
Episode 4
Reward:μσmM 19.28 8.99 8.00 55.00
Episode 5
Reward:μσmM 20.62 7.93 10.00 37.00
Episode 6
Reward:μσmM 22.54 12.58 8.00 77.00
Episode 7
Reward:μσmM 21.64 10.05 9.00 54.00
Episode 8
Reward:μσmM 26.70 13.61 11.00 89.00
Episode 9
Reward:μσmM 24.00 13.84 10.00 64.00
Episode 10
Reward:μσmM 22.54 12.84 9.00 64.00
Episode 11
Reward:μσmM 23.64 12.61 9.00 89.00
Episode 12
Reward:μσmM 25.46 12.30 10.00 69.00
Episode 13
Reward:μσmM 23.20 8.82 10.00 49.00
Episode 14
Reward:μσmM 23.56 11.03 9.00 49.00
Episode 15
Reward:μσmM 23.94 18.52 10.00 132.00
Episode 16
Reward:μσmM 31.60 20.62 10.00 93.00
Episode 17
Reward:μσmM 29.34 18.09 10.00 93.00
Episode 18
Reward:μσmM 29.42 19.27 9.00 112.00
Episode 19
Reward:μσmM 31.10 16.00 12.00 88.00
Episode 20
Reward:μσmM 29.92 15.94 9.00 77.00
Episode 21
Reward:μσmM 28.12 13.73 11.00 69.00
Episode 22
Reward:μσmM 35.40 24.27 1

### Don't let the past distract you

- The sum of rewards during one episode has a high variance which affects the performance of this version of REINFORCE.
- To assess the quality of an action, it make more sens to take into consideration only the rewards obtained after taking this action.
- It can be proven that $$  \nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}\left[{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1})}\right].$$
- Bonus: proof of this claim.
- This has for effect to reduce the variance. Past rewards have zero mean but nonzero variance so they just add noise.  

**Q3: Implement this enhance version of REINFORCE**

In [19]:
class EnhanceAgent(BaseAgent):
    
    def optimize_model(self, env, n_trajectories):
        weighted_logproba = torch.zeros(n_trajectories)
        reward_trajectories = np.zeros(n_trajectories)

        for i in range(n_trajectories):
            # New episode
            observation = env.reset()
            rewards_episode = []
            logproba_episode = []
            discount_factor = 1
            observation = torch.tensor(observation, dtype=torch.float)
            done = False
            
            while not done:
                action = self.model.select_action(observation)
                logproba_episode.append(torch.log(self.model.forward(observation))[action])
                # Interaction with the environment
                observation, reward, done, info = env.step(int(action))
                observation = torch.tensor(observation, dtype=torch.float)
                rewards_episode.append(discount_factor * reward)
                discount_factor *= self.gamma
            
            inverse_cum_rewards = self._make_returns(rewards_episode)
            reward_trajectories[i] = inverse_cum_rewards[0]
            inverse_cum_rewards = torch.tensor(inverse_cum_rewards, dtype=torch.float)
            weighted_logproba[i]= torch.sum(inverse_cum_rewards * torch.cat(logproba_episode))
            
        loss = - weighted_logproba.mean()
        
        self.optimizer.zero_grad()
        # Compute the gradient 
        loss.backward()
        # Do the gradient descent step
        self.optimizer.step()
        return reward_trajectories.mean(), reward_trajectories.std(), reward_trajectories.min(), reward_trajectories.max()
   

In [20]:
Agent = EnhanceAgent(dim_observation=dim_observation, n_actions=n_actions, gamma=1)
Agent.train(env=env, n_trajectories=50, n_update=50)

Episode 1
Reward:μσmM 20.72 10.90 9.00 60.00
Episode 2
Reward:μσmM 21.54 9.81 10.00 54.00
Episode 3
Reward:μσmM 23.74 13.46 10.00 69.00
Episode 4
Reward:μσmM 25.54 13.12 9.00 72.00
Episode 5
Reward:μσmM 27.58 15.44 11.00 80.00
Episode 6
Reward:μσmM 25.26 12.04 12.00 76.00
Episode 7
Reward:μσmM 26.74 15.13 8.00 67.00
Episode 8
Reward:μσmM 29.32 13.61 11.00 68.00
Episode 9
Reward:μσmM 28.84 15.98 11.00 84.00
Episode 10
Reward:μσmM 30.92 19.93 10.00 106.00
Episode 11
Reward:μσmM 29.54 15.78 10.00 84.00
Episode 12
Reward:μσmM 35.36 16.17 13.00 80.00
Episode 13
Reward:μσmM 36.32 21.94 9.00 123.00
Episode 14
Reward:μσmM 40.06 19.07 14.00 93.00
Episode 15
Reward:μσmM 41.20 28.30 11.00 139.00
Episode 16
Reward:μσmM 45.24 25.57 11.00 148.00
Episode 17
Reward:μσmM 47.00 22.18 11.00 124.00
Episode 18
Reward:μσmM 49.04 24.80 17.00 115.00
Episode 19
Reward:μσmM 46.60 21.34 13.00 131.00
Episode 20
Reward:μσmM 48.74 19.05 21.00 91.00
Episode 21
Reward:μσmM 52.80 34.78 17.00 254.00
Episode 22
Reward:μ

## From REINFORCE to A2C

### The idea behind A2C

The need of a critic.

### A2C

In [42]:
class A2CModel(nn.Module):
    def __init__(self, dim_observation, n_actions):
        super().__init__()
        self.dim_observation = dim_observation
        self.n_actions = n_actions
        
        self.embedding = nn.Sequential(
            nn.Linear(in_features=self.dim_observation, out_features=64),
            nn.ReLU(),
            nn.Linear(in_features=64, out_features=64),
            nn.ReLU()
        )
        self.policy = nn.Linear(in_features=64, out_features=n_actions)
        self.value = nn.Linear(in_features=64, out_features=1)
        
    def forward(self, x):
        embedded_obs = self.embedding(x)
        policy = F.softmax(self.policy(embedded_obs))
        value = self.value(embedded_obs)
        return value, policy
    
    def value_action(self, state):
        policy, value = self.forward(state)
        action = torch.multinomial(policy, 1)
        return value, action

In [137]:
env_id = 'CartPole-v1'
env = gym.make(env_id)
model = A2CModel(env.observation_space.shape[0], env.action_space.n)
print(f'The model we created correspond to:\n{model}')

The model we created correspond to:
A2CModel(
  (embedding): Sequential(
    (0): Linear(in_features=4, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=64, bias=True)
    (3): ReLU()
  )
  (policy): Linear(in_features=64, out_features=2, bias=True)
  (value): Linear(in_features=64, out_features=1, bias=True)
)


In [17]:
class SimpleAgent(BaseAgent):
    
    def optimize_model(self, env, n_trajectories):
        weighted_logproba = torch.zeros(n_trajectories)
        reward_trajectories = np.zeros(n_trajectories)

        for i in range(n_trajectories):
            # New episode
            observation = env.reset()
            rewards_episode = []
            logproba_episode = []
            discount_factor = 1
            observation = torch.tensor(observation, dtype=torch.float)
            done = False
            
            while not done:
                action = self.model.select_action(observation)
                logproba_episode.append(torch.log(self.model.forward(observation))[action])
                # Interaction with the environment
                observation, reward, done, info = env.step(int(action))
                observation = torch.tensor(observation, dtype=torch.float)
                rewards_episode.append(discount_factor * reward)
                discount_factor *= self.gamma
            
            cum_rewards_episode = np.sum(rewards_episode)
            weighted_logproba[i]= cum_rewards_episode * torch.cat(logproba_episode).sum()
            reward_trajectories[i] = cum_rewards_episode
            
        loss = - weighted_logproba.mean()
        
        self.optimizer.zero_grad()
        # Compute the gradient 
        loss.backward()
        # Do the gradient descent step
        self.optimizer.step()
        return reward_trajectories.mean(), reward_trajectories.std(), reward_trajectories.min(), reward_trajectories.max()


In [141]:
class A2CAgent:
    
    def __init__(self, dim_observation, n_actions, gamma):
        self.model = A2CModel(dim_observation=dim_observation, n_actions=n_actions)
        self.gamma = gamma
        self.optimizer = torch.optim.RMSprop(self.model.parameters(), lr=0.0007)
    
    def _make_returns(self, rewards):
        returns = np.zeros_like(rewards)
        returns[-1] = rewards[-1]
        for t in reversed(range(len(rewards) - 1)):
            returns[t] = rewards[t] + self.gamma * returns[t + 1]
        return returns
    
    def optimize_model(self, env, n_trajectories):
        loss = 0
        reward_trajectories = []
        for i in range(n_trajectories):
            rewards_episode, logproba_episode, values_episode, entropy_episode = self.collect_trajectory(env)
            reward_trajectories.append(rewards_episode.sum())
            returns = self._make_returns(rewards_episode)
            returns = torch.tensor(returns, dtype=torch.float32)
            value_loss = 0.5 * F.mse_loss(values_episode, returns)
            policy_loss = torch.mean(logproba_episode * (returns - values_episode.detach()))
            loss += value_loss - policy_loss - 0.01 * entropy_episode
        print(loss)
        self.optimizer.zero_grad()
        # Compute the gradient 
        loss.backward()
        # Do the gradient descent step
        self.optimizer.step()
        
        reward_trajectories = np.array(reward_trajectories)
        return reward_trajectories.mean(), reward_trajectories.std(), reward_trajectories.min(), reward_trajectories.max()
        
            
    def collect_trajectory(self, env):
        # New episode
        observation = env.reset()
        rewards_episode = []
        values_episode = []
        logproba_episode = []
        entropy_episode = []
        observation = torch.tensor(observation, dtype=torch.float)
        done = False

        while not done:
            value, policy = self.model(observation)
            values_episode.append(value)
            entropy_episode.append(- torch.sum(policy * torch.log(policy)))
            logproba_episode.append(torch.log(policy)[action])
          
            # Interaction with the environment
            observation, reward, done, info = env.step(int(action))
            observation = torch.tensor(observation, dtype=torch.float)
            rewards_episode.append(reward)
            
        rewards_episode =  np.array(rewards_episode)
        logproba_episode = torch.stack(logproba_episode)
        values_episode = torch.cat(values_episode)
        entropy_episode = torch.sum(torch.stack(entropy_episode))
        
        return rewards_episode, logproba_episode, values_episode, entropy_episode
    
    def train(self, env, n_trajectories, n_update):
        for episode in range(n_update):
            mean_reward, std_reward, min_reward, max_reward = self.optimize_model(env, n_trajectories)
            print("Episode {}".format(episode+1))
            print("Reward:μσmM {:.2f} {:.2f} {:.2f} {:.2f}"
              .format(mean_reward, std_reward, min_reward, max_reward))

    def evaluate(self, env, n_trajectories):
        reward_trajectories = np.zeros(n_trajectories)
        for i in range(n_trajectories):
            # New episode
            observation = env.reset()
            observation = torch.tensor(observation, dtype=torch.float)
            reward_episode = 0
            done = False
            
            while not done:
                env.render()
                action = self.model.value_action(observation)[1]
                observation, reward, done, info = env.step(int(action))
                observation = torch.tensor(observation, dtype=torch.float)
                reward_episode += reward
            
            reward_trajectories[i] = reward_episode
        env.close()
        print("Reward:μσmM {:.2f} {:.2f} {:.2f} {:.2f}"
              .format(reward_trajectories.mean(), reward_trajectories.std(), 
                      reward_trajectories.min(), reward_trajectories.max()))
        

In [142]:
agent = A2CAgent(env.observation_space.shape[0], env.action_space.n, 0.99)

In [143]:
agent.train(env, 100, 10)



tensor(1267.4742, grad_fn=<AddBackward0>)
Episode 1
Reward:μσmM 9.44 0.75 8.00 11.00
tensor(1099.6851, grad_fn=<AddBackward0>)
Episode 2
Reward:μσmM 9.31 0.69 8.00 11.00
tensor(997.5997, grad_fn=<AddBackward0>)
Episode 3
Reward:μσmM 9.37 0.86 8.00 11.00
tensor(861.1323, grad_fn=<AddBackward0>)
Episode 4
Reward:μσmM 9.33 0.71 8.00 11.00
tensor(730.2023, grad_fn=<AddBackward0>)
Episode 5
Reward:μσmM 9.32 0.69 8.00 11.00
tensor(592.4122, grad_fn=<AddBackward0>)
Episode 6
Reward:μσmM 9.25 0.77 8.00 11.00
tensor(492.8332, grad_fn=<AddBackward0>)
Episode 7
Reward:μσmM 9.38 0.80 8.00 11.00
tensor(388.9459, grad_fn=<AddBackward0>)
Episode 8
Reward:μσmM 9.42 0.84 8.00 11.00
tensor(273.0960, grad_fn=<AddBackward0>)
Episode 9
Reward:μσmM 9.28 0.74 8.00 11.00
tensor(203.6145, grad_fn=<AddBackward0>)
Episode 10
Reward:μσmM 9.31 0.76 8.00 11.00


In [None]:
class A2CAgent(A2CBaseAgent):
    
    def optimize_model(self, n_trajectories):
        