# Reinforce & Actor-Advantage Critic (A2C)

[You can find the original paper here](https://arxiv.org/pdf/1602.01783.pdf).

## Intro

In this tutorial we will focus on Deep Reinforcement Learning with **Reinforce** and the **Actor-Advantage Critic** algorithm. This tutorial is composed of:
* A quick reminder of the RL setting,
* A theoritical approch of Reinforce
* A theoritical approch of A2C,
* An introduction to the deep learning framework: **PyTorch**, 
* A coding part with experiments.


## Introduction to PyTorch

*If you already know PyTorch you can skip this part. From this part on we assume that you have some experience with Python and Numpy.*

PyTorch is a Python package that provides two high-level features:
- Tensor computation (like NumPy) with strong GPU acceleration
- Deep neural networks built on a tape-based autograd system

At a granular level, PyTorch is a library that consists of the following components:

| Component | Description |
| ---- | --- |
| [**torch**](https://pytorch.org/docs/stable/torch.html) | a Tensor library like NumPy, with strong GPU support |
| [**torch.autograd**](https://pytorch.org/docs/stable/autograd.html) | a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch |
| [**torch.jit**](https://pytorch.org/docs/stable/jit.html) | a compilation stack (TorchScript) to create serializable and optimizable models from PyTorch code  |
| [**torch.nn**](https://pytorch.org/docs/stable/nn.html) | a neural networks library deeply integrated with autograd designed for maximum flexibility |
| [**torch.multiprocessing**](https://pytorch.org/docs/stable/multiprocessing.html) | Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training |
| [**torch.utils**](https://pytorch.org/docs/stable/data.html) | DataLoader and other utility functions for convenience |



PyTorch works in a very similar way as Numpy and PyTorch's Tensors are the equivalent of Numpy's Arrays.

In [7]:
import torch
import numpy as np

You can initialize an zero filled tensor just like in numpy.

In [28]:
torch.zeros(5,3)

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])

In [30]:
torch.eye(3)

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])

You can also convert an array to a tensor.

In [32]:
torch.tensor(np.eye(3))

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]], dtype=torch.float64)

And you can transform a tensor to an array.

In [33]:
a_tensor.numpy()

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

You can sum, substract, multiply arrays just like in numpy.

In [35]:
a = torch.randint(0,10,(2,3))
print(a)

tensor([[7, 4, 3],
        [7, 3, 1]])


In [36]:
b = torch.randint(0,10,(2,3))
print(b)

tensor([[4, 9, 1],
        [1, 4, 0]])


In [43]:
print(f'a + b = {a + b}')
print(f'a * b = {a * b}')

a + b = tensor([[11, 13,  4],
        [ 8,  7,  1]])
a * b = tensor([[28, 36,  3],
        [ 7, 12,  0]])


You can make matrix products.

In [46]:
a @ b.t()

tensor([[67, 23],
        [56, 19]])

### AUTOGRAD: automatic differentiation

The autograd package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

``torch.Tensor`` is the central class of the package. If you set its attribute
``.requires_grad`` as ``True``, it starts to track all operations on it. When
you finish your computation you can call ``.backward()`` and have all the
gradients computed automatically. The gradient for this tensor will be
accumulated into ``.grad`` attribute.

To stop a tensor from tracking history, you can call ``.detach()`` to detach
it from the computation history, and to prevent future computation from being
tracked.

To prevent tracking history (and using memory), you can also wrap the code block
in ``with torch.no_grad():``. This can be particularly helpful when evaluating a
model because the model may have trainable parameters with
``requires_grad=True``, but for which we don't need the gradients.

There’s one more class which is very important for autograd
implementation - a ``Function``.

``Tensor`` and ``Function`` are interconnected and build up an acyclic
graph, that encodes a complete history of computation. Each tensor has
a ``.grad_fn`` attribute that references a ``Function`` that has created
the ``Tensor`` (except for Tensors created by the user - their
``grad_fn is None``).

If you want to compute the derivatives, you can call ``.backward()`` on
a ``Tensor``. If ``Tensor`` is a scalar (i.e. it holds a one element
data), you don’t need to specify any arguments to ``backward()``,
however if it has more elements, you need to specify a ``gradient``
argument that is a tensor of matching shape.

## Reminder of the RL setting

As always we will consider a MDP $M = (\mathcal{X}, \mathcal{A}, p, r, \gamma)$ with:
* $\mathcal{X}$ the state space,
* $\mathcal{A}$ the action space,
* $p(x^\prime \mid x, a)$ the transition probability,
* $r(x, a, x^\prime)$ the reward of the transition $(x, a, x^\prime)$,
* $\gamma \in [0,1)$ is the discount factor.

A policy $\pi$ is a mapping from the state space $\mathcal{X}$ to the probability of selecting each action.

The action value function of a policy is the overall expected reward from a state action. $Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\big[R(\tau) \mid s_0=s, a_0=a\big]$ where $R(\tau)$ is the random variable defined as the sum of the discounted reward.

The goal is to maximize the agent's reward.

$$ J(\pi) = \mathbb{E}_{\tau \sim \pi}\big[ \sum_{t} \gamma^t R_t \mid x_0, \pi \big]$$

# Gym + Random agent

# REINFORCE

## Introduction

Reinforce is an actor-based **on policy** method. The policy $\pi_{\theta}$ is parametrized by a function approximator (e.g. a neural network).

Recall: $$ J(\pi) = \mathbb{E}_{\tau \sim \pi}\big[ \sum_{t} \gamma^t R_t \mid x_0, \pi \big].$$

To update the parameters $\theta$ of the policy, one has to do gradient ascent: $\theta_{k+1} = \theta_{k} + \alpha \nabla_{\theta}J(\pi_{\theta})|_{\theta_{k}}$.

Advantages of this approach:
- Compared to a Q-learning approach, here the policy is directly parametrized so a small change of the parameters will not dramatically change the policy whereas this is not the case for Q-learning approaches.
- The stochasticity of the policy allows exploration. In off policy learning, one has to deal with both a behaviour policy and an exploration policy.

## Policy Gradient Theorem

Q.1: Prove the Policy Gradient Theorem: $$ \displaystyle \nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}\left[{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau)}\right]$$

Hint 1 - The probability of a trajectory $\tau = (s_{0}, a_{0},\dots, s_{T+1}$) with action chosen from $\displaystyle \pi_{\theta}$ is $P(\tau|\theta) = \rho_{0}(s_{0})\prod_{t=0}^{T}P\left(s_{t+1}|s_{t}, a_{t}\right) \pi_{\theta}(a_{t}|s_{t})$

Hint 2 - Gradient-log trick: $  \nabla_{\theta}P(\tau|\theta)= P(\tau|\theta)\nabla_{\theta}\log P(\tau|\theta). $

The policy gradient can therefore be approximated with:
$$ \hat{g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau) $$

Q.2: Implement the REINFORCE algorithm

### code of reinforcement

In [8]:
import gym 
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F 

In [52]:
class Model(nn.Module):
    def __init__(self, dim_observation, n_actions):
        super(Model, self).__init__()
        
        self.n_actions = n_actions
        self.dim_observation = dim_observation
        
        self.net = nn.Sequential(
            nn.Linear(in_features=self.dim_observation, out_features=16),
            nn.ReLU(),
            nn.Linear(in_features=16, out_features=8),
            nn.ReLU(),
            nn.Linear(in_features=8, out_features=self.n_actions),
            nn.Softmax(dim=0)
        )
        
    def forward(self, state):
        return self.net(state)
    
    def select_action(self, state):
        action = torch.multinomial(self.forward(state), 1)
        return action

In [61]:
class BaseAgent:
    
    def __init__(self, dim_observation, n_actions, gamma):
        self.model = Model(dim_observation=dim_observation, n_actions=n_actions)
        self.gamma = gamma
        self.optimizer = torch.optim.Adam(self.model.net.parameters(), lr=0.01)
    
    def _make_returns(self, rewards):
        returns = np.zeros_like(rewards)
        returns[-1] = rewards[-1]
        for t in reversed(range(len(rewards) - 1)):
            returns[t] = rewards[t] + self.gamma * returns[t + 1]
        return returns
    
    def optimize_model(self):
        raise NotImplementedError
    
    def train(self, env, n_trajectories, n_update):
        for episode in range(n_update):
            mean_reward, std_reward, min_reward, max_reward = self.optimize_model(env, n_trajectories)
            print("Episode {}".format(episode+1))
            print("Reward:μσmM {:.2f} {:.2f} {:.2f} {:.2f}"
              .format(mean_reward, std_reward, min_reward, max_reward))

    def evaluate(self, env, n_trajectories):
        reward_trajectories = np.zeros(n_trajectories)
        for i in range(n_trajectories):
            # New episode
            observation = env.reset()
            observation = torch.tensor(observation, dtype=torch.float)
            reward_episode = 0
            done = False
            
            while not done:
                env.render()
                action = self.model.select_action(observation)
                observation, reward, done, info = env.step(int(action))
                observation = torch.tensor(observation, dtype=torch.float)
                reward_episode += reward
            
            reward_trajectories[i] = reward_episode
        env.close()
        print("Reward:μσmM {:.2f} {:.2f} {:.2f} {:.2f}"
              .format(reward_trajectories.mean(), reward_trajectories.std(), 
                      reward_trajectories.min(), reward_trajectories.max()))
        

In [62]:
class SimpleAgent(BaseAgent):
    def __init__(self, dim_observation, n_actions, gamma):
        super(SimpleAgent, self).__init__(dim_observation=dim_observation, n_actions=n_actions, gamma=gamma)
    
    def optimize_model(self, env, n_trajectories):
        weighted_logproba = torch.zeros(n_trajectories)
        reward_trajectories = np.zeros(n_trajectories)

        for i in range(n_trajectories):
            # New episode
            observation = env.reset()
            rewards_episode = []
            logproba_episode = []
            discount_factor = 1
            observation = torch.tensor(observation, dtype=torch.float)
            done = False
            
            while not done:
                action = self.model.select_action(observation)
                logproba_episode.append(torch.log(self.model.forward(observation))[action])
                # Interaction with the environment
                observation, reward, done, info = env.step(int(action))
                observation = torch.tensor(observation, dtype=torch.float)
                rewards_episode.append(discount_factor * reward)
                discount_factor *= self.gamma
            
            cum_rewards_episode = np.sum(rewards_episode)
            weighted_logproba[i]= cum_rewards_episode * torch.cat(logproba_episode).sum()
            reward_trajectories[i] = cum_rewards_episode
            
        loss = - weighted_logproba.mean()
        
        self.optimizer.zero_grad()
        # Compute the gradient 
        loss.backward()
        # Do the gradient descent step
        self.optimizer.step()
        return reward_trajectories.mean(), reward_trajectories.std(), reward_trajectories.min(), reward_trajectories.max()


## Don't let the past distract you

- The sum of rewards during one episode has a high variance which affects the performance of this version of REINFORCE.
- To assess the quality of an action, it make more sens to take into consideration only the rewards obtained after taking this action.
- It can be proven that $$  \nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}\left[{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1})}\right].$$
- Bonus: proof of this claim.
- This has for effect to reduce the variance. Past rewards have zero mean but nonzero variance so they just add noise.  

Q3: Implement this enhance version of REINFORCE

In [63]:
class EnhanceAgent(BaseAgent):
    def __init__(self, dim_observation, n_actions, gamma):
        super(EnhanceAgent, self).__init__(dim_observation=dim_observation, n_actions=n_actions, gamma=gamma)
    
    def optimize_model(self, env, n_trajectories):
        weighted_logproba = torch.zeros(n_trajectories)
        reward_trajectories = np.zeros(n_trajectories)

        for i in range(n_trajectories):
            # New episode
            observation = env.reset()
            rewards_episode = []
            logproba_episode = []
            discount_factor = 1
            observation = torch.tensor(observation, dtype=torch.float)
            done = False
            
            while not done:
                action = self.model.select_action(observation)
                logproba_episode.append(torch.log(self.model.forward(observation))[action])
                # Interaction with the environment
                observation, reward, done, info = env.step(int(action))
                observation = torch.tensor(observation, dtype=torch.float)
                rewards_episode.append(discount_factor * reward)
                discount_factor *= self.gamma
            
            inverse_cum_rewards = self._make_returns(rewards_episode)
            reward_trajectories[i] = inverse_cum_rewards[0]
            inverse_cum_rewards = torch.tensor(inverse_cum_rewards, dtype=torch.float)
            weighted_logproba[i]= torch.sum(inverse_cum_rewards * torch.cat(logproba_episode))
            
        loss = - weighted_logproba.mean()
        
        self.optimizer.zero_grad()
        # Compute the gradient 
        loss.backward()
        # Do the gradient descent step
        self.optimizer.step()
        return reward_trajectories.mean(), reward_trajectories.std(), reward_trajectories.min(), reward_trajectories.max()
   

In [65]:
# Make the environment
env = gym.make("CartPole-v1")

# Seeds
seed = 1
env.seed(seed=seed)
np.random.seed(seed=seed)
torch.manual_seed(seed=seed)

observation = env.reset()

n_actions = env.action_space.n
dim_observation = observation.shape[0]

Agent = SimpleAgent(dim_observation=dim_observation, n_actions=n_actions, gamma=1)
Agent.train(env=env, n_trajectories=50, n_update=50)

Episode 1
Reward:μσmM 19.44 9.94 8.00 48.00
Episode 2
Reward:μσmM 21.26 10.15 8.00 58.00
Episode 3
Reward:μσmM 20.42 9.13 8.00 56.00
Episode 4
Reward:μσmM 19.28 8.99 8.00 55.00
Episode 5
Reward:μσmM 20.62 7.93 10.00 37.00
Episode 6
Reward:μσmM 22.54 12.58 8.00 77.00
Episode 7
Reward:μσmM 21.64 10.05 9.00 54.00
Episode 8
Reward:μσmM 26.70 13.61 11.00 89.00
Episode 9
Reward:μσmM 24.00 13.84 10.00 64.00
Episode 10
Reward:μσmM 22.54 12.84 9.00 64.00
Episode 11
Reward:μσmM 23.64 12.61 9.00 89.00
Episode 12
Reward:μσmM 25.46 12.30 10.00 69.00
Episode 13
Reward:μσmM 23.20 8.82 10.00 49.00
Episode 14
Reward:μσmM 23.56 11.03 9.00 49.00
Episode 15
Reward:μσmM 23.94 18.52 10.00 132.00
Episode 16
Reward:μσmM 31.60 20.62 10.00 93.00
Episode 17
Reward:μσmM 29.34 18.09 10.00 93.00
Episode 18
Reward:μσmM 29.42 19.27 9.00 112.00
Episode 19
Reward:μσmM 31.10 16.00 12.00 88.00
Episode 20
Reward:μσmM 29.92 15.94 9.00 77.00
Episode 21
Reward:μσmM 28.12 13.73 11.00 69.00
Episode 22
Reward:μσmM 35.40 24.27 1

In [64]:
Agent = EnhanceAgent(dim_observation=dim_observation, n_actions=n_actions, gamma=1)
Agent.train(env=env, n_trajectories=50, n_update=50)

Episode 1
Reward:μσmM 16.20 8.98 8.00 61.00
Episode 2
Reward:μσmM 17.10 8.74 10.00 48.00
Episode 3
Reward:μσmM 19.70 12.42 9.00 80.00
Episode 4
Reward:μσmM 20.12 8.60 10.00 39.00
Episode 5
Reward:μσmM 18.44 7.33 10.00 45.00
Episode 6
Reward:μσmM 20.18 9.06 9.00 52.00
Episode 7
Reward:μσmM 18.46 8.05 9.00 53.00
Episode 8
Reward:μσmM 23.22 11.99 9.00 64.00
Episode 9
Reward:μσmM 22.10 9.45 11.00 57.00
Episode 10
Reward:μσmM 24.42 12.38 10.00 82.00
Episode 11
Reward:μσmM 24.46 14.12 10.00 77.00
Episode 12
Reward:μσmM 23.92 10.57 9.00 51.00
Episode 13
Reward:μσmM 28.94 16.70 9.00 86.00
Episode 14
Reward:μσmM 27.22 12.61 10.00 68.00
Episode 15
Reward:μσmM 28.68 20.64 11.00 118.00
Episode 16
Reward:μσmM 35.28 18.55 10.00 93.00
Episode 17
Reward:μσmM 29.28 17.24 11.00 83.00
Episode 18
Reward:μσmM 28.38 16.87 10.00 92.00
Episode 19
Reward:μσmM 35.26 24.87 9.00 120.00
Episode 20
Reward:μσmM 34.96 21.66 9.00 123.00
Episode 21
Reward:μσmM 40.34 23.07 12.00 100.00
Episode 22
Reward:μσmM 33.68 22.48

## From REINFORCE to A2C

### The idea behind A2C

The need of a critic.

### A2C