# Reinforce & Actor-Advantage Critic (A2C)

[You can find the original paper here](https://arxiv.org/pdf/1602.01783.pdf).

## Intro

In this tutorial we will focus on Deep Reinforcement Learning with **Reinforce** and the **Actor-Advantage Critic** algorithm. This tutorial is composed of:
* A quick reminder of the RL setting,
* A theoritical approch of Reinforce
* A theoritical approch of A2C,
* An introduction to the deep learning framework: **PyTorch**, 
* A coding part with experiments.


## Introduction to PyTorch

*If you already know PyTorch you can skip this part. From this part on we assume that you have some experience with Python and Numpy.*

PyTorch is a Python package that provides two high-level features:
- Tensor computation (like NumPy) with strong GPU acceleration
- Deep neural networks built on a tape-based autograd system

At a granular level, PyTorch is a library that consists of the following components:

| Component | Description |
| ---- | --- |
| [**torch**](https://pytorch.org/docs/stable/torch.html) | a Tensor library like NumPy, with strong GPU support |
| [**torch.autograd**](https://pytorch.org/docs/stable/autograd.html) | a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch |
| [**torch.jit**](https://pytorch.org/docs/stable/jit.html) | a compilation stack (TorchScript) to create serializable and optimizable models from PyTorch code  |
| [**torch.nn**](https://pytorch.org/docs/stable/nn.html) | a neural networks library deeply integrated with autograd designed for maximum flexibility |
| [**torch.multiprocessing**](https://pytorch.org/docs/stable/multiprocessing.html) | Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training |
| [**torch.utils**](https://pytorch.org/docs/stable/data.html) | DataLoader and other utility functions for convenience |



PyTorch works in a very similar way as Numpy and PyTorch's Tensors are the equivalent of Numpy's Arrays.

In [1]:
import torch
import numpy as np

You can initialize an zero filled tensor just like in numpy.

In [28]:
torch.zeros(5,3)

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])

In [30]:
torch.eye(3)

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])

You can also convert an array to a tensor.

In [32]:
torch.tensor(np.eye(3))

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]], dtype=torch.float64)

And you can transform a tensor to an array.

In [33]:
a_tensor.numpy()

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

You can sum, substract, multiply arrays just like in numpy.

In [35]:
a = torch.randint(0,10,(2,3))
print(a)

tensor([[7, 4, 3],
        [7, 3, 1]])


In [36]:
b = torch.randint(0,10,(2,3))
print(b)

tensor([[4, 9, 1],
        [1, 4, 0]])


In [43]:
print(f'a + b = {a + b}')
print(f'a * b = {a * b}')

a + b = tensor([[11, 13,  4],
        [ 8,  7,  1]])
a * b = tensor([[28, 36,  3],
        [ 7, 12,  0]])


You can make matrix products.

In [46]:
a @ b.t()

tensor([[67, 23],
        [56, 19]])

### AUTOGRAD: automatic differentiation

The autograd package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

``torch.Tensor`` is the central class of the package. If you set its attribute
``.requires_grad`` as ``True``, it starts to track all operations on it. When
you finish your computation you can call ``.backward()`` and have all the
gradients computed automatically. The gradient for this tensor will be
accumulated into ``.grad`` attribute.

To stop a tensor from tracking history, you can call ``.detach()`` to detach
it from the computation history, and to prevent future computation from being
tracked.

To prevent tracking history (and using memory), you can also wrap the code block
in ``with torch.no_grad():``. This can be particularly helpful when evaluating a
model because the model may have trainable parameters with
``requires_grad=True``, but for which we don't need the gradients.

There’s one more class which is very important for autograd
implementation - a ``Function``.

``Tensor`` and ``Function`` are interconnected and build up an acyclic
graph, that encodes a complete history of computation. Each tensor has
a ``.grad_fn`` attribute that references a ``Function`` that has created
the ``Tensor`` (except for Tensors created by the user - their
``grad_fn is None``).

If you want to compute the derivatives, you can call ``.backward()`` on
a ``Tensor``. If ``Tensor`` is a scalar (i.e. it holds a one element
data), you don’t need to specify any arguments to ``backward()``,
however if it has more elements, you need to specify a ``gradient``
argument that is a tensor of matching shape.

## Reminder of the RL setting

As always we will consider a MDP $M = (\mathcal{X}, \mathcal{A}, p, r, \gamma)$ with:
* $\mathcal{X}$ the state space,
* $\mathcal{A}$ the action space,
* $p(x^\prime \mid x, a)$ the transition probability,
* $r(x, a, x^\prime)$ the reward of the transition $(x, a, x^\prime)$,
* $\gamma \in [0,1)$ is the discount factor.

A policy $\pi$ is a mapping from the state space $\mathcal{X}$ to the probability of selecting each action.

The action value function of a policy is the overall expected reward from a state action. $Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\big[R(\tau) \mid s_0=s, a_0=a\big]$ where $R(\tau)$ is the random variable defined as the sum of the discounted reward.

The goal is to maximize the agent's reward.

$$ J(\pi) = \mathbb{E}_{\tau \sim \pi}\big[ \sum_{t} \gamma^t R_t \mid x_0, \pi \big]$$

# Gym + Random agent

## REINFORCE

Reinforce is an on policy method **describe adv of on policy** .
* Policy gradient theorem with two hint likelyhood of $\tau$ and log derivative trick, and end formula.
* Test with don't let the past distract you (Bonus demo)

### code of reinforcement

In [2]:
class ReinforceModel(torch.nn.Module):
    pass

In [None]:
class ReinforceAgent:
    pass

## From REINFORCE to A2C

### The idea behind A2C

The need of a critic.

### A2C