<a href="https://colab.research.google.com/github/rlberry-py/tutorials/blob/main/A2C/(Solution)_Tutorial_Advantage_Actor_Critic_(A2C).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial - Advantage Actor Critic (A2C)

A2C keeps two neural networks:
*   One network with paramemeters $\theta$ to represent the policy $\pi_\theta$.
*   One network with parameters $\omega$ to represent a value function $V_\omega$, that approximates $V^{\pi_\theta}$


At each iteration, A2C collects $M$ transitions $(s_i, a_i, r_i, s_i')_{i=1}^M$ by following the policy $\pi_\theta$. If a terminal state is reached, we simply go back to the initial state and continue to play $\pi_\theta$ until we gather the $M$ transitions.

Consider the following quantities, defined based on the collected transitions:

$$
\widehat{V}(s_i) = \widehat{Q}(s_i, a_i) = \sum_{t=i}^{\tau_i \wedge M} \gamma^{t-i} r_t + \gamma^{M-i+1} V_\omega(s_M')\mathbb{I}\{\tau_i>M\}
$$

where and $\tau_i = \min\{t\geq i: s_i' \text{ is a terminal state}\}$, and 

$$
\mathbf{A}_\omega(s_i, a_i) = \widehat{Q}(s_i, a_i) -  V_\omega(s_i)  
$$


A2C then takes a gradient step to minimize the policy "loss" (keeping $\omega$ fixed):

$$
L_\pi(\theta) =
-\frac{1}{M} \sum_{i=1}^M \mathbf{A}_\omega(s_i, a_i) \log \pi_\theta(a_i|s_i)
- \frac{\alpha}{M}\sum_{i=1}^M \sum_a  \pi(a|s_i) \log \frac{1}{\pi(a|s_i)}
$$

and a gradient step to minimize the value loss (keeping $\theta$ fixed):

$$
L_v(\omega) = \frac{1}{M} \sum_{i=1}^M \left( \widehat{V}(s_i) - V_\omega(s_i)   \right)^2
$$
 


# Reminders


Objective function:

$$
J(\theta) = \mathbb{E}_{\pi_\theta}
\left[ 
  \sum_{t=0}^\infty \gamma^t r(S_t, A_t)
\right]
$$

Policy gradient:

$$
\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}
\left[ 
  \sum_{t=0}^\infty \gamma^t A^{\pi_\theta}(S_t, A_t) 
  \nabla_\theta \log \pi_\theta(A_t|S_t)
\right]
$$
where $A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s) $ is the advantage function.

# Colab setup

In [None]:
# After installing, restart the kernel

# install rlberry library
!git clone https://github.com/rlberry-py/rlberry.git 
!cd rlberry && git pull && pip install -e .[full] > /dev/null 2>&1
!pip install ffmpeg-python > /dev/null 2>&1

# gym
!pip install 'gym[all]' > /dev/null 2>&1

# packages required to show video
!pip install pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

# ask to restart runtime
print("")
print(" ~~~  Libraries installed, please restart the runtime! ~~~ ")
print("")

Cloning into 'rlberry'...
remote: Enumerating objects: 491, done.[K
remote: Counting objects: 100% (491/491), done.[K
remote: Compressing objects: 100% (309/309), done.[K
remote: Total 3560 (delta 296), reused 335 (delta 179), pack-reused 3069[K
Receiving objects: 100% (3560/3560), 889.06 KiB | 1.22 MiB/s, done.
Resolving deltas: 100% (2290/2290), done.
Already up to date.

 ~~~  Libraries installed, please restart the runtime! ~~~ 



In [None]:
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40)  # error only

import torch
import torch.nn as nn
import torch.nn.functional as F 
from torch import optim

import numpy as np


# for videos
import rlberry.colab_utils.display_setup
from rlberry.colab_utils.display_setup import show_video

In [None]:
class ActorNetwork(nn.Module):
    """
     This network represents the policy
    """

    def __init__(self, input_size, hidden_size, action_size):
        super(ActorNetwork, self).__init__()
        self.n_actions = action_size
        self.dim_observation = input_size
        
        self.net = nn.Sequential(
            nn.Linear(in_features=self.dim_observation, out_features=hidden_size),
            nn.ReLU(),
            nn.Linear(in_features=hidden_size, out_features=hidden_size),
            nn.ReLU(),
            nn.Linear(in_features=hidden_size, out_features=self.n_actions),
            nn.Softmax(dim=-1)
        )
        
    def policy(self, state):
        state = torch.tensor(state, dtype=torch.float)
        return self.net(state)
    
    def sample_action(self, state):
        state = torch.tensor(state, dtype=torch.float)
        action = torch.multinomial(self.policy(state), 1)
        return action.item()

In [None]:
class ValueNetwork(nn.Module):
  """
   This class represents the value function
  """

  def __init__(self, input_size, hidden_size, output_size):
      super(ValueNetwork, self).__init__()
      self.fc1 = nn.Linear(input_size, hidden_size)
      self.fc2 = nn.Linear(hidden_size, hidden_size)
      self.fc3 = nn.Linear(hidden_size, output_size)

  def forward(self, x):
      out = F.relu(self.fc1(x))
      out = F.relu(self.fc2(out))
      out = self.fc3(out)
      return out
  
  def value(self, state):
      state = torch.tensor(state, dtype=torch.float)
      return self.forward(state)

In [None]:
# You can select your environment here
env_id = 'CartPole-v1'  # @param ["CartPole-v1", "LunarLander-v2", "MountainCar-v0"]
env = gym.make(env_id)
eval_env = gym.make(env_id) # environment to evaluate the policy

INFO: Making new env: CartPole-v1
INFO: Making new env: CartPole-v1


In [None]:
# Define you networks
value_network = ValueNetwork(env.observation_space.shape[0], 16, 1)
actor_network = ActorNetwork(env.observation_space.shape[0], 16, env.action_space.n)
print(value_network)
print(actor_network)

# Define your optimizers
value_network_optimizer = torch.optim.RMSprop(value_network.parameters(), lr=0.01)
actor_network_optimizer = torch.optim.RMSprop(actor_network.parameters(), lr=0.01)

# --------------------------------------------------------------
# Parameters
# --------------------------------------------------------------
num_iterations = 300     # Number of iterations
batch_size = 512         # How many samples to collect (value of M)
gamma = 1                # Discount factor
alpha = 0.001            # Entropy term coefficient
reward_threshold = 495   # Stop training when the policy achieves this amound of rewards


# --------------------------------------------------------------
# Train
# --------------------------------------------------------------
for iteration in range(num_iterations):
    # Initialize batch storage
    batch_losses = torch.zeros(batch_size)
    batch_returns = np.zeros(batch_size)


    states = np.empty((batch_size,) + env.observation_space.shape, dtype=np.float)        # shape (batch_size, state_dim)
    rewards = np.empty((batch_size,), dtype=np.float)                                     # shape (batch_size, )                                 
    next_states = np.empty((batch_size,) + env.observation_space.shape, dtype=np.float)   # shape (batch_size, state_dim)
    dones = np.empty((batch_size,), dtype=np.bool)                                        # shape (batch_size, ) 
    proba = torch.empty((batch_size,), dtype=np.float)                                    # shape (batch_size, ), store pi(a_t|s_t)
    next_value = 0                               # 
  
    # Intialize environment
    state = env.reset()

  # Generate batch
    for i in range(batch_size):
        action = actor_network.sample_action(state)
        next_state, reward, done, _ = env.step(action)

        states[i] = state
        rewards[i] = reward
        next_states[i] = next_state
        dones[i] = done
        proba[i] = actor_network.policy(state)[action]

        state = next_state
        if done:
          state = env.reset()

    if not done:
        next_value = value_network.value(next_states[-1]).detach().numpy()[0]

    # compute returns
    returns = np.zeros((batch_size,), dtype=np.float)
    T = batch_size
    for j in range(T):
        returns[T-j-1] = rewards[T-j-1]
        if j > 0:
            returns[T-j-1] += gamma * returns[T-j] * (1 - dones[T-j])
        else:
            returns[T-j-1] += gamma * next_value

    # compute advantage
    values = value_network.value(states)
    advantages = returns - values.detach().numpy().squeeze()

    # Compute MSE
    value_network_optimizer.zero_grad()
    loss_value = F.mse_loss(values, torch.tensor(returns, dtype=torch.float).unsqueeze(1)) 
    loss_value.backward()
    value_network_optimizer.step()

    # compute entropy term
    dist = actor_network.policy(states)
    entropy_term = -(dist*dist.log()).sum(-1).mean()

    # Compute Actor Gradient
    actor_network_optimizer.zero_grad()
    loss_policy = -torch.mean(torch.log(proba) * torch.tensor(advantages, dtype=torch.float))
    loss_policy += -alpha * entropy_term
    loss_policy.backward()
    actor_network_optimizer.step()

    if( (iteration+1)%10 == 0 ):
        eval_rewards = np.zeros(5)
        for sim in range(5):
            eval_done = False
            eval_state = eval_env.reset()
            while not eval_done:
                eval_action = actor_network.sample_action(eval_state)
                eval_next_state, eval_reward, eval_done, _ = eval_env.step(eval_action)
                eval_rewards[sim] += eval_reward
                eval_state = eval_next_state
        print("Iteration = {}, loss_value = {:0.3f}, loss_policy = {:0.3f}, rewards = {:0.2f}"
              .format(iteration +1, loss_value.item(), loss_policy.item(), eval_rewards.mean()))
        if (eval_rewards.mean() > reward_threshold):
            break

ValueNetwork(
  (fc1): Linear(in_features=4, out_features=16, bias=True)
  (fc2): Linear(in_features=16, out_features=16, bias=True)
  (fc3): Linear(in_features=16, out_features=1, bias=True)
)
ActorNetwork(
  (net): Sequential(
    (0): Linear(in_features=4, out_features=16, bias=True)
    (1): ReLU()
    (2): Linear(in_features=16, out_features=16, bias=True)
    (3): ReLU()
    (4): Linear(in_features=16, out_features=2, bias=True)
    (5): Softmax(dim=-1)
  )
)




Iteration = 10, loss_value = 1631.849, loss_policy = 11.716, rewards = 65.00
Iteration = 20, loss_value = 413.188, loss_policy = -5.621, rewards = 61.80
Iteration = 30, loss_value = 572.310, loss_policy = -7.139, rewards = 62.20
Iteration = 40, loss_value = 3875.631, loss_policy = 13.241, rewards = 106.20
Iteration = 50, loss_value = 2029.424, loss_policy = 4.542, rewards = 195.60
Iteration = 60, loss_value = 1659.521, loss_policy = -9.202, rewards = 57.40
Iteration = 70, loss_value = 22123.277, loss_policy = 24.466, rewards = 253.40
Iteration = 80, loss_value = 320.646, loss_policy = -0.344, rewards = 127.40
Iteration = 90, loss_value = 30235.520, loss_policy = 34.976, rewards = 165.20
Iteration = 100, loss_value = 6986.101, loss_policy = 8.136, rewards = 392.60
Iteration = 110, loss_value = 17869.715, loss_policy = 12.621, rewards = 445.40
Iteration = 120, loss_value = 19477.117, loss_policy = 4.367, rewards = 500.00


In [None]:
env = Monitor(env, "./gym-results", force=True, video_callable=lambda episode: True)
for episode in range(1):
    done = False
    state = env.reset()
    while not done:
        action = actor_network.sample_action(state)
        state, reward, done, info = env.step(action)
env.close()
show_video(directory="./gym-results")

INFO: Creating monitor directory ./gym-results
INFO: Starting new video recorder writing to /content/gym-results/openaigym.video.0.532.video000000.mp4




INFO: Finished writing results. You can upload them to the scoreboard via gym.upload('/content/gym-results')


# Test other environments!

Try some other environments available in OpenAI gym ([link](https://gym.openai.com/envs/#classic_control)). Suggestion: use `classic control` or `Box2D` environments.