# Assignment 4

# Advantage Actor Critic (A2C)


- Here, We're going **to train a robotic arm** (Franka Emika Panda robot) to perform a task:
- `Reach`: the robot must place its end-effector at a target position.

### Environments:

- [Panda-Gym](https://github.com/qgallouedec/panda-gym)

### RL-Library:

- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)

## Create a virtual display 🔽

- During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).

In [None]:
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay
!pip install stable-baselines3[extra]
!pip install gymnasium
!pip install huggingface_sb3
!pip install huggingface_hub
!pip install panda_gym

## Initialising the Display class

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display( visible=0, size = (1400,900) )
virtual_display.start()

### Install dependencies 🔽

The Following dependencies were previously installed.
- `gymnasium`: Maintained fork of openAI's gym, providing the configurations for different environments.
- `panda-gym`: Contains the robotics arm environments.
- `stable-baselines3`: The SB3 deep reinforcement learning library.
- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.
- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.


## Import the packages 📦

In [None]:
import os
import gymnasium as gym
import panda_gym
from huggingface_sb3 import load_from_hub, package_to_hub
from stable_baselines3 import A2C
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.env_util import make_vec_env
from huggingface_hub import notebook_login

## PandaReachDense-v3 🦾

The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).

In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.

In `PandaReach`, the robot must place its end-effector at a target position (green ball).

We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.

Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg"  alt="Robotics"/>


This way **the training will be easier**.



### Create the environment

#### The environment 🎮

In `PandaReachDense-v3` the robotic arm must place its end-effector at a target position (green ball).

In [None]:
env_id = "PandaReachDense-v3"

# Create the env
env = gym.make(env_id)
s_size = env.observation_space.shape
a_size = env.action_space.shape

## Showing a random sample for the state and action space

In [None]:
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample())

In [None]:
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample())

The observation space **is a dictionary with 3 different elements**:
- `achieved_goal`: (x,y,z) the current position of the end-effector.
- `desired_goal`: (x,y,z) the target position for the end-effector.
- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).
---
- Given it's a dictionary as observation, We will use a MultiInputPolicy policy instead of MlpPolicy.

### Normalize observation and rewards

A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html).

For that purpose, there is a wrapper that will compute a running average and standard deviation of input features.

We also normalize rewards with this same wrapper by adding `norm_reward = True`

[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)

In [None]:
env = make_vec_env( env_id, n_envs=4)

# Adding this wrapper to normalize the observation and the reward
new_env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10  )

### Creating the A2C Model

For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes


In [None]:
model = A2C("MultiInputPolicy",env,  verbose=1)

### Train the A2C agent 🏃
- Training our agent for 1,000,000 timesteps.

In [None]:
model.learn(1_000_000)

In [None]:
# Save the model and  VecNormalize statistics when saving the agent
model.save("a2c-PandaReachDense-v3")
new_env.save("vec_normalize.pkl")

### Evaluate the agent
- Now that's our  agent is trained, we need to **check its performance**.
- Stable-Baselines3 provides a method to do that: `evaluate_policy`

In [None]:
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

# Load the saved statistics
eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v3")])
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)

# We need to override the render_mode
eval_env.render_mode = "rgb_array"

eval_env.training = False
eval_env.norm_reward = False

# Load and Evaluating the agent
model = A2C.load("a2c-PandaReachDense-v3")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")

Actor-Critic (AC) algorithms give advantage over DQN, generally in complex environments with continuous action spaces and where the variance in learning needs to be minimized.
1. Continuous Space limitation
  - DQNs are typically limited to discrete action spaces because they rely on choosing actions by maximizing Q-values for each possible action in a finite set. Custom modifications are to be done to the environment space to use them in continuous space.
  - Actor-Critic algo. handle continuous action spaces because they use a separate actor network to directly output actions instead of evaluating them individually.

2. Reduced Overestimation Bias in Q-Value Estimation
  - DQNs can suffer from overestimation bias, as the Q-learning target maximizes over Q-values, which can increase the estimation errors.
  - Actor-Critic methods usually combine value-based and policy-based methods, helping in reducing bias and leading to more stable learning.

- However, DQN is generally more data efficient than ActorCritic. When operating in action spaces that are small, the primitive exploration mechanism of DQN is often good enough to get results but they are very vulnerable to the continuous spaces.
