# Training for Reinforcement Learning of a Lunar Lander
This is the first tutorial of HuggingFaces Deep Reinforcement Learning. I would like to run everything on my PC, but have been struggling with the installation. The problem seems to be that the Python versions in my installations might be too new for some of the RL libraries.

The following is the solution to these installation problems

## Installing libraries in WSL2
WSL2 is now able to run Linux GUIs without the need for other X-term emulators.
### Install Ananconda and python-opengl
The libraries require python-opengl, xvfb. I am installing the rest because it works for now.
1. Install Anaconda3 on WSL2
2. ```sudo apt-get install python-opengl```
3. ```sudo apt-get install cmake zlib1g-dev xorg-dev libgtk2.0-0 swig python-opengl xvfb``` - note that ```python-matplotlib``` is not installed

### Create the environment
Initially I created the conda environment with python=3.5. But the latest version python=3.9.12 seems to work just fine.

Don't use python=3.10

But you can set the version using
```
conda create --name gym python=3.5
```
For the conda python version
```
conda create --name gym
```
Activate the environment

```
source activate gym
```
Now install the OpenAI Gym package

```
pip install gym

pip install gym[atari]

pip install gym[box2d]

pip install box2d-py

pip install box2d

pip install box2d-kengz
```
I have been having huge issues with gym[box2d], but now its installing just fine with this method. I couldn't run the PacMan because I don't have a license.

I am sure this method will work for my Linux problems too. But I am having issues with the CUDA installation after the update to 22.04. I might need to install 20.04 again to solve these.


### Install Python Virtual Display
The next problem was with the virtual display.


```pip install pyvirtualdisplay```

I used Pycharm to install this

The final part that kills WSL is that Jupyter doesn't work.

### Install Pytorch
Installed Pytorch using the pip command from Pytorch.org



In [33]:
from pyvirtualdisplay import Display
virtual_display = Display(visible=True, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7fee30a37d60>

In [2]:
import torch
torch.cuda.is_available()

True

## Installing Huggingface Libraries
```
pip install huggingface-sb

pip install huggingface-hub

pip install stable-baselines3

```

## What is GYM and how it works
The library containing our environment is called Gym. Gym is used a lot in Deep Reinforcement Learning.

The GYM library provides two things:
* An interface that allows you to create RL environments
* A collection of environments (gym-control, atari, box2d)

### The reinforcement learning loop
A recap on the RL loop:

1. The agent receives state S0 from the Environment - The first frame of the game
2. The agent takes action A0 - The agent makes a move to the right
3. The environment creates a new state S1 - A new frame from the game
4. The environment gives a reward R1 to the Agent- If not dead Positive Reward +1

### The RL loop in Gym
1. The environment is created by ```gym.make()```
2. Reset the environment to its initial state with ```observation = env.reset()```
3. Using ```env.step(action)``` we perform an action in the environment (a random action) and we receive.
            * ```observation```: The new state S1
            * ```reward```: Reward for the action
            * ```done```: Indicates if the episode terminated
            * ```info```: A dictionary that provides additional information (depends on the environment)

If the episode is done, we reset the environment to its initial sates with ```observation = env.reset()```.

In [3]:
import gym
# Create the environment called LunarLander V2 from box2d

env = gym.make("LunarLander-v2")

# Reset the environment to S0
observation = env.reset()

for _ in range(20):
    action = env.action_space.sample() # take a random action
    print(f"Action taken {action}")

    # Perform the action and receive the next_state, reward, done and info
    observation, reward, done, info = env.step(action)

    if done:
        print("Environment is reset") # reset the environment
        observation = env.reset()

Action taken 3
Action taken 1
Action taken 1
Action taken 1
Action taken 3
Action taken 2
Action taken 0
Action taken 1
Action taken 0
Action taken 2
Action taken 0
Action taken 0
Action taken 1
Action taken 0
Action taken 3
Action taken 2
Action taken 3
Action taken 2
Action taken 2
Action taken 0


## Creating the LunarLander Environment and understanding how it works
We are going to train a Lunar Lander to land correctly on the moon. We need the agent to learn to adapt its speed and position (horizontal, vertical and angular) to land correctly.

### Lunar Lander Documentation
[lunar_lander](https://www.gymlibrary.ml/environments/box2d/lunar_lander/)

In [4]:
# create the environment with gym.make()
env = gym.make("LunarLander-v2")
env.reset()
print("----------------Observation Space-------------------\n")
print(f'Observation Space Shape {env.observation_space.shape}')
print(f'Sample observation {env.observation_space.sample()}') # get a random observation

----------------Observation Space-------------------

Observation Space Shape (8,)
Sample observation [-0.86316216  0.08773206 -0.4644909  -0.90330577 -0.5794051  -0.6585583
  1.1527103   0.08829654]


With the Observation shape ```(8,)``` that the observation is vector of size 8 ,each value is information about the lander.

1. Horizon pad coordinate (x)
2. Vertical pad coordinate (y)
3. Horizontal speed (x)
4. Vertical speed (y)
5. Angle
6. Angular speed
7. If the left leg has contact point touches the land
8. If the right leg has contact point touched the land

In [5]:
print("\n___________________Action Space_______________\n")
print(f"Action Space Shape {env.action_space.n}")
print(f'Action Space Sample {env.action_space.sample()}') # takes a random action


___________________Action Space_______________

Action Space Shape 4
Action Space Sample 1


The action space is the set of possible actions the agent can make. It is discrete with 4 actions available:
1. Do nothing
2. Fire left orientation engine
3. Fire the main engine
4. Fire right orientation engine

The Reward Function is the function that will give a reward at each time step:
1. Moving from the top of the screen to the landing pad and zero speed is ~ 100 to 140 points
2. Firing main engine is -0.3 each frame
3. Each leg ground contact is + 10 points
4. If episode finishes with a crash -100 points or comes to rest + 100 points
5. The game is solved if the agent has 200 points.

### Vectorize the Environment
A vectorized environment is a way to stack multiple independent environments into a single environment.
We create a vectorized environment of 16 environments, so that we will have a more diverse experience during training.

In [6]:
from stable_baselines3.common.env_util import  make_vec_env
# create the environment with 16 independent environment scenarios
env = make_vec_env('LunarLander-v2', n_envs=16)

## Creating the Model
We have created an environment that enables the Lunar Lander to land correctly on a Landing Pad by controlling left, right and main orientation engine.

We need to now build the algorithm that will solve the problem.

We use the Deep RL library Stable Baselines 3 (SB#) to do this
SB2 is a set of reliable implementations of reinforcement learning algorithms in Pytorch

### Stable Baseline 3
[documentation and tutorials](https://stable-baselines3.readthedocs.io/en/master/)

### Solving the problem with SB3
We are going to use SB3 PPO. PPO (Proximal Policy Optimization) is not of the state-of-the-art Deep Reinforcement Learning algorithms that will be studied in this course.

PPO is a combination of:
* Value-based reinforcement learning: learning an action-value function that will tell us what is the most valuable action to take given a state and action
* Policy based reinforcement learning: learning a policy that will give us a probability distribution over actions.

### Setting up SB3
1. Create the environment
2. Define the model you want to use and instantiate the model with ```model = PPO('MlpPolicy')```
3. Train the agent with ```model.learn``` and define the number of training time steps

In [7]:
# Define a PPO MlpPolicy architecture
# MultilayerPerceptron Policy
# We use the MlpPolicy because we are using as input a vector
# If we use frames as input we will use CnnPolicy

In [8]:
from stable_baselines3 import PPO
model  = PPO(
    policy='MlpPolicy',
    env=env,
    n_steps=1024,
    batch_size=64,
    n_epochs=4,
    gamma=0.999,
    gae_lambda=0.98,
    ent_coef=0.01,
    verbose=1
)

Using cuda device


## Train the PPO agent
* Train the agent for 500 000 time steps.



In [16]:
model.learn(total_timesteps=1000000)

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 289      |
|    ep_rew_mean     | 272      |
| time/              |          |
|    fps             | 3663     |
|    iterations      | 1        |
|    time_elapsed    | 4        |
|    total_timesteps | 16384    |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 285          |
|    ep_rew_mean          | 269          |
| time/                   |              |
|    fps                  | 2633         |
|    iterations           | 2            |
|    time_elapsed         | 12           |
|    total_timesteps      | 32768        |
| train/                  |              |
|    approx_kl            | 0.0050933086 |
|    clip_fraction        | 0.051        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.781       |
|    explained_variance   | 0.946        |
|    learning_r

<stable_baselines3.ppo.ppo.PPO at 0x7feef9249850>

## Evaluate the agent
* Stable-Baselines3 provides a method called ```evaluate_policy```
* You can find the documentation [here](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#basic-usage-training-saving-loading)

When evaluating the agent, we create the evaluation environment, different from the training environment.


In [18]:
from stable_baselines3.common.evaluation import evaluate_policy
eval_env = gym.make("LunarLander-v2")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward = {mean_reward:.2f} +/- {std_reward:.2f}")



mean_reward = 285.37 +/- 16.61


## Save the model

In [21]:
model.save("ppo-LunarLander-v2")

After training for 10 million steps I got a mean reward of ```285.37 +/- 16.61```

## Publish the trained model on the Hub
We can publish the train model to the hub with one line of code. Buy using ```package_to_hub``` you evalate, record a replay and generate a model for the agent which is then pushed to the hub.

* You can showcase the work
* Visualize the agent playing
* Share with the community an agent others can use
* Access a leaderboard to see how well the agen is performing compared to the classmates at [the leaderboard](https://huggingface.co/spaces/ThomasSimonini/Lunar-Lander-Leaderboard)

To enable sharing the model you need to:
1. Create an account at HF
2. Sign in and then get the authentication token from HF website by:
        *  Create a new token with write role
        *  Copy the token
        *  Run cell with the pasted token

I can't use the notebook login, so I used ```huggingface-cli login``` instead and pasted the token into the commandline.

In [None]:
from huggingface_hub import notebook_login
# notebook_login()
# !git config --global credential.helper store

3. Push the trained agent to the Hub using ```package_to_hub``` function


### Package to hub
Needs:
* ```model```: the trained model
* ```model_name```: the name of the trained model that we defined in ```model_save```
* ```model_architecture```: the model architecture we used (in our case PPO)
* ```env_id```: the name of the environment, in our case ```LunarLander-v2```
* ```eval_env```: the evaluation environment defined in ```eval_env```
* ```repo_id```: the name of HF Hub Respository that will be created or updated (```repo_id = {username}/{repo_name}``` - a good name is {username}/{model_architecture}-{env-id}
* ```commit message```: message of the commit

### GIT large file storage
Uploaded requires lfs:

Download the package from [git-lfs](https://git-lfs.github.com/)

Go to the download

```
tar -xf git-lfs-linux-amd64-v2.9.0.tar.gz

chmod 755 install.sh

sudo ./install.sh
```

Go to repository directory (gym)

```
git lfs install
```
You only need to do  this once

## Uploaded in HF
Note that for this notebook, it cannot start the video of the lander. But if you perform this in a python file it all works, then uploads the model to HF. Future tutorials should be done

In [32]:
# HF imports
import gym
from huggingface_sb3 import package_to_hub
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env

# Define the name of the environment
env_id = "LunarLander-v2"

# Define the name of the trained model that we defined in model_save
model_name = "ppo-LunarLander-v2_test"

# Define the model architecture
model_architecture = "PPO"

# repo_id is the id of the model repository from HF hub (repo_id = {wesleywt}/{repo_name}

repo_id = "wesleywt/ppo-LunarLander-v2"

# Commit message
commit_message = "First commit of model"


# Create evaluation env
eval_env = DummyVecEnv([lambda : gym.make(env_id)])

# Fill in the package_to_hub
package_to_hub(model=model,
               model_name=model_name,
               model_architecture=model_architecture,
               env_id=env_id,
               eval_env=eval_env,
               repo_id=repo_id,
               commit_message=commit_message)

[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: If you encounter a bug, please open an issue and use
push_to_hub instead.[0m


/home/wesley/PycharmProjects/deep-rl-class/unit1/hub/ppo-LunarLander-v2 is already a clone of https://huggingface.co/wesleywt/ppo-LunarLander-v2. Make sure you pull the latest changes with `repo.git_pull()`.


ContextException: Could not create GL context

Now we have trained and uploaded the Deep RL Learning agent. But it will not upload because jupyter cannot start the video of the training.

Maybe use Jupyter Labs?

It works in a normal Python script: ```train_upload_lunar_lander_v2.py```



## Additional Challengers
We can optimize the hyperparameters to get better training results. For example increasing the number of training steps.

Some ideas:
1. Train more steps
2. Hyperparameter optimization. You can find the parameters [here](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#parameters)
3. Try other model architectures such as DQN. You can find them [here]((https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html))
4. Push and compare the results on the [leaderboard](https://huggingface.co/spaces/ThomasSimonini/Lunar-Lander-Leaderboard)

## Other environments
Try other environments such as CartPole-V1 or MountainCar-v0 etc. Check out how they work [here](https://www.gymlibrary.ml/

## Weights and Bias
I am thinking of using W&B for hyperparameter optimizationss