-----
# Using Open AI Gym
-----

### Notebook Overview

In this notebook, I will practice implementing basic concepts of Reinforcement Learning using OpenAI's Gymnasium and Stable Baselines3 library.

I will be using **Lunar Lander**. Below is an introduction to the model based on OpenAI Gymnasium's documentation:

1. Action Space

    The agent can take 4 discrete(fixed) actions:
    
        - 0: Do nothing
        - 1: Fire left engine
        - 2: Fire main engine
        - 3: Fire right engine

2. Observation Space

    - Defined as `Box([ -2.5 -2.5 -10. -10. -6.2831855 -10. -0. -0. ], [ 2.5 2.5 10. 10. 6.2831855 10. 1. 1. ], (8,), float32)` 
    - Observation is 8 dimenstional vector:
        1. X position
        2. Y position
        3. X velocity
        4. Y velocity
        5. Angle of lander
        6. Angular velocity - how tilted the lander is
        7. Left Left - if left leg contacts the ground 1 yes 0 no
        8. Right Left - if right leg contacts the ground 1 yes 0 no
    - 2 copies, first is lower bound, second is upper bound

3. Reward

    - Goal is to land between the two flags
    - Rewards for each step:
        - increases closer the lander is to the lander pad
        - increaes the slower the lander is mocing 
        - decreases the more tilted the lander is
        - increases by 10 points for each leg in contact with the ground
        - decreased by 0.03 points each time a side engine fires (seen as red dots on rendering)
        - decreased by 0.3 points each time the main engine fires 
        - additional -100/+100 points for crashing/safe landing 

4. Episode End

    - Truncation: Reached when agent scores 200 points
    - Terminataion: If lander crashes, if lander goes out of bounds or if the lander is asleep.

To establish a baseline performance , I will first implement a random policy that randomly selects actions from the action space.

I will then progress to training the agent using the Proximal Policy Optimization (PPO) algorithm from Stable Baselines3.

## Imports
-----

In [1]:
import os
import numpy as np

# Open AI imports
import gymnasium as gym

# Stable Baseline3 imports
from stable_baselines3 import PPO # ALGORITHM: Proximal Policy Optimization 
from stable_baselines3.common.vec_env import DummyVecEnv # train RL agent on multiple env same time, increase speed, wrapper around env
from stable_baselines3.common.evaluation import evaluate_policy # Avagerage reward certain epsidoes

## Lunar Landar : Random Action Selection
---

In [2]:
# Initialising environment using .make() class
# Setting render_mode here to humnan to visualise actions occuring during each step, no need to render env later if set here
env = gym.make('LunarLander-v2', render_mode = 'human')

In [3]:
# Set the number of episodes and limiting steps per episode for quicker runtime
episodes = 5 
max_steps = 1_000

# Loop through each episode, resetting env after each episode has ran
for episode in range(1, episodes+1):
    obs, info = env.reset(seed=1)
    score = 0

    # Loop through each step in max_steps
    for step in range(max_steps): 
       
        # Select rand action from action_space using .sample() 
        r_action = env.action_space.sample()

        # Get info from environment after taking rand action
        obs, reward, terminated, truncated, info = env.step(r_action)  
        # adding reward to score
        score+=reward

        # Episode is over if either of flags are set, break if true
        if terminated or truncated:
            break
    
    # Returning Episode number and score
    print(f'Episode {episode} Score:{score}')

# Closing environment once done
env.close() 

Episode 1 Score:-187.4799284113057
Episode 2 Score:-427.21278868013303
Episode 3 Score:-341.04416222268674
Episode 4 Score:-127.43072586064358
Episode 5 Score:-125.8051098987291


----
**Comment:**

Each episode's score is negative. 

From this, it is clear that the agent was unable to achieve the goal through taking random actions.

This shows the need of training for the agent to understand its environemnt and make informed action choices based on its observations.

## Lunar Landar : Training using PPO Algorithm
---

### PPO

**What is PPO?**
 
PPO stands for Proximal Policy Optimisation. 

PPO is a policy based algorithm meaning it learns a policy by optimising probability of taking high rewarding actions. 

After the first episode, the agent has a policy based on what the agent has observed in its environment. The agent starts the next episode and continues to collect observations recording the rewards of each action. PPO calculates the reward of each action and compares it to the expected outcome (this would be previous result). PPO uses these values to update the action probabilites to increase the likelihood of taking actions which give higher rewards. If the difference between the actions are too drastic, PPO 'clips' them to ensure stable learning. 

**Why PPO?**

I chose PPO as my starting point in Reinforcement Learning because it’s both relatively simple and stable. PPO makes gradual, controlled updates to the policy, so the agent doesn’t make drastic changes all at once. 

In stable baselines 3 documentation there is also a table to help choose algorithms to use based on the environment and action space you are working with: https://stable-baselines.readthedocs.io/en/master/guide/algos.html.

### Vectorising Env 

Before I start training, I need to vectorise the environment. 

Vectorising the environment allows running of multiple instances of an environment at the same time. So intead of training an agent on one Mountain Car environment, I now can train it on 8 environments simultaneously. This means I can get more data in less time and speed up the training process!

In this notebook I use `DummyEnvVec` as its a simple wrapper to add to the environment and simple to use.

In [4]:
# Vectorising environment using DummyVecEnv to create 4 instances
# Choosing not to render env during training for 2 reasons:
v_env = DummyVecEnv([lambda: gym.make('LunarLander-v2') for i in range(4)])

### Calling PPO

In [5]:
# File path to save log to
log_path = ('../../Training/Logs/')
# Using MlpPolicy as environment is relatively simple
# To save logs to log path above, to be used later for TensorBoard
model = PPO('MlpPolicy', v_env, verbose = 3, tensorboard_log= log_path)

Using cpu device


In [6]:
# Using .learn method to train the model, setting max time steps to 1_000_000 across all episodes
# Number of epsiodes decided by total_timesteps
model.learn(total_timesteps= 1_000_000)

Logging to ../../Training/Logs/PPO_16
------------------------------
| time/              |       |
|    fps             | 14104 |
|    iterations      | 1     |
|    time_elapsed    | 0     |
|    total_timesteps | 8192  |
------------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 7052         |
|    iterations           | 2            |
|    time_elapsed         | 2            |
|    total_timesteps      | 16384        |
| train/                  |              |
|    approx_kl            | 0.0065644067 |
|    clip_fraction        | 0.0462       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.38        |
|    explained_variance   | 0.005244136  |
|    learning_rate        | 0.0003       |
|    loss                 | 497          |
|    n_updates            | 10           |
|    policy_gradient_loss | -0.00566     |
|    value_loss           | 1.45e+03     |
--------------

<stable_baselines3.ppo.ppo.PPO at 0x152ee7760>

### Saving of Logs and Model

In [7]:
model.save('../../Training/Saved Models/PPO_lunar_lander')

In [8]:
# Loading model back using same vectorised env as above
final_model = PPO.load('../../Training/Saved Models/PPO_lunar_lander', env = v_env)

In [9]:
# Deciding to evaluate across 5 episodes
evaluate_policy(model, v_env, n_eval_episodes = 10, render = False)



(np.float64(270.83103408833716), np.float64(15.382679669192557))

------
**Comment:**

`evaluate_policy`: returns average reward and standard deviation of reward

**Mean Reward : 258.1**

- Average reward across all ten episodes.
- Greater than 200 showing agent has learned the task successfully.

**Standard Deviation: 12.3**

- Shows some variability in agent's performance across episodes.


### Testing

In [10]:
env = gym.make('LunarLander-v2', render_mode = 'human')

In [11]:
episodes = 10

for episode in range(1, episodes+1):
    # Resetting env to initial state, ready for testing
    obs, _ = env.reset(seed = 1)
    score = 0
    done = False

    while not done:
        # Using the model.predict for action selection using policy
        action, states = model.predict(obs) 
        # Take step in the env using chosen action
        obs, reward, terminated, truncated, info = env.step(action) 
        # Update to the score using reward
        score+=reward

        done = truncated or terminated

    print(f'Episode Number {episode} Score:{score}')

# Closing after env, no longer needed
env.close()

Episode Number 1 Score:124.10840970246511
Episode Number 2 Score:124.47667406187728
Episode Number 3 Score:260.2095870624055
Episode Number 4 Score:247.21193783183256
Episode Number 5 Score:259.46422672810206
Episode Number 6 Score:255.22591452127557
Episode Number 7 Score:262.5630435909743
Episode Number 8 Score:261.2563348120568
Episode Number 9 Score:261.0228806009323
Episode Number 10 Score:260.60661251834915


-----
**Comment:**

Overall, agent shows a strong performance with most scores above 200.

There are some low and even negative scores which may suggest the agent struggled in certain scenarios. 

Perhaps worth investigating environment states for poor performing episodes to see what went wrong. Fine-tuning certain hyperparameters could also help improve consistency. However, since Lunar Lander is mainly for gaining experience with OpenAI and Stable Baselines3, I may soon shift to developing my own environment, where I can focus more on fine-tuning and tailoring the agent’s performance. 

### Viewing the logs in TensorBoard

In [13]:
# to run in command
# navigate to logs for ppo 
# run tensorboard --logdir=.

To give overview of logs in tensorboard!

## Summary
----

Training the agent has clearly produced better results than random action selection. 

Next, I plan to build a custom environment, implement PPO and potentially explore Q-learning and DQN.