# GE2340 Course Group Project Summary
##### In this project, our primary objective is to evaluate and compare the performance of the Proximal Policy Optimization (PPO) algorithm within the LunarLander environment provided by the Gymnasium library. We selected the LunarLander environment due to its comparatively low computational demands, especially when contrasted with more resource-intensive environments such as Atari games available within the same framework. This choice allows us to conduct experiments efficiently without the need for extensive computational resources.

##### To facilitate the implementation of the PPO algorithm and streamline the process of adjusting hyperparameters, we utilized the Stable Baselines 3 (SB3) framework. SB3 offers a comprehensive collection of pre-implemented reinforcement learning algorithms, which simplifies our experimentation process and reduces the time required for development. By leveraging SB3, we can focus on analyzing the algorithm's performance and make informed adjustments to optimize outcomes effectively.

-- Code comments are purely for team coordination and following work, you can ignore them

## Download and Importing all dependencies

In [4]:
!pip install gymnasium\[box2d\] swig stable-baselines3 tensorboard



In [82]:
import gymnasium as gym # To use the environment

from stable_baselines3 import PPO # These are the algorithms from SB3
from stable_baselines3.common.evaluation import evaluate_policy # This is needed for evaluation
from stable_baselines3.common.monitor import Monitor # for evaluation purposes too

# Following two imports are to vectorize multiple environments that are running in parallel
# Basically, to speed up the process of training
from stable_baselines3.common.env_util import make_vec_env 
from stable_baselines3.common.vec_env import SubprocVecEnv

import pandas as pd # This is to save results in a tabular format

### To view the statistics of the algorithms, we will use Tensorboard.

In [15]:
# This is a log path to store the statistics of every training.
log_path = 'data/logs'

In [24]:
# Entering this command in terminal is preferred (without the exclamation mark in the beginning)
# As it is a bit troublesome inside the jupyter notebook

!tensorboard --logdir 'data/logs'

# Copy & paste this to the terminal:
# tensorboard --logdir 'data/logs'

^C
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/tensorboard/compat/__init__.py", line 42, in tf
    from tensorboard.compat import notf  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: cannot import name 'notf' from 'tensorboard.compat' (/opt/anaconda3/lib/python3.12/site-packages/tensorboard/compat/__init__.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/bin/tensorboard", line 8, in <module>
    sys.exit(run_main())
             ^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/tensorboard/main.py", line 38, in run_main
    main_lib.global_init()
  File "/opt/anaconda3/lib/python3.12/site-packages/tensorboard/main_lib.py", line 50, in global_init
    if getattr(tf, "__version__", "stub") == "stub":
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/tensorboard/lazy.py", 

## First let's understand an environment, action space and the observation space

All of these can be found in the documentation of the environment: https://gymnasium.farama.org/environments/box2d/lunar_lander/

In [6]:
env = gym.make('LunarLander-v2') # Initializing the LunarLander environment

In [7]:
# The action space is discrete with 4 actions available:
# 0 - do nothing
# 1 - fire left orientation engine
# 2 - fire main engine
# 3 - fire right orientation engine

env.action_space

Discrete(4)

In [8]:
# The observation space is box (8-dimensional vector), meaning that you receive values bounded within specific values.
# The first list below contains lower bounds, the second represent the upper bounds for each value within a list
# The values themselves represent:
# x and y coordinates of the spacecraft
# its linear velocities in x and y
# its angle and angular velocity
# and two boolean values 0 or 1 representing whether each leg is in contact with the ground or not

env.observation_space

Box([-1.5       -1.5       -5.        -5.        -3.1415927 -5.
 -0.        -0.       ], [1.5       1.5       5.        5.        3.1415927 5.        1.
 1.       ], (8,), float32)

In [9]:
# This is an example of one observation that an agent receives

env.reset()

(array([ 1.3456345e-03,  1.4103701e+00,  1.3627613e-01, -2.4438359e-02,
        -1.5524103e-03, -3.0868551e-02,  0.0000000e+00,  0.0000000e+00],
       dtype=float32),
 {})

### For administrative use only, ignore this.

Just functions for training the model, visualizing it, recording the gameplay, evaluating, etc.
They are going to be broken down later, while explaining the baseline model

In [132]:
def train_agent(algo, timesteps, log_name, policy="MlpPolicy", log_path="data/logs", lr = 0.0003):
    env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
    model = algo(policy, env, learning_rate=lr, device="cpu", tensorboard_log=log_path)
    model.learn(total_timesteps=timesteps, tb_log_name=log_name)
    
    return model

In [62]:
def visualize_agent(model_path, episodes=5, algo=PPO):
    model = algo.load(model_path)
    env = gym.make('LunarLander-v2', render_mode='human')
    
    for episode in range(episodes):
        obs, info = env.reset()
        episode_reward = 0
        terminated = False
        truncated = False
        
        while not terminated and not truncated:
            action, _state = model.predict(obs)
            obs, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
            env.render()
        
        print(f'Episode: {episode + 1}, Reward: {episode_reward}')
    
    env.close()

In [80]:
def evaluate_rl_model(algo, model_path, n_eval_episodes=100, deterministic=True):
    env = Monitor(gym.make('LunarLander-v2'))
    model = algo.load(model_path)
    mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=n_eval_episodes, deterministic=deterministic)
    
    print(f"{algo.__name__} - Mean Reward: {mean_reward} +/- {std_reward}")
    
    return mean_reward, std_reward

In [85]:
def log_evaluation(name, timesteps, mean_reward=mean_reward, std_reward=std_reward, results_df=results):
    temp_df = pd.DataFrame({
        'Algorithm': [name],
        'Mean Reward': [mean_reward],
        'Standard Deviation': [std_reward],
        'Timesteps': [timesteps]
    })
    results_df = pd.concat([results_df, temp_df], ignore_index=True)
    return results_df

In [140]:
def log_evaluation_lr(name, lr, results_df, mean_reward=mean_reward, std_reward=std_reward):
    temp_df = pd.DataFrame({
        'Algorithm': [name],
        'Mean Reward': [mean_reward],
        'Standard Deviation': [std_reward],
        'Learning Rate': [lr]
    })
    results_df = pd.concat([results_df, temp_df], ignore_index=True)
    return results_df

In [163]:
def log_evaluation_arch(name, arch, results_df, mean_reward=mean_reward, std_reward=std_reward):
    temp_df = pd.DataFrame({
        'Algorithm': [name],
        'Mean Reward': [mean_reward],
        'Standard Deviation': [std_reward],
        'Architecture': [arch]
    })
    results_df = pd.concat([results_df, temp_df], ignore_index=True)
    return results_df

### Let's initialize the environment and visualize the performance of an agent that performs random actions in the environment

#### For local usage only, since Gymnasium code visualizes the game in an external window of PyGame

In [187]:
env = gym.make('LunarLander-v2', render_mode='human') # this render mode renders the environment using PyGame to visualize the gameplay

episodes = 5 # episode is a single run of the game

for episode in range(episodes):
    obs, info = env.reset() # resetting the environment is necessary
    episode_reward = 0 # cumulative reward for the whole episode
    terminated = False # for the termination of the episode because of the game rules (e.g. the lander crushes)
    truncated = False # to finish the episode if the time limit is reached

    while not terminated and not truncated: # if any of those flags becomes true, then the episode will finish; to avoid the infinite loop
        action = env.action_space.sample() # take a random action from the action_space
        obs, reward, terminated, truncated, info = env.step(action) # env.step produces the reward and observation; that will be the agent input
        episode_reward += reward # sum up rewards for every step
        env.render() # to visualize the environment

    print(f'Episode: {episode + 1}, reward: {episode_reward}') # this prints the episode number and reward

env.close()


Episode: 1, reward: -111.10050402559793
Episode: 2, reward: -113.05190105473527
Episode: 3, reward: -179.09979614175077
Episode: 4, reward: -211.21555214185832
Episode: 5, reward: -174.2443513547509


### We plan to conduct several tests to gain insights into the PPO algorithm. 
### But let's make a default model to use as a baseline to other models

In [19]:
# To check the standard parameters for the PPO algorithm

PPO??

# PPO(
#     policy: Union[str, Type[stable_baselines3.common.policies.ActorCriticPolicy]],
#     env: Union[gymnasium.core.Env, ForwardRef('VecEnv'), str],
#     learning_rate: Union[float, Callable[[float], float]] = 0.0003,
#     n_steps: int = 2048,
#     batch_size: int = 64,
#     n_epochs: int = 10,
#     gamma: float = 0.99,
#     gae_lambda: float = 0.95,
#     clip_range: Union[float, Callable[[float], float]] = 0.2,
#     clip_range_vf: Union[NoneType, float, Callable[[float], float]] = None,
#     normalize_advantage: bool = True,
#     ent_coef: float = 0.0,
#     vf_coef: float = 0.5,
#     max_grad_norm: float = 0.5,
#     use_sde: bool = False,
#     sde_sample_freq: int = -1,
#     rollout_buffer_class: Optional[Type[stable_baselines3.common.buffers.RolloutBuffer]] = None,
#     rollout_buffer_kwargs: Optional[Dict[str, Any]] = None,
#     target_kl: Optional[float] = None,
#     stats_window_size: int = 100,
#     tensorboard_log: Optional[str] = None,
#     policy_kwargs: Optional[Dict[str, Any]] = None,
#     verbose: int = 0,
#     seed: Optional[int] = None,
#     device: Union[torch.device, str] = 'auto',
#     _init_setup_model: bool = True,
# )

In [22]:
# This will be our stock model, training it with default parameters mentioned above on 1mil timesteps

# Training here is done using 8 cores of the CPU in parallel.
# That's why we need SubprocVecEnv, make_vec_env

env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)

# The CPU is used here, since PPO was meant to be used on CPU primarily (SB3 docs)
model = PPO("MlpPolicy", env, device="cpu", tensorboard_log=log_path)
model.learn(total_timesteps=1_000_000,  tb_log_name='PPO_default')

<stable_baselines3.ppo.ppo.PPO at 0x16c79d190>

In [175]:
# SB3 allows us to use MLPs to implement the Actor-Critic Policy
# The MLP architecture consists of 2 networks for both the value function (critic) and the policy (actor)
# Each of them consist of 2-fully connected hidden layers with 64 units per layer
# Tanh is used as the activation function for the hidden layers
# The model also involves the Feature Extractor
# Read more about it here: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#sb3-policy

model.policy

ActorCriticPolicy(
  (features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (pi_features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (vf_features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (mlp_extractor): MlpExtractor(
    (policy_net): Sequential(
      (0): Linear(in_features=8, out_features=64, bias=True)
      (1): Tanh()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): Tanh()
    )
    (value_net): Sequential(
      (0): Linear(in_features=8, out_features=64, bias=True)
      (1): Tanh()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): Tanh()
    )
  )
  (action_net): Linear(in_features=64, out_features=4, bias=True)
  (value_net): Linear(in_features=64, out_features=1, bias=True)
)

In [25]:
model.save('models/PPO_default')

In [26]:
del model # for administrative use only

#### Visualizing the model's behaviour

In [28]:
model = PPO.load('models/PPO_default')

env = gym.make('LunarLander-v2', render_mode='human')

episodes = 5

for episode in range(episodes):
    obs, info = env.reset()
    episode_reward = 0
    terminated = False
    truncated = False
    
    while not terminated and not truncated:
        action, _state = model.predict(obs) #
        obs, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward
        env.render()
        
    print(f'Episode: {episode + 1}, reward: {episode_reward}')

env.close()

Episode: 1, reward: 277.15495825688726
Episode: 2, reward: 291.3077068253734
Episode: 3, reward: 270.0817612896867
Episode: 4, reward: 272.894404198532
Episode: 5, reward: 283.8284480196621


#### Saving the mp4 files of the gameplay of an agent

# Check this later bro

#### Evaluating this model over 100 episodes

In [243]:
# No rendering here is needed
env = gym.make("LunarLander-v2")

model = PPO.load('models/PPO_default')

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, deterministic=True)

print(f"PPO - Mean Reward: {mean_reward} +/- {std_reward}")



PPO - Mean Reward: 261.0436890006603 +/- 41.42890211445548


In [49]:
results = pd.DataFrame(columns=['Algorithm', 'Mean Reward', 'Standard Deviation', 'Timesteps'])

In [50]:
name = 'PPO_default'

temp_df = pd.DataFrame({
    'Algorithm': [name],
    'Mean Reward': [mean_reward],
    'Standard Deviation': [std_reward],
    'Timesteps': 1_000_000})

results = pd.concat([results, temp_df])

  results = pd.concat([results, temp_df])


## Firstly, we want to examine how the performance changes based on the number of timesteps the agent is trained for.

Training the model for 500k timesteps, reducing the baseline amount twice

In [53]:
env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO("MlpPolicy", env, device="cpu", tensorboard_log=log_path)
model.learn(total_timesteps=500_000,  tb_log_name='PPO_500k_timesteps')

<stable_baselines3.ppo.ppo.PPO at 0x32019b470>

In [60]:
model.save('models/PPO_500k_timesteps')

In [None]:
del model # for administrative use

Visualizing the model

In [268]:
visualize_agent('models/PPO_500k_timesteps', episodes=5)

Episode: 1, Reward: 258.1017631212074
Episode: 2, Reward: 266.22540669030803
Episode: 3, Reward: 282.04932361448215
Episode: 4, Reward: 247.1627704224718
Episode: 5, Reward: 267.9915426733965


Recording

Evaluating and logging the evaluation

In [116]:
mean_reward, std_reward = evaluate_rl_model(algo=PPO, model_path='models/PPO_500k_timesteps')

PPO - Mean Reward: 226.76535850999997 +/- 64.11821598336066


In [117]:
print(mean_reward, std_reward)

226.76535850999997 64.11821598336066


In [118]:
results = log_evaluation(name='PPO_500k', timesteps=500_000, mean_reward=mean_reward, std_reward=std_reward)

### Let's increase the amount of timesteps twice:

In [71]:
model = train_agent(algo=PPO, timesteps=2_000_000, log_name='PPO_2mil_timesteps')

In [72]:
model.save('models/PPO_2mil_timesteps')

In [123]:
mean_reward, std_reward = evaluate_rl_model(algo=PPO, model_path='models/PPO_2mil_timesteps')

PPO - Mean Reward: 267.35464206 +/- 22.856307370573735


In [130]:
results = log_evaluation(name='PPO_2mil', timesteps=2_000_000, mean_reward=mean_reward, std_reward=std_reward, results_df=results)

In [246]:
results

Unnamed: 0,Algorithm,Mean Reward,Standard Deviation,Timesteps
0,PPO_default,261.043689,41.428902,1000000
1,PPO_500k,226.765359,64.118216,500000
2,PPO_2mil,267.354642,22.856307,2000000


## A second experiment is to change the learning rate of the PPO algorithm
The default learning rate is 0.0003, let's increase it 10 times

In [133]:
model = train_agent(algo=PPO, timesteps=1_000_000, log_name='PPO_lr_x10', lr=0.003)

In [135]:
model.save('models/PPO_lr_x10')

In [137]:
mean_reward, std_reward = evaluate_rl_model(algo=PPO, model_path='models/PPO_lr_x10')

PPO - Mean Reward: 235.93047351 +/- 83.43652698272338


In [142]:
results_lr = pd.DataFrame(columns=['Algorithm', 'Mean Reward', 'Standard Deviation', 'Learning Rate'])

results_lr.loc[0] = ['PPO_default', 266.884301, 21.949487, 0.0003]

In [269]:
visualize_agent('models/PPO_lr_x10.zip')

Episode: 1, Reward: 294.2391201373126
Episode: 2, Reward: 308.02606720363895
Episode: 3, Reward: 281.2610729355911
Episode: 4, Reward: 55.90130036096838
Episode: 5, Reward: 267.85304492537927


In [144]:
results_lr = log_evaluation_lr(name='PPO_LR_x10', lr=0.003, results_df=results_lr, mean_reward=mean_reward, std_reward=std_reward)

Training another model, but decreasing a learning rate 10 times

In [146]:
model = train_agent(algo=PPO, timesteps=1_000_000, log_name='PPO_lr_x0_1')

In [147]:
model.save('models/PPO_lr_x0_1')

In [154]:
mean_reward, std_reward = evaluate_rl_model(algo=PPO, model_path='models/PPO_lr_x0_1', n_eval_episodes=1000)

PPO - Mean Reward: 271.72410653700007 +/- 24.651670297485083


In [151]:
visualize_agent('models/PPO_lr_x0_1')

Episode: 1, Reward: 277.1073003018405
Episode: 2, Reward: 283.54802500154074
Episode: 3, Reward: 271.323135518571
Episode: 4, Reward: 264.53156146977983
Episode: 5, Reward: 273.9498880889795


In [152]:
results_lr = log_evaluation_lr(name='PPO_LR_x0.1', lr=0.00003, results_df=results_lr, mean_reward=mean_reward, std_reward=std_reward) 

In [247]:
results_lr

Unnamed: 0,Algorithm,Mean Reward,Standard Deviation,Learning Rate
0,PPO_default,261.043689,41.428902,0.0003
1,PPO_LR_x10,235.930474,83.436527,0.003
2,PPO_LR_x0.1,272.309767,22.056692,3e-05


Very stable agent, a bit better than the default one in terms of the reward

### Another experiment: changing the architecture of the model

In [195]:
# Increasing the amount of layers to 3 and neurons to 128

policy_kwargs = dict(net_arch=[128, 128, 128])

env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO("MlpPolicy", env, policy_kwargs=policy_kwargs, device="cpu", tensorboard_log=log_path)
model.learn(total_timesteps=1_000_000, tb_log_name='PPO_3_layer_128')

<stable_baselines3.ppo.ppo.PPO at 0x30acc8c20>

In [196]:
model.save('models/PPO_3_layer_128')

In [199]:
visualize_agent(model_path='models/PPO_3_layer_128')

Episode: 1, Reward: 320.50967258667
Episode: 2, Reward: 263.53044420536514
Episode: 3, Reward: 253.91983949692423
Episode: 4, Reward: 252.7634606809969
Episode: 5, Reward: 268.32590577701535


In [207]:
## FINISH IT
mean_reward, std_reward = evaluate_rl_model(algo=PPO, model_path='models/PPO_3_layer_128')

PPO - Mean Reward: 279.24673498 +/- 19.962474193817183


In [208]:
results_arch = pd.DataFrame(columns=['Algorithm', 'Mean Reward', 'Standard Deviation', 'Architecture'])

results_arch.loc[0] = ['PPO_default', 266.884301, 21.949487, '[64, 64]']

In [209]:
results_arch = log_evaluation_arch(name='PPO 3-layered', arch='[128, 128, 128]', results_df=results_arch, mean_reward=mean_reward, std_reward=std_reward)

#### Let's play with it more

In [228]:
policy_kwargs = dict(net_arch=[128, 128])

env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO("MlpPolicy", env, policy_kwargs=policy_kwargs, device="cpu", tensorboard_log=log_path)
model.learn(total_timesteps=1_000_000, tb_log_name='PPO_2_layer_128')

<stable_baselines3.ppo.ppo.PPO at 0x30ad60c80>

In [172]:
model.policy

ActorCriticPolicy(
  (features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (pi_features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (vf_features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (mlp_extractor): MlpExtractor(
    (policy_net): Sequential(
      (0): Linear(in_features=8, out_features=128, bias=True)
      (1): Tanh()
      (2): Linear(in_features=128, out_features=128, bias=True)
      (3): Tanh()
    )
    (value_net): Sequential(
      (0): Linear(in_features=8, out_features=128, bias=True)
      (1): Tanh()
      (2): Linear(in_features=128, out_features=128, bias=True)
      (3): Tanh()
    )
  )
  (action_net): Linear(in_features=128, out_features=4, bias=True)
  (value_net): Linear(in_features=128, out_features=1, bias=True)
)

In [230]:
model.save('models/PPO_2_layer_128')

In [231]:
visualize_agent(model_path='models/PPO_2_layer_128')

Episode: 1, Reward: 247.20306813978377
Episode: 2, Reward: 281.9463705270159
Episode: 3, Reward: 324.4132904522078
Episode: 4, Reward: 291.4043412706661
Episode: 5, Reward: 296.50902780526906


In [232]:
mean_reward, std_reward = evaluate_rl_model(algo=PPO, model_path='models/PPO_2_layer_128')

PPO - Mean Reward: 275.68694847 +/- 26.82860779678868


In [241]:
results_arch = log_evaluation_arch(name='PPO 2-layered', arch='[128, 128]', results_df=results_arch, mean_reward=mean_reward, std_reward=std_reward)

Making a policy function more complex by increasing the numbers of neurons twice in each layer

Very successful model, low amount of useless moves

In [204]:
policy_kwargs = dict(net_arch=dict(pi=[128, 128], vf=[64, 64]))

env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO("MlpPolicy", env, policy_kwargs=policy_kwargs, device="cpu", tensorboard_log=log_path)
model.learn(total_timesteps=1_000_000, tb_log_name='PPO_2_layer_128_64')

<stable_baselines3.ppo.ppo.PPO at 0x300f5b710>

In [205]:
model.policy

ActorCriticPolicy(
  (features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (pi_features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (vf_features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (mlp_extractor): MlpExtractor(
    (policy_net): Sequential(
      (0): Linear(in_features=8, out_features=128, bias=True)
      (1): Tanh()
      (2): Linear(in_features=128, out_features=128, bias=True)
      (3): Tanh()
    )
    (value_net): Sequential(
      (0): Linear(in_features=8, out_features=64, bias=True)
      (1): Tanh()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): Tanh()
    )
  )
  (action_net): Linear(in_features=128, out_features=4, bias=True)
  (value_net): Linear(in_features=64, out_features=1, bias=True)
)

In [206]:
model.save('models/PPO_2_layer_128_64')

In [216]:
visualize_agent('models/PPO_2_layer_128_64')

Episode: 1, Reward: 266.62139467787006
Episode: 2, Reward: 273.6814372671445
Episode: 3, Reward: 282.77987151886714
Episode: 4, Reward: 265.1764128666954
Episode: 5, Reward: 304.6802897341968


In [218]:
mean_reward, std_reward = evaluate_rl_model(algo=PPO, model_path='models/PPO_2_layer_128_64')

PPO - Mean Reward: 257.00967935999995 +/- 25.95230886212065


In [219]:
results_arch = log_evaluation_arch('PPO 2-layered 128/64', arch='Pi[128, 128], Vf[64, 64]', results_df=results_arch, mean_reward=mean_reward, std_reward=std_reward)

Making the value function more complex

In [221]:
policy_kwargs = dict(net_arch=dict(pi=[64, 64], vf=[128, 128]))

env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO("MlpPolicy", env, policy_kwargs=policy_kwargs, device="cpu", tensorboard_log=log_path)
model.learn(total_timesteps=1_000_000, tb_log_name='PPO_2_layer_64_128')

<stable_baselines3.ppo.ppo.PPO at 0x300f596a0>

In [222]:
model.save('models/PPO_2_layer_64_128')

In [223]:
visualize_agent('models/PPO_2_layer_64_128')

Episode: 1, Reward: 243.15817723485623
Episode: 2, Reward: 281.09322463872337
Episode: 3, Reward: 291.40914278406206
Episode: 4, Reward: 290.37787597840645
Episode: 5, Reward: 244.91018521978071


In [225]:
mean_reward, std_reward = evaluate_rl_model(algo=PPO, model_path='models/PPO_2_layer_64_128')

PPO - Mean Reward: 244.96941969 +/- 46.34079186873261


In [226]:
results_arch = log_evaluation_arch('PPO 2-layered 64/128', arch='Pi[64, 64], Vf[128,128]', results_df=results_arch, mean_reward=mean_reward, std_reward=std_reward)

In [248]:
results_arch

Unnamed: 0,Algorithm,Mean Reward,Standard Deviation,Architecture
0,PPO_default,261.043689,41.428902,"[64, 64]"
1,PPO 3-layered,279.246735,19.962474,"[128, 128, 128]"
2,PPO 2-layered 128/64,257.009679,25.952309,"Pi[128, 128], Vf[64, 64]"
3,PPO 2-layered 64/128,244.96942,46.340792,"Pi[64, 64], Vf[128,128]"
4,PPO 2-layered,275.686948,26.828608,"[128, 128]"


Changing the activation function?

In [250]:
import torch as th

policy_kwargs = dict(activation_fn=th.nn.ReLU,
                     net_arch=[64, 64])

env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO("MlpPolicy", env, policy_kwargs=policy_kwargs, device="cpu", tensorboard_log=log_path)
model.learn(total_timesteps=1_000_000, tb_log_name='PPO_ReLU')

<stable_baselines3.ppo.ppo.PPO at 0x306c55d60>

In [251]:
model.save('models/PPO_ReLU')

In [252]:
visualize_agent('models/PPO_ReLU')

Episode: 1, Reward: 284.62028903926443
Episode: 2, Reward: 276.3256555585462
Episode: 3, Reward: 271.42252015274016
Episode: 4, Reward: 268.90678769184956
Episode: 5, Reward: 269.53705624587974


In [253]:
mean_reward, std_reward = evaluate_rl_model(algo=PPO, model_path='models/PPO_ReLU')

PPO - Mean Reward: 279.56737877 +/- 19.755591984854483


Making the 'perfect' model

In [255]:
policy_kwargs = dict(activation_fn=th.nn.ReLU,
                     net_arch=[128, 128, 128])

env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO("MlpPolicy", env, learning_rate=0.00003, policy_kwargs=policy_kwargs, device="cpu", tensorboard_log=log_path)
model.learn(total_timesteps=2_000_000, tb_log_name='PPO_possibly_best')

<stable_baselines3.ppo.ppo.PPO at 0x309180e00>

In [256]:
model.save('models/PPO_possibly_best')

In [270]:
visualize_agent('models/PPO_possibly_best')

Episode: 1, Reward: 291.04192792533775
Episode: 2, Reward: 254.48729856005542
Episode: 3, Reward: 322.00864637392283
Episode: 4, Reward: 295.51798064943887
Episode: 5, Reward: 50.146137351463835


In [263]:
mean_reward, std_reward = evaluate_rl_model(algo=PPO, model_path='models/PPO_possibly_best', n_eval_episodes=1000)

PPO - Mean Reward: 270.819484418 +/- 43.2421512093431


Making a LR default

In [264]:
policy_kwargs = dict(activation_fn=th.nn.ReLU,
                     net_arch=[128, 128, 128])

env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO("MlpPolicy", env, policy_kwargs=policy_kwargs, device="cpu", tensorboard_log=log_path)
model.learn(total_timesteps=2_000_000, tb_log_name='PPO_2nd_possibly_best')

<stable_baselines3.ppo.ppo.PPO at 0x30916af60>

In [266]:
model.save('models/PPO_2nd_possibly_best')

In [275]:
visualize_agent('models/PPO_2nd_possibly_best')

Episode: 1, Reward: 296.19584472653185
Episode: 2, Reward: 307.1673237229345
Episode: 3, Reward: 297.9091820490997
Episode: 4, Reward: 266.8851508065559
Episode: 5, Reward: 284.9845581341762


In [276]:
mean_reward, std_reward = evaluate_rl_model(PPO, model_path='models/PPO_2nd_possibly_best', n_eval_episodes=1000)

PPO - Mean Reward: 280.86061939500007 +/- 37.38880397932653
