### Importing dependencies

#### Importing Weights and Biases for training callback (to check out the training progress)

### Nevermind, WandB got glitched and end up showing nothing
#### So, please, just ignore it

In [3]:
#import wandb
#from wandb.integration.sb3 import WandbCallback

#wandb.init(project="lunarlander_experiments")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /Users/maksat/.netrc


Gymnasium was chosen to make environments and test out different algorithms using Stable Baselines 3.

In [4]:
import gymnasium as gym

from stable_baselines3 import PPO, DQN, A2C
from stable_baselines3.common.evaluation import evaluate_policy

import pandas as pd

In [9]:
# Making a log directory to save logs
logs = 'data/logs'

### Understanding the Lunar Lander Environment and other things here
Check out this page https://gymnasium.farama.org/environments/box2d/lunar_lander/

In [74]:
environment = gym.make('LunarLander-v2')

In [75]:
# Discrete action space means that the agent has 4 action types to choose from
# 0 - do nothing
# 1 - fire left engine
# 2 - fire main engine
# 3 - fire right engine

environment.action_space

Discrete(4)

In [79]:
# This is an observation space of the game, i.e., the state that is put into the agent
# The first two values are the (x,y) coordinates of the agent at the given moment
# The second two values are its linear velocities in x and y
# The third two values are its angle and angular velocity

# The last two values are the boolean values in the form of float numbers. 
# They represent whether the legs of the spacecraft are in contact with the ground

environment.reset()

(array([ 1.1892319e-03,  1.4023979e+00,  1.2044747e-01, -3.7876463e-01,
        -1.3713022e-03, -2.7283153e-02,  0.0000000e+00,  0.0000000e+00],
       dtype=float32),
 {})

Rewards and other things, such as the conditions of the termination of the episode (one attempt of playing the game) can be found on the website mentioned above.

### Important thing to note: 
If you want to check logs (they are important for the comparisons), run this cell. Then, go to localhost:6006 to open the tensorboard

In [None]:
!tensorboard --logdir 'data/logs'

### Comparison of the amount of training timesteps.

#### First run of PPO algorithm for 1_000_000 timesteps

Later on I trained a few more models using 8 cores of cpu. So, do not compare time efficiency of those 3 agents below (PPO_1mil and PPO_500000), since those were trained on 1 core of cpu

In [7]:
env = gym.make("LunarLander-v2") # Initializing an environment

model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=logs) #
model.learn(total_timesteps=1_000_000, callback=WandbCallback())

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to data/logs/PPO_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 86.9     |
|    ep_rew_mean     | -142     |
| time/              |          |
|    fps             | 2093     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 89.1         |
|    ep_rew_mean          | -139         |
| time/                   |              |
|    fps                  | 1602         |
|    iterations           | 2            |
|    time_elapsed         | 2            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0074320715 |
|    clip_fraction        | 0.0274       |
|    clip_range        

<stable_baselines3.ppo.ppo.PPO at 0x31c220da0>

In [10]:
model.save('models/PPO_1mil_model')

In [22]:
env = gym.make("LunarLander-v2", render_mode='rgb_array')

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True)

print(f"PPO - Mean Reward: {mean_reward} +/- {std_reward}")

PPO - Mean Reward: 161.4443761205236 +/- 103.32166450740453


#### Storing an evaluation of the model

In [29]:
results = pd.DataFrame(columns=['Algorithm', 'Mean Reward', 'Std Reward', 'Timesteps'])
name = 'PPO_1mil'

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, deterministic=True)
print(f"PPO - Mean Reward: {mean_reward} +/- {std_reward}")

temp_df = pd.DataFrame({
    'Algorithm' : [name],
    'Mean Reward' : [mean_reward],
    'Std Reward' : [std_reward],
    'Timesteps' : 1_000_000})

results = pd.concat([results, temp_df])



PPO - Mean Reward: 165.55157471745926 +/- 119.14903230671855


  results = pd.concat([results, temp_df])


#### Running the same PPO algorithm for fewer episodees

In [30]:
env = gym.make("LunarLander-v2", render_mode='rgb_array')

model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=logs)
model.learn(total_timesteps=500000, callback=WandbCallback(), tb_log_name='PPO_500000')

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to data/logs/PPO_500000_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 89       |
|    ep_rew_mean     | -166     |
| time/              |          |
|    fps             | 1595     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 90.3         |
|    ep_rew_mean          | -155         |
| time/                   |              |
|    fps                  | 1375         |
|    iterations           | 2            |
|    time_elapsed         | 2            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0071888836 |
|    clip_fraction        | 0.0642       |
|    clip_range 

<stable_baselines3.ppo.ppo.PPO at 0x32f927740>

#### This attempt was not successful

The reward went down, the episode length went up since the very beginning of the training, indicating that the agent didn't explore well.

In [31]:
env = gym.make("LunarLander-v2", render_mode='rgb_array')

model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=logs)
model.learn(total_timesteps=500000, callback=WandbCallback(), tb_log_name='PPO_500000_2nd')

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to data/logs/PPO_500000_2nd_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 88.3     |
|    ep_rew_mean     | -173     |
| time/              |          |
|    fps             | 2002     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 92.7         |
|    ep_rew_mean          | -177         |
| time/                   |              |
|    fps                  | 1665         |
|    iterations           | 2            |
|    time_elapsed         | 2            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0072733155 |
|    clip_fraction        | 0.0104       |
|    clip_ra

<stable_baselines3.ppo.ppo.PPO at 0x32f8b9ee0>

In [32]:
model.save('models/PPO_500000_2nd_attempt')

In [35]:
env = gym.make("LunarLander-v2", render_mode='rgb_array')

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True)

print(f"PPO - Mean Reward: {mean_reward} +/- {std_reward}")



PPO - Mean Reward: 208.16804920145992 +/- 62.315576606305754


In [36]:
name = 'PPO_500k'

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, deterministic=True)
print(f"PPO - Mean Reward: {mean_reward} +/- {std_reward}")

temp_df = pd.DataFrame({
    'Algorithm': [name],
    'Mean Reward': [mean_reward],
    'Std Reward': [std_reward],
    'Timesteps': 500000})

results = pd.concat([results, temp_df])



PPO - Mean Reward: 188.9723823407851 +/- 101.74879156113954


In [37]:
results

Unnamed: 0,Algorithm,Mean Reward,Std Reward,Timesteps
0,PPO_1mil,165.551575,119.149032,1000000
0,PPO_500k,188.972382,101.748792,500000


#### Testing DQN algorithm
exploration_fraction indicates what fraction of the training episodes model has to explore, slowly decaying gamma

In [40]:
env = gym.make('LunarLander-v2', render_mode='rgb_array')

model = DQN('MlpPolicy', env, exploration_fraction=0.5, verbose=1, tensorboard_log=logs)
model.learn(total_timesteps=1_000_000, callback=WandbCallback(), tb_log_name='DQN_1mil')

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to data/logs/DQN_1mil_1
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 92.2     |
|    ep_rew_mean      | -146     |
|    exploration_rate | 0.999    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 1260     |
|    time_elapsed     | 0        |
|    total_timesteps  | 369      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.75     |
|    n_updates        | 67       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 94       |
|    ep_rew_mean      | -160     |
|    exploration_rate | 0.999    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 932      |
|    time_elapsed     | 0        |
|    total_timesteps  | 752      |

<stable_baselines3.dqn.dqn.DQN at 0x144dd4d40>

#### This attempt of training was not successful too. It didn't explore enough.
I didn't like DQN's, they don't perform well

In [41]:
model.save('models/DQN_1mil_model')

In [44]:
name = 'DQN_1mil'

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, deterministic=True)
print(f"PPO - Mean Reward: {mean_reward} +/- {std_reward}")

temp_df = pd.DataFrame({
    'Algorithm': [name],
    'Mean Reward': [mean_reward],
    'Std Reward': [std_reward],
    'Timesteps': 500000})

results = pd.concat([results, temp_df])



KeyboardInterrupt: 

### Testing A2C algorithm

In [46]:
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

if __name__=="__main__":
    env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
    model = A2C("MlpPolicy", env, device="cpu", tensorboard_log=logs)
    model.learn(total_timesteps=1000000, callback=WandbCallback(), tb_log_name='A2C_1mil')

### Very unstable run

The reward value dropped very quickly because agent reduced the episode length but didn't adjust itself to it.

In [47]:
model.save('models/A2C_1mil_model')

Trying to lower the learning rate, since the agent is overreacting to the changes (a lot of fluctuations on the episode reward graph). Default learning rate in SB3 is 0.0007

In [50]:
env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = A2C("MlpPolicy", env, learning_rate=0.0001, device="cpu", tensorboard_log=logs)
model.learn(total_timesteps=1000000, callback=WandbCallback(), tb_log_name='A2C_1mil_adjusted')

<stable_baselines3.a2c.a2c.A2C at 0x146c234d0>

In [51]:
model.save('models/A2C_1mil_adjusted')

Again, the mean score is -36. Trying higher learning rate

In [53]:
env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = A2C("MlpPolicy", env, learning_rate=0.00045, device="cpu", tensorboard_log=logs)
model.learn(total_timesteps=1000000, callback=WandbCallback(), tb_log_name='A2C_1mil_adjusted2')

<stable_baselines3.a2c.a2c.A2C at 0x141072720>

In [54]:
model.save('models/A2C_1mil_adjusted2')

### Trying different learning rates to see the performance of the agents

In [63]:
# This command shows the information about the function. 
# I used it to look up default values of the parameters.

PPO??

Trying to increase a learning rate 10 times

In [64]:
# default learning rate of PPO is 0.0003

env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO("MlpPolicy", env, learning_rate=0.003, device="cpu", tensorboard_log=logs)
model.learn(total_timesteps=1000000, callback=WandbCallback(), tb_log_name='PPO_10times_LR')

<stable_baselines3.ppo.ppo.PPO at 0x143272300>

In [65]:
model.save('models/PPO_10times_LR')

In [67]:
env = gym.make("LunarLander-v2", render_mode='rgb_array')

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, deterministic=True)
print(f"PPO - Mean Reward: {mean_reward} +/- {std_reward}")



PPO - Mean Reward: 210.33921440428344 +/- 59.028505439107256


Trying out learning rate which is 10 times lower than default

In [68]:
# default learning rate of PPO is 0.0003

env = make_vec_env("LunarLander-v2", n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO("MlpPolicy", env, learning_rate=0.00003, device="cpu", tensorboard_log=logs)
model.learn(total_timesteps=1000000, callback=WandbCallback(), tb_log_name='PPO_0.1times_LR')

<stable_baselines3.ppo.ppo.PPO at 0x142f14a70>

In [69]:
model.save('models/PPO_0_1times_LR')

Default learning rate

In [70]:
model = PPO('MlpPolicy', env, tensorboard_log=logs)
model.learn(total_timesteps=1000000, callback=WandbCallback(), tb_log_name='PPO_default_LR')

<stable_baselines3.ppo.ppo.PPO at 0x142f80a70>

In [71]:
model.save('models/PPO_default_LR')

In [73]:
# Run this cell to visualize the agent's gameplay locally
model.load('models/PPO_default_LR')
env = gym.make("LunarLander-v2", render_mode='human')

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True)
print(f"PPO - Mean Reward: {mean_reward} +/- {std_reward}")



PPO - Mean Reward: 193.9048228531026 +/- 83.52647337219422


### To visualize any model locally:

In [80]:
name = 'models/put the name of the model here'
model = model.load(name)

env = gym.make('LunarLander-v2', render_mode='human')

episodes = 5 # Nums of episodes

for episode in range(episodes):
    obs, info = env.reset()
    episode_reward = 0
    terminated = False
    truncated = False
    
    while not terminated and not truncated:
        action, state = model.predict(obs, deterministic=True)
        obs, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward
        env.render()
        
    print(f"Episode {episode + 1} reward: {episode_reward}")

env.close()

Process ForkServerProcess-258:
Process ForkServerProcess-250:
Process ForkServerProcess-251:
Process ForkServerProcess-260:
Process ForkServerProcess-249:
Process ForkServerProcess-254:
Process ForkServerProcess-259:
Process ForkServerProcess-252:
Process ForkServerProcess-253:
Process ForkServerProcess-241:
Process ForkServerProcess-245:
Process ForkServerProcess-243:
Process ForkServerProcess-257:
Process ForkServerProcess-263:
Process ForkServerProcess-247:
Process ForkServerProcess-242:
Process ForkServerProcess-255:
Process ForkServerProcess-204:
Process ForkServerProcess-264:
Process ForkServerProcess-246:
Process ForkServerProcess-244:
Process ForkServerProcess-203:
Process ForkServerProcess-261:
Process ForkServerProcess-256:
Process ForkServerProcess-201:
Process ForkServerProcess-208:
Process ForkServerProcess-262:
Process ForkServerProcess-207:
Process ForkServerProcess-202:
Process ForkServerProcess-206:
Process ForkServerProcess-205:
Process ForkServerProcess-248:
Tracebac

KeyboardInterrupt: 


  File "/opt/anaconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/anaconda3/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/lib/python3.12/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 32, in _worker
    cmd, data = remote.recv()
                ^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/multiprocessing/connection.py", line 395, in _recv
    chunk = read(handle, remaining)
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/anaconda