<a href="https://colab.research.google.com/github/tombackert/rl-stuff/blob/main/first-notebook-on-reinforcement-learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# First Notebook on Reinforcement Learning


## Solving the Lunar Lander Problem

Sources:
1. [Gymnasium](https://gymnasium.farama.org/)
2. [Stable Baselines3](https://stable-baselines3.readthedocs.io/en/master/#)
3. [Deep Mind paper](https://arxiv.org/abs/1312.5602)
4. [Natur article](https://www.nature.com/articles/nature14236)
5. [Tuned Hyperparamters for DQN Model](https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/dqn.yml)

Reinforcement Learning Resources:
- [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/)
- [Huggingface's Deep RL Course](https://huggingface.co/learn/deep-rl-course/unit0/introduction)
- [Lilian Weng's Blog](https://lilianweng.github.io/posts/2018-04-08-policy-gradient/)
- [Berkley's Deep RL Bootcamp](https://sites.google.com/view/deep-rl-bootcamp/lectures)


*more updates coming...*

### 0. Dependencies and Imports

In [None]:
# dependencies
!apt-get update && apt-get install swig cmake
!pip install box2d-py

!pip install "stable-baselines3[extra]>=2.0.0a4"

In [None]:
# imports
import gymnasium as gym
import imageio
import numpy as np

from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy

### 1. Building and Training the Agent

In [None]:
# Deep RL Model using the DQN Algorithm

# hyperparameters (tuned from link 5)
policy="MlpPolicy"
env="LunarLander-v2"
learning_rate=6.3e-4
buffer_size=50000
learning_starts=0
batch_size=128
target_update_interval=250
gamma=0.99
train_freq=4
gradient_steps=-1
exploration_fraction=0.12
exploration_final_eps=0.1
policy_kwargs=dict(net_arch=[256, 256])


# model
model = DQN(policy=policy,
            env=env,
            learning_rate=learning_rate,
            buffer_size=buffer_size,
            learning_starts=learning_starts,
            batch_size=batch_size,
            target_update_interval=target_update_interval,
            gamma=gamma,
            train_freq=train_freq,
            gradient_steps=gradient_steps,
            exploration_fraction=exploration_fraction,
            exploration_final_eps=exploration_final_eps,
            policy_kwargs=policy_kwargs
            ).learn(total_timesteps=100_000, progress_bar=True)

# trains aproxi 8 mins on gpu

### 2. Testing the Agent

In [None]:
# making a gif
images = []
obs = model.env.reset()
img = model.env.render(mode="rgb_array")
for i in range(5000):
    images.append(img)
    action, _ = model.predict(obs)
    obs, _, _ ,_ = model.env.step(action)
    img = model.env.render(mode="rgb_array")

imageio.mimsave("lander_dqn.gif", [np.array(img) for i, img in enumerate(images) if i%2 == 0], fps=29)

In [None]:
# evaluate the model
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
print(mean_reward, std_reward) # 245.4761378 72.87695287287264