# Deep Q-Learning for Lunar Landing

## Part 0 - Installing the required packages and importing the libraries

### Installing Gymnasium

In [None]:
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!apt-get install -y swig
!pip install gymnasium[box2d]

Collecting gymnasium
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-0.29.1
Collecting shimmy[atari]<1.0,>=0.1.0 (from gymnasium[accept-rom-license,atari])
  Downloading Shimmy-0.2.1-py3-none-any.whl (25 kB)
Collecting autorom[accept-rom-license]~=0.4.2 (from gymnasium[accept-rom-license,atari])
  Downloading AutoROM-0.4.2-py3-none-any.whl (16 kB)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.4.2->gymnasium[accept-rom-license,atari])
  Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m434.7/434.7 kB[0m [31m19.8 

### Importing the libraries

In [None]:
import os
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.autograd as autograd
from torch.autograd import Variable
from collections import deque, namedtuple

## Part 1 - Building the AI

### Creating the architecture of the Neural Network

Fully connected layers (or linear layers) are basic layers in neural networks where each input node is connected to each output node.



In [None]:
class Network(nn.Module):

  def __init__(self, state_size, action_size, seed = 42): # The seed parameter is used for reproducibility (ensuring that results can be repeated).
    super(Network, self).__init__()  #initialize the inherited properties.
    self.seed = torch.manual_seed(seed) #This sets the random seed for PyTorch to ensure that the random operations are reproducible.
    self.fc1 = nn.Linear(state_size, 64) #A fully connected layer that takes state_size inputs and outputs 64 features.
    self.fc2 = nn.Linear(64, 64) #Another fully connected layer that takes 64 inputs and outputs 64 features.
    self.fc3 = nn.Linear(64, action_size) #A fully connected layer that takes 64 inputs and outputs action_size features.

  def forward(self, state): #This method defines the forward pass of the network, which is how the input data passes through the network to produce an output.
    x = self.fc1(state) #Passes the input state through the first fully connected layer.
    x = F.relu(x) # Applies the ReLU (Rectified Linear Unit) activation function to the output of the first layer.
    x = self.fc2(x) # Passes the result through the second fully connected layer.
    x = F.relu(x)
    return self.fc3(x)

**Why is this Important?**

This class defines the architecture of the neural network, specifying the number and types of layers.

The forward method specifies how the data flows through the network, which is crucial for both training and inference.

 The use of ReLU activation functions introduces non-linearity, allowing the network to learn more complex functions.

## Part 2 - Training the AI

### Setting up the environment

In [None]:
import gymnasium as gym
env = gym.make('LunarLander-v2') # an instance of the LunarLander-v2 environment.
state_shape = env.observation_space.shape #This attribute defines the space of all possible states the environment can be in.
state_size = env.observation_space.shape[0]
number_actions = env.action_space.n
print('State shape: ', state_shape)
print('State size: ', state_size)
print('Number of actions: ', number_actions)

State shape:  (8,)
State size:  8
Number of actions:  4


### Initializing the hyperparameters

**Learning Rate:** This controls how much to adjust the model's weights with respect to the loss gradient during training. A smaller learning rate makes the training more stable by making smaller updates, but it might take longer to converge. On the other side, a larger learning rate speeds up the training but can overshoot the optimal solution. 5e-4 (or 0.0005) is a relatively small learning rate, which helps in making fine updates to the model parameters.

**Minibatch Size:** This is the number of samples from the replay buffer that the model trains on at each training step. Instead of updating the model weights after every single sample, minibatches of samples are used to provide a more stable estimate of the gradient. A minibatch size of 100 means that 100 experiences (state, action, reward, next state) will be sampled and used to update the model in each training iteration.


**Discount Factor (γ):** This determines how much future rewards are worth compared to immediate rewards. A discount factor close to 1 (like 0.99) means that future rewards are almost as valuable as immediate rewards, encouraging the agent to consider long-term gains. The discount factor helps the agent balance short-term and long-term rewards.


**Replay Buffer Size:** The replay buffer stores past experiences for training. Using a replay buffer helps in breaking the correlation between consecutive experiences by sampling randomly from the buffer. A size of 1e5 (or 100,000) means that the buffer can store up to 100,000 experiences. This size is large enough to provide a diverse set of experiences for training while keeping the memory usage manageable.


**Interpolation Parameter (τ):** This is used in the soft update of the target network in Double DQN (Deep Q-Network). Instead of copying the weights directly from the main network to the target network, a small fraction of the main network's weights (defined by τ) is added to the target network's weights. This soft update helps in stabilizing the learning process by making gradual updates to the target network. An interpolation parameter of 1e-3 (or 0.001) means that the target network is updated slowly and smoothly.


In [None]:
learning_rate = 5e-4
minibatch_size = 100
discount_factor = 0.99
replay_buffer_size = int(1e5)
interpolation_parameter = 1e-3

  and should_run_async(code)


### Implementing Experience Replay

**Capacity:** This sets the maximum number of experiences the replay buffer can hold.

**Memory:** This initializes an empty list to store the experiences.

**Initial State:** The lunar lander is hovering above the ground at a certain position and velocity.

**Action:** The agent decides to fire the main engine.



*   Do nothing.
*   Fire left orientation engine.
*   Fire main engine.
*   Fire right orientation engine.


**Next State:** After firing the main engine, the new position and velocity of the lander are updated.

**Reward:** The agent receives a small negative reward for using fuel but might get a positive reward if it moved closer to the landing pad.

**Done Flag:** The episode continues because the lander has not crashed or landed yet.




In [None]:
class ReplayMemory(object): #This line defines a new class ReplayMemory which will manage the storage and sampling of experiences.

  def __init__(self, capacity):
    # This line checks if a GPU (CUDA) is available. If it is, it sets the device to GPU (cuda:0); otherwise, it defaults to the CPU. This ensures that tensor operations can leverage GPU acceleration if available, speeding up computations.
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.capacity = capacity
    self.memory = []

  def push(self, event): #This method adds new experiences to the replay memory.
    self.memory.append(event) #Adds the new experience (event) to the memory list.
    if len(self.memory) > self.capacity:
      del self.memory[0] #If the memory exceeds its capacity, it removes the oldest experience (at index 0).

  def sample(self, batch_size):
    experiences = random.sample(self.memory, k = batch_size) #Randomly selects batch_size experiences from the memory. This
    states = torch.from_numpy(np.vstack([e[0] for e in experiences if e is not None])).float().to(self.device)
    actions = torch.from_numpy(np.vstack([e[1] for e in experiences if e is not None])).long().to(self.device)
    rewards = torch.from_numpy(np.vstack([e[2] for e in experiences if e is not None])).float().to(self.device)
    next_states = torch.from_numpy(np.vstack([e[3] for e in experiences if e is not None])).float().to(self.device)
    dones = torch.from_numpy(np.vstack([e[4] for e in experiences if e is not None]).astype(np.uint8)).float().to(self.device)
    return states, next_states, actions, rewards, dones

### Implementing the DQN class

In [None]:
class Agent():

  def __init__(self, state_size, action_size):
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.state_size = state_size
    self.action_size = action_size
    #This neural network is used to estimate Q-values for the current states and actions. It is updated frequently during training.
    self.local_qnetwork = Network(state_size, action_size).to(self.device)
    #This network is used to estimate the target Q-values during the update step. It is updated less frequently to provide stable target values.
    self.target_qnetwork = Network(state_size, action_size).to(self.device)
    self.optimizer = optim.Adam(self.local_qnetwork.parameters(), lr = learning_rate)
    #This initializes the replay buffer with a specified capacity to store and sample past experiences for training.
    self.memory = ReplayMemory(replay_buffer_size)
    self.t_step = 0

  def step(self, state, action, reward, next_state, done):
    self.memory.push((state, action, reward, next_state, done))
    self.t_step = (self.t_step + 1) % 4
    if self.t_step == 0:
      if len(self.memory.memory) > minibatch_size:
        experiences = self.memory.sample(100)
        self.learn(experiences, discount_factor)

  def act(self, state, epsilon = 0.):
    state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
    self.local_qnetwork.eval()
    with torch.no_grad():
      action_values = self.local_qnetwork(state)
    self.local_qnetwork.train()
    if random.random() > epsilon:
      return np.argmax(action_values.cpu().data.numpy())
    else:
      return random.choice(np.arange(self.action_size))

  def learn(self, experiences, discount_factor):
    states, next_states, actions, rewards, dones = experiences
    next_q_targets = self.target_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)
    q_targets = rewards + discount_factor * next_q_targets * (1 - dones)
    q_expected = self.local_qnetwork(states).gather(1, actions)
    loss = F.mse_loss(q_expected, q_targets)
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()
    self.soft_update(self.local_qnetwork, self.target_qnetwork, interpolation_parameter)

  def soft_update(self, local_model, target_model, interpolation_parameter):
    for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
      target_param.data.copy_(interpolation_parameter * local_param.data + (1.0 - interpolation_parameter) * target_param.data)

### Initializing the DQN agent

In [None]:
agent = Agent(state_size, number_actions)

### Training the DQN agent

In [None]:
number_episodes = 2000
maximum_number_timesteps_per_episode = 1000
epsilon_starting_value  = 1.0
epsilon_ending_value  = 0.01
epsilon_decay_value  = 0.995
epsilon = epsilon_starting_value
scores_on_100_episodes = deque(maxlen = 100)

for episode in range(1, number_episodes + 1):
  state, _ = env.reset()
  score = 0
  for t in range(maximum_number_timesteps_per_episode):
    action = agent.act(state, epsilon)
    next_state, reward, done, _, _ = env.step(action)
    agent.step(state, action, reward, next_state, done)
    state = next_state
    score += reward
    if done:
      break
  scores_on_100_episodes.append(score)
  epsilon = max(epsilon_ending_value, epsilon_decay_value * epsilon)
  print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)), end = "")
  if episode % 100 == 0:
    print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)))
  if np.mean(scores_on_100_episodes) >= 200.0:
    print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode - 100, np.mean(scores_on_100_episodes)))
    torch.save(agent.local_qnetwork.state_dict(), 'checkpoint.pth')
    break

  and should_run_async(code)


Episode 100	Average Score: -162.27
Episode 200	Average Score: -118.46
Episode 300	Average Score: -25.03
Episode 400	Average Score: -0.05
Episode 500	Average Score: 128.99
Episode 600	Average Score: 133.90
Episode 700	Average Score: 152.36
Episode 800	Average Score: 176.00
Episode 888	Average Score: 200.21
Environment solved in 788 episodes!	Average Score: 200.21


## Part 3 - Visualizing the results

In [None]:
import glob
import io
import base64
import imageio
from IPython.display import HTML, display
from gym.wrappers.monitoring.video_recorder import VideoRecorder

def show_video_of_model(agent, env_name):
    env = gym.make(env_name, render_mode='rgb_array')
    state, _ = env.reset()
    done = False
    frames = []
    while not done:
        frame = env.render()
        frames.append(frame)
        action = agent.act(state)
        state, reward, done, _, _ = env.step(action.item())
    env.close()
    imageio.mimsave('video.mp4', frames, fps=30)

show_video_of_model(agent, 'LunarLander-v2')

def show_video():
    mp4list = glob.glob('*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

show_video()

