*DELE CA2 Part B Submission*

# Lunar Lander: Learning to land a rocket! 

|          Name        |      Class    | Admin No. |
|----------------------|---------------|-----------|
| Timothy Chia Kai Lun | DAAA/FT/2B/02 | P2106911  |
|      Lim Jun Jie     | DAAA/FT/2B/02 | P2100788  |

![](https://i.imgur.com/2pzRsfx.jpg)

**<u>Objectives</u>**

We are tasked with training an agent capable of landing the lunar lander safely onto a landing pad [as per documentation](#https://gymnasium.farama.org/environments/box2d/lunar_lander/), using reinforcement learning techniques. The environment is provided to us through the [gymnasium library](#https://gymnasium.farama.org/).

We aim to study the different behaviours executed in the Lunar Lander environment and attempt to optimize these behaviours such that we maximize on rewards.

---

## 1. About The Environment

The Lunar Lander is a very interesting environment as it is described as a classic rocket trajectory optimization problem. The environment comes in two versions: discrete and continuous, with differing action spaces and the goal is to safely land the lunar lander on the launch pad located at (0,0).

For this group assignment, we will be applying different RL algorithms on the discrete version environment which contain 4 actions in the action space and the state space is represented as an 8th dimensional vector.

### 1.1 States and Actions

$$

\text{State Vector}
\begin{cases}
    s_0=\text{x-axis coord of agent} \\
    s_1=\text{y-axis coord of agent} \\
    s_2=\text{x-axis linear velocity} \\
    s_3=\text{y-axis linear velocity} \\
    s_4=\text{Agent's angle} \\
    s_5=\text{Agent's angular velocity} \\
    s_6=\text{Right leg touched ground} \\
    s_7=\text{Left leg touched ground}
\end{cases}

\text{Action Space}
\begin{cases}
    a_0=\text{Do nothing} \\
    a_1=\text{Fire left engine} \\
    a_2=\text{Fire main engine} \\
    a_3=\text{Fire right engine}
\end{cases}

$$

### 1.2 Reward Scheme

- The agent gains 100-140 points for landing on the launch pad and coming to a rest.
- Coming to a rest yields an additional 100 points.
- Each leg with ground contact earns an 10 points.
- Moving away from landing spot decreases the rewards.
- Crashing decreases the rewards by -100 points.
- Firing the main engine decreases rewards by -0.3 points.
- Firing the side engine decreases rewards by -0.3 points.

An episode is considered a solution when the episodic rewards obtained are greater than or equal to 200 points.

---

## 2. Project Setup

We have written modules for the algorithms used in this assignment as Python scripts including utility classes. These scripts are located in the `.\models` directory. The rendering of episodes will be stored in .gif and can be found in the `.\gifs` directory. 

For the deep learning aspect of the asignment, we will be using TensorFlow 2. Weights of the model upon completion of training and history of metrics can be found in `.\assets` and `.\history` respectively.

In [None]:
import os
import json
import numpy as np
import gymnasium as gym
from matplotlib import pyplot as plt

from models.dqn import DQN
from models.sarsa import SARSA
from models.dueling_dql import DuelingDQL
from models.utils import ReplayBuffer, EpisodeSaver

---

## 3. Reinforcement Learning Algorithms

In this assignment, we will be comparing and analysing the differences in application and performance of different reinforcement algorithms. 

The topic of reinforment learning consist of five elements:

1. An agent
2. A policy
3. A reward signal
4. A value function 

The goal of an agent is to maximize cumulative future rewards and it's actions are defined in terms of a policy (probability distribution). There are two kinds of policies which we will be exploring:

source: https://analyticsindiamag.com/reinforcement-learning-policy/

**<u>Off-policy</u>**



In Q-Learning, the agent learns optimal policy with the help of a greedy policy and behaves using policies of other agents. Q-learning is called off-policy because the updated policy is different from the behavior policy, so Q-Learning is off-policy. In other words, it estimates the reward for future actions and appends a value to the new state without actually following any greedy policy.

**<u>On-policy</u>**

SARSA (state-action-reward-state-action) is an on-policy reinforcement learning algorithm that estimates the value of the policy being followed. In this algorithm, the agent grasps the optimal policy and uses the same to act. The policy that is used for updating and the policy used for acting is the same, unlike in Q-learning. This is an example of on-policy learning.

### 3.1 Baseline: Deep Q-Learning (DQN)

#### 3.1.1 Epsilon Greedy

```python
if np.random.random() < epsilon:
    # explore
else:
    # exploit
```

#### 3.1.2 Experience Replay

#### 3.1.3 Network Architecture

### 3.3 State-Action-Reward-State-Action (SARSA)

### 3.4 Dueling Deep Q-Learning (DDQN)

---
## 4. Experiments

### 4.1 Random Engine Failure

---
## 5. Evaluation

## 6. Conclusion

---

## 7. References

in-text

(Sutton & Barto, 1992)

(Plaat, 2022)

- Sutton, R.S. and Barto, A.G. (1992) Reinforcement learning: An introduction, Reinforcement Learning: An Introduction. Available at: http://www.incompleteideas.net/book/the-book-2nd.html (Accessed: January 31, 2023). 

- Plaat, A. (2022) Deep Reinforcement Learning, a textbook, arXiv.org. Available at: https://arxiv.org/abs/2201.02135 (Accessed: January 31, 2023). 