# **Homework 12 - Reinforcement Learning**

If you have any problem, e-mail us at mlta-2023-spring@googlegroups.com



## Preliminary work

First, we need to install all necessary packages.
One of them, gym, builded by OpenAI, is a toolkit for developing Reinforcement Learning algorithm. Other packages are for visualization in colab.

In [1]:
!apt update
!apt install python-opengl xvfb -y
!pip install -q swig
!pip install box2d==2.3.2 gym[box2d]==0.25.2 box2d-py pyvirtualdisplay tqdm numpy==1.22.4 
!pip install box2d==2.3.2 box2d-kengz
!pip freeze > requirements.txt


[33m0% [Working][0m            Hit:1 http://archive.ubuntu.com/ubuntu focal InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.39)] [[0m                                                                               Hit:2 http://archive.ubuntu.com/ubuntu focal-updates InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.39)] [[0m                                                                               Hit:3 http://archive.ubuntu.com/ubuntu focal-backports InRelease
[33m0% [Connecting to security.ubuntu.com (185.125.190.39)] [Connecting to cloud.r-[0m                                                                               Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
[33m0% [Waiting for headers] [Connected to cloud.r-project.org (108.138.128.85)] [C[0m                                                                               Hit:5 htt


Next, set up virtual display，and import all necessaary packages.

In [2]:
%%capture
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

%matplotlib inline
import matplotlib.pyplot as plt

from IPython import display

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical
from tqdm.notebook import tqdm

import random
from collections import namedtuple, deque

_exp = 'model1'

# Warning ! Do not revise random seed !!!
# Your submission on JudgeBoi will not reproduce your result !!!
Make your HW result to be reproducible.


In [3]:
seed = 2023 # Do not change this
def fix(env, seed):
  env.seed(seed)
  env.action_space.seed(seed)
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False
  np.random.seed(seed)
  torch.manual_seed(seed)
  if torch.cuda.is_available():
      torch.cuda.manual_seed_all(seed)

Last, call gym and build an [Lunar Lander](https://gym.openai.com/envs/LunarLander-v2/) environment.

In [4]:
%%capture
import gym
import random
env = gym.make('LunarLander-v2')
fix(env, seed) # fix the environment Do not revise this !!!

## What Lunar Lander？

“LunarLander-v2”is to simulate the situation when the craft lands on the surface of the moon.

This task is to enable the craft to land "safely" at the pad between the two yellow flags.
> Landing pad is always at coordinates (0,0).
> Coordinates are the first two numbers in state vector.

![](https://gym.openai.com/assets/docs/aeloop-138c89d44114492fd02822303e6b4b07213010bb14ca5856d2d49d6b62d88e53.svg)

"LunarLander-v2" actually includes "Agent" and "Environment". 

In this homework, we will utilize the function `step()` to control the action of "Agent". 

Then `step()` will return the observation/state and reward given by the "Environment".

### Observation / State

First, we can take a look at what an Observation / State looks like.

In [5]:
print(env.observation_space)

Box([-1.5       -1.5       -5.        -5.        -3.1415927 -5.
 -0.        -0.       ], [1.5       1.5       5.        5.        3.1415927 5.        1.
 1.       ], (8,), float32)


  and should_run_async(code)



`Box(8,)`means that observation is an 8-dim vector
### Action

Actions can be taken by looks like

In [6]:
print(env.action_space)

Discrete(4)


`Discrete(4)` implies that there are four kinds of actions can be taken by agent.
- 0 implies the agent will not take any actions
- 2 implies the agent will accelerate downward
- 1, 3 implies the agent will accelerate left and right

Next, we will try to make the agent interact with the environment. 
Before taking any actions, we recommend to call `reset()` function to reset the environment. Also, this function will return the initial state of the environment.

In [7]:
initial_state = env.reset()
print(initial_state)

[-0.00506535  1.413064   -0.5130838   0.09527162  0.00587628  0.11622101
  0.          0.        ]


Then, we try to get a random action from the agent's action space.

In [8]:
random_action = env.action_space.sample()
print(random_action)

1


More, we can utilize `step()` to make agent act according to the randomly-selected `random_action`.
The `step()` function will return four values:
- observation / state
- reward
- done (True/ False)
- Other information

In [9]:
observation, reward, done, info = env.step(random_action)

In [10]:
print(done)

False


### Reward


> Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. 

In [11]:
print(reward)

-1.4981841929643156


### Random Agent
In the end, before we start training, we can see whether a random agent can successfully land the moon or not.

## DQN
Now, we can build a Q network. The network will return one of action in the action space.

In [12]:
#references: https://github.com/Singyuan/Machine-Learning-NTUEE-2022/blob/master/hw12/hw12.ipynb
#references: https://github.com/yujunkuo/ML2022-Homework/blob/main/hw12/hw12_strong.ipynb
#https://medium.com/pyladies-taiwan/reinforcement-learning-%E9%80%B2%E9%9A%8E%E7%AF%87-deep-q-learning-26b10935a745
#https://ithelp.ithome.com.tw/articles/10208668


class QNetwork(nn.Module):

    def __init__(self, state_size=8, action_size=4):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, action_size)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return x

Then, we need to build a simple agent. The agent will acts according to the output of the policy network above. There are a few things can be done by agent:
- `learn()`：update the policy network from log probabilities and rewards.
- `sample()`：After receiving observation from the environment, utilize policy network to tell which action to take. The return values of this function includes action and log probabilities. 

In [13]:
#references: https://github.com/Singyuan/Machine-Learning-NTUEE-2022/blob/master/hw12/hw12.ipynb
#references: https://github.com/yujunkuo/ML2022-Homework/blob/main/hw12/hw12_strong.ipynb
#https://medium.com/pyladies-taiwan/reinforcement-learning-%E9%80%B2%E9%9A%8E%E7%AF%87-deep-q-learning-26b10935a745
#https://ithelp.ithome.com.tw/articles/10208668


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

#Parameters
BUFFER_SIZE = int(1e5)
BATCH_SIZE = 64         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 5e-4               # learning rate 
UPDATE_EVERY = 2        # how often to update the network
CLIP_GRAD_NORM = 0.8

class DQNAgent():

    def __init__(self, state_size=8, action_size=4):
        
        self.state_size = state_size
        self.action_size = action_size
        self.qnetwork_local = QNetwork(state_size, action_size).to(device)
        self.qnetwork_target = QNetwork(state_size, action_size).to(device)
        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=LR)
        self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE)
        self.t_step = 0
    
    def step(self, state, action, reward, next_state, done):
        self.memory.add(state, action, reward, next_state, done)
        
        self.t_step = (self.t_step + 1) % UPDATE_EVERY
        if self.t_step == 0:
            if len(self.memory) > BATCH_SIZE:
                idxs, is_weight, experiences = self.memory.sample()
                self.learn(experiences, GAMMA, idxs, is_weight)

    def act(self, state, eps=0.):
        
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        self.qnetwork_local.eval()
        with torch.no_grad():
            action_values = self.qnetwork_local(state)
        self.qnetwork_local.train()

        if random.random() > eps:
            return np.argmax(action_values.cpu().data.numpy())
        else:
            return random.choice(np.arange(self.action_size))

    def learn(self, experiences, gamma, idxs, is_weights):      
        states, actions, rewards, next_states, dones = experiences
        Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
        Q_expected = self.qnetwork_local(states).gather(1, actions)
        loss = (torch.FloatTensor(is_weights) * F.mse_loss(Q_expected, Q_targets)).mean()
        errors = torch.abs(Q_expected.detach() - Q_targets).data.squeeze().tolist()
        self.memory.update(idxs, errors)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(self.qnetwork_local.parameters(), CLIP_GRAD_NORM)
        self.optimizer.step()
        self.soft_update(self.qnetwork_local, self.qnetwork_target, TAU)                     

    def soft_update(self, local_model, target_model, tau):
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)

    def get_mem_parms(self):
        alpha, beta = self.memory.get_parm()
        return alpha, beta

In [14]:
#references: https://github.com/Singyuan/Machine-Learning-NTUEE-2022/blob/master/hw12/hw12.ipynb
#references: https://github.com/yujunkuo/ML2022-Homework/blob/main/hw12/hw12_strong.ipynb
#https://medium.com/pyladies-taiwan/reinforcement-learning-%E9%80%B2%E9%9A%8E%E7%AF%87-deep-q-learning-26b10935a745
#https://ithelp.ithome.com.tw/articles/10208668


# a binary tree data structure to represent that the parent’s value is the sum of its children
class SumTree:
    write = 0

    def __init__(self, capacity):
        self.capacity = capacity
        self.tree = np.zeros(2 * capacity - 1)
        self.data = np.zeros(capacity, dtype=object) 
        self.n_entries = 0

    def _propagate(self, idx, change):
        parent = (idx - 1) // 2

        self.tree[parent] += change

        if parent != 0:
            self._propagate(parent, change)

    def _retrieve(self, idx, s):
        left = 2 * idx + 1
        right = left + 1

        if left >= len(self.tree):
            return idx

        if s <= self.tree[left]:
            return self._retrieve(left, s)
        else:
            return self._retrieve(right, s - self.tree[left])

    def total(self):
        return self.tree[0]

    def add(self, p, data):
        idx = self.write + self.capacity - 1

        self.data[self.write] = data
        self.update(idx, p)

        self.write += 1
        if self.write >= self.capacity:
            self.write = 0

        if self.n_entries < self.capacity:
            self.n_entries += 1

    def update(self, idx, p):
        change = p - self.tree[idx]

        self.tree[idx] = p
        self._propagate(idx, change)

    def get(self, s):
        
        idx = self._retrieve(0, s)
        dataIdx = idx - self.capacity + 1

        return (idx, self.tree[idx], self.data[dataIdx])

In [15]:
#references: https://github.com/Singyuan/Machine-Learning-NTUEE-2022/blob/master/hw12/hw12.ipynb
#references: https://github.com/yujunkuo/ML2022-Homework/blob/main/hw12/hw12_strong.ipynb
#https://medium.com/pyladies-taiwan/reinforcement-learning-%E9%80%B2%E9%9A%8E%E7%AF%87-deep-q-learning-26b10935a745
#https://ithelp.ithome.com.tw/articles/10208668


class ReplayBuffer:

    def __init__(self, action_size, buffer_size, batch_size):
        
        self.action_size = action_size
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.tree =  SumTree(buffer_size)

        self.buffer_size = buffer_size
        self.prio_max = 0.1
        self.alpha = 1.0
        self.e = 0.01
        self.beta = 0.5
        self.beta_growth_rate = 1.005
        self.alpha_decay_rate = 0.995

        self.update_cnt = 0
    
    def add(self, state, action, reward, next_state, done):
        data = self.experience(state, action, reward, next_state, done)
        p = (np.abs(self.prio_max) + self.e) ** self.alpha 
        self.tree.add(p, data)
    
    def sample(self):
        states_lst, actions_lst, rewards_lst, next_states_lst, dones_lst = [], [], [], [], []
        idxs = []
        segment = self.tree.total() / self.batch_size
        priorities = []

        for i in range(self.batch_size):
            a = segment * i
            b = segment * (i + 1)
            s = random.uniform(a, b)
            idx, p, ex = self.tree.get(s)
            if ex is not None:
                states_lst.append(ex.state)
                actions_lst.append(ex.action)
                rewards_lst.append(ex.reward)
                next_states_lst.append(ex.next_state)
                dones_lst.append(ex.done)
                priorities.append(p)
                idxs.append(idx)
                
            
        states = torch.from_numpy(np.vstack(states_lst)).float().to(device)
        actions = torch.from_numpy(np.vstack(actions_lst)).long().to(device)
        rewards = torch.from_numpy(np.vstack(rewards_lst)).float().to(device)
        next_states = torch.from_numpy(np.vstack(next_states_lst)).float().to(device)
        dones = torch.from_numpy(np.vstack(dones_lst).astype(np.uint8)).float().to(device)

        if self.update_cnt % 200 == 0:
            self.beta = np.min([1., self.beta*self.beta_growth_rate])
            self.alpha = np.max([0.5, self.alpha*self.alpha_decay_rate])
        self.update_cnt += 1

        sampling_probabilities = priorities / self.tree.total()
        is_weight = np.power(self.tree.n_entries * sampling_probabilities, -self.beta)
        is_weight /= is_weight.max()

        return idxs, is_weight, (states, actions, rewards, next_states, dones)

    def update(self, idxs, errors):
        self.prio_max = max(self.prio_max, max(np.abs(errors)))
        for i, idx in enumerate(idxs):
            p = (np.abs(errors[i]) + self.e) ** self.alpha
            self.tree.update(idx, p) 

    def __len__(self):
        return self.tree.n_entries

    def get_parm(self):
        return self.alpha, self.beta

## Training Agent

Now let's start to train our agent.
Through taking all the interactions between agent and environment as training data, the policy network can learn from all these attempts,

In [16]:
#references: https://github.com/Singyuan/Machine-Learning-NTUEE-2022/blob/master/hw12/hw12.ipynb
#references: https://github.com/yujunkuo/ML2022-Homework/blob/main/hw12/hw12_strong.ipynb
#https://medium.com/pyladies-taiwan/reinforcement-learning-%E9%80%B2%E9%9A%8E%E7%AF%87-deep-q-learning-26b10935a745
#https://ithelp.ithome.com.tw/articles/10208668


class DQN(object):
  def __init__(self, env, n_episodes=3500, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
    self.agent = DQNAgent()
    self.env = env
    self.n_episodes = n_episodes
    self.max_t = max_t
    self.eps_start = eps_start
    self.eps_end = eps_end
    self.eps_decay = eps_decay

  def train(self):
    scores, final_rewards = [], []
    scores_window = deque(maxlen=100)
    eps = self.eps_start                    
    score_best = 0.0

    for i_episode in range(1, self.n_episodes+1):
      state = self.env.reset()
      score = 0
      for t in range(self.max_t):
        action = self.agent.act(state, eps)
        next_state, reward, done, _ = self.env.step(action)
        self.agent.step(state, action, reward, next_state, done)
        state = next_state
        score += reward
        if done:
            final_rewards.append(reward)
            break 
      scores_window.append(score)       
      scores.append(score)              
      eps = max(self.eps_end, self.eps_decay*eps)

      alpha, beta = self.agent.get_mem_parms()

      if i_episode % 100 == 0:
        print('\rEpisode {}\tAverage Score: {:.2f}\tepsilon-greedy: {:.4f}\talpha: {:.3f}, \tbeta: {:.3f}'.format(i_episode, np.mean(scores_window), eps, alpha, beta))

      if np.mean(scores_window)>=score_best+5:
        print('Environment saved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
        torch.save(self.agent.qnetwork_local.state_dict(), f'checkpoint_{_exp}.pth')
        score_best = np.mean(scores_window).item()
        if np.mean(scores_window)>=280.0:
            print('Environment saved in {:d} episodes!\tAverage Score: {:.2f} <-- exceed baseline'.format(i_episode, np.mean(scores_window)))
            torch.save(self.agent.qnetwork_local.state_dict(), f'checkpoint_{_exp}.pth')
            break
    # final step
    if np.mean(scores_window)>=score_best:
      print('Environment saved in {:d} episodes!\tAverage Score: {:.2f} <-- final step'.format(i_episode, np.mean(scores_window)))
      torch.save(self.agent.qnetwork_local.state_dict(), f'checkpoint_{_exp}.pth')

    return scores, final_rewards


In [None]:
DQN = DQN(env)
scores, final_rewards = DQN.train()

Episode 100	Average Score: -180.35	epsilon-greedy: 0.6058	alpha: 0.882, 	beta: 0.566
Episode 200	Average Score: -115.47	epsilon-greedy: 0.3670	alpha: 0.711, 	beta: 0.702


### Training Result
During the training process, we recorded `avg_total_reward`, which represents the average total reward of episodes before updating the policy network.

Theoretically, if the agent becomes better, the `avg_total_reward` will increase.
The visualization of the training process is shown below:  


In [None]:
plt.plot(scores)
plt.title("Total Rewards")
plt.show()

In addition, `avg_final_reward` represents average final rewards of episodes. To be specific, final rewards is the last reward received in one episode, indicating whether the craft lands successfully or not.


In [None]:
plt.plot(final_rewards)
plt.title("Final Rewards")
plt.show()

## Testing
The testing result will be the average reward of 5 testing

In [None]:
fix(env, seed)
DQN.agent.qnetwork_local.load_state_dict(torch.load(f'checkpoint_{_exp}.pth'))
#DQN.agent.qnetwork_local.load_state_dict(torch.load(f'checkpoint.pth'))
# agent.network.eval()  # set the network into evaluation mode
NUM_OF_TEST = 5 # Do not revise this !!!
test_total_reward = []
action_list = []
for i in range(NUM_OF_TEST):
  actions = []
  state = env.reset()

  img = plt.imshow(env.render(mode='rgb_array'))

  total_reward = 0

  done = False
  while not done:
      action = DQN.agent.act(state)
      actions.append(action)
      state, reward, done, _ = env.step(action)

      total_reward += reward
      
  print(total_reward)
  test_total_reward.append(total_reward)

  action_list.append(actions) # save the result of testing 


In [None]:
print(np.mean(test_total_reward))

Action list

In [None]:
print("Action list looks like ", action_list)
print("Action list's shape looks like ", np.shape(action_list))

Analysis of actions taken by agent

In [None]:
distribution = {}
for actions in action_list:
  for action in actions:
    if action not in distribution.keys():
      distribution[action] = 1
    else:
      distribution[action] += 1
print(distribution)

Saving the result of Model Testing


In [None]:
PATH = "d11948002_hw12.npy" # Can be modified into the name or path you want
np.save(PATH ,np.array(action_list)) 

### This is the file you need to submit !!!
Download the testing result to your device



In [None]:
from google.colab import files
files.download(PATH)

# Server 
The code below simulate the environment on the judge server. Can be used for testing.

In [None]:
action_list = np.load(PATH,allow_pickle=True) # The action list you upload
seed = 2023 # Do not revise this
fix(env, seed)

#agent.network.eval()  # set network to evaluation mode
DQN.agent.qnetwork_local.load_state_dict(torch.load(f'checkpoint_{_exp}.pth'))
#DQN.agent.qnetwork_local.load_state_dict(torch.load(f'checkpoint.pth'))
#agent.qnetwork_local.load_state_dict(torch.load('checkpoint.pth'))

test_total_reward = []
if len(action_list) != 5:
  print("Wrong format of file !!!")
  exit(0)
for actions in action_list:
  state = env.reset()
  img = plt.imshow(env.render(mode='rgb_array'))

  total_reward = 0

  done = False

  for action in actions:
  
      state, reward, done, _ = env.step(action)
      total_reward += reward
      if done:
        break

  print(f"Your reward is : %.2f"%total_reward)
  test_total_reward.append(total_reward)

# Your score

In [None]:
print(f"Your final reward is : %.2f"%np.mean(test_total_reward))

## Reference

Below are some useful tips for you to get high score.

- [DRL Lecture 1: Policy Gradient (Review)](https://youtu.be/z95ZYgPgXOY)
- [ML Lecture 23-3: Reinforcement Learning (including Q-learning) start at 30:00](https://youtu.be/2-JNBzCq77c?t=1800)
- [Lecture 7: Policy Gradient, David Silver](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf)
