# **Homework 12 - Reinforcement Learning**

If you have any problem, e-mail us at ntu-ml-2022spring-ta@googlegroups.com



* [Rainbow:整合DQN六種改進的深度強化學習方法！](https://www.jianshu.com/p/1dfd84cd2e69)
* [ericyangyu/PPO-for-Beginners](https://github.com/ericyangyu/PPO-for-Beginners)

In [None]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



## Preliminary work

First, we need to install all necessary packages.
One of them, gym, builded by OpenAI, is a toolkit for developing Reinforcement Learning algorithm. Other packages are for visualization in colab.

In [None]:
!apt update
!apt install python-opengl xvfb -y
!pip install gym[box2d]==0.18.3 pyvirtualdisplay numpy==1.19.5 torch==1.8.1

[33m0% [Working][0m            Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
[33m0% [Connecting to security.ubuntu.com (91.189.91.39)] [Connected to cloud.r-pro[0m                                                                               Hit:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
                                                                               Hit:3 http://security.ubuntu.com/ubuntu bionic-security InRelease
[33m0% [Waiting for headers] [Connected to cloud.r-project.org (13.225.213.82)] [Co[0m[33m0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Connected to cloud.r-projec[0m                                                                               Hit:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
                                                                               Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:6 https://cloud.r-project


Next, set up virtual display，and import all necessaary packages.

In [None]:
%%capture
from pyvirtualdisplay import Display

%matplotlib inline
import matplotlib.pyplot as plt

from IPython import display

import math
import numpy as np
from tqdm.auto import tqdm
from collections import deque

import torch
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autograd 
import torch.nn.functional as F
from torch.distributions import Categorical

In [None]:
cuda = True if torch.cuda.is_available() else False
device = torch.device('cuda:0' if cuda else 'cpu')
FloatTensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor
Variable = lambda *args, **kwargs: autograd.Variable(*args, **kwargs).cuda() if cuda else autograd.Variable(*args, **kwargs)
device

device(type='cpu')

# Warning ! Do not revise random seed !!!
# Your submission on JudgeBoi will not reproduce your result !!!
Make your HW result to be reproducible.


In [None]:
seed = 543 # Do not change this
def fix(env, seed):
    env.seed(seed)
    env.action_space.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.set_deterministic(True)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

Last, call gym and build an [Lunar Lander](https://gym.openai.com/envs/LunarLander-v2/) environment.

In [None]:
%%capture
import gym
import random
env = gym.make('LunarLander-v2')
fix(env, seed) # fix the environment Do not revise this !!!

## What Lunar Lander？

“LunarLander-v2”is to simulate the situation when the craft lands on the surface of the moon.

This task is to enable the craft to land "safely" at the pad between the two yellow flags.
> Landing pad is always at coordinates (0,0).
> Coordinates are the first two numbers in state vector.

![](https://user-images.githubusercontent.com/15806078/153222406-af5ce6f0-4696-4a24-a683-46ad4939170c.gif)

"LunarLander-v2" actually includes "Agent" and "Environment". 

In this homework, we will utilize the function `step()` to control the action of "Agent". 

Then `step()` will return the observation/state and reward given by the "Environment".

### Observation / State

First, we can take a look at what an Observation / State looks like.

In [None]:
print(env.observation_space)

Box(-inf, inf, (8,), float32)



`Box(8,)`means that observation is an 8-dim vector
### Action

Actions can be taken by looks like

In [None]:
print(env.action_space)

Discrete(4)


`Discrete(4)` implies that there are four kinds of actions can be taken by agent.
- 0 implies the agent will not take any actions
- 2 implies the agent will accelerate downward
- 1, 3 implies the agent will accelerate left and right

Next, we will try to make the agent interact with the environment. 
Before taking any actions, we recommend to call `reset()` function to reset the environment. Also, this function will return the initial state of the environment.

In [None]:
initial_state = env.reset()
print(initial_state)

[ 0.00396109  1.4083536   0.40119505 -0.11407257 -0.00458307 -0.09087662
  0.          0.        ]


Then, we try to get a random action from the agent's action space.

In [None]:
random_action = env.action_space.sample()
print(random_action)

0


More, we can utilize `step()` to make agent act according to the randomly-selected `random_action`.
The `step()` function will return four values:
- observation / state
- reward
- done (True/ False)
- Other information

In [None]:
observation, reward, done, info = env.step(random_action)

In [None]:
print(done)

False


### Reward


> Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. 

In [None]:
print(reward)

-0.8588900517154912


### Random Agent
In the end, before we start training, we can see whether a random agent can successfully land the moon or not.

In [None]:
%%script false --no-raise-error

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

env.reset()

img = plt.imshow(env.render(mode='rgb_array'))

done = False
rewards = []
while not done:
    action = env.action_space.sample()
    observation, reward, done, _ = env.step(action)
    
    rewards.append(reward)
    img.set_data(env.render(mode='rgb_array'))
    display.display(plt.gcf())
    display.clear_output(wait=True)
print(np.mean(rewards))
virtual_display.stop()

## Model

In [None]:
class Actor(nn.Module):

    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(in_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, out_dim),
            nn.Softmax(dim=-1),
        )

    def forward(self, state):
        return self.model(state)

In [None]:
class Critic(nn.Module):

    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(in_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, out_dim),
        )

    def forward(self, state):
        return self.model(state)

## utils

### Test Env 
作弊用 only for 作業，現實不可能

In [None]:
def test_agent(agent, env, seed=543):
    fix(env, seed)
    agent.eval()  # set the network into evaluation mode
    NUM_OF_TEST = 5 # Do not revise this !!!
    test_total_reward = []
    for i in range(NUM_OF_TEST):
        state = env.reset()

        total_reward = 0

        done = False
        while not done:
            action, _ = agent.action(np.expand_dims(state, axis=0))
            state, reward, done, _ = env.step(action.squeeze())

            total_reward += reward

        test_total_reward.append(total_reward)
    
    agent.train()
    return np.mean(test_total_reward)

### Play one and collect data

In [None]:
def play_one_episode(env, agent, gamma, max_ep_len):
    state = env.reset()
    
    states = []
    actions = []
    log_probs = []
    rewards = []
    rewards_acc = []
    next_states = []
    dones = []

    for _ in range(max_ep_len):
        # take action
        action, log_prob = agent.action(np.expand_dims(state,axis=0))

        # interact
        next_state, reward, done, _ = env.step(action.squeeze())

        # store trajectory
        states.append(state)
        actions.append(action)
        log_probs.append([log_prob]) # [log(a1|s1), log(a2|s2), ...., log(at|st)]
        rewards.append([reward])
        next_states.append(next_state)
        dones.append([done])
        
        state = next_state
        if done:
            break

    # ! IMPORTANT !
    # Current reward implementation: immediate reward,  given action_list : a1, a2, a3 ......
    #                                                         rewards :     r1, r2 ,r3 ......
    # medium：change "rewards" to accumulative decaying reward, given action_list : a1,                           a2,                           a3, ......
    #                                                           rewards :           r1+0.99*r2+0.99^2*r3+......, r2+0.99*r3+0.99^2*r4+...... ,  r3+0.99*r4+0.99^2*r5+ ......
    # boss : implement Actor-Critic
    R = 0
    for r in rewards[::-1]:
        R = r[0] + gamma * R
        rewards_acc.insert(0, [R])
    
    # same batch size
    batch = len(states)
    assert np.shape(states) == (batch, 8)
    assert np.shape(actions) == (batch, 1)
    assert torch.Tensor(log_probs).shape == (batch, 1)
    assert np.shape(rewards) == (batch, 1)
    assert np.shape(rewards_acc) == (batch, 1)
    assert np.shape(next_states) == (batch, 8)
    assert np.shape(dones) == (batch, 1)
    
    return states, actions, log_probs, rewards, rewards_acc, next_states, dones

## Policy Gradient
Now, we can build a simple policy network. The network will return one of action in the action space.

In [None]:
class PolicyGradientAgent(nn.Module):
    def __init__(self, actor, optimizer):
        super().__init__()
        self.actor = actor
        self.optimizer = optimizer
        
    def forward(self, x):
        return self.actor(x)
    
    def action(self, state):
        action_prob = self(torch.FloatTensor(state))
        action_dist = Categorical(action_prob)
        action = action_dist.sample()
        log_prob = action_dist.log_prob(action)
        
        return action.cpu().numpy(), log_prob
    
    def learn(self, log_probs, rewards):
        # use torch.stack to remain gradient
        log_probs = torch.stack([log_prob[0] for log_prob in log_probs])  # (batch, 1)
        rewards = FloatTensor(rewards)                                    # (batch, 1)

        # train
        rewards_norm = (rewards - rewards.mean()) / (rewards.std() + 1e-9)  # !important: normalize the reward to make negative & postive reward
        loss = -log_probs * rewards_norm
        loss = loss.sum() # must be sum
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return loss.item()

Lastly, build a network and agent to start training.

In [None]:
epochs = 2000
episodes = 5
max_ep_len = 300
gamma = 0.99

learning_rate = 0.001

In [None]:
actor = Actor(8, 4)
optimizer = optim.SGD(actor.parameters(), lr=learning_rate) # !important SGD
agent_policygradient = PolicyGradientAgent(actor, optimizer)
# agent_policygradient.load_state_dict(torch.load("../input/hw12tmp/agent_policygradient.pt"))

### Training Agent

Now let's start to train our agent.
Through taking all the interactions between agent and environment as training data, the policy network can learn from all these attempts,

In [None]:
%%script false --no-raise-error

agent_policygradient.train()
best_avg_total_reward = -np.inf
best_test_total_reward = -np.inf
avg_total_rewards, avg_final_rewards = [], []
total_rewards, final_rewards, total_loss = deque(maxlen=100), deque(maxlen=100), deque(maxlen=100)
progress_bar = tqdm(range(epochs))
for epoch in progress_bar:
    log_probs_epoch, rewards_epoch = [], []
    for episode in range(episodes):  # Don't infinite loop while learning
        states, actions, log_probs, rewards, rewards_acc, next_states, dones = play_one_episode(env, agent_policygradient, gamma, max_ep_len)
        
        log_probs_epoch.extend(log_probs)
        rewards_epoch.extend(rewards_acc)
        
        total_rewards.append(np.sum(rewards))
        final_rewards.append(rewards[-1])
    
    loss = agent_policygradient.learn(log_probs_epoch, rewards_epoch)
    total_loss.append(loss)
    
    avg_total_reward = np.mean(total_rewards)
    avg_final_reward = np.mean(final_rewards)
    avg_total_rewards.append(avg_total_reward)
    avg_final_rewards.append(avg_final_reward)
    
    if best_avg_total_reward < avg_total_reward:
#         torch.save(agent_policygradient.state_dict(), "agent_policygradient.pt")
        best_avg_total_reward = avg_total_reward
    
    test_total_reward = test_agent(agent_policygradient, env)
    if best_test_total_reward < test_total_reward:
        torch.save(agent_policygradient.state_dict(), "agent_policygradient.pt")
        best_test_total_reward = test_total_reward
        print(f"{epoch:4d}: best test {best_test_total_reward:.4f}, avg:{avg_total_reward:.4f}, best avg:{best_avg_total_reward:.4f}")
    
    progress_bar.set_postfix(Total=avg_total_reward, 
                             Total_best=best_avg_total_reward,
                             Test=test_total_reward,
                             Test_best=best_test_total_reward,
                             Final=avg_final_reward, 
                             loss=np.mean(total_loss))

### Training Result
During the training process, we recorded `avg_total_reward`, which represents the average total reward of episodes before updating the policy network.

Theoretically, if the agent becomes better, the `avg_total_reward` will increase.
The visualization of the training process is shown below:  

In addition, `avg_final_reward` represents average final rewards of episodes. To be specific, final rewards is the last reward received in one episode, indicating whether the craft lands successfully or not.

In [None]:
%%script false --no-raise-error

plt.plot(avg_total_rewards)
plt.title("Total Rewards")
plt.show()

plt.plot(avg_final_rewards)
plt.title("Final Rewards")
plt.show()

## PPO(Proximal Policy Optimization)
* https://huggingface.co/ThomasSimonini/ppo-LunarLander-v2
* https://github.com/nikhilbarhate99/PPO-PyTorch
* https://hackmd.io/@shaoeChen/Bywb8YLKS/https%3A%2F%2Fhackmd.io%2F%40shaoeChen%2FSyez2AmFr

### ref model to imporve that it could work

In [None]:
%%script false --no-raise-error
!pip install stable-baselines3

In [None]:
%%script false --no-raise-error
from stable_baselines3 import PPO


model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=100e3)
        
obs = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))

done = False
rewards = []
while not done:
    action, _states = model.predict(obs)
    obs, reward, done, _ = env.step(action)
    
    rewards.append(reward)
    img.set_data(env.render(mode='rgb_array'))
    display.display(plt.gcf())
    display.clear_output(wait=True)
print(np.mean(rewards))

### implement

In [None]:
class PPOAgent(nn.Module):
    def __init__(self, actor, critic, optimizer_actor, optimizer_critic, gamma, train_epochs=5, clip=0.2):
        super().__init__()
        self.actor = actor
        self.critic = critic
        self.optimizer_actor = optimizer_actor
        self.optimizer_critic = optimizer_critic
        self.gamma = gamma
        self.train_epochs = train_epochs
        self.clip = clip
        
    def forward(self, x):
        return self.actor(x)
    
    def action(self, state):
        action_prob = self(torch.FloatTensor(state))
        action_dist = Categorical(action_prob)
        action = action_dist.sample()
        log_prob = action_dist.log_prob(action)
        
        return action.cpu().numpy(), log_prob
    
    def evaluate_action(self, state, action):
        action_prob = self(state)
        action_dist = Categorical(action_prob)

        log_prob = action_dist.log_prob(action.squeeze()).unsqueeze(1)
        entropy = action_dist.entropy().unsqueeze(1)
        
        return log_prob, entropy
    
    def learn(self, states, actions, next_states, log_probs, rewards, rewards_acc):
        # use FloatTensor to replce torch.from_numpy to avoid RuntimeError: Found dtype Double but expected Float
        states = FloatTensor(states)                      # (batch, 8)
        actions = torch.LongTensor(actions)               # (batch, 1)
        next_states = FloatTensor(next_states)            # (batch, 8)
        old_log_probs = FloatTensor(log_probs).detach()   # (batch, 1)
        rewards = FloatTensor(rewards)                    # (batch, 1)
        rewards_acc = FloatTensor(rewards_acc)            # (batch, 1)        
        
        loss_actors = []
        loss_critics = []
        loss_entropys = []
        for _ in range(self.train_epochs): 
            # critic ========================================================
            V = self.critic(states)                       # (batch, 1)
#             V_next = self.critic(next_states)             # (batch, 1)
            # A = r + gamma * V_s_t+1 - V_s_t
#             A = rewards + self.gamma * V_next.detach() - V.detach() 
            A = rewards_acc - V
        
            loss_critic = A.pow(2).mean()
            self.optimizer_critic.zero_grad()
            loss_critic.backward()
            self.optimizer_critic.step()
            
            loss_critics.append(loss_critic.item())
        
        V = self.critic(states)
        A = rewards_acc - V
        for _ in range(self.train_epochs): 
            # actor ========================================================      
            new_log_probs, entropies = self.evaluate_action(states, actions) # same action to calculate right ratios
            # (batch, 1)   (batch, 1)
            
            # important sampling
            # Calculate the ratio pi_theta(a_t | s_t) / pi_theta_k(a_t | s_t)
            ratios = torch.exp(new_log_probs - old_log_probs)                            # (batch, 1)
            
            # losses.
            loss1 = ratios * A.detach()                                                  # (batch, 1)
            loss_limit = torch.clamp(ratios, 1 - self.clip, 1 + self.clip) * A.detach()  # (batch, 1)
            # larger entropies more exploration
            loss_actor = (-torch.min(loss1, loss_limit)) - 0.01 * entropies              # (batch, 1)
            loss_actor = loss_actor.mean()
            self.optimizer_actor.zero_grad()
            loss_actor.backward()
            self.optimizer_actor.step()
            
            loss_actors.append(loss_actor.item())
            loss_entropys.append(entropies.mean().item())
        
        return loss_actors, loss_critics, loss_entropys

In [None]:
epochs = 2000
episodes = 3
max_ep_len = 300
gamma = 0.99

In [None]:
actor = Actor(8, 4)
critic = Critic(8, 1)
optimizer_actor = optim.Adam(actor.parameters(), lr=0.0003)
optimizer_critic = optim.Adam(critic.parameters(), lr=0.001)
agent_PPO = PPOAgent(actor, critic, optimizer_actor, optimizer_critic, gamma, train_epochs=30)
# agent_PPO.load_state_dict(torch.load("../input/hw12tmp/agent_PPO.pt"))

In [None]:
%%script false --no-raise-error

agent_PPO.train()
best_avg_total_reward = -np.inf
best_test_total_reward = -np.inf
avg_total_rewards, avg_final_rewards = [], []
total_rewards, final_rewards = deque(maxlen=100), deque(maxlen=100)
total_loss_actors, total_loss_critics, total_loss_entropys = deque(maxlen=100), deque(maxlen=100), deque(maxlen=100)
progress_bar = tqdm(range(epochs))
for epoch in progress_bar:
    states_epoch, actions_epoch, log_probs_epoch, rewards_epoch, rewards_acc_epoch, next_states_epoch = [], [], [], [], [], []
    for episode in range(episodes):  # Don't infinite loop while learning
        states, actions, log_probs, rewards, rewards_acc, next_states, dones = play_one_episode(env, agent_PPO, gamma, max_ep_len)
        
        states_epoch.extend(states)
        actions_epoch.extend(actions)
        log_probs_epoch.extend(log_probs)
        rewards_epoch.extend(rewards)
        rewards_acc_epoch.extend(rewards_acc)
        next_states_epoch.extend(next_states)
        
        total_rewards.append(np.sum(rewards))
        final_rewards.append(rewards[-1])
    
    loss_actors, loss_critics, loss_entropys = agent_PPO.learn(states_epoch, actions_epoch, next_states_epoch, log_probs_epoch, rewards_epoch, rewards_acc_epoch)
    
    avg_total_reward = np.mean(total_rewards)
    avg_final_reward = np.mean(final_rewards)
    avg_total_rewards.append(avg_total_reward)
    avg_final_rewards.append(avg_final_reward)
    total_loss_actors.extend(loss_actors)
    total_loss_critics.extend(loss_critics)
    total_loss_entropys.extend(loss_entropys)
    
    if best_avg_total_reward < avg_total_reward:
#         torch.save(agent_PPO.state_dict(), "agent_PPO.pt")
        best_avg_total_reward = avg_total_reward
      
    test_total_reward = test_agent(agent_PPO, env)
    if best_test_total_reward < test_total_reward:
        torch.save(agent_PPO.state_dict(), "agent_PPO.pt")
        best_test_total_reward = test_total_reward
        print(f"{epoch:4d}: best test {best_test_total_reward:.4f}, avg:{avg_total_reward:.4f}, best avg:{best_avg_total_reward:.4f}")
    
    progress_bar.set_postfix(Total=avg_total_reward, 
                             Total_best=best_avg_total_reward,
                             Test=test_total_reward,
                             Test_best=best_test_total_reward,
                             Final=avg_final_reward, 
                             loss_actors=np.mean(total_loss_actors),
                             loss_critics=np.mean(total_loss_critics),
                             loss_entropys=np.mean(total_loss_entropys))

In [None]:
%%script false --no-raise-error

plt.plot(avg_total_rewards)
plt.title("Total Rewards")
plt.show()

plt.plot(avg_final_rewards)
plt.title("Final Rewards")
plt.show()

## DQN
* https://github.com/higgsfield/RL-Adventure
* https://goodboychan.github.io/python/reinforcement_learning/pytorch/udacity/2021/05/07/DQN-LunarLander.html
* https://github.com/Curt-Park/rainbow-is-all-you-need

### Replay Buffer

In [None]:
class ReplayBuffer(object):
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):            
        self.buffer.append((state, action, [reward], next_state, [done]))
    
    def sample(self, batch_size):
        state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
        assert np.shape(state) == (batch_size, 8)
        assert np.shape(action) == (batch_size, 1)
        assert np.shape(reward) == (batch_size, 1)
        assert np.shape(next_state) == (batch_size, 8)
        assert np.shape(done) == (batch_size, 1)
        
        return state, action, reward, next_state, done
    
    def __len__(self):
        return len(self.buffer)

In [None]:
class NoisyLinear(nn.Module):
    def __init__(self, in_features, out_features, std_init=0.4):
        super(NoisyLinear, self).__init__()
        
        self.in_features  = in_features
        self.out_features = out_features
        self.std_init     = std_init
        
        self.weight_mu    = nn.Parameter(torch.FloatTensor(out_features, in_features))
        self.weight_sigma = nn.Parameter(torch.FloatTensor(out_features, in_features))
        self.register_buffer('weight_epsilon', torch.FloatTensor(out_features, in_features))
        
        self.bias_mu    = nn.Parameter(torch.FloatTensor(out_features))
        self.bias_sigma = nn.Parameter(torch.FloatTensor(out_features))
        self.register_buffer('bias_epsilon', torch.FloatTensor(out_features))
        
        self.reset_parameters()
        self.reset_noise()
    
    def forward(self, x):
        if self.training: 
            weight = self.weight_mu + self.weight_sigma  * self.weight_epsilon
            bias   = self.bias_mu   + self.bias_sigma * self.bias_epsilon
        else:
            weight = self.weight_mu
            bias   = self.bias_mu
        
        return F.linear(x, weight, bias)
    
    def reset_parameters(self):
        mu_range = 1 / math.sqrt(self.weight_mu.size(1))
        
        self.weight_mu.data.uniform_(-mu_range, mu_range)
        self.weight_sigma.data.fill_(self.std_init / math.sqrt(self.weight_sigma.size(1))) # 1/sqrt(in_features)
        
        self.bias_mu.data.uniform_(-mu_range, mu_range)
        self.bias_sigma.data.fill_(self.std_init / math.sqrt(self.bias_sigma.size(0)))     # 1/sqrt(out_features)
    
    def reset_noise(self):
        epsilon_in  = self._scale_noise(self.in_features)
        epsilon_out = self._scale_noise(self.out_features)
        
        self.weight_epsilon.copy_(epsilon_out.outer(epsilon_in))
        self.bias_epsilon.copy_(self._scale_noise(self.out_features))
    
    def _scale_noise(self, size):
        x = torch.randn(size)
        x = x.sign().mul(x.abs().sqrt())
        return x

In [None]:
class QNet(nn.Module):

    def __init__(self, in_dim, out_dim):
        super().__init__()
        # Dueling-DQN
        self.net = nn.Sequential(
            nn.Linear(in_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
        )
        self.adv = nn.Sequential(
            nn.Linear(64, 1),
        )
        self.value = nn.Sequential(
            nn.Linear(64, out_dim),
        )        

    def forward(self, state):
        x = self.net(state)
        adv = self.adv(x)
        value = self.value(x)
        value = value - value.mean(dim=1).unsqueeze(1)
        return value + adv
    
class QNetNoise(nn.Module):

    def __init__(self, in_dim, out_dim):
        super().__init__()        
        # NoisyNet & Dueling-DQN
        self.linear = nn.Linear(in_dim, 64)
        self.noise_layer = NoisyLinear(64, 64)
        self.noise_layer_adv = NoisyLinear(64, 1)
        self.noise_layer_value = NoisyLinear(64, out_dim)
        

    def forward(self, state):
        x = F.relu(self.linear(state))
        x = F.relu(self.noise_layer(x))
        
        adv = self.noise_layer_adv(x)
        value = self.noise_layer_value(x)
        value = value - value.mean(dim=1).unsqueeze(1)
        return value + adv
    
    def reset_noise(self):
        self.noise_layer.reset_noise()
        self.noise_layer_adv.reset_noise()
        self.noise_layer_value.reset_noise()

In [None]:
class DQNAgent(nn.Module):
    def __init__(self, qNet_eval, qNet_target, optimizer_qNet, gamma, tau, epsilon=0.01, train_epochs=5):
        super().__init__()
        self.qNet_eval = qNet_eval
        self.optimizer_qNet_eval = optimizer_qNet
        self.qNet_target = qNet_target
        self.qNet_target.load_state_dict(self.qNet_eval.state_dict())
        self.gamma = gamma
        self.tau = tau
        self.epsilon = epsilon
        self.train_epochs = train_epochs
        
    def forward(self, x):
        return self.qNet_eval(x)
    
    def action(self, state):
#         # Epsilon-greedy action selection
#         # use simple epsilon to do better work
#         if random.random() > self.epsilon:
#             action_prob = self.qNet_eval(torch.FloatTensor(state))
#             log_prob, action = torch.max(action_prob, dim=1)

#             return action.cpu().numpy(), log_prob
#         else:
#             return np.array([random.choice(np.arange(4))]), None

        # NoisyNet: no epsilon greedy action selection
        action_prob = self.qNet_eval(torch.FloatTensor(state))
        log_prob, action = torch.max(action_prob, dim=1)

        return action.cpu().numpy(), log_prob
    
    def learn(self, replay_buffer):
        state, action, reward, next_state, done = replay_buffer.sample(batch_size)
        
        # use FloatTensor to replce torch.from_numpy to avoid RuntimeError: Found dtype Double but expected Float
        states      = FloatTensor(state)
        next_states = FloatTensor(next_state)
        actions     = torch.LongTensor(action)
        rewards     = FloatTensor(reward)
        dones       = torch.LongTensor(done)
        
        loss_qNet_evals = []
        for _ in range(self.train_epochs): 
            Q = self.qNet_eval(states)                     # (batch, 4)
            Q_next = self.qNet_eval(next_states)           # (batch, 4)
            Q_next_target = self.qNet_target(next_states)  # (batch, 4)
            
            # G_t   = r + gamma * v(s_{t+1})  if state != Terminal
            #       = r                       otherwise
            # Double-DQN
            # 在 dim=1，以 a 為 index 取值
            # ref:https://zhuanlan.zhihu.com/p/352877584
            Q = Q.gather(dim=1, index=actions)                               # (batch, 1)
            actions_next = torch.argmax(Q_next, dim=1, keepdim=True)         # (batch, 1)
            Q_next_target = Q_next_target.gather(dim=1, index=actions_next)  # (batch, 1)
            Q_expect = rewards + self.gamma * Q_next_target * (1 - dones)    # (batch, 1)
            
#             loss_qNet_eval = F.mse_loss(Q, Q_expect.detach()) # TD Error
            loss_qNet_eval = F.smooth_l1_loss(Q, Q_expect.detach()) # TD Error
            self.optimizer_qNet_eval.zero_grad()
            loss_qNet_eval.backward()
            self.optimizer_qNet_eval.step()
            
            loss_qNet_evals.append(loss_qNet_eval.item())
        
        self.qNet_eval.reset_noise()
        self.qNet_target.reset_noise()
#         self.qNet_target.load_state_dict(self.qNet_eval.state_dict())
        self.soft_update(qNet_eval, qNet_target, self.tau)
        
        return loss_qNet_evals
    
    def soft_update(self, eval_model, target_model, tau):
        """Soft update model parameters.
        θ_target = τ*θ_eval + (1 - τ)*θ_target

        Params
        ======
            eval_param   (PyTorch model): weights will be copied from
            target_model (PyTorch model): weights will be copied to
            tau (float): interpolation parameter 
        """
        for target_param, eval_param in zip(target_model.parameters(), eval_model.parameters()):
            target_param.data.copy_(tau*eval_param.data + (1.0-tau)*target_param.data)

In [None]:
epochs = 2000
episodes = 5
max_ep_len = 300
gamma = 0.99
batch_size = 64
TAU = 1e-3              
learning_rate = 0.0001 # 5e-4   
replay_size = 10000 #1000

In [None]:
replay_buffer = ReplayBuffer(replay_size)

qNet_eval = QNetNoise(8, 4)
qNet_target = QNetNoise(8, 4)
optimizer_qNet = optim.Adam(qNet_eval.parameters(), lr=learning_rate)
agent_DQN = DQNAgent(qNet_eval, qNet_target, optimizer_qNet, gamma, TAU, train_epochs=1)
# agent_DQN.load_state_dict(torch.load("../input/hw12tmp/agent_DQN.pt"))

In [None]:
%%script false --no-raise-error

agent_DQN.train()
avg_total_rewards, avg_final_rewards = [], []
best_avg_total_reward = -np.inf
best_test_total_reward = -np.inf
total_rewards, final_rewards, total_loss = deque(maxlen=100), deque(maxlen=100), deque(maxlen=100)
progress_bar = tqdm(range(epochs))
for epoch in progress_bar:
    for episode in range(episodes):  # Don't infinite loop while learning
        state = env.reset()
        total_reward = 0
        for i in range(max_ep_len):
            # take action
            action, log_prob = agent_DQN.action(np.expand_dims(state, axis=0))
            # interact
            next_state, reward, done, _ = env.step(action.squeeze())

            replay_buffer.push(state, action, reward, next_state, done)
            
            # !important training in episode helps train better, if outer loss would not converge
            if i%3 == 0 and len(replay_buffer) > batch_size:
                loss = agent_DQN.learn(replay_buffer)
                total_loss.extend(loss)

            state = next_state
            if done:
                break
            
            total_reward += reward

        total_rewards.append(total_reward)
        final_rewards.append(reward)
        
    avg_total_reward = np.mean(total_rewards)
    avg_final_reward = np.mean(final_rewards)
    avg_total_rewards.append(avg_total_reward)
    avg_final_rewards.append(avg_final_reward)
    
    if best_avg_total_reward < avg_total_reward:
#         torch.save(agent_DQN.state_dict(), "agent_DQN.pt")
        best_avg_total_reward = avg_total_reward
    
    test_total_reward = test_agent(agent_DQN, env)
    if best_test_total_reward < test_total_reward:
        torch.save(agent_DQN.state_dict(), "agent_DQN.pt")
        best_test_total_reward = test_total_reward
        print(f"{epoch:4d}: best test {best_test_total_reward:.4f}, avg:{avg_total_reward:.4f}, best avg:{best_avg_total_reward:.4f}")
    
    progress_bar.set_postfix(Total=avg_total_reward, 
                             Total_best=best_avg_total_reward,
                             Test=test_total_reward,
                             Test_best=best_test_total_reward,
                             Final=avg_final_reward, 
                             loss=np.mean(total_loss))

In [None]:
%%script false --no-raise-error

plt.plot(avg_total_rewards)
plt.title("Total Rewards")
plt.show()

plt.plot(avg_final_rewards)
plt.title("Final Rewards")
plt.show()

## Advantage Actor-Critic

In [None]:
class A2CAgent(nn.Module):
    def __init__(self, actor, critic, optimizer_actor, optimizer_critic, gamma, train_actor_epochs, train_critic_epochs):
        super().__init__()
        self.actor = actor
        self.critic = critic
        self.optimizer_actor = optimizer_actor
        self.optimizer_critic = optimizer_critic
        self.gamma = gamma
        self.train_actor_epochs = train_actor_epochs
        self.train_critic_epochs = train_critic_epochs
    
    def action(self, state):
        action_prob = self.actor(torch.FloatTensor(state))
        action_dist = Categorical(action_prob)
        action = action_dist.sample()
        log_prob = action_dist.log_prob(action)
        
        return action.cpu().numpy(), log_prob
    
    def evaluate_action(self, state, action):
        action_prob = self.actor(state)
        action_dist = Categorical(action_prob)

        log_prob = action_dist.log_prob(action.squeeze()).unsqueeze(1)
        entropy = action_dist.entropy().unsqueeze(1)
        
        return log_prob, entropy
    
    def learn(self, states, actions, next_states, rewards, rewards_acc, dones):
        # use FloatTensor to replce torch.from_numpy to avoid RuntimeError: Found dtype Double but expected Float
        states = FloatTensor(states)                                      # (batch, 8)
        actions = torch.LongTensor(actions)                               # (batch, 1)
        next_states = FloatTensor(next_states)                            # (batch, 8)
        rewards = FloatTensor(rewards)                                    # (batch, 1)
        rewards_acc = FloatTensor(rewards_acc)                            # (batch, 1)        
        dones = torch.LongTensor(dones)                                   # (batch, 1)        
        
        loss_actors = []
        loss_critics = []
        loss_entropys = []
        
        for _ in range(self.train_critic_epochs): 
            # critic ===============================================================
            V = self.critic(states)                  # (batch, 1)
#             V_next = self.critic(next_states)        # (batch, 1)
    #         A = r + gamma * V_s_t+1 - V_s_t
    #         A = rewards + self.gamma * V_next * (1-dones) - V  # TD Error
            A = rewards_acc - V                    # MC Error

            loss_critic = A.pow(2).mean()
            self.optimizer_critic.zero_grad()
            loss_critic.backward()
            self.optimizer_critic.step()
            
            loss_critics.append(loss_critic.item())
        
        V = self.critic(states)                  # (batch, 1)
#         V_next = self.critic(next_states)        # (batch, 1)
#         A = r + gamma * V_s_t+1 - V_s_t
#         A = rewards + self.gamma * V_next * (1-dones) - V  # TD Error
        A = rewards_acc - V                    # MC Error
        for _ in range(self.train_actor_epochs): 
            # actor ===============================================================
            log_probs, entropies = self.evaluate_action(states, actions) # same action to calculate right ratios
            #(batch, 1) (batch, 1)

            # losses.
            loss = -A.detach() * log_probs           # (batch, 1) 
            # larger entropies more exploration
            loss_actor = loss - 0.01 * entropies     # (batch, 1)
            loss_actor = loss_actor.mean()
            self.optimizer_actor.zero_grad()
            loss_actor.backward()
            self.optimizer_actor.step()

            loss_actors.append(loss_actor.item())
            loss_entropys.append(entropies.mean().item())

            return loss_actors, loss_critics, loss_entropys

In [None]:
epochs = 2000
episodes = 5
max_ep_len = 300
gamma = 0.99

In [None]:
actor = Actor(8, 4)
critic = Critic(8, 1)
optimizer_actor = optim.Adam(actor.parameters(), lr=0.0003)
optimizer_critic = optim.Adam(critic.parameters(), lr=0.001)
agent_A2C = A2CAgent(actor, critic, optimizer_actor, optimizer_critic, gamma, 
                     train_actor_epochs=3, 
                     train_critic_epochs=30)
# agent_A2C.load_state_dict(torch.load("../input/hw12tmp/agent_A2C.pt"))

In [None]:
%%script false --no-raise-error

agent_A2C.train()
best_avg_total_reward = -np.inf
best_test_total_reward = -np.inf
avg_total_rewards, avg_final_rewards = [], []
total_rewards, final_rewards = deque(maxlen=100), deque(maxlen=100)
total_loss_actors, total_loss_critics, total_loss_entropys = deque(maxlen=100), deque(maxlen=100), deque(maxlen=100)
progress_bar = tqdm(range(epochs))
for epoch in progress_bar:
    states_epoch, actions_epoch, log_probs_epoch, rewards_epoch, rewards_acc_epoch, next_states_epoch, dones_epoch = [], [], [], [], [], [], []
    for episode in range(episodes):  # Don't infinite loop while learning
        states, actions, log_probs, rewards, rewards_acc, next_states, dones = play_one_episode(env, agent_A2C, gamma, max_ep_len)
        
        states_epoch.extend(states)
        actions_epoch.extend(actions)
        log_probs_epoch.extend(log_probs)
        rewards_epoch.extend(rewards)
        rewards_acc_epoch.extend(rewards_acc)
        next_states_epoch.extend(next_states)
        dones_epoch.extend(dones)
        
        total_rewards.append(np.sum(rewards))
        final_rewards.append(rewards[-1])
    
    loss_actors, loss_critics, loss_entropys = agent_A2C.learn(states_epoch, 
                                                               actions_epoch, 
                                                               next_states_epoch, 
                                                               rewards_epoch, 
                                                               rewards_acc_epoch,
                                                               dones_epoch)
    
    avg_total_reward = np.mean(total_rewards)
    avg_final_reward = np.mean(final_rewards)
    avg_total_rewards.append(avg_total_reward)
    avg_final_rewards.append(avg_final_reward)
    total_loss_actors.extend(loss_actors)
    total_loss_critics.extend(loss_critics)
    total_loss_entropys.extend(loss_entropys)
    
    if best_avg_total_reward < avg_total_reward:
#         torch.save(agent_A2C.state_dict(), "agent_A2C.pt")
        best_avg_total_reward = avg_total_reward
    
    test_total_reward = test_agent(agent_A2C, env)
    if best_test_total_reward < test_total_reward:
        torch.save(agent_A2C.state_dict(), "agent_A2C.pt")
        best_test_total_reward = test_total_reward
        print(f"{epoch:4d}: best test {best_test_total_reward:.4f}, avg:{avg_total_reward:.4f}, best avg:{best_avg_total_reward:.4f}")
    
    progress_bar.set_postfix(Total=avg_total_reward, 
                             Total_best=best_avg_total_reward,
                             Test=test_total_reward,
                             Test_best=best_test_total_reward,
                             Final=avg_final_reward, 
                             loss_actors=np.mean(total_loss_actors),
                             loss_critics=np.mean(total_loss_critics),
                             loss_entropys=np.mean(total_loss_entropys))

In [None]:
%%script false --no-raise-error

plt.plot(avg_total_rewards)
plt.title("Total Rewards")
plt.show()

plt.plot(avg_final_rewards)
plt.title("Final Rewards")
plt.show()

## Testing
The testing result will be the average reward of 5 testing

In [None]:
# %%script false --no-raise-error

# agent = agent_policygradient
# agent.load_state_dict(torch.load("../input/hw12tmp/agent_policygradient.pt"))

# agent = agent_PPO
# agent.load_state_dict(torch.load("./agent_PPO.pt"))

# agent = agent_DQN
# agent.load_state_dict(torch.load("../input/hw12tmp/agent_DQN.pt"))

# agent = agent_A2C
# agent.load_state_dict(torch.load("./agent_A2C.pt"))

In [None]:
%%script false --no-raise-error

fix(env, seed)
agent.eval()  # set the network into evaluation mode
NUM_OF_TEST = 5 # Do not revise this !!!
test_total_reward = []
action_list = []
for i in range(NUM_OF_TEST):
    actions = []
    state = env.reset()

    total_reward = 0

    done = False
    while not done:
        action, _ = agent.action(np.expand_dims(state, axis=0))
        actions.append(action.item())
        state, reward, done, _ = env.step(action.squeeze())

        total_reward += reward

    print(total_reward)
    test_total_reward.append(total_reward)

    action_list.append(actions) # save the result of testing 
    
print("mean", np.mean(test_total_reward))

In [None]:
%%script false --no-raise-error

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

fix(env, seed)
agent.eval()  # set the network into evaluation mode
NUM_OF_TEST = 5 # Do not revise this !!!
test_total_reward = []
action_list = []
for i in range(NUM_OF_TEST):
    actions = []
    state = env.reset()

    img = plt.imshow(env.render(mode='rgb_array'))

    total_reward = 0

    done = False
    while not done:
        action, _ = agent.action(np.expand_dims(state, axis=0))
        actions.append(action.item())
        state, reward, done, _ = env.step(action.squeeze())

        total_reward += reward

        img.set_data(env.render(mode='rgb_array'))
        display.display(plt.gcf())
        display.clear_output(wait=True)

    print(total_reward)
    test_total_reward.append(total_reward)

    action_list.append(actions) # save the result of testing 

virtual_display.stop()

In [None]:
%%script false --no-raise-error

print(np.mean(test_total_reward))

Action list

In [None]:
%%script false --no-raise-error

print("Action list looks like ", action_list)
print("Action list's shape looks like ", np.shape(action_list))

Analysis of actions taken by agent

In [None]:
%%script false --no-raise-error

distribution = {}
for actions in action_list:
    for action in actions:
        if action not in distribution.keys():
            distribution[action] = 1
        else:
            distribution[action] += 1
print(distribution)

# Server 
The code below simulate the environment on the judge server. Can be used for testing.

In [None]:
%%script false --no-raise-error

# action_list = np.load(PATH,allow_pickle=True) # The action list you upload
seed = 543 # Do not revise this
fix(env, seed)

agent.eval()  # set network to evaluation mode

test_total_reward = []
if len(action_list) != 5:
    print("Wrong format of file !!!")
    exit(0)
for actions in action_list:
    state = env.reset()
    img = plt.imshow(env.render(mode='rgb_array'))

    total_reward = 0

    done = False

    for action in actions:
        state, reward, done, _ = env.step(action)
        total_reward += reward
        if done:
            break

    print(f"Your reward is : %.2f"%total_reward)
    test_total_reward.append(total_reward)

# Your score

In [None]:
%%script false --no-raise-error

print(f"Your final reward is : %.2f"%np.mean(test_total_reward))

* 4pt Baseline 270
* 3pt Baseline 170
* 2pt Baseline 100
* 1pt Baseline 0

## Reference

Below are some useful tips for you to get high score.

- [DRL Lecture 1: Policy Gradient (Review)](https://youtu.be/z95ZYgPgXOY)
- [ML Lecture 23-3: Reinforcement Learning (including Q-learning) start at 30:00](https://youtu.be/2-JNBzCq77c?t=1800)
- [Lecture 7: Policy Gradient, David Silver](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf)
