<a href="https://colab.research.google.com/github/pabair/rl-course-ss21/blob/main/6_LunarLander_PolicyBased.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lunar Lander with Cross-Entropy Method

In this notebook we look at the lunar lander environment and solve it with the cross-entropy method.

In [1]:
!pip3 install box2d-py



In [2]:
import gym
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical
import matplotlib.pyplot as plt
%matplotlib inline
from collections import deque

torch.manual_seed(1)
np.random.seed(1)

# Neural Network

We define a simple neural network that generates the action scores based on a given state.

In [3]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(obs_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, n_actions)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

# Generate Episodes

We generate a batch of episodes and remember the traversed states, actions and rewards. To select the next action we use the output of the network. For this we first pass the scores through a softmax to get probabilites. In the second step we sampel from this distribution to get the next action to execute.

In [4]:
def generate_batch(env, batch_size, t_max=5000):
    
    activation = nn.Softmax(dim=1)
    batch_actions,batch_states, batch_rewards = [],[],[]
    
    for b in range(batch_size):
        states,actions = [],[]
        total_reward = 0
        s = env.reset()
        for t in range(t_max):
            
            s_v = torch.FloatTensor([s])
            act_probs_v = activation(net(s_v))
            act_probs = act_probs_v.data.numpy()[0]
            a = np.random.choice(len(act_probs), p=act_probs)

            new_s, r, done, info = env.step(a)

            #record sessions like you did before
            states.append(s)
            actions.append(a)
            total_reward += r

            s = new_s
            if done:
                batch_actions.append(actions)
                batch_states.append(states)
                batch_rewards.append(total_reward)
                break
                
    return batch_states, batch_actions, batch_rewards

# Training

In the training step, we first use the neural network to generate a batch of episodes and then use the state-action pairs to improve the neural network.

In [5]:
batch_size = 100
session_size = 100
percentile = 80
hidden_size = 200
completion_score = 100
learning_rate = 0.01

env = gym.make("LunarLander-v2")
n_states = env.observation_space.shape[0]
n_actions = env.action_space.n

#neural network
net = Net(n_states, hidden_size, n_actions)
#loss function
objective = nn.CrossEntropyLoss()
#optimisation function
optimizer = optim.Adam(params=net.parameters(), lr=learning_rate)

for i in range(session_size):
    #generate new sessions
    states, actions, rewards = generate_batch(env, batch_size, t_max=500)

    # TODO-1: here we need to filter out episodes that are not good


    # train on the states using actions as targets
    for s_i in range(len(states)):
      optimizer.zero_grad()
      tensor_states = torch.FloatTensor(states[s_i])
      tensor_actions = torch.LongTensor(actions[s_i])
      action_scores_v = net(tensor_states)
      loss_v = objective(action_scores_v, tensor_actions)
      loss_v.backward()
      optimizer.step()

    #show results
    mean_reward = np.mean(rewards)
    print("%d: loss=%.3f, reward_mean=%.1f" % (
            i, loss_v.item(), mean_reward))
    
    #check if 
    if np.mean(rewards)> completion_score:
        print("Environment has been successfullly completed!")

0: loss=1.386, reward_mean=-177.3
1: loss=1.381, reward_mean=-175.4
2: loss=1.326, reward_mean=-209.5
3: loss=1.375, reward_mean=-173.4
4: loss=1.394, reward_mean=-181.5
5: loss=1.395, reward_mean=-193.7
6: loss=1.372, reward_mean=-144.9
7: loss=1.375, reward_mean=-168.2
8: loss=1.351, reward_mean=-168.4
9: loss=1.236, reward_mean=-178.6
10: loss=1.353, reward_mean=-185.9
11: loss=1.341, reward_mean=-192.2
12: loss=1.399, reward_mean=-190.1
13: loss=1.367, reward_mean=-169.7
14: loss=1.357, reward_mean=-198.1
15: loss=1.354, reward_mean=-172.0
16: loss=0.800, reward_mean=-195.9
17: loss=1.392, reward_mean=-172.7
18: loss=1.345, reward_mean=-169.3
19: loss=1.389, reward_mean=-178.3
20: loss=1.358, reward_mean=-195.7
21: loss=1.365, reward_mean=-170.6
22: loss=1.330, reward_mean=-192.1
23: loss=1.358, reward_mean=-183.1
24: loss=1.379, reward_mean=-177.9
25: loss=1.378, reward_mean=-183.5
26: loss=1.387, reward_mean=-189.3
27: loss=1.357, reward_mean=-190.0
28: loss=1.393, reward_mean=-1

# Evaluation

The following code enables Colab to record sessions (not needed when using executing code locally).



In [6]:
!apt-get install -y xvfb x11-utils

!pip install pyvirtualdisplay==0.2.* \
             PyOpenGL==3.1.* \
             PyOpenGL-accelerate==3.1.*

!pip install gym[box2d]==0.17.*

import pyvirtualdisplay

_display = pyvirtualdisplay.Display(visible=False, size=(1400, 900))
_ = _display.start()

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libxxf86dga1
Suggested packages:
  mesa-utils
The following NEW packages will be installed:
  libxxf86dga1 x11-utils xvfb
0 upgraded, 3 newly installed, 0 to remove and 13 not upgraded.
Need to get 993 kB of archives.
After this operation, 2,981 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 libxxf86dga1 amd64 2:1.1.4-1 [13.7 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 x11-utils amd64 7.7+3build1 [196 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 xvfb amd64 2:1.19.6-1ubuntu4.8 [784 kB]
Fetched 993 kB in 1s (850 kB/s)
Selecting previously unselected package libxxf86dga1:amd64.
(Reading database ... 146374 files and directories currently installed.)
Preparing to unpack .../libxxf86dga1_2%3a1.1.4-1_amd64.deb ...
Unpacking libxxf86dga1:amd64 (2:1.

Use the trained model to play and record one episode. The recorded video will be stored into the video-subfolder on disk.



In [7]:
# TODO-2: Play episode with agent and record it!