# Imports/tools

In [1]:
!./containers_run.sh

Starting container 1
0f137ef7faacec3cb819da25403630cac50d0147bf4219b7e8fc68668e225ccf


In [43]:
import gym
import ptan
import time
import copy
import numpy as np
import universe
from typing import List, Optional, Tuple
from universe import vectorized
from universe.wrappers.experimental import SoftmaxClickMouse

from PIL import Image
import matplotlib.pylab as plt

%matplotlib inline

In [3]:
DOCKER_IMAGE = "shmuma/miniwob:latest"
ENV_NAME = "wob.mini.ClickDialog-v0"

In [4]:
# function to build connection endpoints for set of containers
# you should tweak its args if you're not using standalone installation
def remotes_url(port_ofs=0, hostname='localhost', count=8):
    hosts = ["%s:%d+%d" % (hostname, 5900 + ofs, 15900 + ofs) for ofs in range(port_ofs, port_ofs+count)]
    return "vnc://" + ",".join(hosts)

In [5]:
def make_env(wrapper_func = lambda env: env, count: int = 1, fps: float = 5) -> universe.envs.VNCEnv:
    """
    Builds the vectorized env
    """
    env = gym.make(ENV_NAME)
    env = wrapper_func(env)
    url = remotes_url(count=count)
    print("Remotes URL: %s" % url)

    env.configure(remotes=url, docker_image=DOCKER_IMAGE, fps=fps, vnc_kwargs={
            'encoding': 'tight', 'compress_level': 0,
            'fine_quality_level': 100, 'subsample_level': 0
        })
    return env

In [6]:
def join_env(env: universe.envs.VNCEnv):
    """
    Function performs initial reset of the env and waits for observations to become ready
    """
    obs_n = env.reset()
    while any(map(lambda o: o is None, obs_n)):
        a = [env.action_space.sample() for _ in obs_n]
        obs_n, reward, is_done, info = env.step(a)
    return obs_n

In [7]:
class MiniWoBCropper(vectorized.ObservationWrapper):
    """
    Crops the WoB area and converts the observation into PyTorch (C, H, W) format.
    """
    # Area of interest
    WIDTH = 160
    HEIGHT = 210
    X_OFS = 10
    Y_OFS = 75
    
    def __init__(self, env, keep_text=False):
        super(MiniWoBCropper, self).__init__(env)
        self.keep_text = keep_text
        img_space = gym.spaces.Box(low=0, high=255, shape=(3, self.HEIGHT, self.WIDTH))
        if keep_text:
            self.observation_space = gym.spaces.Tuple(spaces=(img_space, gym.spaces.Space))
        else:
            self.observation_space = img_space

    def _observation(self, observation_n):
        res = []
        for obs in observation_n:
            if obs is None:
                res.append(obs)
                continue
            img = obs['vision'][self.Y_OFS:self.Y_OFS+self.HEIGHT, self.X_OFS:self.X_OFS+self.WIDTH, :]
            img = np.transpose(img, (2, 0, 1))
            if self.keep_text:
                text = " ".join(map(lambda d: d.get('instruction', ''), obs.get('text', [{}])))
                res.append((img, text))
            else:
                res.append(img)
        return res

# Basic DQN baseline

With all those simplifications done, we can implement basic DQN agent to solve some environments.

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

## Model 

Below is the model we're going to use. Nothing fancy, just two convolution layers followed by single FC-layer, returning Q-values for our 256 click locations.

In [13]:
class Model(nn.Module):
    def __init__(self, input_shape, n_actions):
        super(Model, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 64, 5, stride=5),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=2),
            nn.ReLU(),
        )

        conv_out_size = self._get_conv_out(input_shape)

        self.fc = nn.Sequential(
            nn.Linear(conv_out_size, n_actions),
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        fx = x.float() / 256
        conv_out = self.conv(fx).view(fx.size()[0], -1)
        return self.fc(conv_out)

## Batch preparation (practice)

For data gathering we'll use [`ptan` library](https://github.com/shmuma/ptan/) which is a very thin RL-specific wrapper around Gym. It implement replay buffers, agent logic and other functions frequently required for RL training.

Our first warm-up problem will be to write conversion of batch sampled from Replay Buffer into PyTorch tensors suitable for training.

Input to the function is a list of `namedtuple` with the following fields ([source code](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L155)):
* `state`: state `s` in the trajectory, which is a 160x210 array of bytes with shape \[3, 210, 160\]
* `action`: action executed, which is an integer from 0 to 255 indicating the square in our click grid
* `reward`: immediate reward obtained after action execution
* `last_state`: state we've got after execution of action. It equals to `None` if episode finished after action.

We'll use 1-step DQN, but Ptan can do N-step calculation for you. You can experiment with the difference.

The output should be three tensors containing the batch data with:
* `state`: in a form of tensor with shape \[X, Y, ...\]
* `actions`: as a long tensor
* `ref values`: approximated Q-values using Bellman equation

$Q_{s,a} = r_{s,a} + \gamma \cdot max_{a'}Q_{s',a'}$


In [11]:
@torch.no_grad()
def unpack_batch(batch: List[ptan.experience.ExperienceFirstLast], net: nn.Module, gamma: float, device="cpu"):
    states = []
    actions = []
    rewards = []
    done_masks = []    # list of booleans, True if this is the last step in the episode
    last_states = []
    for exp in batch:
        # unpack every experience entry into individual lists
        raise NotImplementedError

    # convert everything to tensors (do not forget to put them into proper devices!)
    # apply network and find best Q values of every action (torch.max() is the right function to use (c: )
    # zero out q-values according to done_masks
    # return the result as tuple of three things: states tensor, actions tensor and 
    # Q-values approximation tensor

### Tests for your solution

There are several test cases you can use to check your solution

In [11]:
env = make_env(wrapper_func = lambda env: SoftmaxClickMouse(MiniWoBCropper(env)))
join_env(env);

[2019-07-10 11:13:21,662] Making new env: wob.mini.ClickDialog-v0
  result = entry_point.load(False)
[2019-07-10 11:13:21,671] Using SoftmaxClickMouse with action_region=(10, 125, 170, 285), noclick_regions=[]
[2019-07-10 11:13:21,672] SoftmaxClickMouse noclick regions removed 0 of 256 actions
[2019-07-10 11:13:21,673] Writing logs to file: /tmp/universe-8120.log
[2019-07-10 11:13:21,678] Using the golang VNC implementation
[2019-07-10 11:13:21,678] Using VNCSession arguments: {'encoding': 'tight', 'compress_level': 0, 'fine_quality_level': 100, 'subsample_level': 0, 'start_timeout': 7}. (Customize by running "env.configure(vnc_kwargs={...})"
[2019-07-10 11:13:21,698] [0] Connecting to environment: vnc://localhost:5900 password=openai. If desired, you can manually connect a VNC viewer, such as TurboVNC. Most environments provide a convenient in-browser VNC client: http://localhost:15900/viewer/?password=openai


Remotes URL: vnc://localhost:5900+15900


[2019-07-10 11:13:38,037] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0
[2019-07-10 11:13:39,360] [0:localhost:5900] Initial reset complete: episode_id=24


In [14]:
# this code also should help you to understand the expectations from your code
net = Model(env.observation_space.shape, env.action_space.n)
# with eps=1, we don't need parent action selector, as all actions will be random, but that's a corner case!
action_selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=1) 
agent = ptan.agent.DQNAgent(net, action_selector)
net

Model(
  (conv): Sequential(
    (0): Conv2d(3, 64, kernel_size=(5, 5), stride=(5, 5))
    (1): ReLU()
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2))
    (3): ReLU()
  )
  (fc): Sequential(
    (0): Linear(in_features=19200, out_features=256, bias=True)
  )
)

In [15]:
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=1.0, vectorized=True)
batch = [e for _, e in zip(range(5), exp_source)]
# add one final state with known reward
batch.append(ptan.experience.ExperienceFirstLast(state=batch[0].state, action=0, reward=10.0, last_state=None))

[2019-07-10 11:14:06,213] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


In [16]:
r = unpack_batch(batch, net, gamma=1.0)

In [17]:
assert isinstance(r, tuple)
assert len(r) == 3

In [18]:
states, actions, next_q = r

In [19]:
assert isinstance(states, torch.Tensor)
assert states.dtype == torch.uint8
assert states.size() == (6, 3, 210, 160)

assert isinstance(actions, torch.Tensor)
assert actions.dtype == torch.long
assert actions.size() == (6, )

assert isinstance(next_q, torch.Tensor)
assert next_q.dtype == torch.float32
assert next_q.size() == (6, )

assert actions[5] == 0
assert next_q[5] == 10.0

## Batch preparation (solution)

Please do not peek inside this, as you'll spoil all the fun :)

In [10]:
@torch.no_grad()
def unpack_batch(batch: List[ptan.experience.ExperienceFirstLast], net: nn.Module, gamma: float, device="cpu"):
    states = []
    actions = []
    rewards = []
    done_masks = []
    last_states = []
    for exp in batch:
        states.append(exp.state)
        actions.append(exp.action)
        rewards.append(exp.reward)
        done_masks.append(exp.last_state is None)
        if exp.last_state is None:
            last_states.append(exp.state)
        else:
            last_states.append(exp.last_state)

    states_v = torch.tensor(states).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    last_states_v = torch.tensor(last_states).to(device)
    last_state_q_v = net(last_states_v)
    best_last_q_v = torch.max(last_state_q_v, dim=1)[0]
    best_last_q_v[done_masks] = 0.0
    return states_v, actions_v, best_last_q_v + rewards_v

## Training loop

Now we're ready to implement the training loop

In [58]:
GAMMA = 0.9
REPLAY_SIZE = 10000
MIN_REPLAY = 100
TGT_SYNC = 100
BATCH_SIZE = 16
LR = 1e-4
DEVICE = "cuda"  
#DEVICE = "cpu"

INITIAL_EPSILON = 1.0
FINAL_EPSILON = 0.2
STEPS_EPSILON = 1000
LIMIT_STEPS = 1500

In [59]:
env = make_env(wrapper_func = lambda env: SoftmaxClickMouse(MiniWoBCropper(env)))
join_env(env);

[2019-07-10 13:26:03,110] Making new env: wob.mini.ClickDialog-v0
[2019-07-10 13:26:03,115] Using SoftmaxClickMouse with action_region=(10, 125, 170, 285), noclick_regions=[]
[2019-07-10 13:26:03,115] SoftmaxClickMouse noclick regions removed 0 of 256 actions
[2019-07-10 13:26:03,116] Using the golang VNC implementation
[2019-07-10 13:26:03,117] Using VNCSession arguments: {'encoding': 'tight', 'compress_level': 0, 'fine_quality_level': 100, 'subsample_level': 0, 'start_timeout': 7}. (Customize by running "env.configure(vnc_kwargs={...})"
[2019-07-10 13:26:03,137] [0] Connecting to environment: vnc://localhost:5900 password=openai. If desired, you can manually connect a VNC viewer, such as TurboVNC. Most environments provide a convenient in-browser VNC client: http://localhost:15900/viewer/?password=openai


Remotes URL: vnc://localhost:5900+15900


[2019-07-10 13:26:19,536] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0
[2019-07-10 13:26:19,569] [0:localhost:5900] Initial reset complete: episode_id=930


In [60]:
net = Model(env.observation_space.shape, env.action_space.n).to(DEVICE)
tgt_net = ptan.agent.TargetNet(net)

parent_selector = ptan.actions.ArgmaxActionSelector()
action_selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=INITIAL_EPSILON, selector=parent_selector) 
agent = ptan.agent.DQNAgent(net, action_selector, device=DEVICE)
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=GAMMA, vectorized=True)

buffer = ptan.experience.ExperienceReplayBuffer(exp_source, REPLAY_SIZE)
optimizer = optim.Adam(net.parameters(), LR)

In [61]:
best_weights = None
best_mean10_reward = None
episode_rewards = []
episode_steps = []
losses = []
steps = 0
last_ts = time.time()
last_steps = 0

while steps < LIMIT_STEPS:
    steps += 1
    buffer.populate(1)
    r = exp_source.pop_rewards_steps()
    if r:
        for rw, st in r:
            episode_rewards.append(rw)
            episode_steps.append(st)
        speed = (steps - last_steps) / (time.time() - last_ts)
        print("%d: Done %d episodes, last 10 means: reward=%.3f, steps=%.3f, speed=%.2f steps/s, eps=%.2f" % (
            steps, len(episode_rewards), np.mean(episode_rewards[-10:]), 
            np.mean(episode_steps[-10:]), speed, action_selector.epsilon
        ))
        last_ts = time.time()
        last_steps = steps
        if np.mean(episode_rewards[-10:]) > 0.9:
            print("You solved the env! Congrats!")
            break
    if len(buffer) < MIN_REPLAY:
        continue
    batch = buffer.sample(BATCH_SIZE)
    state_v, actions_v, ref_q_v = unpack_batch(batch, tgt_net.target_model, gamma=GAMMA, device=DEVICE)
    optimizer.zero_grad()
    q_v = net(state_v)
    q_v = q_v.gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    loss_v = F.mse_loss(q_v, ref_q_v)
    loss_v.backward()
    optimizer.step()
    losses.append(loss_v.item())
    
    if steps % TGT_SYNC == 0:
        print("%d: nets synced, mean loss for last 10 steps = %.3f" % (
            steps, np.mean(losses[-10:])))
        tgt_net.sync()
    action_selector.epsilon = max(FINAL_EPSILON, INITIAL_EPSILON - (steps-MIN_REPLAY) / STEPS_EPSILON)
    
    if action_selector.epsilon < 0.4:
        m10_reward = np.mean(episode_rewards[-10:])
        if best_mean10_reward is None or m10_reward > best_mean10_reward:
            print("%d: best reward updated: %s -> %.3f, weights saved" % (
                steps, best_mean10_reward, m10_reward
            ))
            best_mean10_reward = m10_reward
            # deep copy is required as state_dict() is just reference to tensors which are being kept on GPU
            best_weights = copy.deepcopy(net.state_dict())

[2019-07-10 13:26:19,902] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


30: Done 1 episodes, last 10 means: reward=-1.000, steps=30.000, speed=4.54 steps/s, eps=1.00
82: Done 2 episodes, last 10 means: reward=-1.000, steps=41.500, speed=4.90 steps/s, eps=1.00
100: nets synced, mean loss for last 10 steps = 0.163
127: Done 3 episodes, last 10 means: reward=-0.650, steps=43.000, speed=4.56 steps/s, eps=0.97
144: Done 4 episodes, last 10 means: reward=-0.328, steps=36.750, speed=3.98 steps/s, eps=0.96
152: Done 5 episodes, last 10 means: reward=-0.096, steps=31.200, speed=3.89 steps/s, eps=0.95
198: Done 6 episodes, last 10 means: reward=-0.246, steps=33.833, speed=4.40 steps/s, eps=0.90
200: nets synced, mean loss for last 10 steps = 0.031
245: Done 7 episodes, last 10 means: reward=-0.354, steps=35.857, speed=4.45 steps/s, eps=0.86
291: Done 8 episodes, last 10 means: reward=-0.435, steps=37.250, speed=4.45 steps/s, eps=0.81
300: nets synced, mean loss for last 10 steps = 0.037
317: Done 9 episodes, last 10 means: reward=-0.334, steps=36.111, speed=4.26 ste

Doesn't look good:
* convergence is bad, 
* loss is growing

Let's test the best weights

In [62]:
# testing the best weights
net.load_state_dict(best_weights)
TEST_EPISODES = 10

rewards = []
steps = []
while len(rewards) < TEST_EPISODES:
    cur_reward = 0
    cur_steps = 0
    obs_n = join_env(env)
    while True:
        obs_t = torch.tensor(obs_n).to(DEVICE)
        q_t = net(obs_t)
        action = q_t.max(dim=1).indices[0].item()
        obs_n, reward_n, is_done_n, _ = env.step([action])
        cur_reward += reward_n[0]
        cur_steps += 1
        if is_done_n[0]:
            break
    rewards.append(cur_reward)
    steps.append(cur_steps)
    print("Episode done in %d steps with %.3f reward" % (cur_steps, cur_reward))
print("Mean reward for %d episodes is %.3f, steps %.3f" % (TEST_EPISODES, np.mean(rewards), np.mean(steps)))

[2019-07-10 13:33:49,821] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0
[2019-07-10 13:33:58,236] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


Episode done in 36 steps with -1.000 reward


[2019-07-10 13:34:08,653] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


Episode done in 51 steps with -1.000 reward


[2019-07-10 13:34:19,269] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


Episode done in 52 steps with -1.000 reward


[2019-07-10 13:34:29,686] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


Episode done in 51 steps with -1.000 reward


[2019-07-10 13:34:30,487] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


Episode done in 3 steps with 0.982 reward


[2019-07-10 13:34:40,901] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


Episode done in 51 steps with -1.000 reward


[2019-07-10 13:34:51,518] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


Episode done in 52 steps with -1.000 reward


[2019-07-10 13:35:01,934] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


Episode done in 51 steps with -1.000 reward


[2019-07-10 13:35:12,351] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


Episode done in 51 steps with -1.000 reward
Episode done in 52 steps with -1.000 reward
Mean reward for 10 episodes is -0.802, steps 45.000


# Troubleshooting of Baseline DQN

> "You love Linux, probably you're experienced in troubleshooting..."
> RHCE Exam Instructor @ 2005

Code above doesn't converge. Usually, in such situations, tutorial author tweaks the parameters to make it working, so, attendants see only final, polished version. This makes the false impression of ML as smooth and quite obvious process, which very far from truth. Don't know about others, but my code doesn't work 90% of the time :).

To learn how to deal with such situations, you're asked to troubleshoot the code above. There are several directions you could explore (of course, you can have your own ideas, that's not the complete list):

* check training samples. Do they make sense? Is reward properly assigned to correct observations?
* training gradients. How large they are? What's the ratio of gradients/weight values?
* do end-of-episode steps handled properly? Final steps of epsiodes act as an anchor to prevent Q values from growing infinitely, so, it is critical to have them properly handled in Bellman equation
* explore Q values produced by the network during the training. Are they growing over time? That's a good practice to have small set of states (including final step and steps before the final) and track their Qs during the training.
* how large the difference between trained network and target net used for next-step Bellman approximation? How it changes over time?

Full check might take a long time, especially if you haven't done it before. In the next notebook (03_miniwob_troubleshooting) you will be given detailed steps of troubleshooting.

In [None]:
# scratchplace for you to enjoy!

# Do not forget to stop container

In [28]:
env.close()

In [29]:
!./containers_stop.sh

24f0ebbed615
