As we've found during the long "Troubleshooting session", asynchronous observations turn even simple environment into POMDP domain: from one single image we cannot make predictions about the future trajectory, which builds a fundamental obstacle for value-iteration method.

Funny enough: PG-based methods are able to converge in the same situation. As they don't need to predict Q-values to get the policy, their task is much simpler: they just need to learn where to click, without bothering much about  the future reward. So, they just click and don't think much.

Can we use the same approach to make DQN work in MiniWoB? Yes, why not, there are two ways of doing this:
1. turn the environment into bandit-style -- agent is allowed to click once and need to wait for the outcome. It won't be very efficient, but it will work. 
2. stop pretenting our env is MDP and deal with its POMDP nature. Nomal way will be to extend the observations with some kind of "state" or "history" to make it MDP again.

# Bandit approach to ClickDialog

Let's explore the first path, but minimize the modifications: implement the environment wrapper which will allow agent to execute one single action and then wait for the final reward (which will happen in 10 seconds in any case due to MiniWoB timeout). 

Then we'll be able to plug this wrapper into the code we already have written, hopefully, this will make our DQN to converge.

Of course, this is an overkill -- you're more than welcome to implement normal bandit methods and compare them to this environment.

## Preparations and imports

In [2]:
!./containers_stop.sh

56916832e628


In [3]:
!./containers_run.sh

Starting container 1
e671bd7f02069d9714c29eebc6dab685c9ed6151a9a9fec8b9f5fb03f5d4fe00


In [4]:
DOCKER_IMAGE = "shmuma/miniwob:latest"

In [5]:
import gym
import ptan
import time
import numpy as np
import universe
from typing import List, Optional, Tuple
from universe import vectorized
from universe.wrappers.experimental import SoftmaxClickMouse

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from PIL import Image

  spec = yaml.load(f)


In [6]:
# function to build connection endpoints for set of containers
def remotes_url(port_ofs=0, hostname='localhost', count=8):
    hosts = ["%s:%d+%d" % (hostname, 5900 + ofs, 15900 + ofs) for ofs in range(port_ofs, port_ofs+count)]
    return "vnc://" + ",".join(hosts)

In [7]:
class MiniWoBCropper(vectorized.ObservationWrapper):
    """
    Crops the WoB area and converts the observation into PyTorch (C, H, W) format.
    """
    # Area of interest
    WIDTH = 160
    HEIGHT = 210
    X_OFS = 10
    Y_OFS = 75
    
    def __init__(self, env, keep_text=False):
        super(MiniWoBCropper, self).__init__(env)
        self.keep_text = keep_text
        img_space = gym.spaces.Box(low=0, high=255, shape=(3, self.HEIGHT, self.WIDTH))
        if keep_text:
            self.observation_space = gym.spaces.Tuple(spaces=(img_space, gym.spaces.Space))
        else:
            self.observation_space = img_space

    def _observation(self, observation_n):
        res = []
        for obs in observation_n:
            if obs is None:
                res.append(obs)
                continue
            img = obs['vision'][self.Y_OFS:self.Y_OFS+self.HEIGHT, self.X_OFS:self.X_OFS+self.WIDTH, :]
            img = np.transpose(img, (2, 0, 1))
            if self.keep_text:
                text = " ".join(map(lambda d: d.get('instruction', ''), obs.get('text', [{}])))
                res.append((img, text))
            else:
                res.append(img)
        return res


In [8]:
class Model(nn.Module):
    def __init__(self, input_shape, n_actions):
        super(Model, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 64, 5, stride=5),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 32, 3, stride=2),
            nn.ReLU(),
        )

        conv_out_size = self._get_conv_out(input_shape)

        self.fc = nn.Sequential(
            nn.Linear(conv_out_size, 256),
            nn.ReLU(),
            nn.Linear(256, n_actions),
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        fx = x.float() / 256
        conv_out = self.conv(fx).view(fx.size()[0], -1)
        return self.fc(conv_out)

In [9]:
@torch.no_grad()
def unpack_batch(batch: List[ptan.experience.ExperienceFirstLast], net: nn.Module, gamma: float, device="cpu"):
    states = []
    actions = []
    rewards = []
    done_masks = []
    last_states = []
    for exp in batch:
        states.append(exp.state)
        actions.append(exp.action)
        rewards.append(exp.reward)
        done_masks.append(exp.last_state is None)
        if exp.last_state is None:
            last_states.append(exp.state)
        else:
            last_states.append(exp.last_state)

    states_v = torch.tensor(states).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    last_states_v = torch.tensor(last_states).to(device)
    last_state_q_v = net(last_states_v)
    best_last_q_v = torch.max(last_state_q_v, dim=1)[0]
    best_last_q_v[done_masks] = 0.0
    return states_v, actions_v, best_last_q_v + rewards_v

## Wrapper class (practice)

Below you need to implement the wrapper. Do not forget that everything is vectorized! Don't need to implement it in a generic way -- case for single environment would be enough

In [6]:
class SingleShotWrapper(vectorized.Wrapper):
    def reset(self):
        return self.env.reset()
    
    def step(self, action_n):
        raise NotImplementedError
        return obs_n, reward_n, [True], {}

### Tests

In [9]:
env = gym.make("wob.mini.ClickDialog-v0")

env = SoftmaxClickMouse(env)
env = MiniWoBCropper(env)
env = SingleShotWrapper(env)
url = remotes_url(count=1)

# Note FPS=10
env.configure(remotes=url, docker_image=DOCKER_IMAGE, fps=1, vnc_kwargs={
        'encoding': 'tight', 'compress_level': 0,
        'fine_quality_level': 100, 'subsample_level': 0
    })
obs = env.reset()

while obs[0] is None:
    a = env.action_space.sample()
    obs, reward, is_done, info = env.step([a])
    time.sleep(1)

[2019-07-01 21:22:46,195] Making new env: wob.mini.ClickDialog-v0
[2019-07-01 21:22:46,196] Using SoftmaxClickMouse with action_region=(10, 125, 170, 285), noclick_regions=[]
[2019-07-01 21:22:46,197] SoftmaxClickMouse noclick regions removed 0 of 256 actions
[2019-07-01 21:22:46,200] Using the golang VNC implementation
[2019-07-01 21:22:46,201] Using VNCSession arguments: {'encoding': 'tight', 'compress_level': 0, 'fine_quality_level': 100, 'subsample_level': 0, 'start_timeout': 7}. (Customize by running "env.configure(vnc_kwargs={...})"
[2019-07-01 21:22:46,271] [0] Connecting to environment: vnc://localhost:5900 password=openai. If desired, you can manually connect a VNC viewer, such as TurboVNC. Most environments provide a convenient in-browser VNC client: http://localhost:15900/viewer/?password=openai


Init: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]


[2019-07-01 21:22:51,442] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


Loop: [0.0], [False]


[2019-07-01 21:22:51,516] [0:localhost:5900] Initial reset complete: episode_id=326


Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]


[2019-07-01 21:23:02,615] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0
[2019-07-01 21:23:02,678] [0:localhost:5900] Initial reset complete: episode_id=328


Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [-1.0], [True]


In [10]:
o = env.reset()
print(o)
env.step([5])

[2019-07-01 21:23:11,374] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


[None]
Init: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [0.0], [False]
Loop: [-1.0], [True]


([array([[[181, 182, 182, ..., 182, 182, 182],
          [255, 255, 255, ..., 255, 255, 255],
          [254, 255, 255, ..., 255, 255, 255],
          ...,
          [255, 255, 255, ..., 255, 255, 255],
          [255, 255, 255, ..., 255, 255, 255],
          [255, 255, 255, ..., 255, 255, 255]],
  
         [[179, 180, 180, ..., 180, 180, 180],
          [255, 255, 255, ..., 255, 255, 255],
          [255, 255, 255, ..., 255, 255, 255],
          ...,
          [255, 255, 255, ..., 255, 255, 255],
          [255, 255, 255, ..., 255, 255, 255],
          [255, 255, 255, ..., 255, 255, 255]],
  
         [[182, 183, 183, ..., 183, 183, 183],
          [  0,   0,   0, ...,   0,   0,   0],
          [  0,   0,   0, ...,   0,   0,   0],
          ...,
          [255, 255, 255, ..., 255, 255, 255],
          [255, 255, 255, ..., 255, 255, 255],
          [255, 255, 255, ..., 255, 255, 255]]], dtype=uint8)],
 [-1.0],
 [True],
 {})

## Wrapper class (solution)

**Spoiler alert!**

In [15]:
class SingleShotWrapper(vectorized.Wrapper):
    def __init__(self, env):
        super(SingleShotWrapper, self).__init__(env)
        self.fresh = True
        
    def reset(self):
        self.fresh = True
        return self.env.reset()
    
    def step(self, action):
        if self.fresh:
            self.fresh = False
            return self.env.step(action)  
        obs_n, reward_n, is_done_n, _ = self.env.step(action)
        while not is_done_n[0]:            
            o_n, r_n, is_done_n, _ = self.env.step(action)
            reward_n[0] += r_n[0]
        self.fresh = True
        return obs_n, reward_n, [True], {}

## Training

In [20]:
# params are tweaked to reflect our one episode -> one sample rate
GAMMA = 1.0              # gamma is totally non-relevant in one-step case
REPLAY_SIZE = 1000
MIN_REPLAY = 20
TGT_SYNC = 10
BATCH_SIZE = 16
LR = 1e-4
DEVICE = "cuda"  
#DEVICE = "cpu"

INITIAL_EPSILON = 1.0
FINAL_EPSILON = 0.2
STEPS_EPSILON = 20

In [21]:
env = gym.make("wob.mini.ClickDialog-v0")

env = SoftmaxClickMouse(env)
env = MiniWoBCropper(env)
env = SingleShotWrapper(env)
url = remotes_url(count=1)

# Note FPS=10
env.configure(remotes=url, docker_image=DOCKER_IMAGE, fps=1, vnc_kwargs={
        'encoding': 'tight', 'compress_level': 0,
        'fine_quality_level': 100, 'subsample_level': 0
    })
obs = env.reset()

while obs[0] is None:
    a = env.action_space.sample()
    obs, reward, is_done, info = env.step([a])
    time.sleep(1)

[2019-07-02 12:04:54,650] Making new env: wob.mini.ClickDialog-v0
[2019-07-02 12:04:54,653] Using SoftmaxClickMouse with action_region=(10, 125, 170, 285), noclick_regions=[]
[2019-07-02 12:04:54,654] SoftmaxClickMouse noclick regions removed 0 of 256 actions
[2019-07-02 12:04:54,656] Using the golang VNC implementation
[2019-07-02 12:04:54,656] Using VNCSession arguments: {'encoding': 'tight', 'compress_level': 0, 'fine_quality_level': 100, 'subsample_level': 0, 'start_timeout': 7}. (Customize by running "env.configure(vnc_kwargs={...})"
[2019-07-02 12:04:54,681] [0] Connecting to environment: vnc://localhost:5900 password=openai. If desired, you can manually connect a VNC viewer, such as TurboVNC. Most environments provide a convenient in-browser VNC client: http://localhost:15900/viewer/?password=openai
[2019-07-02 12:05:11,106] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0
[2019-07-02 12:05:11,119] [0:localhost:5900] Initial reset complete:

In [22]:
net = Model(env.observation_space.shape, env.action_space.n).to(DEVICE)
tgt_net = ptan.agent.TargetNet(net)

parent_selector = ptan.actions.ArgmaxActionSelector()
action_selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=INITIAL_EPSILON, selector=parent_selector) 
agent = ptan.agent.DQNAgent(net, action_selector, device=DEVICE)
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=GAMMA, vectorized=True)

buffer = ptan.experience.ExperienceReplayBuffer(exp_source, REPLAY_SIZE)
optimizer = optim.Adam(net.parameters(), LR)

In [23]:
episode_rewards = []
episode_steps = []
losses = []
steps = 0
last_ts = time.time()
last_steps = 0

while True:
    steps += 1
    buffer.populate(1)
    r = exp_source.pop_rewards_steps()
    if r:
        for rw, st in r:
            episode_rewards.append(rw)
            episode_steps.append(st)
        speed = (steps - last_steps) / (time.time() - last_ts)
        print("%d: Done %d episodes, last 10 means: reward=%.3f, steps=%.3f, speed=%.2f steps/s, eps=%.2f" % (
            steps, len(episode_rewards), np.mean(episode_rewards[-10:]), 
            np.mean(episode_steps[-10:]), speed, action_selector.epsilon
        ))
        last_ts = time.time()
        last_steps = steps
        if np.mean(episode_rewards[-10:]) > 0.9:
            print("You solved the env! Congrats!")
            break
    if len(buffer) < MIN_REPLAY:
        continue
    batch = buffer.sample(BATCH_SIZE)
    state_v, actions_v, ref_q_v = unpack_batch(batch, tgt_net.target_model, gamma=GAMMA, device=DEVICE)
    optimizer.zero_grad()
    q_v = net(state_v)
    q_v = q_v.gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    loss_v = F.mse_loss(q_v, ref_q_v)
    loss_v.backward()
    optimizer.step()
    losses.append(loss_v.item())
    
    if steps % TGT_SYNC == 0:
        print("%d: nets synced, mean loss for last 10 steps = %.3f" % (
            steps, np.mean(losses[-10:])))
        tgt_net.sync()
    action_selector.epsilon = max(FINAL_EPSILON, INITIAL_EPSILON - (steps - MIN_REPLAY) / STEPS_EPSILON)

[2019-07-02 12:05:16,784] [0:localhost:5900] Sending reset for env_id=wob.mini.ClickDialog-v0 fps=60 episode_id=0


2: Done 1 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.11 steps/s, eps=1.00
3: Done 2 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.10 steps/s, eps=1.00
4: Done 3 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.09 steps/s, eps=1.00
5: Done 4 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.10 steps/s, eps=1.00
6: Done 5 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.09 steps/s, eps=1.00
7: Done 6 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.10 steps/s, eps=1.00
8: Done 7 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.09 steps/s, eps=1.00
9: Done 8 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.10 steps/s, eps=1.00
10: Done 9 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.09 steps/s, eps=1.00
11: Done 10 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.10 steps/s, eps=1.00
12: Done 11 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.09 s

86: Done 85 episodes, last 10 means: reward=-0.901, steps=2.000, speed=0.10 steps/s, eps=0.20
87: Done 86 episodes, last 10 means: reward=-0.901, steps=2.000, speed=0.09 steps/s, eps=0.20
88: Done 87 episodes, last 10 means: reward=-0.901, steps=2.000, speed=0.10 steps/s, eps=0.20
89: Done 88 episodes, last 10 means: reward=-0.901, steps=2.000, speed=0.09 steps/s, eps=0.20
90: Done 89 episodes, last 10 means: reward=-0.901, steps=2.000, speed=0.10 steps/s, eps=0.20
90: nets synced, mean loss for last 10 steps = 0.772
91: Done 90 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.09 steps/s, eps=0.20
92: Done 91 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.10 steps/s, eps=0.20
93: Done 92 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.09 steps/s, eps=0.20
94: Done 93 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.10 steps/s, eps=0.20
95: Done 94 episodes, last 10 means: reward=-1.000, steps=2.000, speed=0.09 steps/s, eps=0.20
96: Don

KeyboardInterrupt: 