In [24]:
from common_imports import *

# GridWorld with Keys

In this game, we make a few modications to the classical gridworld environment.

Instead of aiming at a single goal, the agent is asked to visit a few goals in sequential order. So the final goal can be thought as the door the agent intends to open, while the goals before are the keys. A reward of `1` is given only if the agent reaches the final goal.

This task is *challenging* because the reward horizon is extremely long. In particular, the agent does not get credit for collecting keys.


## Baseline Setting

Here's the basic setting we run the baseline algorithm.

```python
H = W = 10      # height and width of the maze.
gamma = 0.99        # reward discount factor.
lr = 1e-4           # learning rate.
memory_size = 1024      # memory size of experience replay buffer.
minibatch_size = 64     # minibatch size for training.
epsilon = 0.05          # probability of taking a random action.
nn_num_batch = 1        # how many minibatches to run per step.
nn_num_iter = 3         # how many backprop iterations per minibatch.
```

We use the `phase` to refer to the process of reaching a goal. 

The start phase is set to be `0`, and there are total of `4` phases. 

**Training Process**

* For each epoch, 
    * sample a task with `phase = 0` and start location uniformly chosen at random from free positions.
    * train DQN for `num_episodes`.
* For every `video_lag` epochs, 
    * test DQN on randoml sampled tasks for `num_trials`. Compute average reward.
    * record the video DQN's play on one episode.



## Examples of Random Exploration

At the beginning, the agent starts by doing random exploration. We are able to visualize this process, and see how inefficient it could be on big and complex mazes.

**On a $5 \times 5$ empty maze**, the agent succeeds to solve the maze with random policy.

In [9]:
HTML(html_embed_mp4('result/02-02-16-11-12-02.802473/video/0.m4v'))

**On a $10 \times 10$ four room maze**, the agent stucks and fails to solve the maze.

In [5]:
HTML(html_embed_mp4('result/02-02-16-13-11-51.711726/video/0.m4v'))

In [6]:
HTML(html_embed_mp4('result/02-02-16-13-11-51.711726/video/5.m4v'))

## Baseline Performance

While baseline is able to solve small and simple mazes, it completely fails on the four-room example.

In [19]:
from pyrl.visualize.plot import *

def plot_result(resultdir):
    with open(path.join(resultdir, 'result.json'), 'r') as f:
        result = json.load(f)
        epochs = range(len(result['reward']))
        reward = result['reward']
        train_error = result['train_error']
        plot_xy([epochs], [reward], names=['reward'], xlabel='test', ylabel='score', title='reward')
        plot_xy([epochs], [train_error], names=['train_error'], xlabel='epochs', ylabel='error', title='training error')

In [20]:
plot_result('result/02-02-16-13-11-51.711726')

**Observation**

* As we can see, the training error is always very small, which is an indication of no reward signal.
* The baseline agent learns nothing on this task.

The following shows the video of the agent playing after `50` epochs.

In [25]:
HTML(html_embed_mp4('result/02-02-16-13-11-51.711726/video/50.m4v'))