## Introduction

### Basics
For this experiment, you will be providing instructions to a machine learner to learner a reward grid. Every box on the grid is assigned some reward value, and the machine learner can move around the grid, collecting rewards. Some boxes may have negative reward value (punishment), so they should be avoioded. It is ideal to guide the learner toward boxes with high reward values. Boxes with negative reward value are color-coded with a cold blue color, while boxes with high rewards are colored warmly, with red. 
                          
<img style="float: middle;" src="img/colorbar.png" width=500 height=500>

                          min_reward                       0                         max_reward
                              
The image below is what a ground truth reward map with one punishment box and one reward box looks like. 

<img style="float: middle;" src="img/image1.png" width=500 height=500>

The learner does not know the ground truth rewards, but will do its best to learn them from the instructions you will provide. 

### Experiment flow
The process of teaching a machine learner a reward map consists of multiple timesteps. At each timestep, you will be presented with three maps. On the left, we show the reward map the machine learner is trying to estimate. In the middle, we see a map of the learner's <font color= blue> current estimation </font> of the ground truth parameters. This starts out random, though we hope to see it improves in the teaching process. On the right, we have the learner's current policy, which represents the <font color= blue> most likely action </font> to be taken from each grid.

<img style="float: middle;" src="img/image2.png" width=750 height=750>

A set of arrows, which can you select from, is displayed on the left and middle maps. By selecting an arrow, you will give the instruction to the learner that it should <font color= blue> follow the arrow's direction if it were in the box the arrow originates from</font>. If an arrow is in a box on an edge of the grid, and its direction is towards the edge, such as the arrow in B1 in the image above, it indicates that the learner should stay in the same grid. After you select an arrow and confirm, the learner's current estimation of the ground truth reward map would be updated base on your selection. Your goal is to pick the arrow that would make the learner's updated estimation as close to the ground truth reward map as possible.



Though you can pick any arrow to instruct the learner as best as possible, we propose two heuristics that may help. 

1. A local heuristic. 

    When you select an arrow, you can expect the learner's reward value of the destination box to increase. On the other hand, the learner will decrease the reward estimation of the source box and all other neighboring boxes. This is best illustrated below. If the arrow at D3 is selected, we can expect box E3 to increase, while boxes C3, D2, D3, and D4 decrease. 

<img style="float: middle;" src="img/local.jpeg" width=650 height=650>

2. A global heuristic. 

    A selected arrow can also teach the learner to increase estimation values in that direction, and decrease values in the opposite direction. In the image below, we illustrate this with a rightward arrow that increases the learner's estimated rewards of boxes on the right, but decreases values on the left. Note that boxes close to the selected arrow change more than those far away. 


For both of these heuristics, the extent to which values increase and decrease may differ between iterations.  


<img style="float: middle;" src="img/global.jpeg" width=650 height=650>

### Warmup round
We will begin with a sample warmup round. 

The warmup round consists of 10 iterations. In each iteration, you can simply click an arrow to select it (on either reward map), and press 'c' to confirm the selected arrow (in green).
At the end of each iteration, after 'c' is pressed, one of the arrows shown on the reward maps might turn yellow. The yellow arrow is the arrow determined to be optimal in the provided set of arrows. 

After the selected arrow is confirmed, click the 'Run' button in the menubar to run the next iteration. The number of iterations to be compeleted is shown on top of the reward maps.

Begin the warmup round by running the next cell.

In [None]:
%matplotlib notebook
from play import run
lfh, _ = run(0, intro=True)

Run the cell below to start the first iteration. Then follow instructions on the plot and run the cell again until all iterations are finished. 

In [None]:
lfh.iteration()

After the message "All iterations are completed" pops up, the instruction collection is finished and this concludes the warmup round.

Next, you will be presented with 5 different ground truth reward maps. For each map, repeat the same teaching process as in the warmup round. After you finish providing instructions to the machine learner, remember to save your instructions (more details will be provided later).

Now you can proceed to the first map by opening the notebook <font color= blue> 'test_script1'</font>.