# Intro

This is a demo of the paper Teachable Reinforcement Learning via Advice Distillation showing how humans can use advice to coach agents through new tasks!

For more details, check out our NeurIPS paper and video: https://neurips.cc/Conferences/2021/Schedule?showEvent=27834

In [None]:
import sys
sys.path = [p for p in sys.path if not 'babyai' in p]

In [None]:
!echo $DISPLAY

# Setup - do this once

To avoid version conflicts, we recommend running this in a conda env with python 3.7.

    conda create --name teachable_rl python=3.7
    conda activate teachable_rl
    pip install notebook
    
You either need to run this on a device with a display. If you're running on a machine without one, use port forwarding:

    ssh -L 9999:localhost:9999 INSERT_SERVER_NAME
    jupyter notebook --no-browser --port 9999


We use two environments: [BabyAI](https://github.com/mila-iqia/babyai) and [AntMaze](https://github.com/rail-berkeley/d4rl).  If you would like to use AntMaze, please [install Mujoco](https://github.com/openai/mujoco-py).

In [None]:
# !git clone https://github.com/aliengirlliv/teachable 1> /dev/null

In [None]:
# cd teachable

In [None]:
# !pip install -r reqs.txt 1> /dev/null

In [None]:
# cd ..

# Setup - Do this each time you reload the notebook

In [None]:
%matplotlib tk

In [None]:
cd teachable

In [None]:
from final_demo import *
from IPython.display import HTML
import pathlib
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

# Setup

## Instructions
1. Select the collection mode.
    - "Advice" runs the "Improvement" phase of our method, allowing you to coach an agent using waypoint advice
    - "Demos" lets you collect trajectories providing actions each timestep.
2. Select a save name (any string describing this experiment).
3. Collect demos below


# Collection
To collect data, run the block below. A window will open which lets you collect data.  

In our human exps, we found you can reach okay performance (> 50% success) on this env with about 30 mins of human collection time.

## Task

**BabyAI**:  The agent's task is to unlock a door by collecting a matching-colored key and using it on the corresponding door. (To speed up training time, we always spawn the agent in the same room as the locked door, and the key is always in the same spot.)

**Ant**: The agent's task is to reach the pink target.

### Using Advice

Note: Collecting advice requires a mouse.

**BabyAI**: The agent you will be coaching has been pretrained to understand Waypoint advice. It has never seen this particular environment/task before, and has never had to unlock a door. Click on a square to tell the agent to head there and manipulate any item present. Use the scrollwheel to advance.

**Ant**: The agent you will be coaching has been pretrained to understand Waypoint advice. It has never seen an environment this large before. Click on a point to tell the agent to head there. Use the scrollwheel to advance.

### Providing Demos (BabyAI env only)
Use the arrow keys to navigate, Page Up/Down to manipulate objects, and Space to open doors.

## Using Pre-collected data
We include a buffer of data collected using 30 mins of human time using Advice. You can only load this data if you have CUDA enabled.

In [None]:
env_type = 'Ant'  # Options are 'BabyAI', 'Ant'
collect_type = 'Advice'  # Options are 'Advice', 'Demos', or 'Precollected'
save_path = 'whatever2'  # Any string

In [None]:
assert False

In [None]:
collector = HumanFeedback(env_type=env_type, collect_type=collect_type, 
                          save_path=save_path, seed=124)

In [None]:
# import pickle as pkl
# import joblib
# with open('saved_models/babyai_offset_advice/latest.pkl', 'rb') as f:
#     q = joblib.load(f)

In [None]:
# env.render('rgb')

# Train

Here, we train an advice-free policy on the collected trajectories using the buffer of collected trajectories.


It will train for 20 itrs, but feel free to pause it before then if you'd like to see thet trained model.

In [None]:
args = make_args(collector, save_path)  
run_experiment(args)

# Visualize

Play a video of the agent you trained. This agent was trained using the coached rollouts you provided.  This agent does **not** receive advice.

In [None]:
display_trained_model(save_path)

Plot the agent's success rate during training.

In [None]:
plot(save_path)