# Intro

This is a demo of the paper Teachable Reinforcement Learning via Advice Distillation showing how humans can use advice to coach agents through new tasks.

For more details, check out our NeurIPS paper and video: https://neurips.cc/Conferences/2021/Schedule?showEvent=27834

In [None]:
import sys
sys.path = [p for p in sys.path if not 'babyai' in p]

# Setup - do this once

To avoid version conflicts, we recommend running this in a conda env with python 3.7.

    conda create --name teachable_rl python=3.7
    conda activate teachable_rl
    pip install notebook
    
You either need to run this on a device with a display. If you're running on a machine without one, use port forwarding:

    ssh -L 9999:localhost:9999 INSERT_SERVER_NAME
    jupyter notebook --no-browser --port 9999


We use two environments: [BabyAI](https://github.com/mila-iqia/babyai) and [AntMaze](https://github.com/rail-berkeley/d4rl).  If you would like to use AntMaze, please [install Mujoco](https://github.com/openai/mujoco-py).

In [None]:
# !git clone https://github.com/aliengirlliv/teachable 1> /dev/null

In [None]:
# cd teachable

In [None]:
# !pip install -r reqs.txt 1> /dev/null

In [None]:
# cd ..

# Setup - Do this each time you reload the notebook

In [None]:
%matplotlib tk

In [None]:
cd teachable

In [None]:
from final_demo import *
from IPython.display import HTMLimport pathlib
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

# Setup

## Instructions
1. Select the collection mode.
    - "Advice" runs the "Improvement" phase of our method, allowing you to coach an agent through
    - "Demos" lets you collect trajectories providing actions each timestep.
2. Select a save name (any string describing this experiment).
3. Collect demos below


# Collection
To collect data, run the block below. A window will open which lets you collect data.  

In our human exps, we found you can reach good performance on this env with about 30 mins of human collection time.

## Task
The agent's task is to unlock a door by collecting a matching-colored key and using it on the corresponding door. (To speed up training time, we always spawn the agent in the same room as the locked door.)

### Using Advice

The agent you will be coaching has been pretrained to understand Waypoint advice. It has never seen this particular environment/task before, and has never had to unlock a door. Click on a square to tell the agent to head there and manipulate any item present. Use the scrollbar to advance.

### Providing Demos
Use the arrow keys to navigate, Page Up/Down to manipulate objects, and Space to open doors.

## Using Pre-collected data
We include a buffer of data collected using 30 mins of human time using Advice. You can only load this data if you have CUDA enabled. If you choose this option, ignore the second cell below.


In [None]:
env_type = 'BabyAI'  # Options are 'BabyAI', 'Ant'
collect_type = 'Advice'  # Options are 'Advice', 'Demos', or 'Precollected'
save_path = 'babyai_10_v0'  # Any string

In [None]:
collector = HumanFeedback(env_type=env_type, collect_type=collect_type, 
                          save_path=save_path, seed=124)

# Train

Here, we train an advice-free policy on the collected trajectories using the buffer of collected trajectories.


It will train for 20 itrs, but feel free to pause before then.

In [None]:
args = make_args(collector, save_path)  
run_experiment(args)

# Visualize

Play video saved during training. This agent does not receive advice.

In [None]:
html_str = display_trained_model(save_path)
HTML(html_str)

In [None]:
plot(save_path)

In [None]:

def load_data(name, file_name='progress.csv'):
    csv_name = pathlib.Path.cwd().joinpath('teachable', 'logs', name, file_name)
    data = pd.read_csv(csv_name)
    data.columns = [c.strip() for c in data.columns]
    return data

def plot(run_name, metric='success_rate', x_label='Itrs'):
    use_itrs = x_label in ['Itrs', 'Samples']
    data = load_data(run_name, file_name='results.csv')
    data.columns = ['policy_env','policy','env','success_rate','stoch_accuracy','itr','num_feedback','time','reward']
    y = data[metric].ewm(span=5).mean().to_numpy() 
    if use_itrs:
        x = data['itr'].to_numpy()
    else:
        x = data['num_feedback'].to_numpy()
    plt.title(run_name)
    plt.plot(x, y)
    plt.ylabel('Success', fontsize=15)
    plt.xlabel(x_label, fontsize=15)
    plt.show()
    



In [None]:
!pip install seaborn