# Imitation Learning: Behavioural Cloning and the DAGGER Algorithm


<br>

In this exercise you'll use imitation learning to teach a student policy to mimic an expert demonstrator. This is an important technique in robotics research.

We'll first try the behavioural cloning technique, which is a simple baseline for imitation learning. It can generate good policies, but they typically can't recover after making mistakes.

We'll then try the DAGGER algorithm, which results in policies that can recover from their mistakes!

You can then try the exercises at the end.


1. Setup
2. View the student and expert policies
3. Run Behavioural Cloning
4. Run the DAGGER algorithm
5. Exercises!

<br>
<img src="flagrun_adv_fallover.gif" alt="Drawing" style="width: 400px;"/>
<br>

**This notebook doesn't need a GPU! You should be able to run it on a laptop CPU.**

## 1.1 Import the Necessary Packages

In [None]:
#add parent dir to find package. Only needed for source code build, pip install doesn't need it.
import os, inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(os.path.dirname(currentdir))
os.sys.path.insert(0,parentdir)

import gym
import numpy as np
import pybullet_envs
import pybullet as p
import os.path
import time
import torch
torch.manual_seed(0)


## 1.2 Instantiate the Environment and Expert Demonstrator

We have two versions of the environment:

- `env_flagrun_with_rendering` which has a GUI for visualization
- `env_flagrun_without_rendering` which runs very quickly but doesn't show any visualization.

These will be useful later!

**If you ever get an error from the physics server, you'll have to run this cell again.**

In [None]:
from utils import rollout_for_one_episode as rollout_for_one_episode
from utils import rollout_for_n_episodes as rollout_for_n_episodes

# shutdown any physics clients that already exist
try: p.disconnect()
except: pass

# build the two versions of the environment
env_flagrun_with_rendering = gym.make("HumanoidFlagrunBulletEnv-v0")
env_flagrun_with_rendering.render(mode="human")
env_flagrun_without_rendering = gym.make("HumanoidFlagrunBulletEnv-v0")


# instantiate the expert
from flagrun_expert_demonstrator import *
flagrun_expert = ExpertPolicy(env_flagrun_with_rendering.observation_space,
                                    env_flagrun_with_rendering.action_space)
           

## 2.1 Evaluate the Untrained Student Policy

Our student policy is a two-layer neural net.

Let's run the untrained student ten times, recording the reward so that we have a baseline for later.

We'll use the second version of the environment, `env_flagrun_without_rendering`, which runs very quickly but doesn't show any visualization.

In [None]:
# instantiate an untrained student policy
from model import StudentPolicy as StudentPolicy
student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  


rollout_data = rollout_for_n_episodes(n = 30,
                                      policy = student_policy,
                                      env = env_flagrun_without_rendering,
                                      render=False)

mean_student_score = np.mean(rollout_data['scores'])

print('Average Untrained Student Score:', mean_student_score)

import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(rollout_data['scores'])
plt.title('Histogram of Untrained Student Scores over 100 Episodes')

We ran the untrained student policy for 1000 iterations so you can get a less-noisy idea of the distribution of scores:

![](student_score_histogram.png)

## 2.2 Watch the Untrained Student Policy


Now we'll use the first version of the environment, `env_flagrun_with_rendering`, which runs in real-time and creates a GUI for visualization.

Watch few rollouts in the GUI: the humanoid will fall to the floor because the policy hasn't been trained yet.

In [None]:
# view the untrained student policy (it should flop on the floor!)
rollout_data = rollout_for_n_episodes(n = 3,
                       policy = student_policy,
                       env = env_flagrun_with_rendering,
                       render=True)

## 2.3 Evaluate the Expert Demonstrator

We have a pretrained expert policy from the pybullet gym. Let's run this expert ten times, recording the reward so that we have a baseline for later.

We'll use the second version of the environment, `env_flagrun_without_rendering`, which runs very quickly but doesn't show any visualization.

In [None]:
rollout_data = rollout_for_n_episodes(n = 10,
                                      policy = ExpertPolicy(env_flagrun_without_rendering.observation_space,
                                      env_flagrun_without_rendering.action_space),
                                      env = env_flagrun_without_rendering,
                                      render=False)


import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(rollout_data['scores'])
plt.title('Histogram of Expert Scores over 100 episodes')

We ran the untrained student policy for 1000 iterations so you get a less-noisy idea of the score:

![](expert_score_histogram.png)

## 2.4 Watch the Expert Demonstrator

Plotting the agent's score can be useful for a quantative comparison, but it's always important to visualize a rollout so you know what's actually happening.

Run the next cell to visualize the expert demonstrator!

You can click+drag on the expert to knock it over and see how it recovers.

You should be able to observe three distinct behaviours:

- Running towards a target
- Changing direction
- Getting up after a fall

Later, we'll train the student policy to imitate the expert.

CTRL+drag in the GUI to rotate the view.


In [None]:
flagrun_expert = ExpertPolicy(env_flagrun_with_rendering.observation_space,
                                    env_flagrun_with_rendering.action_space)
           
rollout_data = rollout_for_n_episodes(1,
                       flagrun_expert,
                       env = env_flagrun_with_rendering,
                       render=True)

We'll aim to hit an average score of about 500 with our student policy.

## 3.1 Train the Student Policy with Behavioural Cloning

In behavioural cloning, we run the expert policy and record all the *[state, action]* pairs. We then train a student policy (with supervised learning!) to directly imitate the expert's actions.

We've provided a helper function, `train_model`, which will train the student policy with the recorded expert *[state, action]* pairs.

In [None]:
from utils import train_model as train_model
from utils import Dataset as Dataset

def behavioural_cloning(expert_policy, student_policy):
    '''
    Given an expert demonstrator and a student policy, perform
    n iterations of dagger.
    
    '''
    # collect initial expert demonstrations
    n=10
    print('Rolling Out Expert')
    expert_rollout_data = rollout_for_n_episodes(30,
                       expert_policy,
                       env = env_flagrun_without_rendering,
                       render=False)
    
    # train student policy with supervised learning
    print('Training Student Model')
    student_policy = train_model(student_policy, expert_rollout_data, num_epochs = 20)
    return student_policy


student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  

# Now run behavioural cloning
behavioural_cloning(flagrun_expert, student_policy)



## 3.2 Evaluate the Trained Student Policy

Let's record the average score so we can quantatively compare the student to the expert.

In [None]:
student_rollout_data = rollout_for_n_episodes(n=10,
                                            policy = student_policy,
                                            env = env_flagrun_without_rendering,
                                            render=False)

import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(rollout_data['scores'])
plt.title('Histogram of Behavioural Cloning-Trained Student Scores over 100 episodes')

We ran the behavioural cloning-trained student policy for 1000 iterations so you get a less noisy idea of the score:

![](student_bc_score_histogram.png)

Compare this to the expert!

## 3.3 Watch the Trained Student Policy

Visualize the student -- how does it qualiatively compare to the expert?

Make sure to  can click+drag on the humanoid in the GUI to knock it over and see how it recovers.

Note that behavioural cloning works suprisingly well! The student policy should be able to run and turn. 

However if the student falls over, it has trouble getting back up, because the expert doesn't fall enough to produce much training data!

The expert is too good to fail so the student never learns how to recover.

In [None]:
rollout_for_n_episodes(n = 10,
                       policy = student_policy,
                       env = env_flagrun_with_rendering,
                       render=True)

## 4.1 Train the Agent with the DAGGER Algorithm

How can we train the student to get back up when it falls?

One way is to ask the expert *what it would have done if it had fallen*, and then train on that data.

This is the essence of the DAGGER algorithm!

In the DAGGER algorithm, we first train a student policy with behavioural cloning. We then rollout this trained student policy and record all the states it visits (it'll fall over a lot!). We then run these states through the expert policy (asking the expert what it *would have done*), generating expert actions, and use these new [state, expert_action] pairs as extra training data for another iteration of behavioural cloning. We can repeat this algorithm many times.

We've provided a helper function, train_model, which will train the student policy with the recorded expert [state, action] pairs.

The first few iterations should give a similar result to behavioural cloning.

By the final iteration, the agent should be able to stand up when it falls!

In [None]:
##### import torch
from torch.autograd import Variable
from utils import train_model as train_model

def evaluate_expert(policy, data):
    '''
    Evaluate an expert policy on a list of recorded observations.
    What would it have done?
    '''
    actions = []
    for obs in data:
        obs = Variable(torch.Tensor(obs))
        predicted_action = policy(obs)
        actions.append(predicted_action.data.numpy())
    return actions

def dagger(expert_policy, student_policy, n_dagger_iterations, env=env_flagrun_without_rendering):
    '''
    Given an expert demonstrator and a student policy, perform
    n iterations of DAGGER.
    
    '''
    # collect initial expert demonstrations
    n=10
    print('Rolling Out Expert')
    expert_rollout_data = rollout_for_n_episodes(n=10,
                                            policy = expert_policy,
                                            env = env_flagrun_without_rendering,
                                            render=False)    
    
    print('Training Student')
    # train initial student model with behavioural cloning
    training_data = expert_rollout_data
    trained_student = train_model(student_policy, expert_rollout_data, num_epochs = 10)
    
 
    for i in range(n_dagger_iterations):
        print('Iteration', i, 'of DAGGER')
        
        # rollout student model
        student_rollout_data = rollout_for_n_episodes(n=10,
                                            policy = student_policy,
                                            env = env_flagrun_without_rendering,
                                            render=False)
        
        # evaluate expert actions on student's trajectories and add to dataset
        expert_corrections = evaluate_expert(expert_policy, student_rollout_data['observations'])
        

        training_data = {'observations': training_data['observations'] + student_rollout_data['observations'],
                         'actions':      training_data['actions']      + expert_corrections}
        
        # train student model with behavioural cloning 
        assert len(training_data['observations']) == len(training_data['actions'])

        student_policy =  train_model(student_policy, training_data, num_epochs = 30)
        
    return student_policy


# instantiate a new untrained student policy
torch.manual_seed(0)
student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  

dagger(flagrun_expert,
       student_policy,
       n_dagger_iterations = 10,
       env = env_flagrun_without_rendering)


**Note: we've provided you with a student model pre-trained with DAGGER. Run the following cell to load it:**

In [None]:
student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  
student_policy.load_state_dict(torch.load('trained_dagger_agent.pt'))


## 4.2 Evaluate the DAGGER-Trained Student Policy

Let's record the average score so we can quantatively compare the student to the expert, and the behavioural-cloning student.

In [None]:
rollout_data = rollout_for_n_episodes(n = 100,
                                       policy = student_policy,
                                       env = env_flagrun_without_rendering,
                                       render=False)
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(rollout_data['scores'], bins = 30)
plt.title('Histogram of DAGGER-trained student scores over 100 episodes')

We ran the DAGGER-trained student policy for 1000 iterations so you get a less noisy idea of the score:

![](student_dagger_score_histogram.png)

Compare this to the expert! The shape of the curve doesn't particularly matter (you'll get a different curve each time you run).

## 4.3 Watch the DAGGER-trained student policy

You should be able to see the agent stand up after falling! Remember you can click+drag on the humanoid in the GUI to knock it over and see how it recovers.

In [None]:
rollout_for_n_episodes(n = 5,
                       policy = student_policy,
                       env = env_flagrun_with_rendering,
                       render=True)

## 5. Explore

In this exercise, we have implemented the behavioural cloning and DAGGER algorithms, and demonstrated how to train policies using imitation learning that can *recover from their mistakes!*.

To continue your learning, you are encouraged to complete any (or all!) of the following tasks:

### 5.1 Adversarial Environments

Another way to teach student policies to recover is to make an environment so difficult that it *forces the expert to fail!*

You can then see what they do to recover, and use this as training data.

Try this out! The `HumanoidFlagrunHarderBulletEnv-v0` environment is the same as before, except fast blocks are thrown directly at the humanoid to knock it over. Gather expert data in this environment, and then test in the original `HumanoidFlagrunBulletEnv-v0` environment.

Now you should be able to train a flagrun agent that can recover from a fall with basic behavioural cloning, without needing DAGGER! Try it!

### 5.2 Imitation as an objective during RL training

We've considered two basic approaches to imitation learning that don't require intensive RL algotithms like you have used in previous sections, instead relying upon having direct access to the observation space and action space of an expert demonstrator (e.g. recorded commands from a car's data bus as a human demonstrator drives around!).

These approaches won't work in other situations. For instance, imagine you're trying to train a [humanoid robot](https://www.youtube.com/watch?v=LikxFZZO2sk) with imitation learning. Where is the expert data? We can watch a person running, but we can't record their joint torques! (And even if we could, they wouldn't transfer to a robot.)

More powerful imitation learning approaches can be used in this situation to *guide* the reinforcement learning process.

One examples is [DEEPMIMIC](https://bair.berkeley.edu/blog/2018/04/10/virtual-stuntman/), which combines a motion-imitation objective with the task objective, and ends up producing motions that are far more natural-looking than with [traditional RL approaches](https://www.youtube.com/watch?v=hx_bgoTF7bs).

Read about this approach. Can you get the [code](https://github.com/xbpeng/DeepMimic) working?
