# Imitation Learning: Behavioural Cloning and the DAGGER Algorithm


In this lesson you'll use imitation learning to teach a student policy to mimic an expert demonstrator! This is an important technique in robotics research.

We'll consider two basic approaches to imitation learning that don't require intensive RL algotithms like you have used in previous sections. These only work when you have direct access to the observation space and action space of an expert demonstrator (e.g. recorded commands from a car's data bus as a human demonstrator drives!).

We'll first try the behavioural cloning technique, which is a simple baseline for imitation learning. It can generate good policies, but they typically can't recover after making mistakes.

We'll then try the DAGGER algorithm, which results in policies that can recover from their mistakes!

You can then try the exercises at the end.


1. Setup
2. View the student and expert policies
3. Run Behavioural Cloning
4. Run the DAGGER algorithm
5. Exercises!

**This notebook doesn't need a GPU! You should be able to run it on a laptop CPU.**

## 1.1 Import the Necessary Packages

In [1]:
#add parent dir to find package. Only needed for source code build, pip install doesn't need it.
import os, inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(os.path.dirname(currentdir))
os.sys.path.insert(0,parentdir)

import gym
import numpy as np
import pybullet_envs
import pybullet as p
import os.path
import time
import torch
torch.manual_seed(0)


<torch._C.Generator at 0x113395850>

## 1.2 Instantiate the Environment and Expert Demonstrator

We have two versions of the environment:

- `env_flagrun_with_rendering` which has a GUI for visualization
- `env_flagrun_without_rendering` which runs very quickly but doesn't show any visualization.

These will be useful later!

**If you ever get an error from the physics server, you'll have to run this cell again.**

In [4]:
from utils import rollout_for_one_episode as rollout_for_one_episode
from utils import rollout_for_n_episodes as rollout_for_n_episodes

# shutdown any physics clients that already exist
try: p.disconnect()
except: pass

# build the two versions of the environment
env_flagrun_with_rendering = gym.make("HumanoidFlagrunBulletEnv-v0")
env_flagrun_with_rendering.render(mode="human")
env_flagrun_without_rendering = gym.make("HumanoidFlagrunBulletEnv-v0")


# instantiate the expert
from flagrun_expert_demonstrator import *
flagrun_expert = ExpertPolicy(env_flagrun_with_rendering.observation_space,
                                    env_flagrun_with_rendering.action_space)
           

WalkerBase::__init__ start
WalkerBase::__init__ start




## 2.1 Watch the Untrained Student Policy

Our student policy is a two-layer neural net.

We'll use the first version of the environment, `env_flagrun_with_rendering`, which runs in real-time and creates a GUI for visualization.

Watch few rollouts in the GUI: the humanoid will fall to the floor because the policy hasn't been trained yet.

In [5]:
# instantiate an untrained student policy
from model import StudentPolicy as StudentPolicy
student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  


# view the untrained student policy (it should flop on the floor!)
rollout_data = rollout_for_n_episodes(n = 3,
                       policy = student_policy,
                       env = env_flagrun_with_rendering,
                       render=True)

episode 0
score=67.09 in 39 frames


KeyboardInterrupt: 

## 2.2 Evaluate the Untrained Student Policy


Let's run the untrained student ten more times, recording the reward so that we have a baseline for later.

We'll use the second version of the environment, `env_flagrun_without_rendering`, which runs very quickly but doesn't show any visualization.

In [None]:
rollout_data = rollout_for_n_episodes(n = 10,
                                      policy = student_policy,
                                      env = env_flagrun_without_rendering,
                                      render=False)

mean_student_score = np.mean(rollout_data['scores'])

print('Average Expert Score:', mean_student_score)

We ran the untrained student policy for 1000 iterations and recorded the scores so you can get a less-noisy idea of the score:

![](student_score_histogram.png)

## 2.3 Watch the Expert Demonstrator

Now we'll visualize the expert demonstrator!

You should be able to observe three distinct behaviours:

- Running towards a target
- Changing direction
- Getting up after a fall

Later, we'll train the student policy to imitate the expert.

CTRL+drag in the GUI to rotate the view. You can click+drag on the expert to knock it over and see how it recovers.


In [None]:
flagrun_expert = ExpertPolicy(env_flagrun_with_rendering.observation_space,
                                    env_flagrun_with_rendering.action_space)
           
rollout_data = rollout_for_n_episodes(1,
                       flagrun_expert,
                       env = env_flagrun_with_rendering,
                       render=True)

## 2.4 Evaluate the Expert Demonstrator

Let's run the expert ten more times, recording the reward so that we have a baseline for later.

We'll use the second version of the environment, `env_flagrun_without_rendering`, which runs very quickly but doesn't show any visualization.

In [None]:
rollout_data = rollout_for_n_episodes(n = 10,
                                      policy = ExpertPolicy(env_flagrun_without_rendering.observation_space,
                                      env_flagrun_without_rendering.action_space),
                                      env = env_flagrun_without_rendering,
                                      render=False)

mean_expert_score = np.mean(rollout_data['scores'])
std_expert_score = np.std(rollout_data['scores'])

print('Average Expert Score:', mean_expert_score, 'Standard Deviation in Expert Score:', std_expert_score)

We ran the expert policy for 1000 iterations so you get a clearer picture:

We ran the untrained student policy for 1000 iterations so you get a less-noisy idea of the score:

![](expert_score_histogram.png)

We'll aim to hit an average score of about 500 with our student policy.

## 3.1 Train the Student Policy with Behavioural Cloning

In behavioural cloning, we run the expert policy and record all the *[state, action]* pairs. We then train a student policy (with supervised learning!) to directly imitate the expert's actions.

We've provided a helper function, `train_model`, which will train the student policy with the recorded expert *[state, action]* pairs.

In [None]:
from utils import train_model as train_model
from utils import Dataset as Dataset

def behavioural_cloning(expert_policy, student_policy):
    '''
    Given an expert demonstrator and a student policy, perform
    n iterations of dagger.
    
    '''
    # collect initial expert demonstrations
    n=10
    print('Rolling Out Expert')
    expert_rollout_data = rollout_for_n_episodes(10,
                       expert_policy,
                       env = env_flagrun_without_rendering,
                       render=False)
    
    # train student policy with supervised learning
    print('Training Student Model')
    student_policy = train_model(student_policy, expert_rollout_data, num_epochs = 300)
    return student_policy


student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  

# Now run behavioural cloning
behavioural_cloning(flagrun_expert, student_policy)



## 3.2 Watch the Trained Student Policy

Note that behavioural cloning works suprisingly well! The student policy should be able to run and turn. 

However if the student falls over, it can't get back up, because the expert doesn't fall enough to produce much training data!

The expert is too good to fail so the student never learns how to recover.

In [None]:
rollout_for_n_episodes(n = 5,
                       policy = student_policy,
                       env = env_flagrun_with_rendering,
                       render=True)

## 3.3 Evaluate the Trained Student Policy

In [None]:
# view the trained student policy (it should run, but fall over!)

student_rollout_data = rollout_for_n_episodes(n=10,
                                            policy = student_policy,
                                            env = env_flagrun_without_rendering,
                                            render=False)

mean_student_score = np.mean(student_rollout_data['scores'])
std_student_score = np.std(student_rollout_data['scores'])
print('Average Expert Score:', mean_student_score, 'Standard Deviation in Expert Score:', std_student_score)

We ran the behavioural cloning-trained student policy for 1000 iterations so you get a less noisy idea of the score:

![](student_bc_score_histogram.png)

Compare this to the expert!

## 4.1 Train the Agent with the DAGGER Algorithm

How can we train the student to get back up when it falls?

One way is to ask the expert *what it would have done*, and then train on that data.

This is the essence of the DAGGER algorithm!

In the DAGGER algorithm, we first train a student policy with behavioural cloning. We then rollout this trained student policy and record all the states it visits. We then run these states through the expert policy (asking the expert what it *would have done*), generating expert actions, and use these new [state, expert_action] pairs as extra training data for another iteration of behavioural cloning. We can repeat this algorithm many times.

We've provided a helper function, train_model, which will train the student policy with the recorded expert [state, action] pairs.

The first iteration should give the same result as behavioural cloning.

By the final iteration, the agent should be able to stand up when it falls!

In [None]:
import torch
from torch.autograd import Variable
from utils import train_model as train_model

def evaluate_expert(policy, data):
    '''
    Evaluate an expert policy on a list of recorded observations.
    What would it have done?
    '''
    actions = []
    for obs in data:
        obs = Variable(torch.Tensor(obs))
        predicted_action = policy(obs)
        actions.append(predicted_action.data.numpy())
    return actions

def dagger(expert_policy, student_policy, n_dagger_iterations, env=env_flagrun_without_rendering):
    '''
    Given an expert demonstrator and a student policy, perform
    n iterations of DAGGER.
    
    '''
    # collect initial expert demonstrations
    n=10
    print('Rolling Out Expert')
    expert_rollout_data = rollout_for_n_episodes(n=10,
                                            policy = expert_policy,
                                            env = env_flagrun_without_rendering,
                                            render=False)
    print('Training Student')
    # train initial student model with behavioural cloning
    training_data = expert_rollout_data
    trained_student = train_model(student_policy, expert_rollout_data)
    
    for i in range(n_dagger_iterations):
        print('Iteration', i, 'of DAGGER')
        
        # rollout student model (renders by default! you can change this if you don't want to render)
        student_rollout_data = rollout_for_n_episodes(n=10,
                                            policy = student_policy,
                                            env = env_flagrun_without_rendering,
                                            render=False)
        
        # evaluate expert actions on student's trajectories and add to dataset
        expert_corrections = evaluate_expert(expert_policy, student_rollout_data['observations'])
        training_data = {'observations': training_data['observations'] + student_rollout_data['observations'],
                         'actions':      training_data['actions']      + expert_corrections}
        # train student model with behavioural cloning
        student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  
        student_policy =  train_model(student_policy, training_data, num_epochs = 500)
        
    return student_policy


# instantiate a new untrained student policy
torch.manual_seed(0)
student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  

dagger(flagrun_expert,
       student_policy,
       n_dagger_iterations = 2,
       env = env_flagrun_without_rendering)


Rolling Out Expert
episode 0
score=1301.41 in 1000 frames
episode 1
score=629.01 in 1000 frames
episode 2
score=543.10 in 1000 frames
episode 3
score=676.28 in 1000 frames
episode 4
score=904.47 in 1000 frames
episode 5
score=416.10 in 1000 frames
episode 6
score=633.28 in 1000 frames
episode 7
score=822.85 in 1000 frames
episode 8
score=499.05 in 1000 frames
episode 9
score=539.21 in 1000 frames
Training Student
Epoch: 50, Total loss: 0.02372540906071663
Iteration 0 of DAGGER
episode 0
score=148.70 in 89 frames
episode 1
score=-40.43 in 24 frames
episode 2
score=94.32 in 52 frames
episode 3
score=-602.38 in 117 frames
episode 4
score=80.15 in 57 frames
episode 5
score=575.75 in 412 frames
episode 6
score=169.31 in 91 frames
episode 7
score=129.13 in 75 frames
episode 8
score=30.42 in 68 frames
episode 9
score=832.12 in 258 frames
Epoch: 50, Total loss: 0.36704450845718384
Epoch: 100, Total loss: 0.1790788620710373
Epoch: 150, Total loss: 0.1599975973367691
Epoch: 200, Total loss: 0.15

## 4.2 Watch the DAGGER-trained student policy

You should be able to see the agent stand up! Remember you can click+drag on the humanoid in the GUI to knock it over and see how it recovers.

In [None]:
# view the trained student policy (it should run, but fall over!)
rollout_for_n_episodes(n = 5,
                       policy = student_policy,
                       env = env_flagrun_with_rendering,
                       render=True)

## 4.3 Train the Agent with the DAGGER Algorithm

## 5. Explore

In this exercise, we have implemented the behavioural cloning and DAGGER algorithms, and demonstrated how to train policies using imitation learning that can *recover from their mistakes!*.

To continue your learning, you are encouraged to complete any (or all!) of the following tasks:

### 5.1 Adversarial Environments

Another way to teach student policies to recover is to make an environment so difficult that it *forces the expert to fail!*

You can then see what they do to recover, and use this as training data.

Try this out! The `HumanoidFlagrunHarderBulletEnv-v0` environment is the same as before, except fast blocks are thrown directly at the humanoid to knock it over. Replace the environment in section 1.2. The expert will fall over a lot!

Now you should be able to train a flagrun agent that can recover from a fall with basic behavioural cloning, without needing DAGGER! Try it!

### 5.2 Imitation as an objective during RL training

We've considered two basic approaches to imitation learning that don't require intensive RL algotithms like you have used in previous sections, instead relying upon having direct access to the observation space and action space of an expert demonstrator (e.g. recorded commands from a car's data bus as a human demonstrator drives around!).

These approaches won't work in other situations. For instance, imagine you're trying to train a [humanoid robot](https://www.youtube.com/watch?v=LikxFZZO2sk) with imitation learning. Where is the expert data? We can watch a person running, but we can't record their joint torques! (And even if we could, they wouldn't transfer to a robot.)

More powerful imitation learning approaches can be used in this situation to *guide* the reinforcement learning process.

One examples is [DEEPMIMIC](https://bair.berkeley.edu/blog/2018/04/10/virtual-stuntman/), which combines a motion-imitation objective with the task objective, and ends up producing motions that are far more natural-looking than with [traditional RL approaches](https://www.youtube.com/watch?v=hx_bgoTF7bs).

Read about this approach. Can you get the 


### 5.3 (VERY HARD) Hierarchical Imitation Learning

The flagrun environment is *multitask*, which means that .

Can you use the expert data to train an agent ? This has important implications for robotics research.
