# Imitation Learning: Behavioural Cloning and the DAGGER Algorithm


In this lesson you'll use imitation learning to teach a student policy to mimic an expert demonstrator! This is an important technique in robotics research.

We'll compare two standard imitation learning approaches (ending up with a policy that can control a humanoid to run towards moving targets!) and then you can try the exercises at the end.

1. Setup
2. View the student and expert policies
3. Run Behavioural Cloning
4. Run the DAGGER algorithm
5. Exercises!


## 1.1 Import the Necessary Packages

In [1]:
#add parent dir to find package. Only needed for source code build, pip install doesn't need it.
import os, inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(os.path.dirname(currentdir))
os.sys.path.insert(0,parentdir)

import gym
import numpy as np
import pybullet_envs
import pybullet as p
import os.path
import time
import torch
torch.manual_seed(0)


<torch._C.Generator at 0x1139bb850>

## 1.2 Instantiate the Environment and Expert Demonstrator

We have two versions of the environment:

- `env_flagrun_with_rendering` which has a GUI for visualization
- `env_flagrun_without_rendering` which runs very quickly but doesn't show any visualization.

These will be useful later!

**If you ever get an error from the physics server, you'll have to run this cell again.**

In [2]:
from utils import rollout_for_one_episode as rollout_for_one_episode
from utils import rollout_for_n_episodes as rollout_for_n_episodes

# shutdown any physics clients that already exist
try: p.disconnect()
except: pass

# build the two versions of the environment
env_flagrun_with_rendering = gym.make("HumanoidFlagrunBulletEnv-v0")
env_flagrun_with_rendering.render(mode="human")
env_flagrun_without_rendering = gym.make("HumanoidFlagrunBulletEnv-v0")


# instantiate the expert
from flagrun_expert_demonstrator import *
flagrun_expert = ExpertPolicy(env_flagrun_with_rendering.observation_space,
                                    env_flagrun_with_rendering.action_space)
           

WalkerBase::__init__ start
WalkerBase::__init__ start




## 2.1 Watch the Untrained Student Policy

Our student policy is a two-layer neural net.

We'll use the first version of the environment, `env_flagrun_with_rendering`, which runs in real-time and creates a GUI for visualization.

Watch few rollouts in the GUI: the humanoid will fall to the floor because the policy hasn't been trained yet.

In [3]:
# instantiate an untrained student policy
from model import StudentPolicy as StudentPolicy
student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  


# view the untrained student policy (it should flop on the floor!)
rollout_data = rollout_for_n_episodes(n = 3,
                       policy = student_policy,
                       env = env_flagrun_with_rendering,
                       render=True)

episode 0
score=44.07 in 38 frames
episode 1
score=12.65 in 20 frames
episode 2
score=7.78 in 18 frames


## 2.2 Record Data from the Untrained Student Policy


Let's run the untrained student ten more times, recording the reward so that we have a baseline for later.

We'll use the second version of the environment, `env_flagrun_without_rendering`, which runs very quickly but doesn't show any visualization.

In [4]:
rollout_data = rollout_for_n_episodes(n = 10,
                                      policy = student_policy,
                                      env = env_flagrun_without_rendering,
                                      render=False)

mean_student_score = np.mean(rollout_data['scores'])

print('Average Expert Score:', mean_student_score)

episode 0
score=8.78 in 18 frames
episode 1
score=25.30 in 31 frames
episode 2
score=75.41 in 48 frames
episode 3
score=66.22 in 51 frames
episode 4
score=62.71 in 54 frames
episode 5
score=51.54 in 43 frames
episode 6
score=27.42 in 26 frames
episode 7
score=12.34 in 18 frames
episode 8
score=16.54 in 29 frames
episode 9
score=17.75 in 33 frames
Average Expert Score: -203.04842826009218


We ran the untrained student policy for 1000 iterations and recorded the scores so you can get a clearer picture:

![](student_score_histogram.png)

## 2.3 Watch the Expert Demonstrator

Now we'll visualize the expert demonstrator!

You should be able to observe three distinct behaviours:

- Running towards a target
- Changing direction
- Getting up after a fall

Later, we'll train the student policy to imitate the expert.

CTRL+drag in the GUI to rotate the view.


In [6]:
flagrun_expert = ExpertPolicy(env_flagrun_with_rendering.observation_space,
                                    env_flagrun_with_rendering.action_space)
           
rollout_data = rollout_for_n_episodes(1,
                       flagrun_expert,
                       env = env_flagrun_with_rendering,
                       render=True)

episode 0


KeyboardInterrupt: 

## 2.4 Record Data from the Expert Demonstrator

Let's run the expert ten more times, recording the reward so that we have a baseline for later.

We'll use the second version of the environment, `env_flagrun_without_rendering`, which runs very quickly but doesn't show any visualization.

In [7]:
rollout_data = rollout_for_n_episodes(n = 10,
                                      policy = ExpertPolicy(env_flagrun_without_rendering.observation_space,
                                      env_flagrun_without_rendering.action_space),
                                      env = env_flagrun_without_rendering,
                                      render=False)

mean_expert_score = np.mean(rollout_data['scores'])
std_expert_score = np.std(rollout_data['scores'])

print('Average Expert Score:', mean_expert_score, 'Standard Deviation in Expert Score:', std_expert_score)

episode 0
score=598.72 in 1000 frames
episode 1
score=885.19 in 1000 frames
episode 2
score=460.91 in 1000 frames
episode 3
score=693.41 in 1000 frames
episode 4
score=1122.50 in 1000 frames
episode 5


KeyboardInterrupt: 

We ran the expert policy for 1000 iterations so you get a clearer picture:

We ran the untrained student policy for 1000 iterations so you get a clearer picture:

![](expert_score_histogram.png)

We'll aim to hit an average score of about 500 with our student policy.

## 3.1 Train the Student Policy with Behavioural Cloning

In behavioural cloning, we run the expert policy and record all the *[state, action]* pairs. We then train a student policy (with supervised learning!) to directly imitate the expert's actions.

We've provided a helper function, `train_model`, which will train the student policy with the recorded expert *[state, action]* pairs.

In [17]:
from utils import train_model as train_model
from utils import Dataset as Dataset

def behavioural_cloning(expert_policy, student_policy):
    '''
    Given an expert demonstrator and a student policy, perform
    n iterations of dagger.
    
    '''
    # collect initial expert demonstrations
    n=10
    print('Rolling Out Expert')
    expert_rollout_data = rollout_for_n_episodes(10,
                       expert_policy,
                       env = env_flagrun_without_rendering,
                       render=False)
    
    # train student policy with supervised learning
    print('Training Student Model')
    student_policy = train_model(student_policy, expert_rollout_data, num_epochs = 300)
    return student_policy


student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  

# Now run behavioural cloning
behavioural_cloning(flagrun_expert, student_policy)



Rolling Out Expert
episode 0
score=-179.61 in 1000 frames
episode 1
score=1057.79 in 1000 frames
episode 2
score=796.59 in 1000 frames
episode 3
score=36.93 in 95 frames
episode 4
score=789.77 in 1000 frames
episode 5
score=1316.14 in 1000 frames
episode 6
score=-214.79 in 1000 frames
episode 7
score=978.54 in 1000 frames
episode 8
score=894.30 in 1000 frames
episode 9
score=541.26 in 1000 frames
Training Student Model
Epoch: 0, Total loss: 0.13469792902469635
Epoch: 10, Total loss: 0.06485603749752045
Epoch: 20, Total loss: 0.053863126784563065
Epoch: 30, Total loss: 0.04694744572043419
Epoch: 40, Total loss: 0.043945662677288055
Epoch: 50, Total loss: 0.040944360196590424
Epoch: 60, Total loss: 0.04058591648936272
Epoch: 70, Total loss: 0.03734274208545685
Epoch: 80, Total loss: 0.03642949089407921
Epoch: 90, Total loss: 0.03638412430882454
Epoch: 100, Total loss: 0.03598912060260773
Epoch: 110, Total loss: 0.03455420956015587
Epoch: 120, Total loss: 0.03302914649248123
Epoch: 130, T

StudentPolicy(
  (weights_dense1): Linear(in_features=44, out_features=256, bias=True)
  (weights_dense2): Linear(in_features=256, out_features=128, bias=True)
  (weights_dense_final): Linear(in_features=128, out_features=17, bias=True)
)

## 3.2 Watch the Trained Student Policy

Note that behavioural cloning works suprisingly well!

The agent can 

In [18]:
# view the trained student policy (it should run, but fall over!)
rollout_for_n_episodes(n = 5,
                       policy = student_policy,
                       env = env_flagrun_with_rendering,
                       render=True)

episode 0
score=95.85 in 200 frames
episode 1
score=76.04 in 89 frames
episode 2
score=-171.23 in 319 frames
episode 3
score=29.39 in 40 frames
episode 4
score=15.10 in 39 frames


{'observations': [array([ 0.59999996,  0.67321783,  0.7394442 ,  0.        ,  0.        ,
          0.        ,  0.        , -0.        , -0.03626894,  0.        ,
          0.36840478,  0.        ,  0.15012197,  0.        ,  0.8067375 ,
          0.        ,  0.23797032,  0.        ,  0.7706085 ,  0.        ,
          1.0249532 ,  0.        ,  0.67306143,  0.        ,  0.3530446 ,
          0.        ,  0.7215555 ,  0.        ,  0.9633973 ,  0.        ,
          0.19231959,  0.        ,  0.17379741,  0.        ,  0.2987668 ,
          0.        , -0.12702169,  0.        , -0.14244284,  0.        ,
          0.23675552,  0.        ,  0.        ,  0.        ], dtype=float32),
  array([ 5.98060548e-01,  6.73032820e-01,  7.39612639e-01,  7.54210353e-02,
          1.49097955e-02, -6.92718998e-02,  2.32728920e-03, -4.02570330e-03,
         -3.01623866e-02,  4.46265079e-02,  4.33898658e-01,  5.84417403e-01,
          1.25302106e-01, -1.39455214e-01,  8.64067733e-01,  1.38489380e-01,
      

## 3.3 Record Data from the Trained Student Policy

In [None]:
# view the trained student policy (it should run, but fall over!)

student_rollout_data = rollout_for_n_episodes(n=10,
                                            policy = student_policy,
                                            env = env_flagrun_without_rendering,
                                            render=False)

## 4.1 Train the Agent with the DAGGER Algorithm

The first iteration should give the same result as behavioural cloning.

By the second iteration, the agent should be able to turn reliably.

By iteration 4, the agent should start to get up if it falls over.

In [24]:
import torch
from torch.autograd import Variable
from utils import train_model as train_model

def evaluate_expert(policy, data):
    '''
    Evaluate a policy on a list of recorded observations.
    '''
    actions = []
    for obs in data:
        obs = Variable(torch.Tensor(obs))
        predicted_action = policy(obs)
        actions.append(predicted_action.data.numpy())
    return actions

def dagger(expert_policy, student_policy, n_dagger_iterations, env=env_flagrun_without_rendering):
    '''
    Given an expert demonstrator and a student policy, perform
    n iterations of dagger.
    
    '''
    # collect initial expert demonstrations
    n=10
    print('Rolling Out Expert')
    expert_rollout_data = rollout_for_n_episodes(n=10,
                                            policy = expert_policy,
                                            env = env_flagrun_without_rendering,
                                            render=False)
    print('Training Student')
    # train initial student model with behavioural cloning
    training_data = expert_rollout_data
    trained_student = train_model(student_policy, expert_rollout_data)
    
    for i in range(n_dagger_iterations):
        print('Iteration', i, 'of DAGGER')
        
        # rollout student model (renders by default! you can change this if you don't want to render)
        student_rollout_data = rollout_for_n_episodes(n=10,
                                            policy = student_policy,
                                            env = env_flagrun_with_rendering,
                                            render=True)
        
        # evaluate expert actions on student's trajectories and add to dataset
        expert_corrections = evaluate_expert(expert_policy, student_rollout_data['observations'])
        training_data = {'observations': training_data['observations'] + student_rollout_data['observations'],
                         'actions':      training_data['actions']      + expert_corrections}
        # train student model with behavioural cloning
        student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  
        student_policy =  train_model(student_policy, training_data, num_epochs = 300)
        
    return student_policy


# instantiate a new untrained student policy
torch.manual_seed(0)
student_policy = StudentPolicy(env_flagrun_with_rendering.observation_space,
                               env_flagrun_with_rendering.action_space)  

dagger(flagrun_expert,
       student_policy,
       n_dagger_iterations = 5,
       env = env_flagrun_without_rendering)


Rolling Out Expert
episode 0
score=985.57 in 1000 frames
episode 1
score=-148.85 in 1000 frames
episode 2
score=112.08 in 1000 frames
episode 3
score=269.63 in 906 frames
episode 4
score=-4.21 in 42 frames
episode 5
score=676.83 in 1000 frames
episode 6
score=438.05 in 1000 frames
episode 7


KeyboardInterrupt: 

## 4.2 Watch the DAGGER-trained student policy

In [23]:
# view the trained student policy (it should run, but fall over!)
rollout_for_n_episodes(n = 5,
                       policy = student_policy,
                       env = env_flagrun_with_rendering,
                       render=True)

episode 0
score=53.28 in 107 frames
episode 1
score=-27.88 in 232 frames
episode 2


KeyboardInterrupt: 

## 4.3 Train the Agent with the DAGGER Algorithm

## 5. Explore

In this exercise, we have implemented the behavioural cloning and DAGGER algorithms, and demonstrated how to use them to solve a pybullet Gym environment. To continue your learning, you are encouraged to complete any (or all!) of the following tasks:

- Repeat the experiments with the 'HumanoidFlagrunHarderBulletEnv-v0' environment. Does behavioural cloning work better for this environment than for the previous, 'HumanoidFlagrunBulletEnv-v0'? Why?
- Show how the results of behavioural cloning improve with data: plot the  behavioural cloning student policy's average reward for a variety of numbers of episodes of expert data, and compare to the expert. 
- Try and reduce the amount of expert data needed for Dagger to work on 'HumanoidFlagrunHarderBulletEnv-v0'. Can you reach an average reward of 500 over ten episodes, using only a total of 100 frames of expert data?

solutions: 


- reward increases slowly...
- Works better because the blocks push it off-distribution so you widen the expert's trajectory distribution (it knows how to correct).
- trick is to (1) run loads of iterations of dagger, (2) use temporally-distant examples.