## Imitation Learning

This notebook uses imitation learning (behavioral cloning) with a number of the openAI gym environments.  With imitation learning, you take an expert policy and record its inputs and outputs. You then use an algorithm to "imitate" it.  As you will see this works with varying degrees of success, but not very well.  The problem with this method is that the errors accumulate.  See http://rll.berkeley.edu/deeprlcourse/docs/week_2_lecture_1_behavior_cloning.pdf

The code in this notebook is largely based off code posted at https://github.com/ghostFaceKillah/deep-rl-berkeley/tree/master/hw1 and I also used https://github.com/favetelinguis/DeepReinforcementLearning

This notebook is setup to give you an idea of how the algorithms are working.  GhostFaceKillah has a nice set of scripts that will run all the environments with different conditions and provide results.

My suggestion is to pick one task.  Then set the epochs to 1 and go through and create the data from the expert policy.  Then build a model to imitate the policy.  You can then save the outputs of this policy.  Finally, you can view both the expert policy (loaded as data) and your model.

In [1]:
import gym
import load_policy
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
import tensorflow as tf
import tf_util
import tqdm

In [2]:
task = 'Reacher-v1'
#task = 'Ant-v1'
num_rollouts =  5  ###Important parameter for how much data you are collecting/running
use_cached_data_for_training = True
cached_data_path = "data/" + task + "-their.p"    
their_data_path = "data/" + task + "-their.p"
our_data_path = "data/" + task + "-our.p"
expert_policy_file = "experts/" + task + ".pkl"

env = gym.make(task)
max_steps = env.spec.tags.get('wrapper_config.TimeLimit.max_episode_steps')
envname = task
render_them =  False
render_us = False

# neural net params
learning_rate = 0.001
epochs = 50   ###important parameter for how long you are training your network

[2017-02-14 11:49:33,806] Making new env: Reacher-v1


In [3]:
def one_data_table_stats(data):
    mean = data['returns'].mean()
    std = data['returns'].std()
    x = data['steps']
    pct_full_steps =  (x / x.max()).mean()

    return pd.Series({
        'mean reward': mean,
        'std reward': std,
        'pct full rollout': pct_full_steps
    })

def view_data(data,rollouts):
    returns = []
    observations = []
    actions = []
    print ("Total rollouts from data: ", rollouts)
    env = gym.make(envname)
    for i in range(rollouts):
        print ("Start rollout ", i)
        observation = env.reset()
        steps = 0
        for t in range(2000):
            env.render()
            x = t + i*max_steps
            action = data['actions'][x,:,:]
            observations.append(t)
            actions.append(action)
            observation, reward, done, info = env.step(action)
            steps += 1
            if steps >= max_steps:
                print("Max timestep reached")
                break
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                #env.render(close=True)
                break
                
def view_model(model,rollouts):
    returns = []
    observations = []
    actions = []
    env = gym.make(envname)
    print ("Total rollouts from model: ", rollouts)
    for i in range(rollouts):
        #observation = env.reset()
        print ("Start rollout ", i)
        obs = env.reset()
        steps = 0
        for t in range(2000):
            env.render()
            #print(observation)
            action = model.predict(obs[None, :])
            observations.append(obs)
            actions.append(action)
            obs, reward, done, info = env.step(action)
            steps += 1
            if steps >= max_steps:
                print("Max timestep reached")
                break
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                break

## Get data from expert policy (only need to run this once per environment)

The goal is to get a list of all states and the corresponding actions that the expert performed. You will want plenty of data for training.

In [4]:
print('Gathering expert data')
print('loading and building expert policy')
policy_fn = load_policy.load_policy(expert_policy_file)
print('loaded and built')

with tf.Session():
    tf_util.initialize()

    max_steps = env.spec.tags.get('wrapper_config.TimeLimit.max_episode_steps')
    print ("Total rollouts for building policy: ", num_rollouts)
    returns = []
    observations = []
    actions = []
    steps_numbers = []

    for i in tqdm.tqdm(range(num_rollouts)):
        obs = env.reset()
        done = False
        totalr = 0.
        steps = 0
        while not done:
            action = policy_fn(obs[None,:])
            observations.append(obs)
            actions.append(action)
            obs, r, done, _ = env.step(action)
            totalr += r
            steps += 1
            if render_them:
                env.render()
            if steps >= max_steps:
                break
        steps_numbers.append(steps)
        returns.append(totalr)

    expert_data = {'observations': np.array(observations),
                   'actions': np.array(actions),
                   'returns': np.array(returns),
                   'steps': np.array(steps_numbers)}

pickle.dump(expert_data, open(their_data_path, 'wb'))

100%|██████████| 5/5 [00:00<00:00, 35.29it/s]

Gathering expert data
loading and building expert policy
obs (1, 11) (1, 11)
loaded and built
Total rollouts for building policy:  5





## Load data from expert policy

In [5]:
##load data from file
data = pickle.load(open(cached_data_path, 'rb'))

## Train your model -- there are different models available

The model is going to take the states as inputs and actions as the outputs

In [6]:
##Alternative model designs:
def baseline_model():
    model = Sequential()
    model.add(Dense(num_inputs/2, input_dim=num_inputs, init='normal', activation='relu'))
    model.add(Dense(num_outputs, init='normal'))
    
    model.compile(loss='mse', optimizer='adam', metrics=['mae'])
    return model
def regularized_model():
    model = Sequential()
    model.add(Dense(64, input_dim=num_inputs, init='normal', activation='relu',W_regularizer=l2(0.01), activity_regularizer=activity_l2(0.01),b_regularizer=l2(0.01)))
    model.add(Dense(64, input_dim=num_inputs, init='normal', activation='relu',W_regularizer=l2(0.01), activity_regularizer=activity_l2(0.01),b_regularizer=l2(0.01)))
    model.add(Dense(num_outputs, init='normal',W_regularizer=l2(0.01), activity_regularizer=activity_l2(0.01),b_regularizer=l2(0.01)))
    
    model.compile(loss='mse', optimizer='adam')
    return model
def wide_model():
    model = Sequential()
    model.add(Dense(128, input_dim=num_inputs, init='normal', activation='relu'))
    model.add(Dense(num_outputs, init='normal'))

    model.compile(loss='mse', optimizer='adam')
    return model

def awesome_model():
    model = Sequential([
    Lambda(lambda x: (x - mean) / std, batch_input_shape=(None, observations_dim)),
    Dense(64, activation='tanh'),
    Dense(64, activation='tanh'),
    Dense(actions_dim)])

    opt = Adam(lr=learning_rate)
    model.compile(optimizer=opt, loss='mse', metrics=['mse'])
    return model

In [7]:
###Train model
from sklearn.utils import shuffle

from keras.models import Sequential
from keras.layers import Dense, Lambda
from keras.optimizers import Adam

mean, std = np.mean(data['observations'], axis=0), np.std(data['observations'], axis=0) + 1e-6

observations_dim = env.observation_space.shape[0]
actions_dim = env.action_space.shape[0]
num_inputs = observations_dim
num_outputs = actions_dim

###Pick out the model here that you will use##
model = baseline_model()
#model = awesome_model()

x, y = shuffle(data['observations'], data['actions'].reshape(-1, actions_dim))
model.fit(x, y,
          validation_split=0.1,
          batch_size=256,
          nb_epoch=epochs,
          verbose=2)

Using TensorFlow backend.


Train on 225 samples, validate on 25 samples
Epoch 1/50
0s - loss: 0.0080 - mean_absolute_error: 0.0476 - val_loss: 0.0087 - val_mean_absolute_error: 0.0472
Epoch 2/50
0s - loss: 0.0080 - mean_absolute_error: 0.0476 - val_loss: 0.0087 - val_mean_absolute_error: 0.0470
Epoch 3/50
0s - loss: 0.0080 - mean_absolute_error: 0.0477 - val_loss: 0.0086 - val_mean_absolute_error: 0.0471
Epoch 4/50
0s - loss: 0.0079 - mean_absolute_error: 0.0478 - val_loss: 0.0085 - val_mean_absolute_error: 0.0473
Epoch 5/50
0s - loss: 0.0079 - mean_absolute_error: 0.0480 - val_loss: 0.0085 - val_mean_absolute_error: 0.0475
Epoch 6/50
0s - loss: 0.0079 - mean_absolute_error: 0.0482 - val_loss: 0.0084 - val_mean_absolute_error: 0.0477
Epoch 7/50
0s - loss: 0.0078 - mean_absolute_error: 0.0484 - val_loss: 0.0084 - val_mean_absolute_error: 0.0479
Epoch 8/50
0s - loss: 0.0078 - mean_absolute_error: 0.0486 - val_loss: 0.0084 - val_mean_absolute_error: 0.0480
Epoch 9/50
0s - loss: 0.0078 - mean_absolute_error: 0.0487 

<keras.callbacks.History at 0x11ec03da0>

## Run the model and save the data

In [8]:
returns = []
observations = []
actions = []
steps_numbers = []

for i in tqdm.tqdm(range(num_rollouts)):
    obs = env.reset()
    done = False
    totalr = 0.
    steps = 0
    while not done:
        action = model.predict(obs[None, :])
        observations.append(obs)
        actions.append(action)
        obs, r, done, _ = env.step(action)
        totalr += r
        steps += 1
        if render_us:
            env.render()
        if steps >= max_steps:
            break
    steps_numbers.append(steps)
    returns.append(totalr)

our_net_data = {'observations': np.array(observations),
                'actions': np.array(actions),
                'returns': np.array(returns),
                'steps': np.array(steps_numbers)}

pickle.dump(our_net_data, open(our_data_path, 'wb'))

100%|██████████| 5/5 [00:00<00:00, 29.87it/s]


## Compare the two models

In [9]:
###analyze single
their = pickle.load(open(their_data_path, 'rb'))
our = pickle.load(open(our_data_path, 'rb'))

df = pd.DataFrame({
    'expert': one_data_table_stats(their),
    'imitation': one_data_table_stats(our)
})

print ("Analyzing experiment " + envname)
print (df)

Analyzing experiment Reacher-v1
                    expert  imitation
mean reward      -3.521681 -12.831075
pct full rollout  1.000000   1.000000
std reward        1.508024   5.372770


## View examples of the expert policy

In [10]:
view_data(data,5)

[2017-02-14 11:49:35,336] Making new env: Reacher-v1


Total rollouts from data:  5
Start rollout  0
Max timestep reached
Start rollout  1
Max timestep reached
Start rollout  2
Max timestep reached
Start rollout  3
Max timestep reached
Start rollout  4
Max timestep reached


## View examples of the imitated policy

In [11]:
view_model(model,5)

[2017-02-14 11:49:39,655] Making new env: Reacher-v1


Total rollouts from model:  5
Start rollout  0
Max timestep reached
Start rollout  1
Max timestep reached
Start rollout  2
Max timestep reached
Start rollout  3
Max timestep reached
Start rollout  4
Max timestep reached
