# Dagger RL Simple Implementation  
Dagger is a reinforcement learning algorithm for imitation learning/behaviour cloning. Introdiced in paper https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-NoRegret.pdf  
It uses initial expert knowledge (usually human labeled data) to perform surprevised learning of agent's policy (mapping from observations to actions). Main trick in Dagger is that after agent learns initial policy (from expert data), it uses that policy to act in real environemnt and stores those experiences (observations). These real observations are then passed to expert to be labeled (add expert's actions), and are added to the training set. Then agent policy is trained again, this time on the new augmented data set, and cycle is repeated.        
  
Main trick in Dagger is dataset augmentation from agent's own experince. Inital expert dataset is limited and it is very likely that agent will diverge from the expert's path and encounter new states. Initial policy is almost useless in those new situations. By obtaining expert labels for those new observations and retraining the policy, agent becomes more robust to path perturbations.  
  
This implementation is made for UC Berkely course CS 294 Deep Reinforcement Learning. It is a naive implementation (still unfinished curentlly) by extending previous ordinary imitation learning technique, uses provided expert policy for gathering expert's dataset, and acts in MuJoCo environment.     

In [1]:
import cPickle as pickle
import tf_util
import load_policy
import numpy as np
import random
import math
import gym
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Netowrk definition 

class Network():
    def __init__(self, input_dim, action_dim, hidden1_units, hidden2_units, regularization = False, beta = 0.01):
        """Network definition"""
        
        self.input_dim = input_dim
        self.action_dim = action_dim
        self.h1_num = hidden1_units
        self.h2_num = hidden2_units
        self.input_observations = tf.placeholder(tf.float32, shape=(None,self.input_dim))
        self.action_labels = tf.placeholder(tf.float32, shape=(None,self.action_dim))

        self.w1 = tf.Variable(tf.truncated_normal([self.input_dim, self.h1_num],
                                                  stddev=1.0 / math.sqrt(float(self.input_dim))),name='w1')
        self.b1 = tf.Variable(tf.zeros(self.h1_num),name='b1')
        self.h1 = tf.nn.relu(tf.matmul(self.input_observations,self.w1) + self.b1)

        self.w2 = tf.Variable(tf.truncated_normal([self.h1_num, self.h2_num],
                                             stddev=1.0 / math.sqrt(float(self.h1_num))),name='w2')
        self.b2 = tf.Variable(tf.zeros(self.h2_num),name='b2')
        self.h2 = tf.nn.relu(tf.matmul(self.h1,self.w2) + self.b2)

        self.w3 = tf.Variable(tf.truncated_normal([self.h2_num, self.action_dim],
                                             stddev=1.0 / math.sqrt(float(self.h2_num))),name='w3')
        self.b3 = tf.Variable(tf.zeros(self.action_dim),name='b3')
        self.output = tf.matmul(self.h2,self.w3) + self.b3

        self.error = tf.reduce_mean(tf.pow(tf.subtract(self.output,self.action_labels),2))
        if regularization:
            self.regularizers = tf.nn.l2_loss(self.w1) + tf.nn.l2_loss(self.w2) + tf.nn.l2_loss(self.w3)
            self.error = self.error + beta * self.regularizers

        self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.error)

    def train(self, sess, saver, train_data, training_epochs, batch_size):
        """Supervised training of agent network"""
        
        total_batch = int(len(train_data)/batch_size)
        for epoch in xrange(training_epochs):
            batch_count = 0
            avg_cost = 0.
      
            # Loop over all batches
            for i in xrange(total_batch):
                next_batch = random.sample(train_data, batch_size)
                next_batch = zip(*next_batch)
                batch_x = next_batch[0]
                batch_y = np.asarray(next_batch[1])
                batch_y = batch_y.reshape((batch_size,self.action_dim))

                # Run optimization op (backprop) and cost op (to get loss value)
                _, c = sess.run([self.optimizer, self.error], 
                                feed_dict={self.input_observations: batch_x, self.action_labels: batch_y})
                
                # Compute average loss
                avg_cost += c / total_batch
                if i % 20000 == 0:
                    print("Batch number {:d}".format(i))
                    #print("Step {} | Average cost {}".format(i, avg_cost))
            
            # Display logs per epoch step
            print("Epoch: {:04d}, cost = {:.9f}".format(epoch+1, avg_cost))
            
        print "Optimization Finished!"
        saver.save(sess, path + '/' + environment + '.cptk')
        print ("Model Saved")
        
    def run(self, sess, saver, env, data_mean, num_rollouts = 20, render = False, load_model = True):
        """Run policy on real observations"""
       
        returns = []
        observations = []
        max_step = env.spec.timestep_limit
        """
        if load_model:
            print('Loading Model...')
            ckpt = tf.train.get_checkpoint_state(path)
            saver.restore(sess,ckpt.model_checkpoint_path)
        else: 
            sess.close()
        """
        for i in xrange(num_rollouts):
            if i % 10 == 0:
                print('iter', i)
            obs = env.reset()
            done = False
            totalr = 0.
            steps = 0
            while not done:
                observations.append(obs)
                obs = obs.reshape((1,self.input_dim))
                obs -= data_mean
                action = sess.run(self.output, feed_dict={self.input_observations: obs})
                obs, r, done, _ = env.step(action)
                totalr += r
                steps += 1
                if render:
                    env.render()
                if steps >= max_step:
                    break
            returns.append(totalr)

        print('returns', returns)
        print('mean return', np.mean(returns))
        print('std of return', np.std(returns))
        
        self.agent_data = {'observations': np.array(observations)}
        with open("Hopper-v1" + '_agent_data.pickle', 'wb') as handle:
            pickle.dump(self.agent_data, handle, protocol=pickle.HIGHEST_PROTOCOL)
            print("Agent data pickled successfully")

In [3]:
def expert(expert_policy_file):
    print('loading and building expert policy')
    policy_fn = load_policy.load_policy(expert_policy_file)
    print('loaded and built')
    
    with open("Hopper-v1" + '_agent_data.pickle', 'rb') as handle:
        agent_data = pickle.load(handle)
        print(agent_data['observations'].shape)
        
    actions = []    
    observations = agent_data['observations']
    #print(type(observations[0]), len(observations), observations[0])
    with tf.Session():
        tf_util.initialize()

        for i in xrange(len(observations)):
            action = policy_fn(np.asarray(observations[i][None,:]))
            #print(action.shape)
            actions.append(action)
    print("Done")        
    return actions, observations

In [4]:
#actions, observations = expert("experts/Hopper-v1.pkl")

In [5]:
#print(np.asarray(actions).shape)

In [6]:
#with open("Hopper-v1" + '_expert_data.pickle', 'rb') as handle:
#        train_data = pickle.load(handle)
#print(train_data['observations'].shape)

In [7]:
def data_preprocessing(train_data, data_mean = []):
    """Data preprocessing - mean substraction and normalization"""
    
    if not any(data_mean):
        data_mean = np.mean(train_data['observations'], axis = 0)
    train_data['observations'] -= data_mean
    input_dim = train_data['observations'].shape[1]
    action_dim = train_data['actions'].shape[2]
    data_combined = zip(train_data['observations'], train_data['actions'])
    return data_mean, input_dim, action_dim, data_combined

In [8]:
def main(num_cycles, environment):    
    with open("Hopper-v1" + '_expert_data.pickle', 'rb') as handle:
        train_data = pickle.load(handle)
    data_mean, input_dim, action_dim, data_combined = data_preprocessing(train_data)
    print("Initial data mean is: {}".format(data_mean))
    
    with tf.Session() as sess:
        agent = Network(input_dim,action_dim, hidden1_units, hidden2_units)
        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver()
        
        for cycle in xrange(num_cycles):            
            agent.train(sess, saver, data_combined, training_epochs, batch_size)
            agent.run(sess, saver, environment, data_mean, num_rollouts, load_model = False)
            #break
            expert_actions, observations = expert("experts/Hopper-v1.pkl")
            train_data['actions'] = np.concatenate((train_data['actions'], expert_actions), axis=0)
            print(train_data['actions'].shape)
            train_data['observations'] = np.concatenate((train_data['observations'], observations), axis=0)
            print(train_data['observations'].shape)
            _, input_dim, action_dim, data_combined = data_preprocessing(train_data, data_mean)
            print("Data mean is: {}".format(data_mean))
        

In [9]:
# Parameters
learning_rate = 0.001
training_epochs = 1
batch_size = 100
display_step = 1
num_rollouts = 200
num_cycles = 3
beta = 0.001
path = './dagger_policy'

# Network Parameters
hidden1_units = 128 # 1st layer number of features
hidden2_units = 128 # 2nd layer number of features

environments = {1: "Ant-v1", 2: "HalfCheetah-v1", 3: "Hopper-v1", 
                4: "Humanoid-v1", 5: "Reacher-v1", 6: "Walker2d-v1"}
environment = environments[3]
env = gym.make(environment)

tf.reset_default_graph()

main(num_cycles,env)

[2017-03-28 17:10:10,929] Making new env: Hopper-v1


Initial data mean is: [  1.40310727e+00  -6.88182004e-03  -1.11049251e-01  -5.71447489e-01
   2.88128503e-01   2.78367951e+00   4.64378410e-02  -4.14444474e-04
  -8.67950539e-03  -1.02232249e-01   1.68187284e-01]
Batch number 0
Batch number 20000
Batch number 40000
Batch number 60000
Batch number 80000
Epoch: 0001, cost = 0.001099784
Optimization Finished!
Model Saved
('iter', 0)
('iter', 10)
('iter', 20)
('iter', 30)
('iter', 40)
('iter', 50)
('iter', 60)
('iter', 70)
('iter', 80)
('iter', 90)
('iter', 100)
('iter', 110)
('iter', 120)
('iter', 130)
('iter', 140)
('iter', 150)
('iter', 160)
('iter', 170)
('iter', 180)
('iter', 190)
('returns', [3776.6603685111959, 3780.8849023811154, 3771.8051972639805, 3773.1690911097544, 3779.2478158351205, 3772.015315135995, 3769.8493120141893, 3774.2872844376388, 3779.4212488159301, 3779.3366561863263, 3773.0865570657356, 3776.135907079818, 3778.2653662720154, 3779.6122731312958, 3778.5714278417317, 3782.6171420786277, 3784.2834198316159, 3778.1222

[2017-03-28 17:13:05,905] From tf_util.py:91: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.


Instructions for updating:
Use `tf.variables_initializer` instead.


[2017-03-28 17:13:05,906] From tf_util.py:92: initialize_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.variables_initializer` instead.


Done
(10200000, 1, 3)
(10200000, 11)
Data mean is: [  1.40310727e+00  -6.88182004e-03  -1.11049251e-01  -5.71447489e-01
   2.88128503e-01   2.78367951e+00   4.64378410e-02  -4.14444474e-04
  -8.67950539e-03  -1.02232249e-01   1.68187284e-01]
Batch number 0
Batch number 20000
Batch number 40000
Batch number 60000
Batch number 80000
Batch number 100000
Epoch: 0001, cost = 0.052559627
Optimization Finished!
Model Saved
('iter', 0)
('iter', 10)
('iter', 20)
('iter', 30)
('iter', 40)
('iter', 50)
('iter', 60)
('iter', 70)
('iter', 80)
('iter', 90)
('iter', 100)
('iter', 110)
('iter', 120)
('iter', 130)
('iter', 140)
('iter', 150)
('iter', 160)
('iter', 170)
('iter', 180)
('iter', 190)
('returns', [223.7509269659931, 226.17714696447288, 226.09183705022397, 226.37806549148792, 225.44352345466734, 224.59023337809555, 228.14787647097046, 227.81240717691847, 229.42359590551769, 224.21406051083761, 226.28215050850656, 228.35925916018132, 228.83346152820204, 228.54278026169598, 228.51933067484399,

[2017-03-28 17:15:27,479] From tf_util.py:91: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.


Instructions for updating:
Use `tf.variables_initializer` instead.


[2017-03-28 17:15:27,480] From tf_util.py:92: initialize_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.variables_initializer` instead.


Done
(10221183, 1, 3)
(10221183, 11)
Data mean is: [  1.40310727e+00  -6.88182004e-03  -1.11049251e-01  -5.71447489e-01
   2.88128503e-01   2.78367951e+00   4.64378410e-02  -4.14444474e-04
  -8.67950539e-03  -1.02232249e-01   1.68187284e-01]
Batch number 0
Batch number 20000
Batch number 40000
Batch number 60000
Batch number 80000
Batch number 100000
Epoch: 0001, cost = 0.051604189
Optimization Finished!
Model Saved
('iter', 0)
('iter', 10)
('iter', 20)
('iter', 30)
('iter', 40)
('iter', 50)
('iter', 60)
('iter', 70)
('iter', 80)
('iter', 90)
('iter', 100)
('iter', 110)
('iter', 120)
('iter', 130)
('iter', 140)
('iter', 150)
('iter', 160)
('iter', 170)
('iter', 180)
('iter', 190)
('returns', [3.4499662302094256, 3.5130058702196427, 3.3650542097684699, 3.4195950043830678, 3.7021478837842117, 3.1092135676855301, 3.2557265803445681, 3.4006629664225612, 3.4926670461639735, 3.4613310346768658, 3.2172893112159877, 3.3917597525308087, 3.3582751652860905, 3.4599359128248284, 3.1996029117521414

[2017-03-28 17:17:05,341] From tf_util.py:91: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.


Instructions for updating:
Use `tf.variables_initializer` instead.


[2017-03-28 17:17:05,342] From tf_util.py:92: initialize_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.variables_initializer` instead.


Done
(10224932, 1, 3)
(10224932, 11)
Data mean is: [  1.40310727e+00  -6.88182004e-03  -1.11049251e-01  -5.71447489e-01
   2.88128503e-01   2.78367951e+00   4.64378410e-02  -4.14444474e-04
  -8.67950539e-03  -1.02232249e-01   1.68187284e-01]
