## Assignments
1. Homework 1: Imitation learning (control via supervised learning)
2. Homework 2: Policy gradients (“REINFORCE”)
3. Homework 3: Q learning and actor-critic algorithms
4. Homework 4: Model-based reinforcement learning
5. Homework 5: Advanced model-free RL algorithms
6. Final project: Research-level project of your choice (form a group of up to 2-3 students, you’re welcome to start early!)

##### Emacs IPython Notebook Commands/Keybinds
* http://millejoh.github.io/emacs-ipython-notebook/#commands-keybinds


# Homework 1 Imitation Learning

Instead of trying to produce a program to simulate the adult mind, why not rather try to produce one which simulates the child's?

If this were then subjected to an appropriate course of education one would obtain the adult brain.

\- Alan Turing



## Behavioral Cloning

1. The starter code provides an expert policy for each of the MuJoCo tasks in OpenAI Gym (See run expert.py). Generate roll-outs from the provided policies, and implement behavioral cloning. => expert_data/XXX.pkl

2. Run behavioral cloning (BC) and report results on two tasks
 – one task where a behavioral cloning agent achieves comparable performance to the expert,
 and one task where it does not.
 When providing results, report the mean and standard deviation of the return over multiple rollouts in a table, and state which task was used.
 Be sure to set up a fair comparison, in terms of network size, amount of data, and number of training iterations, and provide these details (and any others you feel are appropriate) in the table caption.

3. Experiment with one hyperparameter that affects the performance of the behavioral cloning agent, such as
* the number of demonstrations,
* the number of training epochs,
* the variance of the expert policy, or
* something that you come up with yourself.
 For one of the tasks used in the previous question, show a graph of how the BC agent’s performance varies with the value of this hyperparameter, and state the hyperparameter and a brief rationale for why you chose it in the caption for the graph.



In [1]:

gym_envs = ('Ant-v2', 'Hopper-v2', 'Reacher-v2', 'HalfCheetah-v2', 'Humanoid-v2', 'Walker2d-v2')


## running experts

run experts of each gym environments

In [None]:
# run expert

import sys, os
import datetime as dt
import run_expert

# https://ipython.readthedocs.io/en/stable/interactive/magics.html
# path=os.environ['PATH']
# %env PATH='/usr/local/bin:'+path

RENDER = False
NUM_ROLLOUTS = 20

for gym_env in gym_envs :
    sys.argv = ['run_expert.py', 'experts/' + gym_env + '.pkl', gym_env, '--num_rollouts', str(NUM_ROLLOUTS) ]
    if RENDER :
        sys.argv.append('--render')
    run_expert.main()
    print('finished run_expert ', gym_env, 'at', dt.datetime.now())

print('finished run_expert on all gym_envs at', dt.datetime.now())


## train on each envs

using the data gathered by expert policy

* environment details : https://github.com/openai/gym/tree/master/gym/envs/mujoco/assets
* source codes of each environments : https://github.com/openai/gym/blob/master/gym/envs/mujoco/
* reference for an HW1 implementation :  https://hollygrimm.com/rl_bc

for the regressor
input : observation
output : action


In [2]:
from load_policy import load_policy
import pickle as pk
import numpy as np
import os

def load_expert_data(gymenv) -> (np.array, np.array) : # observations, actions
    with open(os.path.join('expert_data', gymenv + '.pkl'), 'rb') as f :
        expert_data = pk.load(f)
        return expert_data['observations'], expert_data['actions']

def load_expert_policy_fn(gymenv) :
    return load_policy('experts/' + gymenv + '.pkl')


## Behavior Cloning

1. generate rollouts(= expert data) with expert policy (and record the returns)
2. learn the rollouts changing some environments (network size, amount of data, and number of training iterations, ...)
3. generate rollouts several times according to each policies learned above and show the returns in a table


In [3]:
import tensorflow as tf
from enum import Enum, IntEnum
import numpy as np
import random
import datetime as dt
import time
import math

default_model_config = dict(neurons = [400, 200, 100],
                            activation = tf.nn.elu, # Using ReLu, which is a discontinuous function, may raise issues. Try using other activation functions, such as tanh or sigmoid.
                            last_activation = None, # final layer activation function. default is no activation
                            optimizer = tf.train.AdadeltaOptimizer, # tf.train.AdamOptimizer, tf.train.ProximalAdagradOptimizer
                            cost_function = tf.losses.mean_squared_error, # tf.losses.huber_loss (robust to outlier)
                            measure_function = 'r_squared', # 'smape' means symmetric_mean_absolute_percentage_error
)

default_train_config = dict(start_learning_rate = 0.001,
                            # minimum_learning_rate = 0.000001,
                            num_epochs = 1000,
                            batch_size = 100, # 500,
                            keep_prob = 0.9, # for training only (dropout)
                            validationset_percent = 0.2, # by default 20 percent is validation set
                            break_accuracy = -1.0, # 0.999, # -1.0
                            early_stopping_epoch_on_max_no_decrease = 20, # 100,
                            shuffle_samples_epochs = 10, # shuffle samples per given epochs considering performance. -1 means no shuffling
                            check_accuracy_epochs = 200, # 5000,
                            use_tboard = True,
                            print_cost_interval = 500,
                            print_trained_model = False,
                            )

class BehavioralCloning(object) :
    default_random_seed = 777

    def __init__(self,
                 X_shape = None, # X shape as list
                 Y_shape = None, # Y shape as list
                 model_config = default_model_config,
                 scope_name = '',
                 restore_mode=False,
                 session=None) :
        self.model_config = model_config
        self.restore_mode = restore_mode
        self.scope_name = scope_name
        self.X_shape = list(X_shape)
        self.X_shape[0] = None
        self.Y_shape = list(Y_shape)
        self.Y_shape[0] = None

        tf.set_random_seed(BehavioralCloning.default_random_seed)  # reproducibility
        np.random.seed(BehavioralCloning.default_random_seed)

        # Launch new session before graph init
        # interactive session will declare itself as a default session and won't be closed on context destroy (so, should explicity call sess.close()
        if session is None :
            tf.reset_default_graph()
            self.session = tf.InteractiveSession()
        else :
            self.session = session
        self._build_network()

    def _build_network(self) :
        g = tf.get_default_graph()

        # build the network
        with g.as_default(), self.session.as_default() :
            self.X = tf.placeholder(tf.float32, shape=self.X_shape, name='X')
            self.Y = tf.placeholder(tf.float32, shape=self.Y_shape, name='Y')
            self.p_keep_prob = tf.placeholder(tf.float32, name='p_keep_prob')
            self.p_training = tf.placeholder(tf.bool, name='p_training')
            self.p_lr = tf.placeholder(tf.float32, name='learning_rate')

            with tf.variable_scope(self.scope_name + '-dnn', reuse=tf.AUTO_REUSE) as scope:
                neurons = self.model_config['neurons']
                layer = self.X
                for i in range(len(neurons)) :
                    neuron = neurons[i]

                    layer = tf.layers.dense(layer, neuron,
                                            kernel_initializer = tf.contrib.layers.xavier_initializer(seed=BehavioralCloning.default_random_seed),
                                            activation=self.model_config['activation'],
                                            name = 'layer-' + str(i))
                    layer = tf.layers.dropout(layer, rate=1-self.p_keep_prob, training=self.p_training)
                n_output = self.Y_shape[1]
                layer = tf.layers.dense(layer, n_output,
                                        kernel_initializer = tf.contrib.layers.xavier_initializer(seed=BehavioralCloning.default_random_seed),
                                        activation=self.model_config['last_activation'],
                                        name = 'layer-last')
                    

                self.hypothesis = layer
                cost_fn = self.model_config['cost_function']
                self.cost = cost_fn(self.Y, self.hypothesis)
                tf.summary.scalar("cost", self.cost)
                measure_alg = self.model_config['measure_function']
                if measure_alg == 'r_squared' :
                    self.measure = self.r_squared(self.Y, self.hypothesis)
                elif measure_alg == 'smape' :
                    self.measure = self.smape(self.Y, self.hypothesis)
                else :
                    self.measure = None
                optimizer_fn = self.model_config['optimizer']
                opt = optimizer_fn(learning_rate=self.p_lr)
                self.objective_tensor = opt.minimize(self.cost)

            if not self.restore_mode :
                self.session.run(tf.global_variables_initializer())


    def train(self, X, Y, train_config = default_train_config) :
        learning_rate = train_config['start_learning_rate']
        num_epochs = train_config['num_epochs']
        keep_prob = train_config['keep_prob']
        batch_size = train_config['batch_size']
        vset_percent = train_config['validationset_percent']
        break_accuracy = train_config['break_accuracy']
        check_accuracy_epochs = train_config['check_accuracy_epochs']
        early_stopping_epoch_on_max_no_decrease = train_config['early_stopping_epoch_on_max_no_decrease']
        print_cost_interval = train_config['print_cost_interval']
        shuffle_samples_epochs = train_config['shuffle_samples_epochs']
        use_tboard = train_config['use_tboard']

        training_costs = np.zeros(num_epochs, dtype=np.float32)
        validation_costs = np.zeros(num_epochs, dtype=np.float32)
        validation_measures = np.zeros(num_epochs, dtype=np.float32)
        min_cost = np.inf
        no_cost_decrease_epochs = 0

        n_samples = X.shape[0]
        n_features = X.shape[1]
        n_output = Y.shape[1]
        n_train = int(n_samples * (1 - vset_percent))
        n_validate = n_samples - n_train

        batch_loop = (n_train - 1) // batch_size + 1

        sess = self.session
        if use_tboard :
            merged_summary = tf.summary.merge_all()
            writer = tf.summary.FileWriter("./tboard_logs")
            writer.add_graph(sess.graph)  # Show the graph
        else :
            merged_summary = None

        current_X = train_X = X[:n_train]
        current_Y = train_Y = Y[:n_train]
        validate_X = X[n_train:]
        validate_Y = Y[n_train:]

        if shuffle_samples_epochs > 0 :
            current_XY = np.hstack((train_X, train_Y))

        start_time = dt.datetime.now()
        print('Learning starts. It will take some time...', start_time)
        for epoch in range(num_epochs):
            shuffle_samples = shuffle_samples_epochs > 0 and epoch % shuffle_samples_epochs == 0 # shuffle on 0th epoch
            
            if shuffle_samples :
                np.random.shuffle(current_XY) # this will shuffle current_XY in place.
                _, current_X, current_Y = np.split(current_XY, (0, n_features), axis=-1)

            epoch_hyps = np.zeros(Y.shape, dtype=np.float32)
            epoch_costs = np.zeros(batch_loop, dtype=np.float32)

            for m in range(batch_loop) :
                if m == batch_loop - 1 :
                    m_X = current_X[batch_size * m :]
                    m_Y = current_Y[batch_size * m :]
                else :
                    m_X = current_X[batch_size * m : batch_size * (m + 1)]
                    m_Y = current_Y[batch_size * m : batch_size * (m + 1)]

                feed_dict = {self.X:m_X, self.Y:m_Y,
                             self.p_keep_prob:keep_prob,
                             self.p_lr:learning_rate,
                             self.p_training:True}
                targets = [ self.hypothesis, self.cost, self.objective_tensor ]
                if use_tboard :
                    targets.append(merged_summary)
                # print('m:', m, ', m_X:', np.shape(m_X), ', m_Y:', np.shape(m_Y), ', feed_dict:', feed_dict)
                results = sess.run(targets, feed_dict = feed_dict)
                if use_tboard :
                    writer.add_summary(results[-1], global_step = epoch * batch_loop + m)

                h_value = results[0]
                epoch_hyps[batch_size * m : batch_size * m + m_Y.shape[0]] = h_value
                cost_value = results[1]
                epoch_costs[m] = cost_value

            training_costs[epoch] = avg_cost = np.mean(epoch_costs)

            validate_feed_dict = {self.X: validate_X, self.Y: validate_Y,
                                  self.p_keep_prob:1.0, self.p_training:False}
            validate_targets = [ self.hypothesis, self.cost, self.measure ]
            vs_hyps, vs_cost, vs_measure = sess.run(validate_targets, feed_dict=validate_feed_dict)
            validation_costs[epoch] = vs_cost
            validation_measures[epoch] = vs_measure

            if epoch % print_cost_interval == 0 :
                print('Epoch:', '%04d' % epoch, 'average training cost =', '{:.9f}'.format(avg_cost),
                      'validation cost =', '{:.9f}'.format(vs_cost), 'validation measure =', '{:.9f}'.format(vs_measure), dt.datetime.now())

            if epoch % check_accuracy_epochs == check_accuracy_epochs :
                print('Epoch:', '%04d' % epoch, 'validation cost =', '{:.9f}'.format(vs_cost),
                      'validation measure =', '{:.9f}'.format(vs_measure), dt.datetime.now())

                if break_accuracy > 0 and break_accuracy < vs_cost :
                    print('Stops the training due to validation loss', vs_cost, ' exceeded the criteria', break_accuracy)
                    training_costs = training_costs[:epoch + 1] # strip un-run epochs
                    validation_costs = validation_costs[:epoch + 1] # strip un-run epochs
                    validation_measures = validation_measures[:epoch + 1] # strip un-run epochs
                    break

            if early_stopping_epoch_on_max_no_decrease > 0 :
                if vs_cost < min_cost :
                    min_cost = vs_cost
                    no_cost_decrease_epochs = 0
                else :
                    no_cost_decrease_epochs = no_cost_decrease_epochs + 1
                    if no_cost_decrease_epochs >= early_stopping_epoch_on_max_no_decrease :
                        # FIXME : in reality, i need to restore variables saved when it was not decreasing but i do not. maybe in the future ..
                        print('Stops the training since cost is not reduced during ', no_cost_decrease_epochs, ' epochs.')
                        training_costs = training_costs[:epoch + 1] # strip un-run epochs
                        validation_costs = validation_costs[:epoch + 1] # strip un-run epochs
                        validation_measures = validation_measures[:epoch + 1] # strip un-run epochs
                        break

        end_time = dt.datetime.now()
        print('Training(learning) Finished!', end_time)
        print('Training took ', '%10d' % ((end_time - start_time).total_seconds()),
              ' seconds.')
   
        return training_costs, validation_costs, validation_measures
                

    def test(self, X, Y) :
        start_time = dt.datetime.now()
        g = tf.get_default_graph()

        with g.as_default() :
            vals = self._test_model(X, Y)
            end_time = dt.datetime.now()
            print('Prediction took ', '%10d' % ((end_time - start_time).total_seconds()),
                  ' seconds.')
            print('Started at ', start_time, ' and finished at ', end_time)
            return vals

    def _test_model(self, X, Y) :
        test_feed_dict = {self.X: X, self.Y: Y,
                          self.p_keep_prob:1.0, self.p_training:False}
        test_targets = [ self.hypothesis, self.cost, self.measure ]

        sess = self.session
        hyps, cost, measure = sess.run(test_targets, feed_dict=test_feed_dict)
        return hyps, cost, measure

    def infer(self, X) :
        g = tf.get_default_graph()

        with g.as_default() :
            vals = self._infer_model(X)
            return vals

    def _infer_model(self, X) :
        test_feed_dict = {self.X: X,
                          self.p_keep_prob:1.0, self.p_training:False}
        test_targets = [ self.hypothesis ]

        sess = self.session
        hyps = sess.run(test_targets, feed_dict=test_feed_dict)
        return hyps
    
    def r_squared(self, y, h) :
        # in tf.reduce_mean, if axis has no entries, all dimensions are reduced, and a tensor with a single element is returned
        total_error = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y, 0))))  # reduce_mean by 0-axis maintains vector dimension
        unexplained_error = tf.reduce_sum(tf.square(tf.subtract(y, h)))
        r_squared = tf.subtract(1.0, tf.div(unexplained_error, total_error))
        return r_squared

    def smape(self, y, h) :
        return tf.reduce_mean(2.0 * tf.abs(tf.subtract(y, h)) / tf.maximum(1e-7, (tf.abs(y) + tf.abs(h)))) # tf.maximum is used to avoid nan
        
    def check_nan(self, value) :
        return value is None or math.isnan(value)

    def save_model(self, save_file_name) :
        # self._dump_graph('save_model(' + save_file_name + ')')
        
        tf.train.Saver().save(self.session, save_file_name)

    def _dump_graph(self, where) :
        print('')

        print('--- dumping tensorflow graph [', where, '] ---')
        g = tf.get_default_graph()
        print('default tf graph :', g)

        # debug graphs
        keys = g.get_all_collection_keys()
        print('current name scope :', g.get_name_scope())
        for key in keys :
            print('all graph (', key, ')  :', g.get_collection(key))
        print('') 
        print('')

       
    def restore_model(self, saved_dir) :
        print('saved dir:', saved_dir)

        with self.session.as_default() :
            # self._dump_graph('restore_model(' + saved_dir + ')')
            
            reader = tf.train.NewCheckpointReader(saved_dir)
            # for var_name in reader.get_variable_to_shape_map() :
            #     print(var_name)
        
            tf.train.Saver().restore(self.session, saved_dir)


In [4]:
# util

TEST_PERCENT = 0.2

def shuffle_XY(X, Y) :
    hstacked = np.hstack((X, Y))
    np.random.shuffle(hstacked)
    _, new_X, new_Y = np.split(hstacked, (0, X.shape[1]), axis=-1)
    return new_X, new_Y


In [4]:
# train behavior cloning policies

for gym_env in gym_envs :
    print('starting ', gym_env, dt.datetime.now())
    observations, actions = load_expert_data(gym_env)
    obs_shape, action_shape = np.shape(observations), np.shape(actions)
    if action_shape[1] == 1 :
        actions = np.reshape(actions, (action_shape[0], action_shape[2]))
        action_shape = np.shape(actions)
    print(gym_env, ' observation shape: ', obs_shape, ', actions shape:', action_shape)
    # for i in range(2) :
    #     print('observation:', observations[i])
    #     print('actions:', actions[i])
    cloning = BehavioralCloning(X_shape=obs_shape, Y_shape=action_shape, scope_name=gym_env)
    
    n_samples = observations.shape[0]
    n_train = int(n_samples * (1 - TEST_PERCENT))

    observations, actions = shuffle_XY(observations, actions)
    training_costs, validation_costs, validation_measures = cloning.train(observations[:n_train], actions[:n_train])
    
    gym_env_model = 'model_' + gym_env
    cloning.save_model(gym_env_model)
    
    cloning = BehavioralCloning(X_shape=obs_shape, Y_shape=action_shape, scope_name=gym_env, restore_mode=True)
    cloning.restore_model(gym_env_model)
    
    test_hyps, test_costs, test_measures = cloning.test(observations[n_train:], actions[n_train:])
    print('ending ', gym_env, dt.datetime.now())
    
    

starting  Ant-v2 2019-07-07 22:42:44.316212
Ant-v2  observation shape:  (19992, 111) , actions shape: (19992, 8)


W0707 22:42:47.032993 50204 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0707 22:42:47.034992 50204 deprecation.py:323] From <ipython-input-3-fc99626fbdc6>:79: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
W0707 22:42:47.263379 50204 deprecation.py:323] From <ipython-input-3-fc99626fbdc6>:80: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
W0707 22:42:47.366063 50204 deprecation.py:323] From C:\Works\tensorflow\lib\site-packages\ten

Learning starts. It will take some time... 2019-07-07 22:42:47.639329
Epoch: 0000 average training cost = 0.802247047 validation cost = 0.619904339 validation measure = -7.028655052 2019-07-07 22:42:48.765364
Epoch: 0500 average training cost = 0.060373325 validation cost = 0.018596660 validation measure = 0.759146452 2019-07-07 22:46:46.832457
Training(learning) Finished! 2019-07-07 22:50:53.710971
Training took         486  seconds.


W0707 22:50:54.270516 50204 deprecation.py:323] From C:\Works\tensorflow\lib\site-packages\tensorflow\python\training\saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.


saved dir: model_Ant-v2
Ant-v2-dnn/Ant-v2-dnn/layer-0/kernel/Adadelta
Ant-v2-dnn/Ant-v2-dnn/layer-0/bias/Adadelta
Ant-v2-dnn/Ant-v2-dnn/layer-0/kernel/Adadelta_1
Ant-v2-dnn/Ant-v2-dnn/layer-0/bias/Adadelta_1
Ant-v2-dnn/layer-last/kernel
Ant-v2-dnn/Ant-v2-dnn/layer-1/kernel/Adadelta_1
Ant-v2-dnn/Ant-v2-dnn/layer-1/bias/Adadelta
Ant-v2-dnn/Ant-v2-dnn/layer-1/kernel/Adadelta
Ant-v2-dnn/Ant-v2-dnn/layer-1/bias/Adadelta_1
Ant-v2-dnn/Ant-v2-dnn/layer-last/kernel/Adadelta_1
Ant-v2-dnn/Ant-v2-dnn/layer-last/kernel/Adadelta
Ant-v2-dnn/Ant-v2-dnn/layer-2/bias/Adadelta_1
Ant-v2-dnn/Ant-v2-dnn/layer-2/bias/Adadelta
Ant-v2-dnn/Ant-v2-dnn/layer-last/bias/Adadelta_1
Ant-v2-dnn/Ant-v2-dnn/layer-2/kernel/Adadelta
Ant-v2-dnn/Ant-v2-dnn/layer-last/bias/Adadelta
Ant-v2-dnn/Ant-v2-dnn/layer-2/kernel/Adadelta_1
Ant-v2-dnn/layer-0/bias
Ant-v2-dnn/layer-0/kernel
Ant-v2-dnn/layer-1/bias
Ant-v2-dnn/layer-1/kernel
Ant-v2-dnn/layer-2/bias
Ant-v2-dnn/layer-2/kernel
Ant-v2-dnn/layer-last/bias
Prediction took       

In [5]:
# run_expert source code for reference

import tf_util
import pickle as pk
import traceback

def load_learned_policy_fn(gym_env, session=None) :
    observations, actions = load_expert_data(gym_env)
    obs_shape, action_shape = list(np.shape(observations)), list(np.shape(actions))
    if action_shape[1] == 1 :
        actions = np.reshape(actions, (action_shape[0], action_shape[2]))
        action_shape = list(np.shape(actions))

    gym_env_model = 'model_' + gym_env
    cloning = BehavioralCloning(X_shape=obs_shape, Y_shape=action_shape, scope_name=gym_env, restore_mode=True, session=session)
    cloning.restore_model(gym_env_model)

    return lambda x : cloning.infer(x)
    
def rollout_by_policy(gym_env, max_timesteps, num_rollouts, policy_fn=None, render=False) :
    policy_type = 'learned'
    
    if policy_fn is None : # default policy_fn is expert policy
        print('loading and building expert policy')
        policy_fn = load_expert_policy_fn(gym_env)
        print('loaded and built')
        policy_type = 'expert'

    with tf.Session():
        tf_util.initialize()

        import gym
        env = gym.make(gym_env)
        max_steps = max_timesteps or env.spec.timestep_limit

        returns = []
        observations = []
        actions = []
        for i in range(num_rollouts):
            # print('iter', i)
            obs = env.reset()
            done = False
            totalr = 0.
            steps = 0
            while not done:
                action = policy_fn(obs[None,:])
                observations.append(obs)
                actions.append(action)
                try :
                    if np.shape(action)[1] == 1 :
                        action_shape = np.shape(action)
                        action = np.reshape(action, (action_shape[0], action_shape[2]))
                    obs, r, done, _ = env.step(action) # observation, reward, done
                except ValueError as e :
                    print('action:', action, ', shape:', np.shape(action), ', policy_type:', policy_type)
                    print('actions:', actions, ', shape:', np.shape(actions))
                    traceback.print_exc()     

                totalr += r
                steps += 1
                if render:
                    env.render()
                # if steps % 100 == 0: print("%i/%i"%(steps, max_steps))
                if steps >= max_steps:
                    break
            returns.append(totalr)

        print('returns', returns)
        print('mean return', np.mean(returns))
        print('std of return', np.std(returns))

        rollout_data = {'observations': np.array(observations),
                        'actions': np.array(actions),
                        'returns': np.array(returns)}

        if not os.path.exists('rollout_data') :
            os.mkdir('rollout_data')
        with open(os.path.join('rollout_data', policy_type + '-' + gym_env + '.pkl'), 'wb') as f:
            pk.dump(rollout_data, f, pk.HIGHEST_PROTOCOL)
            
        return rollout_data, policy_type, env


In [11]:
# rollout and check

import pandas as pd

max_timesteps = None
num_rollouts = 200

df = pd.DataFrame(columns=['gymenv', 'policy_type', 'rollouts', 'mean', 'std'])

for gym_env in gym_envs :
    for expert_policy in [ True, False] :
        start_time = dt.datetime.now()
        rollout_data, policy_type, _ = rollout_by_policy(gym_env, max_timesteps, num_rollouts,
                                                      policy_fn=None if expert_policy else load_learned_policy_fn(gym_env),
                                                      render=False)
        returns = rollout_data['returns']
        end_time = dt.datetime.now()
        print('Rollout result. env:', gym_env, ', policy_type:', policy_type, ', returns:', len(returns), '/', np.mean(returns), '/', np.std(returns))
        df = df.append({'gymenv':gym_env, 'policy_type':policy_type, 'rollouts': len(returns), 'mean': np.mean(returns), 'std': np.std(returns)}, ignore_index=True)
        print('It took ', '%10d' % ((end_time - start_time).total_seconds()), ' seconds.')
        


loading and building expert policy
obs (1, 111) (1, 111)
loaded and built
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
returns [4747.680326602361, 4741.33141203345, 4849.193871008658, 4833.752898733907, 4998.299597073863, 4865.028149508911, 4710.76965776187, 5002.620061411658, 4801.008480298248, 4884.905360052358, 4859.259532020084, 4851.707784397738, 4992.636609725978, 5082.002053765846, 5025.357253313969, 4740.592035374321, 4940.539756066792, 4740.010348598251, 4695.18690206555, 4909.017572131628, 4747.75311241958, 4817.893227044679, 4766.990389490045, 4644.289659625638, 4740.790944634136, 4722.294687296413, 4683.6788184581865, 4665.8446391249645, 4928.847069560956, 4820.923032515179, 4879.911593685575, 4624.070944791383, 4942.153330334369, 4755.480397902153, 4684.802168880333, 4826.7330093160335, 4794.9830279908965

In [12]:
df.head(20)

Unnamed: 0,gymenv,policy_type,rollouts,mean,std
0,Ant-v2,expert,200,4772.544028,422.163975
1,Ant-v2,learned,200,926.068148,140.414594
2,Hopper-v2,expert,200,3778.515927,3.775791
3,Hopper-v2,learned,200,148.669662,6.481258
4,Reacher-v2,expert,200,-4.125223,1.780242
5,Reacher-v2,learned,200,-12.930317,4.21162
6,HalfCheetah-v2,expert,200,4144.304682,75.581012
7,HalfCheetah-v2,learned,200,2244.335064,180.081137
8,Humanoid-v2,expert,200,10357.171722,698.728618
9,Humanoid-v2,learned,200,446.594071,56.643694


## DAgger

1. Implement DAgger. See the code provided in run expert.py to see how to query the expert policy and perform roll-outs in the environment.

2. Run DAgger and report results on one task in which DAgger can learn a better policy than behavioral cloning.
Report your results in the form of a learning curve, plotting the number of DAgger iterations vs. the policy’s mean return,
with error bars to show the standard deviation.

Include the performance of the expert policy and the behavioral cloning agent on the same plot.
In the caption, state which task you used, and any details regarding network architecture, amount of data, etc. (as in the previous section).

### note
1. DAgger needs labeling by human experts.
1. The main idea is that the trajectories are collected by the learned policy. but the action is relabeled by the expert policy.
1. DAgger addresses the problem of distributional “drift”

In [6]:
import gym

def rollout_by_dagger(gym_env, max_timesteps, num_rollouts, num_epochs=50, render=False) :
    policy_type = 'dagger'
    
    print('loading and building learned policy')
    policy_fn = load_learned_policy_fn(gym_env)
    print('loaded and built')

    print('starting dagger ', gym_env, dt.datetime.now())
    train_observations, train_actions = load_expert_data(gym_env)
    obs_shape, action_shape = np.shape(train_observations), np.shape(train_actions)
    if action_shape[1] == 1 :
        train_actions = np.reshape(train_actions, (action_shape[0], action_shape[2]))
        action_shape = np.shape(train_actions)
    print(gym_env, ' observation shape: ', obs_shape, ', actions shape:', action_shape)
    
    with tf.Session(graph=tf.Graph()) as session, session.graph.as_default() : # for session nesting, the graphs should be isolated for each tf sessions
        print('loading and building expert policy for DAgger')
        expert_policy_fn = load_expert_policy_fn(gym_env)
        print('loaded and built for DAgger')

        tf_util.initialize()

        gym_env_model = 'model_' + gym_env
        gym_env_dagger_model = 'model_dagger_' + gym_env # new model file to save after lite training
        light_train_config = default_train_config.copy()
        light_train_config['num_epochs'] = num_epochs
        cloning_model = None

        env = gym.make(gym_env)
        max_steps = max_timesteps or env.spec.timestep_limit

        returns = []
        observations = []
        actions = []

        for i in range(num_rollouts):
            # print('iter', i)
            obs = env.reset()
            done = False
            totalr = 0.
            steps = 0

            while not done:
                action = policy_fn(obs[None,:])
                # print('before append>>>> observations shape:', np.shape(observations), ', train_observations shape:', np.shape(train_observations), ', obs shape:', np.shape(obs))
                observations.append(obs)
                train_observations = np.append(train_observations, obs[None, :], axis=0)
                # print('after append>>>> observations shape:', np.shape(observations), ', train_observations shape:', np.shape(train_observations), ', obs shape:', np.shape(obs))

                expert_action = expert_policy_fn(obs[None,:]) # None makes additional dimension. to reduce, use np.hstack
                actions.append(expert_action)
                # print('before append>>>> actions shape:', np.shape(actions), ', train_actions shape:', np.shape(train_actions), ', expert_action shape:', np.shape(expert_action))
                train_actions = np.append(train_actions, expert_action, axis=0)
                # print('after append>>>> actions shape:', np.shape(actions), ', train_actions shape:', np.shape(train_actions), ', expert_action shape:', np.shape(expert_action))

                try :
                    if np.shape(action)[1] == 1 :
                        action_shape = np.shape(action)
                        action = np.reshape(action, (action_shape[0], action_shape[2]))
                        action_shape = np.shape(action)
                    obs, r, done, _ = env.step(action) # observation, reward, done
                except ValueError as e :
                    print('action:', action, ', shape:', np.shape(action), ', policy_type:', policy_type)
                    print('actions:', actions, ', shape:', np.shape(actions))
                    print('expert_action:', expert_action, ', shape:', np.shape(expert_action))
                    traceback.print_exc()     

                totalr += r
                steps += 1
                if render:
                    env.render()
                # if steps % 100 == 0: print("%i/%i"%(steps, max_steps))
                if steps >= max_steps:
                    break
            returns.append(totalr)

            # retrain on every new rollouts
            n_samples = train_observations.shape[0]
            n_train = int(n_samples * (1 - TEST_PERCENT))

            print('train_observations shape:', train_observations.shape, ', train_actions shape:', train_actions.shape)
            print('observations shape:', np.shape(observations), ', actions shape:', np.shape(actions))
            try :
                train_observations, train_actions = shuffle_XY(train_observations, train_actions)
            except IndexError as e :
                print('train_observations:', train_observations, ', train_actions:', train_actions)
                traceback.print_exc() 

            print('train input : train_observations shape:', train_observations[:n_train].shape, ', train_actions shape:', train_actions[:n_train].shape)

            with tf.Session(graph=tf.Graph()) as nested_session, nested_session.graph.as_default():
                saved_model = gym_env_model if cloning_model is None else gym_env_dagger_model
                cloning_model = BehavioralCloning(X_shape=obs_shape, Y_shape=action_shape,
                                                  scope_name=gym_env, restore_mode=True, session=nested_session)
                cloning_model.restore_model(saved_model)        
                training_costs, validation_costs, validation_measures = cloning_model.train(train_observations[:n_train], train_actions[:n_train],
                                                                                           train_config = light_train_config)
                cloning_model.save_model(gym_env_dagger_model)

        with tf.Session(graph=tf.Graph()) as nested_session, nested_session.graph.as_default():
            cloning = BehavioralCloning(X_shape=obs_shape, Y_shape=action_shape,
                                        scope_name=gym_env, restore_mode=True, session=nested_session)
            cloning.restore_model(gym_env_dagger_model)
            test_hyps, test_costs, test_measures = cloning.test(train_observations[n_train:], train_actions[n_train:])

        print('returns', returns)
        print('mean return', np.mean(returns))
        print('std of return', np.std(returns))

        rollout_data = {'observations': np.array(observations),
                        'actions': np.array(actions),
                        'returns': np.array(returns)}

        if not os.path.exists('rollout_data') :
            os.mkdir('rollout_data')
        with open(os.path.join('rollout_data', policy_type + '-' + gym_env + '.pkl'), 'wb') as f:
            pk.dump(rollout_data, f, pk.HIGHEST_PROTOCOL)

        return rollout_data, policy_type, env


In [14]:
max_timesteps = None
num_rollouts = 10 # incremental learning is too slow 

for gym_env in gym_envs :
    start_time = dt.datetime.now()
    rollout_data, policy_type, _ = rollout_by_dagger(gym_env, max_timesteps, num_rollouts,
                                                    render=False)
    returns = rollout_data['returns']
    end_time = dt.datetime.now()
    print('Rollout result. env:', gym_env, ', policy_type:', policy_type, ', returns:', len(returns), '/', np.mean(returns), '/', np.std(returns))
    df = df.append({'gymenv':gym_env, 'policy_type':policy_type, 'rollouts': len(returns), 'mean': np.mean(returns), 'std': np.std(returns)}, ignore_index=True)
    print('It took ', '%10d' % ((end_time - start_time).total_seconds()), ' seconds.')


loading and building learned policy
saved dir: model_Ant-v2
loaded and built
starting dagger  Ant-v2 2019-07-09 18:54:32.707609
Ant-v2  observation shape:  (19992, 111) , actions shape: (19992, 8)
loading and building expert policy for DAgger
obs (1, 111) (1, 111)
loaded and built for DAgger
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
train_observations shape: (20992, 111) , train_actions shape: (20992, 8)
observations shape: (1000, 111) , actions shape: (1000, 1, 8)
train input : train_observations shape: (16793, 111) , train_actions shape: (16793, 8)
saved dir: model_Ant-v2
Learning starts. It will take some time... 2019-07-09 18:54:40.918082
Epoch: 0000 average training cost = 0.040187810 validation cost = 0.018603407 validation measure = 0.763998151 2019-07-09 18:54:41.572289
Training(learning) Finished! 2019-07-

In [15]:
df.head(30)

Unnamed: 0,gymenv,policy_type,rollouts,mean,std
0,Ant-v2,expert,200,4772.544028,422.163975
1,Ant-v2,learned,200,926.068148,140.414594
2,Hopper-v2,expert,200,3778.515927,3.775791
3,Hopper-v2,learned,200,148.669662,6.481258
4,Reacher-v2,expert,200,-4.125223,1.780242
5,Reacher-v2,learned,200,-12.930317,4.21162
6,HalfCheetah-v2,expert,200,4144.304682,75.581012
7,HalfCheetah-v2,learned,200,2244.335064,180.081137
8,Humanoid-v2,expert,200,10357.171722,698.728618
9,Humanoid-v2,learned,200,446.594071,56.643694


## just rollout and render using the policies. enjoy the visuals

In [None]:
# rollout and check
from time import sleep
import pandas as pd
import glfw

max_timesteps = None
num_rollouts = 5

df2 = pd.DataFrame(columns=['gymenv', 'policy_type', 'rollouts', 'mean', 'std'])

def close_mujoco_window(win) :
    if win.unwrapped.viewer is not None :
        glfw.destroy_window(win.unwrapped.viewer.window)
        win.unwrapped.viewer = None
    
for gym_env in gym_envs :
    for expert_policy in [ True, False] :
        start_time = dt.datetime.now()
        rollout_data, policy_type, opengym_win = rollout_by_policy(gym_env, max_timesteps, num_rollouts,
                                                                  policy_fn=None if expert_policy else load_learned_policy_fn(gym_env),
                                                                  render=True)
        returns = rollout_data['returns']
        end_time = dt.datetime.now()
        print('Rollout result. env:', gym_env, ', policy_type:', policy_type, ', returns:', len(returns), '/', np.mean(returns), '/', np.std(returns))
        df2 = df2.append({'gymenv':gym_env, 'policy_type':policy_type, 'rollouts': len(returns), 'mean': np.mean(returns), 'std': np.std(returns)}, ignore_index=True)
        print(gym_env, '-', policy_type, '. It took ', '%10d' % ((end_time - start_time).total_seconds()), ' seconds.')
        input("Press Enter to continue...")
        close_mujoco_window(opengym_win)

    start_time = dt.datetime.now()
    rollout_data, policy_type, opengym_win = rollout_by_dagger(gym_env, max_timesteps, num_rollouts, num_epochs=0,
                                                              render=True)
    returns = rollout_data['returns']
    end_time = dt.datetime.now()
    print('Rollout result. env:', gym_env, ', policy_type:', policy_type, ', returns:', len(returns), '/', np.mean(returns), '/', np.std(returns))
    df2 = df2.append({'gymenv':gym_env, 'policy_type':policy_type, 'rollouts': len(returns), 'mean': np.mean(returns), 'std': np.std(returns)}, ignore_index=True)
    print(gym_env, '-', policy_type, '. It took ', '%10d' % ((end_time - start_time).total_seconds()), ' seconds.')
    input("Press Enter to continue...")
    close_mujoco_window(opengym_win)
        

W0709 19:46:10.948683 53768 deprecation_wrapper.py:119] From C:\Exception\Works\GitHub\cs294\yoonforh\hw1\load_policy.py:55: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.



loading and building expert policy
obs (1, 111) (1, 111)
loaded and built


W0709 19:46:12.753803 53768 deprecation.py:323] From C:\Exception\Works\GitHub\cs294\yoonforh\hw1\tf_util.py:91: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
W0709 19:46:12.755799 53768 deprecation_wrapper.py:119] From C:\Exception\Works\GitHub\cs294\yoonforh\hw1\tf_util.py:74: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0709 19:46:12.756796 53768 deprecation.py:323] From C:\Works\tensorflow\lib\site-packages\tensorflow\python\util\tf_should_use.py:193: initialize_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.variables_initializer` instead.
  result = entry_point.load(False)


[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
returns [4784.474920335864, 4713.279314667821, 4809.302924908928, 4914.444299500961, 4809.060126767605]
mean return 4806.112317236235
std of return 64.5613371147942
Rollout result. env: Ant-v2 , policy_type: expert , returns: 5 / 4806.112317236235 / 64.5613371147942
Ant-v2 - expert . It took          66  seconds.


Press Enter to continue... 


W0709 19:47:25.381299 53768 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0709 19:47:25.382292 53768 deprecation.py:323] From <ipython-input-3-a1d04047b886>:82: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
W0709 19:47:25.725375 53768 deprecation.py:323] From <ipython-input-3-a1d04047b886>:83: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
W0709 19:47:25.871944 53768 deprecation.py:323] From C:\Works\tensorflow\lib\site-packages\ten

saved dir: model_Ant-v2


  result = entry_point.load(False)


[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
returns [896.5271234364158, 886.5595075282029, 1044.8450915467165, 934.9606952685215, 1033.3468939267186]
mean return 959.247862341315
std of return 67.26798142975883
Rollout result. env: Ant-v2 , policy_type: learned , returns: 5 / 959.247862341315 / 67.26798142975883
Ant-v2 - learned . It took          63  seconds.


Press Enter to continue... 


loading and building learned policy




saved dir: model_Ant-v2
loaded and built
starting dagger  Ant-v2 2019-07-09 19:48:43.320826
Ant-v2  observation shape:  (19992, 111) , actions shape: (19992, 8)
loading and building expert policy for DAgger
obs (1, 111) (1, 111)
loaded and built for DAgger
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
train_observations shape: (20992, 111) , train_actions shape: (20992, 8)
observations shape: (1000, 111) , actions shape: (1000, 1, 8)
train input : train_observations shape: (16793, 111) , train_actions shape: (16793, 8)
saved dir: model_Ant-v2
Learning starts. It will take some time... 2019-07-09 19:49:01.376691
Training(learning) Finished! 2019-07-09 19:49:01.376691
Training took           0  seconds.
train_observations shape: (21992, 111) , train_actions shape: (21992, 8)
observations shape: (2000

In [None]:
df2.head(30)

##  Bonus: Alternative Policy Architectures

1. (Optional) Experiment with a different policy architecture, e.g. using recurrence or changing the size or nonlinearities used.

Compare performance between your new and original policy architectures using behavioral cloning and/or DAgger,
and report your results in the same form as above, with a caption describing what you did.
