## Assignments
1. Homework 1: Imitation learning (control via supervised learning)
2. Homework 2: Policy gradients (“REINFORCE”)
3. Homework 3: Q learning and actor-critic algorithms
4. Homework 4: Model-based reinforcement learning
5. Homework 5: Advanced model-free RL algorithms
6. Final project: Research-level project of your choice (form a group of up to 2-3 students, you’re welcome to start early!)

##### Emacs IPython Notebook Commands/Keybinds
* http://millejoh.github.io/emacs-ipython-notebook/#commands-keybinds


# Homework 1 Imitation Learning

Instead of trying to produce a program to simulate the adult mind, why not rather try to produce one which simulates the child's?

If this were then subjected to an appropriate course of education one would obtain the adult brain.

\- Alan Turing



## Behavioral Cloning

1. The starter code provides an expert policy for each of the MuJoCo tasks in OpenAI Gym (See run expert.py). Generate roll-outs from the provided policies, and implement behavioral cloning. => expert_data/XXX.pkl

2. Run behavioral cloning (BC) and report results on two tasks
 – one task where a behavioral cloning agent achieves comparable performance to the expert,
 and one task where it does not.
 When providing results, report the mean and standard deviation of the return over multiple rollouts in a table, and state which task was used.
 Be sure to set up a fair comparison, in terms of network size, amount of data, and number of training iterations, and provide these details (and any others you feel are appropriate) in the table caption.

3. Experiment with one hyperparameter that affects the performance of the behavioral cloning agent, such as
* the number of demonstrations,
* the number of training epochs,
* the variance of the expert policy, or
* something that you come up with yourself.
 For one of the tasks used in the previous question, show a graph of how the BC agent’s performance varies with the value of this hyperparameter, and state the hyperparameter and a brief rationale for why you chose it in the caption for the graph.



In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '1' # 0 is default GPU

In [None]:

gym_envs = ('Ant-v2', 'Hopper-v2', 'Reacher-v2', 'HalfCheetah-v2', 'Humanoid-v2', 'Walker2d-v2')

num_epochs_per_envs = {'Ant-v2' : 3000, 'Hopper-v2': 2000, 'Reacher-v2' : 3000, 'HalfCheetah-v2' : 2000, 'Humanoid-v2' : 3000, 'Walker2d-v2' : 3000 }
num_rollouts_per_envs = {'Ant-v2' : 500, 'Hopper-v2': 500, 'Reacher-v2' : 2500, 'HalfCheetah-v2' : 500, 'Humanoid-v2' : 500, 'Walker2d-v2' : 500 }

NUM_ROLLOUTS = 500
MAX_TIMESTEPS = None # no check on max timesteps



## running experts

run experts of each gym environments

In [None]:
# run expert

import sys, os
import datetime as dt
import run_expert

# https://ipython.readthedocs.io/en/stable/interactive/magics.html
# path=os.environ['PATH']
# %env PATH='/usr/local/bin:'+path

RENDER = False

for gym_env in gym_envs :
    sys.argv = ['run_expert.py', 'experts/' + gym_env + '.pkl', gym_env, '--num_rollouts', str(num_rollouts_per_envs[gym_env]) ]
    if RENDER :
        sys.argv.append('--render')
    run_expert.main()
    print('finished run_expert ', gym_env, 'at', dt.datetime.now())

print('finished run_expert on all gym_envs at', dt.datetime.now())


## train on each envs

using the data gathered by expert policy

* environment details : https://github.com/openai/gym/tree/master/gym/envs/mujoco/assets
* source codes of each environments : https://github.com/openai/gym/blob/master/gym/envs/mujoco/
* reference for an HW1 implementation :  https://hollygrimm.com/rl_bc

for the regressor
input : observation
output : action


In [3]:
from load_policy import load_policy
import pickle as pk
import numpy as np
import os

def load_expert_data(gymenv) -> (np.array, np.array) : # observations, actions
    with open(os.path.join('expert_data', gymenv + '.pkl'), 'rb') as f :
        expert_data = pk.load(f)
        return expert_data['observations'], expert_data['actions']

def load_expert_policy_fn(gymenv) :
    return load_policy('experts/' + gymenv + '.pkl')


In [4]:
# scale (1) minmax (2) normal_dist (3) preserve_sign

def scale_minmax(data, minv=None, maxv=None):
    if minv == -1 and maxv == -1 : # no scaling
        return data, -1, -1
    
    if minv is None :
        minv = np.min(data, 0)
    if maxv is None :
        maxv = np.max(data, 0)
    ''' Min Max Normalization

    Parameters
        ----------
        data : numpy.ndarray
        input data to be normalized
        shape: [Batch size, dimension]

    Returns
        ----------
        data : numpy.ndarry
        normalized data
        shape: [Batch size, dimension]

    References
        ----------
        .. [1] http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

    '''
    numerator = data - minv
    denominator = maxv - minv
    # noise term prevents the zero division
    return numerator / (denominator + 1e-7), minv, maxv

def descale_minmax(data, minv, maxv):
    if minv is None or maxv is None :
        return data
    if minv == -1 and maxv == -1 : # no scaling
        return data
    
    # noise term prevents the zero division
    return data * (maxv - minv + 1e-7) + minv

def scale_signed(data, minv=None, maxv=None): # value 0 is preserved even after rescale
    if minv == -1 and maxv == -1 : # no scaling
        return data, -1, -1

    if maxv is None :
        maxv = np.max(np.abs(data), 0)
    
    numerator = data
    denominator = maxv
    # noise term prevents the zero division
    return numerator / (denominator + 1e-7), 0.0, maxv

def descale_signed(data, minv, maxv): # value 0 is preserved even after rescale
    if minv is None or maxv is None :
        return data
    if minv == -1 and maxv == -1 : # no scaling
        return data
    
    return data * (maxv + 1e-7)


## Behavior Cloning

1. generate rollouts(= expert data) with expert policy (and record the returns)
2. learn the rollouts changing some environments (network size, amount of data, and number of training iterations, ...)
3. generate rollouts several times according to each policies learned above and show the returns in a table


In [5]:
import tensorflow as tf
from enum import Enum, IntEnum
import numpy as np
import random
import datetime as dt
import time
import math
import pickle as pk
import os

default_model_config = dict(neurons = [400, 200],
                            activation = tf.nn.elu, # Using ReLu, which is a discontinuous function, may raise issues. Try using other activation functions, such as tanh or sigmoid.
                            last_activation = None, # final layer activation function. default is no activation
                            optimizer = tf.train.AdadeltaOptimizer, # tf.train.AdamOptimizer, tf.train.ProximalAdagradOptimizer
                            cost_function = tf.losses.mean_squared_error, # tf.losses.huber_loss (robust to outlier)
                            measure_function = 'r_squared', # 'smape' means symmetric_mean_absolute_percentage_error
)

default_train_config = dict(start_learning_rate = 0.001,
                            # minimum_learning_rate = 0.000001,
                            num_epochs = 2000,
                            batch_size = 100,
                            keep_prob = 0.5, # for training only (dropout for hidden layer)
                            keep_prob_input = 0.8, # for training only (dropout for input layer)
                            validationset_percent = 0.2, # by default 20 percent is validation set
                            break_accuracy = 0.99, # -1.0, # 0.999, # -1.0
                            early_stopping_epoch_on_max_no_decrease = 500,
                            shuffle_samples_epochs = 200, # shuffle samples per given epochs considering performance. -1 means no shuffling
                            check_accuracy_epochs = 200, # 5000,
                            use_tboard = True,
                            print_cost_interval = 500,
                            print_trained_model = False,
                            )

class BehavioralCloning(object) :
    default_random_seed = 777

    def __init__(self,
                 X_shape = None, # X shape as list
                 Y_shape = None, # Y shape as list
                 model_config = default_model_config,
                 scope_name = '',
                 restore_mode=False,
                 session=None) :
        self.model_config = model_config
        self.restore_mode = restore_mode
        self.scope_name = scope_name
        self.X_shape = list(X_shape)
        self.X_shape[0] = None
        self.Y_shape = list(Y_shape)
        self.Y_shape[0] = None

        tf.set_random_seed(BehavioralCloning.default_random_seed)  # reproducibility
        np.random.seed(BehavioralCloning.default_random_seed)

        # Launch new session before graph init
        # interactive session will declare itself as a default session and won't be closed on context destroy (so, should explicity call sess.close()
        if session is None :
            tf.reset_default_graph()
            self.session = tf.InteractiveSession()
        else :
            self.session = session
        self._build_network()

    def _build_network(self) :
        g = tf.get_default_graph()

        # build the network
        with g.as_default(), self.session.as_default() :
            self.X = tf.placeholder(tf.float32, shape=self.X_shape, name='X')
            self.Y = tf.placeholder(tf.float32, shape=self.Y_shape, name='Y')
            self.p_keep_prob = tf.placeholder(tf.float32, name='p_keep_prob')
            self.p_keep_prob_input = tf.placeholder(tf.float32, name='p_keep_prob_input')
            self.p_training = tf.placeholder(tf.bool, name='p_training')
            self.p_lr = tf.placeholder(tf.float32, name='learning_rate')

            with tf.variable_scope(self.scope_name + '-dnn', reuse=tf.AUTO_REUSE) as scope:
                neurons = self.model_config['neurons']
                layer = self.X
                layer = tf.layers.dropout(layer, rate=1-self.p_keep_prob_input, training=self.p_training)
                for i in range(len(neurons)) :
                    neuron = neurons[i]

                    layer = tf.layers.dense(layer, neuron,
                                            kernel_initializer = tf.contrib.layers.xavier_initializer(seed=BehavioralCloning.default_random_seed),
                                            activation=self.model_config['activation'],
                                            name = 'layer-' + str(i))
                    layer = tf.layers.dropout(layer, rate=1-self.p_keep_prob, training=self.p_training)
                n_output = self.Y_shape[1]
                layer = tf.layers.dense(layer, n_output,
                                        kernel_initializer = tf.contrib.layers.xavier_initializer(seed=BehavioralCloning.default_random_seed),
                                        activation=self.model_config['last_activation'],
                                        name = 'layer-last')
                    

                self.hypothesis = layer
                cost_fn = self.model_config['cost_function']
                self.cost = cost_fn(self.Y, self.hypothesis)
                tf.summary.scalar("cost", self.cost)
                measure_alg = self.model_config['measure_function']
                if measure_alg == 'r_squared' :
                    self.measure = self.r_squared(self.Y, self.hypothesis)
                elif measure_alg == 'smape' :
                    self.measure = self.smape(self.Y, self.hypothesis)
                else :
                    self.measure = None
                optimizer_fn = self.model_config['optimizer']
                opt = optimizer_fn(learning_rate=self.p_lr)
                self.objective_tensor = opt.minimize(self.cost)

            if not self.restore_mode :
                self.session.run(tf.global_variables_initializer())


    def train(self, X, Y, rescale_factor=None, train_config = default_train_config, scale_fn=scale_minmax) :
        learning_rate = train_config['start_learning_rate']
        num_epochs = train_config['num_epochs']
        keep_prob = train_config['keep_prob']
        keep_prob_input = train_config['keep_prob_input']
        batch_size = train_config['batch_size']
        vset_percent = train_config['validationset_percent']
        break_accuracy = train_config['break_accuracy']
        check_accuracy_epochs = train_config['check_accuracy_epochs']
        early_stopping_epoch_on_max_no_decrease = train_config['early_stopping_epoch_on_max_no_decrease']
        print_cost_interval = train_config['print_cost_interval']
        shuffle_samples_epochs = train_config['shuffle_samples_epochs']
        use_tboard = train_config['use_tboard']

        training_costs = np.zeros(num_epochs, dtype=np.float32)
        validation_costs = np.zeros(num_epochs, dtype=np.float32)
        validation_measures = np.zeros(num_epochs, dtype=np.float32)
        min_cost = np.inf
        no_cost_decrease_epochs = 0

        if rescale_factor is None :
            minx, maxx, miny, maxy = None, None, None, None
        else :
            minx, maxx, miny, maxy = rescale_factor['minx'], rescale_factor['maxx'], rescale_factor['miny'], rescale_factor['maxy']
            
        X, minx, maxx = scale_fn(X, minv=minx, maxv=maxx) # rescale X
        Y, miny, maxy = scale_fn(Y, minv=miny, maxv=maxy) # rescale Y
        
        n_samples = X.shape[0]
        n_features = X.shape[1]
        n_output = Y.shape[1]
        n_train = int(n_samples * (1 - vset_percent))
        n_validate = n_samples - n_train

        batch_loop = (n_train - 1) // batch_size + 1

        sess = self.session
        if use_tboard :
            merged_summary = tf.summary.merge_all()
            writer = tf.summary.FileWriter("./tboard_logs")
            writer.add_graph(sess.graph)  # Show the graph
        else :
            merged_summary = None

        train_X = X[:n_train]
        train_Y = Y[:n_train]
        validate_X = X[n_train:]
        validate_Y = Y[n_train:]

        if shuffle_samples_epochs > 0 :
            current_XY = np.hstack((X, Y))

        start_time = dt.datetime.now()
        print('Learning starts. It will take some time...', start_time)
        for epoch in range(num_epochs):
            shuffle_samples = shuffle_samples_epochs > 0 and epoch % shuffle_samples_epochs == 0 # shuffle on 0th epoch
            
            if shuffle_samples :
                np.random.shuffle(current_XY) # this will shuffle current_XY in place.
                _, shuffled_X, shuffled_Y = np.split(current_XY, (0, n_features), axis=-1)
                train_X = shuffled_X[:n_train]
                train_Y = shuffled_Y[:n_train]
                validate_X = shuffled_X[n_train:]
                validate_Y = shuffled_Y[n_train:]

            epoch_hyps = np.zeros(Y.shape, dtype=np.float32)
            epoch_costs = np.zeros(batch_loop, dtype=np.float32)

            for m in range(batch_loop) :
                if m == batch_loop - 1 :
                    m_X = train_X[batch_size * m :]
                    m_Y = train_Y[batch_size * m :]
                else :
                    m_X = train_X[batch_size * m : batch_size * (m + 1)]
                    m_Y = train_Y[batch_size * m : batch_size * (m + 1)]

                feed_dict = {self.X:m_X, self.Y:m_Y,
                             self.p_keep_prob:keep_prob,
                             self.p_keep_prob_input:keep_prob_input,
                             self.p_lr:learning_rate,
                             self.p_training:True}
                targets = [ self.hypothesis, self.cost, self.objective_tensor ]
                if use_tboard :
                    targets.append(merged_summary)
                # print('m:', m, ', m_X:', np.shape(m_X), ', m_Y:', np.shape(m_Y), ', feed_dict:', feed_dict)
                results = sess.run(targets, feed_dict = feed_dict)
                if use_tboard :
                    writer.add_summary(results[-1], global_step = epoch * batch_loop + m)

                h_value = results[0]
                epoch_hyps[batch_size * m : batch_size * m + m_Y.shape[0]] = h_value
                cost_value = results[1]
                epoch_costs[m] = cost_value

            training_costs[epoch] = avg_cost = np.mean(epoch_costs)

            validate_feed_dict = {self.X: validate_X, self.Y: validate_Y,
                                  self.p_keep_prob:1.0, self.p_keep_prob_input:1.0, self.p_training:False}
            validate_targets = [ self.hypothesis, self.cost, self.measure ]
            vs_hyps, vs_cost, vs_measure = sess.run(validate_targets, feed_dict=validate_feed_dict)
            validation_costs[epoch] = vs_cost
            validation_measures[epoch] = vs_measure

            if epoch % print_cost_interval == 0 or epoch == num_epochs - 1:
                print('Epoch:', '%04d' % epoch, 'average training cost =', '{:.9f}'.format(avg_cost),
                      'validation cost =', '{:.9f}'.format(vs_cost), 'validation measure =', '{:.9f}'.format(vs_measure), dt.datetime.now())

            if epoch % check_accuracy_epochs == check_accuracy_epochs :
                print('Epoch:', '%04d' % epoch, 'average training cost =', '{:.9f}'.format(avg_cost),
                      'validation cost =', '{:.9f}'.format(vs_cost),
                      'validation measure =', '{:.9f}'.format(vs_measure), dt.datetime.now())

                if break_accuracy > 0 and break_accuracy >= vs_measure :
                    print('Stops the training due to high validation measure', vs_measure, ' exceeded the criteria', break_accuracy)
                    training_costs = training_costs[:epoch + 1] # strip un-run epochs
                    validation_costs = validation_costs[:epoch + 1] # strip un-run epochs
                    validation_measures = validation_measures[:epoch + 1] # strip un-run epochs
                    break

            if early_stopping_epoch_on_max_no_decrease > 0 :
                if vs_cost < min_cost :
                    min_cost = vs_cost
                    no_cost_decrease_epochs = 0
                else :
                    no_cost_decrease_epochs = no_cost_decrease_epochs + 1
                    if no_cost_decrease_epochs >= early_stopping_epoch_on_max_no_decrease :
                        print('Epoch:', '%04d' % epoch, 'average training cost =', '{:.9f}'.format(avg_cost),
                              'validation cost =', '{:.9f}'.format(vs_cost),
                              'validation measure =', '{:.9f}'.format(vs_measure), dt.datetime.now())
                        # FIXME : in reality, i need to restore variables saved when it was not decreasing but i do not. maybe in the future ..
                        print('Stops the training since cost is not reduced during ', no_cost_decrease_epochs, ' epochs.')
                        training_costs = training_costs[:epoch + 1] # strip un-run epochs
                        validation_costs = validation_costs[:epoch + 1] # strip un-run epochs
                        validation_measures = validation_measures[:epoch + 1] # strip un-run epochs
                        break

        end_time = dt.datetime.now()
        print('Training(learning) Finished!', end_time)
        print('Training took ', '%10d' % ((end_time - start_time).total_seconds()),
              ' seconds.')
   
        rescale_factor = { 'minx':minx, 'maxx':maxx, 'miny':miny, 'maxy':maxy }
        return training_costs, validation_costs, validation_measures, rescale_factor               

    def test(self, X, Y, rescale_factor=None, scale_fn=scale_minmax, descale_fn=descale_minmax) :
        start_time = dt.datetime.now()
        g = tf.get_default_graph()

        if rescale_factor is not None :
            X, _, _ = scale_fn(X, rescale_factor['minx'], rescale_factor['maxx'])
            Y, _, _ = scale_fn(Y, rescale_factor['miny'], rescale_factor['maxy'])
            
        with g.as_default() :
            hyps, cost, measure = self._test_model(X, Y)
            if rescale_factor is not None :
                hyps = descale_fn(hyps, rescale_factor['miny'], rescale_factor['maxy'])
            end_time = dt.datetime.now()
            print('Prediction took ', '%10d' % ((end_time - start_time).total_seconds()),
                  ' seconds.')
            print('Started at ', start_time, ' and finished at ', end_time)
            return hyps, cost, measure

    def _test_model(self, X, Y) :
        test_feed_dict = {self.X: X, self.Y: Y,
                          self.p_keep_prob:1.0, self.p_keep_prob_input:1.0, self.p_training:False}
        test_targets = [ self.hypothesis, self.cost, self.measure ]

        sess = self.session
        hyps, cost, measure = sess.run(test_targets, feed_dict=test_feed_dict)
        return hyps, cost, measure

    def infer(self, X, rescale_factor=None, scale_fn=scale_minmax, descale_fn=descale_minmax) :
        g = tf.get_default_graph()

        if rescale_factor is not None :
            X, _, _ = scale_fn(X, rescale_factor['minx'], rescale_factor['maxx'])
        
        with g.as_default() :
            hyps = self._infer_model(X)
            if rescale_factor is not None :
                hyps = descale_fn(hyps, rescale_factor['miny'], rescale_factor['maxy'])
                
            return hyps

    def _infer_model(self, X) :
        test_feed_dict = {self.X: X,
                          self.p_keep_prob:1.0, self.p_keep_prob_input:1.0, self.p_training:False}
        test_targets = [ self.hypothesis ]

        sess = self.session
        hyps = sess.run(test_targets, feed_dict=test_feed_dict)
        return hyps
    
    def r_squared(self, y, h) :
        # in tf.reduce_mean, if axis has no entries, all dimensions are reduced, and a tensor with a single element is returned
        total_error = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y, 0))))  # reduce_mean by 0-axis maintains vector dimension
        unexplained_error = tf.reduce_sum(tf.square(tf.subtract(y, h)))
        r_squared = tf.subtract(1.0, tf.div(unexplained_error, total_error))
        return r_squared

    def smape(self, y, h) :
        return tf.reduce_mean(2.0 * tf.abs(tf.subtract(y, h)) / tf.maximum(1e-7, (tf.abs(y) + tf.abs(h)))) # tf.maximum is used to avoid nan
        
    def check_nan(self, value) :
        return value is None or math.isnan(value)

    def save_model(self, save_file_name) :
        # self._dump_graph('save_model(' + save_file_name + ')')
        
        tf.train.Saver().save(self.session, save_file_name)

    def save_scale(self, save_model_name, rescale_factor) :
        with open(save_model_name + '.scale.pkl', 'wb') as f :
            pk.dump(rescale_factor, f, pk.HIGHEST_PROTOCOL)
        
    def read_scale(self, save_model_name) :
        with open(save_model_name + '.scale.pkl', 'rb') as f :
            rescale_factor = pk.load(f)
            return rescale_factor
        
    def _dump_graph(self, where) :
        print('')

        print('--- dumping tensorflow graph [', where, '] ---')
        g = tf.get_default_graph()
        print('default tf graph :', g)

        # debug graphs
        keys = g.get_all_collection_keys()
        print('current name scope :', g.get_name_scope())
        for key in keys :
            print('all graph (', key, ')  :', g.get_collection(key))
        print('') 
        print('')

       
    def restore_model(self, saved_dir) :
        print('saved dir:', saved_dir)

        with self.session.as_default() :
            # self._dump_graph('restore_model(' + saved_dir + ')')
            
            reader = tf.train.NewCheckpointReader(saved_dir)
            # for var_name in reader.get_variable_to_shape_map() :
            #     print(var_name)
        
            tf.train.Saver().restore(self.session, saved_dir)


In [6]:
# util

NO_RESCALE = { 'minx':-1, 'maxx':-1, 'miny':-1, 'maxy':-1 } 
RESCALE_X = { 'minx':None, 'maxx':None, 'miny':-1, 'maxy':-1 } 
RESCALE_XY = None

TEST_PERCENT = 0.2

def shuffle_XY(X, Y) :
    hstacked = np.hstack((X, Y))
    np.random.shuffle(hstacked)
    _, new_X, new_Y = np.split(hstacked, (0, X.shape[1]), axis=-1)
    return new_X, new_Y


In [7]:

def train_gym(gym_env, num_epochs=None, no_early_stop=False, retrain=False, no_rescale=False) :
    gym_env_model = 'model_' + gym_env
    train_config = default_train_config.copy()
    if num_epochs is not None :
        train_config['num_epochs'] = num_epochs
    if no_early_stop :
        train_config['early_stopping_epoch_on_max_no_decrease'] = -1
    
    print('starting ', gym_env, dt.datetime.now())
    observations, actions = load_expert_data(gym_env)
    obs_shape, action_shape = np.shape(observations), np.shape(actions)
    if action_shape[1] == 1 :
        actions = np.reshape(actions, (action_shape[0], action_shape[2]))
        action_shape = np.shape(actions)
    print(gym_env, ' observation shape: ', obs_shape, ', actions shape:', action_shape)

    if retrain :
        cloning = BehavioralCloning(X_shape=obs_shape, Y_shape=action_shape, scope_name=gym_env, restore_mode=True, session=None)
        cloning.restore_model(gym_env_model)
        rescale_factor = cloning.read_scale(gym_env_model)
    else :
        cloning = BehavioralCloning(X_shape=obs_shape, Y_shape=action_shape, scope_name=gym_env)
        if no_rescale :
            rescale_factor = NO_RESCALE
        else :
            rescale_factor = RESCALE_X # do not rescale Y
            # rescale_factor = RESCALE_XY # rescale both X and Y
    
    n_samples = observations.shape[0]
    n_train = int(n_samples * (1 - TEST_PERCENT))

    observations, actions = shuffle_XY(observations, actions)
    training_costs, validation_costs, validation_measures, rescale_factor = cloning.train(observations[:n_train], actions[:n_train],
                                                                                          train_config=train_config, rescale_factor=rescale_factor,
                                                                                          scale_fn=scale_signed)
    
    cloning.save_model(gym_env_model)
    if not retrain :
        cloning.save_scale(gym_env_model, rescale_factor)
    
    cloning = BehavioralCloning(X_shape=obs_shape, Y_shape=action_shape, scope_name=gym_env, restore_mode=True)
    cloning.restore_model(gym_env_model)
    
    test_hyps, test_costs, test_measures = cloning.test(observations[n_train:], actions[n_train:],
                                                        rescale_factor=rescale_factor, scale_fn=scale_signed, descale_fn=descale_signed)
    print('ending ', gym_env, dt.datetime.now())

def retrain_gym(gym_env, num_epochs=None, no_early_stop=False) :
    train_gym(gym_env, num_epochs=num_epochs, no_early_stop=no_early_stop, retrain=True)
    

In [124]:
# train behavior cloning policies

for gym_env in gym_envs :
    train_gym(gym_env, num_epochs=num_epochs_per_envs[gym_env], no_early_stop=False, no_rescale=False)

starting  HalfCheetah-v2 2019-07-20 19:16:00.466183
HalfCheetah-v2  observation shape:  (500000, 17) , actions shape: (500000, 6)
Learning starts. It will take some time... 2019-07-20 19:16:02.603910
Epoch: 0000 average training cost = 0.579295337 validation cost = 0.385019094 validation measure = 0.295401573 2019-07-20 19:16:15.728855
Epoch: 0500 average training cost = 0.149796918 validation cost = 0.092154197 validation measure = 0.831848323 2019-07-20 21:07:56.335238
Epoch: 1000 average training cost = 0.127625644 validation cost = 0.071496472 validation measure = 0.869776368 2019-07-20 23:00:05.789216
Epoch: 1500 average training cost = 0.117151290 validation cost = 0.062932454 validation measure = 0.884375155 2019-07-21 00:52:16.901180
Epoch: 1999 average training cost = 0.110436104 validation cost = 0.058209136 validation measure = 0.893877625 2019-07-21 02:44:49.686552
Training(learning) Finished! 2019-07-21 02:44:49.687549
Training took       26927  seconds.
saved dir: model_H

In [None]:
for gym_env in [ 'Reacher-v2', 'Ant-v2', 'Walker2d-v2'] :
    retrain_gym(gym_env, num_epochs=num_epochs_per_envs[gym_env], no_early_stop=True)

In [19]:
# run_expert source code for reference

import tf_util
import pickle as pk
import traceback

def load_learned_policy_fn(gym_env, session=None, dagger=False) :
    observations, actions = load_expert_data(gym_env)
    obs_shape, action_shape = list(np.shape(observations)), list(np.shape(actions))
    if action_shape[1] == 1 :
        actions = np.reshape(actions, (action_shape[0], action_shape[2]))
        action_shape = list(np.shape(actions))

    gym_env_model = 'model_' + gym_env
    gym_env_dagger_model = 'model_dagger_' + gym_env
    
    cloning = BehavioralCloning(X_shape=obs_shape, Y_shape=action_shape, scope_name=gym_env, restore_mode=True, session=session)
    cloning.restore_model(gym_env_model if not dagger else gym_env_dagger_model)
    rescale_factor = cloning.read_scale(gym_env_model)

    return lambda x : cloning.infer(x, rescale_factor=rescale_factor, scale_fn=scale_signed, descale_fn=descale_signed)
    
def rollout_by_policy(gym_env, max_timesteps, num_rollouts, policy_fn=None, render=False) :
    policy_type = 'learned'
    
    if policy_fn is None : # default policy_fn is expert policy
        print('loading and building expert policy')
        policy_fn = load_expert_policy_fn(gym_env)
        print('loaded and built')
        policy_type = 'expert'

    with tf.Session():
        tf_util.initialize()

        import gym
        env = gym.make(gym_env)
        max_steps = max_timesteps or env.spec.timestep_limit

        returns = []
        observations = []
        actions = []
        for i in range(num_rollouts):
            # print('iter', i)
            obs = env.reset()
            done = False
            totalr = 0.
            steps = 0
            while not done:
                action = policy_fn(obs[None,:])
                observations.append(obs)
                actions.append(action)
                try :
                    if np.shape(action)[1] == 1 :
                        action_shape = np.shape(action)
                        action = np.reshape(action, (action_shape[0], action_shape[2]))
                    obs, r, done, _ = env.step(action) # observation, reward, done
                except ValueError as e :
                    print('action:', action, ', shape:', np.shape(action), ', policy_type:', policy_type)
                    print('actions:', actions, ', shape:', np.shape(actions))
                    traceback.print_exc()     

                totalr += r
                steps += 1
                if render:
                    env.render()
                # if steps % 100 == 0: print("%i/%i"%(steps, max_steps))
                if steps >= max_steps:
                    break
            returns.append(totalr)

        print('shape of returns', np.shape(returns))
        print('mean return', np.mean(returns))
        print('std of return', np.std(returns))

        rollout_data = {'observations': np.array(observations),
                        'actions': np.array(actions),
                        'returns': np.array(returns)}

        if not os.path.exists('rollout_data') :
            os.mkdir('rollout_data')
        with open(os.path.join('rollout_data', policy_type + '-' + gym_env + '.pkl'), 'wb') as f:
            pk.dump(rollout_data, f, pk.HIGHEST_PROTOCOL)
            
        return rollout_data, policy_type, env


In [11]:
# rollout and check

import pandas as pd

df = pd.DataFrame(columns=['gymenv', 'policy_type', 'rollouts', 'mean', 'std'])

for gym_env in gym_envs :
    for expert_policy in [ True, False] :
        start_time = dt.datetime.now()
        rollout_data, policy_type, _ = rollout_by_policy(gym_env, MAX_TIMESTEPS, NUM_ROLLOUTS,
                                                      policy_fn=None if expert_policy else load_learned_policy_fn(gym_env),
                                                      render=False)
        returns = rollout_data['returns']
        end_time = dt.datetime.now()
        print('Rollout result. env:', gym_env, ', policy_type:', policy_type, ', returns:', len(returns), '/', np.mean(returns), '/', np.std(returns))
        df = df.append({'gymenv':gym_env, 'policy_type':policy_type, 'rollouts': len(returns), 'mean': np.mean(returns), 'std': np.std(returns)}, ignore_index=True)
        print('It took ', '%10d' % ((end_time - start_time).total_seconds()), ' seconds.')
        


W0723 19:38:12.873242 69560 deprecation_wrapper.py:119] From C:\Exception\Works\GitHub\cs294\yoonforh\hw1\load_policy.py:55: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0723 19:38:13.030773 69560 deprecation.py:323] From C:\Exception\Works\GitHub\cs294\yoonforh\hw1\tf_util.py:91: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
W0723 19:38:13.031772 69560 deprecation_wrapper.py:119] From C:\Exception\Works\GitHub\cs294\yoonforh\hw1\tf_util.py:74: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0723 19:38:13.032768 69560 deprecation.py:323] From C:\Works\tensorflow\lib\site-packages\tensorflow\python\util\tf_should_use.py:193: initialize_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.

loading and building expert policy
obs (1, 111) (1, 111)
loaded and built


  result = entry_point.load(False)


[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
shape of returns (500,)
mean return 4792.043355892032
std of return 260.3806693532324
Rollout result. env: Ant-v2 , policy_type: expert , returns: 500 / 4792.043355892032 / 260.3806693532324
It took        1424  seconds.




saved dir: model_Ant-v2
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
shape of returns (500,)
mean return 1509.1139724415493
std of return 104.04023142009902
Rollout result. env: Ant-v2 , policy_type: learned , returns: 500 / 1509.1139724415493 / 104.04023142009902
It took        1568  seconds.
loading and building expert policy
obs (1, 11) (1, 11)
loaded and built
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
shape of returns (500,)
mean return 3778.0636147566383
std of return 3.640208992602978
Rollout result. env: Hopper-v2 , policy_type: expert , returns: 500 / 3778.0636147566383 / 3.640208992602978
It took         941  seconds.
saved dir: model_Hop

In [12]:
df.head(20)

Unnamed: 0,gymenv,policy_type,rollouts,mean,std
0,Ant-v2,expert,500,4792.043356,260.380669
1,Ant-v2,learned,500,1509.113972,104.040231
2,Hopper-v2,expert,500,3778.063615,3.640209
3,Hopper-v2,learned,500,788.532313,335.400783
4,Reacher-v2,expert,500,-3.833945,1.632588
5,Reacher-v2,learned,500,-10.479362,4.560073
6,HalfCheetah-v2,expert,500,4142.096107,81.876911
7,HalfCheetah-v2,learned,500,2628.466742,195.237611
8,Humanoid-v2,expert,500,10366.597295,565.231565
9,Humanoid-v2,learned,500,295.178255,24.592578


## DAgger

1. Implement DAgger. See the code provided in run expert.py to see how to query the expert policy and perform roll-outs in the environment.

2. Run DAgger and report results on one task in which DAgger can learn a better policy than behavioral cloning.
Report your results in the form of a learning curve, plotting the number of DAgger iterations vs. the policy’s mean return,
with error bars to show the standard deviation.

Include the performance of the expert policy and the behavioral cloning agent on the same plot.
In the caption, state which task you used, and any details regarding network architecture, amount of data, etc. (as in the previous section).

### note
1. DAgger needs labeling by human experts.
1. The main idea is that the trajectories are collected by the learned policy. but the action is relabeled by the expert policy.
1. DAgger addresses the problem of distributional “drift”

In [28]:
import gym

def rollout_by_dagger(gym_env, max_timesteps, num_rollouts, num_epochs=50, render=False) :
    policy_type = 'dagger'
    
    print('loading and building learned policy')
    policy_fn = load_learned_policy_fn(gym_env)
    print('loaded and built')

    print('starting dagger ', gym_env, dt.datetime.now())
    train_observations, train_actions = load_expert_data(gym_env)
    obs_shape, action_shape = np.shape(train_observations), np.shape(train_actions)
    if action_shape[1] == 1 :
        train_actions = np.reshape(train_actions, (action_shape[0], action_shape[2]))
        action_shape = np.shape(train_actions)
    print(gym_env, ' observation shape: ', obs_shape, ', actions shape:', action_shape)
    
    with tf.Session(graph=tf.Graph()) as session, session.graph.as_default() : # for session nesting, the graphs should be isolated for each tf sessions
        print('loading and building expert policy for DAgger')
        expert_policy_fn = load_expert_policy_fn(gym_env)
        print('loaded and built for DAgger')

        tf_util.initialize()

        gym_env_model = 'model_' + gym_env
        gym_env_dagger_model = 'model_dagger_' + gym_env # new model file to save after lite training
        light_train_config = default_train_config.copy()
        light_train_config['num_epochs'] = num_epochs
        light_train_config['early_stopping_epoch_on_max_no_decrease'] = -1
        cloning_model = None

        env = gym.make(gym_env)
        max_steps = max_timesteps or env.spec.timestep_limit

        returns = []
        observations = []
        actions = []

        for i in range(num_rollouts):
            # print('iter', i)
            obs = env.reset()
            done = False
            totalr = 0.
            steps = 0

            while not done:
                action = policy_fn(obs[None,:])
                # print('before append>>>> observations shape:', np.shape(observations), ', train_observations shape:', np.shape(train_observations), ', obs shape:', np.shape(obs))
                observations.append(obs)
                train_observations = np.append(train_observations, obs[None, :], axis=0)
                # print('after append>>>> observations shape:', np.shape(observations), ', train_observations shape:', np.shape(train_observations), ', obs shape:', np.shape(obs))

                expert_action = expert_policy_fn(obs[None,:]) # None makes additional dimension. to reduce, use np.hstack
                actions.append(expert_action)
                # print('before append>>>> actions shape:', np.shape(actions), ', train_actions shape:', np.shape(train_actions), ', expert_action shape:', np.shape(expert_action))
                train_actions = np.append(train_actions, expert_action, axis=0)
                # print('after append>>>> actions shape:', np.shape(actions), ', train_actions shape:', np.shape(train_actions), ', expert_action shape:', np.shape(expert_action))

                try :
                    if np.shape(action)[1] == 1 :
                        action_shape = np.shape(action)
                        action = np.reshape(action, (action_shape[0], action_shape[2]))
                        action_shape = np.shape(action)
                    obs, r, done, _ = env.step(action) # observation, reward, done
                except ValueError as e :
                    print('action:', action, ', shape:', np.shape(action), ', policy_type:', policy_type)
                    print('actions:', actions, ', shape:', np.shape(actions))
                    print('expert_action:', expert_action, ', shape:', np.shape(expert_action))
                    traceback.print_exc()     

                totalr += r
                steps += 1
                #if render: # this is just for dagger training
                #    env.render()
                # if steps % 100 == 0: print("%i/%i"%(steps, max_steps))
                if steps >= max_steps:
                    break
            returns.append(totalr)

            # retrain on every new rollouts
            n_samples = train_observations.shape[0]
            n_train = int(n_samples * (1 - TEST_PERCENT))

            print('train_observations shape:', train_observations.shape, ', train_actions shape:', train_actions.shape)
            print('observations shape:', np.shape(observations), ', actions shape:', np.shape(actions))
            try :
                train_observations, train_actions = shuffle_XY(train_observations, train_actions)
            except IndexError as e :
                print('train_observations:', train_observations, ', train_actions:', train_actions)
                traceback.print_exc() 

            print('train input : train_observations shape:', train_observations[:n_train].shape, ', train_actions shape:', train_actions[:n_train].shape)

            with tf.Session(graph=tf.Graph()) as nested_session, nested_session.graph.as_default():
                saved_model = gym_env_model if cloning_model is None else gym_env_dagger_model
                cloning_model = BehavioralCloning(X_shape=obs_shape, Y_shape=action_shape,
                                                  scope_name=gym_env, restore_mode=True, session=nested_session)
                cloning_model.restore_model(saved_model)        
                rescale_factor = cloning_model.read_scale(gym_env_model)
                training_costs, validation_costs, validation_measures, rescale_factor = cloning_model.train(train_observations[:n_train], train_actions[:n_train],
                                                                                                            train_config = light_train_config, rescale_factor = rescale_factor,
                                                                                                            scale_fn=scale_signed)
                cloning_model.save_model(gym_env_dagger_model)

        with tf.Session(graph=tf.Graph()) as nested_session, nested_session.graph.as_default():
            cloning = BehavioralCloning(X_shape=obs_shape, Y_shape=action_shape,
                                        scope_name=gym_env, restore_mode=True, session=nested_session)
            cloning.restore_model(gym_env_dagger_model)
            rescale_factor = cloning_model.read_scale(gym_env_model)
            test_hyps, test_costs, test_measures = cloning.test(train_observations[n_train:], train_actions[n_train:],
                                                                rescale_factor = rescale_factor, scale_fn=scale_signed, descale_fn=descale_signed)

        print('shape of returns', np.shape(returns))
        print('mean return', np.mean(returns))
        print('std of return', np.std(returns))

        rollout_data = {'observations': np.array(observations),
                        'actions': np.array(actions),
                        'returns': np.array(returns)}

        if not os.path.exists('rollout_data') :
            os.mkdir('rollout_data')
        with open(os.path.join('rollout_data', policy_type + '-' + gym_env + '.pkl'), 'wb') as f:
            pk.dump(rollout_data, f, pk.HIGHEST_PROTOCOL)

    rollout_data, policy_type, win_env = rollout_by_policy(gym_env, max_timesteps, num_rollouts,
                                    policy_fn=load_learned_policy_fn(gym_env, dagger=True),
                                    render=render)
    return rollout_data, 'dagger', win_env



In [29]:
max_timesteps = None
num_rollouts = 10 # incremental learning is too slow 

for gym_env in gym_envs :
    start_time = dt.datetime.now()
    rollout_data, policy_type, _ = rollout_by_dagger(gym_env, max_timesteps, num_rollouts,
                                                    render=False)
    returns = rollout_data['returns']
    end_time = dt.datetime.now()
    print('Rollout result. env:', gym_env, ', policy_type:', policy_type, ', returns:', len(returns), '/', np.mean(returns), '/', np.std(returns))
    df = df.append({'gymenv':gym_env, 'policy_type':policy_type, 'rollouts': len(returns), 'mean': np.mean(returns), 'std': np.std(returns)}, ignore_index=True)
    print('It took ', '%10d' % ((end_time - start_time).total_seconds()), ' seconds.')


loading and building learned policy
saved dir: model_Ant-v2
loaded and built
starting dagger  Ant-v2 2019-07-24 14:57:11.272614
Ant-v2  observation shape:  (496130, 111) , actions shape: (496130, 8)
loading and building expert policy for DAgger
obs (1, 111) (1, 111)
loaded and built for DAgger
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
train_observations shape: (497130, 111) , train_actions shape: (497130, 8)
observations shape: (1000, 111) , actions shape: (1000, 1, 8)
train input : train_observations shape: (397704, 111) , train_actions shape: (397704, 8)
saved dir: model_Ant-v2
Learning starts. It will take some time... 2019-07-24 14:59:22.308587
Epoch: 0000 average training cost = 0.026412237 validation cost = 0.017820382 validation measure = 0.768059015 2019-07-24 14:59:35.407298
Epoch: 0049 average training co

In [30]:
df.head(30)

Unnamed: 0,gymenv,policy_type,rollouts,mean,std
0,Ant-v2,expert,500,4792.043356,260.380669
1,Ant-v2,learned,500,1509.113972,104.040231
2,Hopper-v2,expert,500,3778.063615,3.640209
3,Hopper-v2,learned,500,788.532313,335.400783
4,Reacher-v2,expert,500,-3.833945,1.632588
5,Reacher-v2,learned,500,-10.479362,4.560073
6,HalfCheetah-v2,expert,500,4142.096107,81.876911
7,HalfCheetah-v2,learned,500,2628.466742,195.237611
8,Humanoid-v2,expert,500,10366.597295,565.231565
9,Humanoid-v2,learned,500,295.178255,24.592578


## just rollout and render using the policies. enjoy the visuals

In [32]:
# rollout and check
from time import sleep
import pandas as pd
import glfw

max_timesteps = None
num_rollouts = 5

df2 = pd.DataFrame(columns=['gymenv', 'policy_type', 'rollouts', 'mean', 'std'])

def close_mujoco_window(win) :
    if win.unwrapped.viewer is not None :
        glfw.destroy_window(win.unwrapped.viewer.window)
        win.unwrapped.viewer = None
    
for gym_env in gym_envs :
    for expert_policy in [ True, False] :
        start_time = dt.datetime.now()
        rollout_data, policy_type, opengym_win = rollout_by_policy(gym_env, max_timesteps, num_rollouts,
                                                                  policy_fn=None if expert_policy else load_learned_policy_fn(gym_env),
                                                                  render=True)
        returns = rollout_data['returns']
        end_time = dt.datetime.now()
        print('Rollout result. env:', gym_env, ', policy_type:', policy_type, ', returns:', len(returns), '/', np.mean(returns), '/', np.std(returns))
        df2 = df2.append({'gymenv':gym_env, 'policy_type':policy_type, 'rollouts': len(returns), 'mean': np.mean(returns), 'std': np.std(returns)}, ignore_index=True)
        print(gym_env, '-', policy_type, '. It took ', '%10d' % ((end_time - start_time).total_seconds()), ' seconds.')
        input("Press Enter to continue...")
        close_mujoco_window(opengym_win)

    start_time = dt.datetime.now()
    rollout_data, policy_type, opengym_win = rollout_by_dagger(gym_env, max_timesteps, num_rollouts, num_epochs=5,
                                                              render=True)
    returns = rollout_data['returns']
    end_time = dt.datetime.now()
    print('Rollout result. env:', gym_env, ', policy_type:', policy_type, ', returns:', len(returns), '/', np.mean(returns), '/', np.std(returns))
    df2 = df2.append({'gymenv':gym_env, 'policy_type':policy_type, 'rollouts': len(returns), 'mean': np.mean(returns), 'std': np.std(returns)}, ignore_index=True)
    print(gym_env, '-', policy_type, '. It took ', '%10d' % ((end_time - start_time).total_seconds()), ' seconds.')
    input("Press Enter to continue...")
    close_mujoco_window(opengym_win)
        

loading and building expert policy
obs (1, 111) (1, 111)
loaded and built
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return 4236.738439673416
std of return 1082.6651409019382
Rollout result. env: Ant-v2 , policy_type: expert , returns: 5 / 4236.738439673416 / 1082.6651409019382
Ant-v2 - expert . It took          61  seconds.


Press Enter to continue... 


saved dir: model_Ant-v2
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return 1456.4291966516435
std of return 20.799161409530495
Rollout result. env: Ant-v2 , policy_type: learned , returns: 5 / 1456.4291966516435 / 20.799161409530495
Ant-v2 - learned . It took          73  seconds.


Press Enter to continue... 


loading and building learned policy
saved dir: model_Ant-v2
loaded and built
starting dagger  Ant-v2 2019-07-25 08:57:33.322723
Ant-v2  observation shape:  (496130, 111) , actions shape: (496130, 8)
loading and building expert policy for DAgger
obs (1, 111) (1, 111)
loaded and built for DAgger
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
train_observations shape: (497130, 111) , train_actions shape: (497130, 8)
observations shape: (1000, 111) , actions shape: (1000, 1, 8)
train input : train_observations shape: (397704, 111) , train_actions shape: (397704, 8)
saved dir: model_Ant-v2
Learning starts. It will take some time... 2019-07-25 08:59:48.727330
Epoch: 0000 average training cost = 0.026391279 validation cost = 0.017780432 validation measure = 0.768477559 2019-07-25 09:00:01.987326
Epoch: 0004 average training co

Press Enter to continue... 


loading and building expert policy
obs (1, 11) (1, 11)
loaded and built
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return 3778.744256532997
std of return 3.2459491355292407
Rollout result. env: Hopper-v2 , policy_type: expert , returns: 5 / 3778.744256532997 / 3.2459491355292407
Hopper-v2 - expert . It took          22  seconds.


Press Enter to continue... 


saved dir: model_Hopper-v2
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return 753.8643438327778
std of return 319.4460042712088
Rollout result. env: Hopper-v2 , policy_type: learned , returns: 5 / 753.8643438327778 / 319.4460042712088
Hopper-v2 - learned . It took           7  seconds.


Press Enter to continue... 


loading and building learned policy
saved dir: model_Hopper-v2
loaded and built
starting dagger  Hopper-v2 2019-07-25 09:16:04.689659
Hopper-v2  observation shape:  (500000, 11) , actions shape: (500000, 3)
loading and building expert policy for DAgger
obs (1, 11) (1, 11)
loaded and built for DAgger
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
train_observations shape: (500236, 11) , train_actions shape: (500236, 3)
observations shape: (236, 11) , actions shape: (236, 1, 3)
train input : train_observations shape: (400188, 11) , train_actions shape: (400188, 3)
saved dir: model_Hopper-v2
Learning starts. It will take some time... 2019-07-25 09:16:09.902176
Epoch: 0000 average training cost = 0.386448085 validation cost = 0.151185811 validation measure = 0.929909945 2019-07-25 09:16:23.027072
Epoch: 0004 average trainin

Press Enter to continue... 


loading and building expert policy
obs (1, 11) (1, 11)
loaded and built
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return -4.224733309005781
std of return 1.484114109365414
Rollout result. env: Reacher-v2 , policy_type: expert , returns: 5 / -4.224733309005781 / 1.484114109365414
Reacher-v2 - expert . It took           3  seconds.


Press Enter to continue... 


saved dir: model_Reacher-v2
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return -10.555401295578095
std of return 2.8969758059010666
Rollout result. env: Reacher-v2 , policy_type: learned , returns: 5 / -10.555401295578095 / 2.8969758059010666
Reacher-v2 - learned . It took           4  seconds.


Press Enter to continue... 


loading and building learned policy
saved dir: model_Reacher-v2
loaded and built
starting dagger  Reacher-v2 2019-07-25 09:22:05.693346
Reacher-v2  observation shape:  (500000, 11) , actions shape: (500000, 2)
loading and building expert policy for DAgger
obs (1, 11) (1, 11)
loaded and built for DAgger
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
train_observations shape: (500050, 11) , train_actions shape: (500050, 2)
observations shape: (50, 11) , actions shape: (50, 1, 2)
train input : train_observations shape: (400040, 11) , train_actions shape: (400040, 2)
saved dir: model_Reacher-v2
Learning starts. It will take some time... 2019-07-25 09:22:07.830201
Epoch: 0000 average training cost = 0.005705089 validation cost = 0.004972897 validation measure = 0.366930246 2019-07-25 09:22:20.233093
Epoch: 0004 average train

Press Enter to continue... 


loading and building expert policy
obs (1, 17) (1, 17)
loaded and built
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return 4129.324865018089
std of return 48.39784988679189
Rollout result. env: HalfCheetah-v2 , policy_type: expert , returns: 5 / 4129.324865018089 / 48.39784988679189
HalfCheetah-v2 - expert . It took          62  seconds.


Press Enter to continue... 


saved dir: model_HalfCheetah-v2
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return 2551.8196935074275
std of return 183.52685814516974
Rollout result. env: HalfCheetah-v2 , policy_type: learned , returns: 5 / 2551.8196935074275 / 183.52685814516974
HalfCheetah-v2 - learned . It took          65  seconds.


Press Enter to continue... 


loading and building learned policy
saved dir: model_HalfCheetah-v2
loaded and built
starting dagger  HalfCheetah-v2 2019-07-25 09:29:53.089264
HalfCheetah-v2  observation shape:  (500000, 17) , actions shape: (500000, 6)
loading and building expert policy for DAgger
obs (1, 17) (1, 17)
loaded and built for DAgger
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
train_observations shape: (501000, 17) , train_actions shape: (501000, 6)
observations shape: (1000, 17) , actions shape: (1000, 1, 6)
train input : train_observations shape: (400800, 17) , train_actions shape: (400800, 6)
saved dir: model_HalfCheetah-v2
Learning starts. It will take some time... 2019-07-25 09:30:20.509413
Epoch: 0000 average training cost = 0.110685512 validation cost = 0.058326729 validation measure = 0.893301785 2019-07-25 09:30:33.796974
Epoch

Press Enter to continue... 


loading and building expert policy
obs (1, 376) (1, 376)
loaded and built
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return 10401.182975388374
std of return 22.169635619860017
Rollout result. env: Humanoid-v2 , policy_type: expert , returns: 5 / 10401.182975388374 / 22.169635619860017
Humanoid-v2 - expert . It took          34  seconds.


Press Enter to continue... 


saved dir: model_Humanoid-v2
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return 284.027115344265
std of return 12.980992706710655
Rollout result. env: Humanoid-v2 , policy_type: learned , returns: 5 / 284.027115344265 / 12.980992706710655
Humanoid-v2 - learned . It took           4  seconds.


Press Enter to continue... 


loading and building learned policy
saved dir: model_Humanoid-v2
loaded and built
starting dagger  Humanoid-v2 2019-07-25 09:39:38.169911
Humanoid-v2  observation shape:  (499084, 376) , actions shape: (499084, 17)
loading and building expert policy for DAgger
obs (1, 376) (1, 376)
loaded and built for DAgger
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
train_observations shape: (499142, 376) , train_actions shape: (499142, 17)
observations shape: (58, 376) , actions shape: (58, 1, 17)
train input : train_observations shape: (399313, 376) , train_actions shape: (399313, 17)
saved dir: model_Humanoid-v2
Learning starts. It will take some time... 2019-07-25 09:40:08.528859
Epoch: 0000 average training cost = 0.181445137 validation cost = 0.104215570 validation measure = 0.891152859 2019-07-25 09:40:22.868708
Epoch: 0004

Press Enter to continue... 


loading and building expert policy
obs (1, 17) (1, 17)
loaded and built
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return 5519.144750657926
std of return 62.62180384508529
Rollout result. env: Walker2d-v2 , policy_type: expert , returns: 5 / 5519.144750657926 / 62.62180384508529
Walker2d-v2 - expert . It took          23  seconds.


Press Enter to continue... 


saved dir: model_Walker2d-v2
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Creating window glfw
shape of returns (5,)
mean return 552.7175979122823
std of return 409.13333953431726
Rollout result. env: Walker2d-v2 , policy_type: learned , returns: 5 / 552.7175979122823 / 409.13333953431726
Walker2d-v2 - learned . It took           9  seconds.


Press Enter to continue... 


loading and building learned policy
saved dir: model_Walker2d-v2
loaded and built
starting dagger  Walker2d-v2 2019-07-25 09:56:26.375748
Walker2d-v2  observation shape:  (499696, 17) , actions shape: (499696, 6)
loading and building expert policy for DAgger
obs (1, 17) (1, 17)
loaded and built for DAgger
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
train_observations shape: (500696, 17) , train_actions shape: (500696, 6)
observations shape: (1000, 17) , actions shape: (1000, 1, 6)
train input : train_observations shape: (400556, 17) , train_actions shape: (400556, 6)
saved dir: model_Walker2d-v2
Learning starts. It will take some time... 2019-07-25 09:56:54.025758
Epoch: 0000 average training cost = 0.319572836 validation cost = 0.194618776 validation measure = 0.803772926 2019-07-25 09:57:08.080753
Epoch: 0004 avera

Press Enter to continue... 


In [33]:
df2.head(30)

Unnamed: 0,gymenv,policy_type,rollouts,mean,std
0,Ant-v2,expert,5,4236.73844,1082.665141
1,Ant-v2,learned,5,1456.429197,20.799161
2,Ant-v2,dagger,5,1559.178639,84.840651
3,Hopper-v2,expert,5,3778.744257,3.245949
4,Hopper-v2,learned,5,753.864344,319.446004
5,Hopper-v2,dagger,5,644.931829,276.00433
6,Reacher-v2,expert,5,-4.224733,1.484114
7,Reacher-v2,learned,5,-10.555401,2.896976
8,Reacher-v2,dagger,5,-12.051469,2.349063
9,HalfCheetah-v2,expert,5,4129.324865,48.39785


##  Bonus: Alternative Policy Architectures

1. (Optional) Experiment with a different policy architecture, e.g. using recurrence or changing the size or nonlinearities used.

Compare performance between your new and original policy architectures using behavioral cloning and/or DAgger,
and report your results in the same form as above, with a caption describing what you did.
