# Policy Gradients
The goal in policy gradient algorithms is to maximize the expected returns of a policy $\pi_\theta$ with parameters $\theta$. Letting $\tau=((s_0, a_0, r_0), \ldots, (s_T, a_T, r_T) )$ denote a trajectory and $R(\tau)$ the return of $\tau$, this objective can be written as
$$\max_{\theta} \mathbb E_{\tau \sim \pi_{\theta}}[R(\tau)].$$

Using the REINFORCE trick, we can compute the policy gradient (the gradient of expected policy returns) as
$$\sum_{t=0}^T \mathbb E_{s_t, a_t \sim \pi(\tau)} \nabla_{\theta} \log \pi_{\theta}(a_t \vert s_t) R(\tau).$$

We can then estimate this with a very simple scheme.
We first sample a trajectory $\tau = ((s_t, a_t, r_t))_{t=0}^\infty$ from our current policy, compute the discounted return of the trajectory as $R$, then take a stochastic estimate of the policy gradient as 

$$\sum_{t=0}^T \mathbb \nabla_{\theta} \log \pi_{\theta}(a_t \vert s_t) R(\tau).$$
We can then repeat sample more trajectories to average the estimate over multiple samples.
In practice, we will often use _discounted_ returns $\tilde R(\tau) = \sum_{t=0}^T \gamma^t r_t$ where $\gamma$ is the discount factor and our policy gradient estimate will simply replace the undiscounted returns with $\tilde R(\tau)$.



In [1]:
#@title imports
# As usual, a bit of setup
import os
import shutil
import time
import numpy as np
import gym
import torch

import deeprl.infrastructure.pytorch_util as ptu

from deeprl.infrastructure.rl_trainer import RL_Trainer
from deeprl.infrastructure.trainers import PG_Trainer
from deeprl.infrastructure.trainers import BC_Trainer

from deeprl.agents.pg_agent import PGAgent
from deeprl.policies.MLP_policy import MLPPolicyPG

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def remove_folder(path):
    # check if folder exists
    if os.path.exists(path): 
        print("Clearing old results at {}".format(path))
        # remove if exists
        shutil.rmtree(path)
    else:
        print("Folder {} does not exist yet. No old results to delete".format(path))

In [2]:
pg_base_args_dict = dict(
    env_name = 'Hopper-v2', #@param ['Ant-v2', 'Humanoid-v2', 'Walker2d-v2', 'HalfCheetah-v2', 'Hopper-v2']
    exp_name = 'test_pg', #@param
    save_params = False, #@param {type: "boolean"}
    
    ep_len = 200, #@param {type: "integer"}
    discount = 0.95, #@param {type: "number"}

    reward_to_go = True, #@param {type: "boolean"}
    nn_baseline = False, #@param {type: "boolean"}
    dont_standardize_advantages = True, #@param {type: "boolean"}

    # Training
    num_agent_train_steps_per_iter = 1, #@param {type: "integer"})
    n_iter = 100, #@param {type: "integer"})

    # batches & buffers
    batch_size = 1000, #@param {type: "integer"})
    eval_batch_size = 1000, #@param {type: "integer"}
    train_batch_size = 1000, #@param {type: "integer"}
    max_replay_buffer_size = 1000000, #@param {type: "integer"}

    #@markdown network
    n_layers = 2, #@param {type: "integer"}
    size = 64, #@param {type: "integer"}
    learning_rate = 5e-3, #@param {type: "number"}

    #@markdown logging
    video_log_freq = -1, #@param {type: "integer"}
    scalar_log_freq = 1, #@param {type: "integer"}

    #@markdown gpu & run-time settings
    no_gpu = False, #@param {type: "boolean"}
    which_gpu = 0, #@param {type: "integer"}
    seed = 2, #@param {type: "integer"}
    logdir = 'test',
)

## Implementing policy gradients
We will first compute a very naive policy gradient calculation by taking the whole discounted return of a trajectory. Fill out the method <code>_discounted_return</code> in <code>pg_agent.py</code>. Your error should be 1e-6 or lower.

In [8]:
### Test return computation
pg_args = dict(pg_base_args_dict)

env_str = 'CartPole'
pg_args['env_name'] = '{}-v0'.format(env_str)
pgtrainer = PG_Trainer(pg_args)
pgagent = pgtrainer.rl_trainer.agent

T = 10
np.random.seed(0)
rewards = np.random.normal(size=T)
discounted_returns = pgagent._discounted_return(rewards)

expected_return = 6.49674307
return_error = rel_error(discounted_returns, expected_return)
print("Error in return estimate is", return_error)

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
CartPole-v0
Error in return estimate is 1.3988372177636252e-10


Next, we'll consider a return estimate with lower variance by taking the discounted reward-to-go at each timestep instead of the entire discounted return. More precisely, instead of taking $\sum_{t'=0}^T \gamma^{t'} r_{t'}$ as the return estimate for all timesteps $t$, we will instead use $\sum_{t'=t}^T \gamma^{t' - t} r_{t'}$ for the return estimate at timestep $t$. Fill out the method <code>_discounted_cumsum</code> in <code>pg_agent.py</code>.   Your error should be 1e-6 or lower.

In [46]:
### Test reward to go computations
pg_args = dict(pg_base_args_dict)

env_str = 'CartPole'
pg_args['env_name'] = '{}-v0'.format(env_str)
pgtrainer = PG_Trainer(pg_args)
pgagent = pgtrainer.rl_trainer.agent

T = 10
np.random.seed(0)
rewards = np.random.normal(size=T)
discounted_cumsum = pgagent._discounted_cumsum(rewards)
expected_cumsum = np.array([6.49674307, 4.98177971, 4.82276053, 4.04633952, 1.90046981, 0.03464402,
 1.06518095, 0.12115003, 0.28684973, 0.4105985])

return_error = rel_error(discounted_cumsum, expected_cumsum)
print("Error in return estimate is", return_error)

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
CartPole-v0
Error in return estimate is 1.0143655677664199e-08


Finally, we'll use our return estimates to compute a policy gradient. Fill out the surrogate loss computation in the <code>update</code> method in MLPPolicyPG class in <code>policies/MLP_policy.py</code>.

In [68]:
### Test policy gradient (check gradients match what we expect)
torch.manual_seed(0)
ac_dim = 2
ob_dim = 3
batch_size = 5

policy = MLPPolicyPG(
            ac_dim=ac_dim,
            ob_dim=ob_dim,
            n_layers=1,
            size=2,
            learning_rate=0.25)

np.random.seed(0)
obs = np.random.normal(size=(batch_size, ob_dim))
acts = np.random.normal(size=(batch_size, ac_dim))
advs = 1000 * np.random.normal(size=(batch_size,))

first_weight_before = np.array(ptu.to_numpy(next(policy.mean_net.parameters())))
print("Weight before update", first_weight_before)

for i in range(5):
    loss = policy.update(obs, acts, advs)['Training Loss']

print(loss)
expected_loss = -6142.9116
loss_error = rel_error(loss, expected_loss)
print("Loss Error", loss_error, "should be on the order of 1e-6 or lower")

first_weight_after = ptu.to_numpy(next(policy.mean_net.parameters()))
print('Weight after update', first_weight_after)

weight_change = first_weight_after - first_weight_before
print("Change in weights", weight_change)

expected_change = np.array([[ 1.035012, 1.0455959, 0.11085394],
                            [-1.1532364, -0.5915445, 0.557522]])
updated_weight_error = rel_error(weight_change, expected_change)
print("Weight Update Error", updated_weight_error, "should be on the order of 1e-6 or lower")

Weight before update [[-0.00432253  0.30971587 -0.47518533]
 [-0.42489457 -0.22236899  0.15482074]]
-6142.9126
Loss Error 8.120385198316438e-08 should be on the order of 1e-6 or lower
Weight after update [[ 1.0306895   1.3553118  -0.36433142]
 [-1.5781313  -0.81391037  0.7123427 ]]
Change in weights [[ 1.035012    1.0455959   0.11085391]
 [-1.1532367  -0.5915414   0.55752194]]
Weight Update Error 2.6122426803972157e-06 should be on the order of 1e-6 or lower


We can compare the two return estimators on a simple environment and compare how well they do. 

In [69]:
pg_args = dict(pg_base_args_dict)

env_str = 'CartPole'
pg_args['env_name'] = '{}-v0'.format(env_str)
pg_args['reward_to_go'] = False
pg_args['n_iter'] = 100

# Delete all previous logs
remove_folder('logs/policy_gradient/{}/full_returns/'.format(env_str))

for seed in range(3):
    print("Running policy gradient experiment with seed", seed)
    pg_args['seed'] = seed
    pg_args['logdir'] = 'logs/policy_gradient/{}/full_returns/seed{}'.format(env_str, seed)
    pgtrainer = PG_Trainer(pg_args)
    pgtrainer.run_training_loop()

Folder logs/policy_gradient/CartPole/full_returns/ does not exist yet. No old results to delete
Running policy gradient experiment with seed 0
########################
logging outputs to  logs/policy_gradient/CartPole/full_returns/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
CartPole-v0


********** Iteration 0 ************

Collecting data to be used for training...
At timestep:     1012 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 29.285715103149414
Eval_StdReturn : 16.32451629638672
Eval_MaxReturn : 86.0
Eval_MinReturn : 10.0
Eval_AverageEpLen : 29.285714285714285
Train_AverageReturn : 22.0
Train_StdReturn : 10.642409324645996
Train_MaxReturn : 51.0
Train_MinReturn : 9.0
Train_AverageEpLen : 22.0
Train_EnvstepsSoFar : 1012
TimeSinceStart : 0.915996789932251
Training Loss 

At timestep:     1027 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 24.9761905670166
Eval_StdReturn : 15.09255313873291
Eval_MaxReturn : 81.0
Eval_MinReturn : 11.0
Eval_AverageEpLen : 24.976190476190474
Train_AverageReturn : 24.452381134033203
Train_StdReturn : 11.658232688903809
Train_MaxReturn : 64.0
Train_MinReturn : 10.0
Train_AverageEpLen : 24.452380952380953
Train_EnvstepsSoFar : 13396
TimeSinceStart : 14.014611959457397
Training Loss : 9.670610427856445
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 13 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 31.176469802856445
Eval_StdReturn : 17.550716400146484
Eval_MaxReturn : 90.0
Eval_MinReturn : 11.0
Eval_AverageEpLen : 31.176470588235293
Train_AverageReturn : 28.714284896850586
Tra

At timestep:     1008 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 44.130435943603516
Eval_StdReturn : 15.255232810974121
Eval_MaxReturn : 71.0
Eval_MinReturn : 16.0
Eval_AverageEpLen : 44.130434782608695
Train_AverageReturn : 37.33333206176758
Train_StdReturn : 17.68657112121582
Train_MaxReturn : 103.0
Train_MinReturn : 11.0
Train_AverageEpLen : 37.333333333333336
Train_EnvstepsSoFar : 25720
TimeSinceStart : 27.592190980911255
Training Loss : 10.478879928588867
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 25 ************

Collecting data to be used for training...
At timestep:     1037 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 43.54166793823242
Eval_StdReturn : 17.387685775756836
Eval_MaxReturn : 83.0
Eval_MinReturn : 15.0
Eval_AverageEpLen : 43.541666666666664
Train_AverageReturn : 51.849998474121094
T

Eval_AverageReturn : 101.69999694824219
Eval_StdReturn : 32.4839973449707
Eval_MaxReturn : 154.0
Eval_MinReturn : 45.0
Eval_AverageEpLen : 101.7
Train_AverageReturn : 72.92857360839844
Train_StdReturn : 23.057270050048828
Train_MaxReturn : 106.0
Train_MinReturn : 25.0
Train_AverageEpLen : 72.92857142857143
Train_EnvstepsSoFar : 38208
TimeSinceStart : 41.09536099433899
Training Loss : 11.56374740600586
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 37 ************

Collecting data to be used for training...
At timestep:     1053 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 84.66666412353516
Eval_StdReturn : 29.519296646118164
Eval_MaxReturn : 154.0
Eval_MinReturn : 36.0
Eval_AverageEpLen : 84.66666666666667
Train_AverageReturn : 87.75
Train_StdReturn : 51.410316467285156
Train_MaxReturn : 181.0
Train_MinReturn : 33.0
Train_AverageEpLen : 87.75
Train_EnvstepsSoFar : 39261

At timestep:     1081 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 83.41666412353516
Eval_StdReturn : 31.645057678222656
Eval_MaxReturn : 162.0
Eval_MinReturn : 35.0
Eval_AverageEpLen : 83.41666666666667
Train_AverageReturn : 83.15384674072266
Train_StdReturn : 32.58797836303711
Train_MaxReturn : 146.0
Train_MinReturn : 33.0
Train_AverageEpLen : 83.15384615384616
Train_EnvstepsSoFar : 52107
TimeSinceStart : 54.21835899353027
Training Loss : 10.874619483947754
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 50 ************

Collecting data to be used for training...
At timestep:     1029 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 103.80000305175781
Eval_StdReturn : 39.974491119384766
Eval_MaxReturn : 176.0
Eval_MinReturn : 56.0
Eval_AverageEpLen : 103.8
Train_AverageReturn : 93.54545593261719
Train_StdReturn 

Eval_AverageReturn : 91.38461303710938
Eval_StdReturn : 37.446449279785156
Eval_MaxReturn : 190.0
Eval_MinReturn : 31.0
Eval_AverageEpLen : 91.38461538461539
Train_AverageReturn : 96.45454406738281
Train_StdReturn : 37.23967361450195
Train_MaxReturn : 200.0
Train_MinReturn : 66.0
Train_AverageEpLen : 96.45454545454545
Train_EnvstepsSoFar : 64630
TimeSinceStart : 64.61874389648438
Training Loss : 10.252517700195312
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 62 ************

Collecting data to be used for training...
At timestep:     1073 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 97.09091186523438
Eval_StdReturn : 47.607208251953125
Eval_MaxReturn : 200.0
Eval_MinReturn : 56.0
Eval_AverageEpLen : 97.0909090909091
Train_AverageReturn : 89.41666412353516
Train_StdReturn : 49.6763801574707
Train_MaxReturn : 200.0
Train_MinReturn : 42.0
Train_AverageEpLen : 89.41666666

At timestep:     1086 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 187.6666717529297
Eval_StdReturn : 24.17758560180664
Eval_MaxReturn : 200.0
Eval_MinReturn : 134.0
Eval_AverageEpLen : 187.66666666666666
Train_AverageReturn : 181.0
Train_StdReturn : 28.23118782043457
Train_MaxReturn : 200.0
Train_MinReturn : 128.0
Train_AverageEpLen : 181.0
Train_EnvstepsSoFar : 78596
TimeSinceStart : 76.68314003944397
Training Loss : 10.735507011413574
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 75 ************

Collecting data to be used for training...
At timestep:     1144 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 193.1666717529297
Eval_StdReturn : 13.170886039733887
Eval_MaxReturn : 200.0
Eval_MinReturn : 164.0
Eval_AverageEpLen : 193.16666666666666
Train_AverageReturn : 190.6666717529297
Train_StdReturn : 15.1840

Eval_AverageReturn : 133.250
Eval_StdReturn : 25.48406410217285
Eval_MaxReturn : 189.0
Eval_MinReturn : 108.0
Eval_AverageEpLen : 133.25
Train_AverageReturn : 116.0
Train_StdReturn : 26.255157470703125
Train_MaxReturn : 156.0
Train_MinReturn : 60.0
Train_AverageEpLen : 116.0
Train_EnvstepsSoFar : 91501
TimeSinceStart : 86.81208491325378
Training Loss : 10.229460716247559
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 87 ************

Collecting data to be used for training...
At timestep:     1039 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 123.88888549804688
Eval_StdReturn : 15.293022155761719
Eval_MaxReturn : 148.0
Eval_MinReturn : 103.0
Eval_AverageEpLen : 123.88888888888889
Train_AverageReturn : 129.875
Train_StdReturn : 23.745723724365234
Train_MaxReturn : 178.0
Train_MinReturn : 109.0
Train_AverageEpLen : 129.875
Train_EnvstepsSoFar : 92540
TimeSinceStart : 87.62

At timestep:     1127 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 92.41666412353516
Eval_StdReturn : 33.018829345703125
Eval_MaxReturn : 138.0
Eval_MinReturn : 29.0
Eval_AverageEpLen : 92.41666666666667
Train_AverageReturn : 93.91666412353516
Train_StdReturn : 31.571239471435547
Train_MaxReturn : 132.0
Train_MinReturn : 34.0
Train_AverageEpLen : 93.91666666666667
Train_EnvstepsSoFar : 105253
TimeSinceStart : 97.5392439365387
Training Loss : 9.574260711669922
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...


Running policy gradient experiment with seed 1
########################
logging outputs to  logs/policy_gradient/CartPole/full_returns/seed1
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
CartPole-v0


********** Iteration 0 ************

Collecting

At timestep:     1033 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 46.95454406738281
Eval_StdReturn : 16.438528060913086
Eval_MaxReturn : 80.0
Eval_MinReturn : 18.0
Eval_AverageEpLen : 46.95454545454545
Train_AverageReturn : 44.91304397583008
Train_StdReturn : 16.287933349609375
Train_MaxReturn : 101.0
Train_MinReturn : 26.0
Train_AverageEpLen : 44.91304347826087
Train_EnvstepsSoFar : 12333
TimeSinceStart : 10.237333059310913
Training Loss : 10.015583992004395
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 12 ************

Collecting data to be used for training...
At timestep:     1018 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 46.04545593261719
Eval_StdReturn : 17.4394474029541
Eval_MaxReturn : 89.0
Eval_MinReturn : 21.0
Eval_AverageEpLen : 46.04545454545455
Train_AverageReturn : 42.41666793

At timestep:     1061 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 95.09091186523438
Eval_StdReturn : 26.102785110473633
Eval_MaxReturn : 140.0
Eval_MinReturn : 40.0
Eval_AverageEpLen : 95.0909090909091
Train_AverageReturn : 88.41666412353516
Train_StdReturn : 35.60771942138672
Train_MaxReturn : 200.0
Train_MinReturn : 61.0
Train_AverageEpLen : 88.41666666666667
Train_EnvstepsSoFar : 24979
TimeSinceStart : 20.66588306427002
Training Loss : 10.522383689880371
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 24 ************

Collecting data to be used for training...
At timestep:     1035 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 115.66666412353516
Eval_StdReturn : 38.43898391723633
Eval_MaxReturn : 184.0
Eval_MinReturn : 61.0
Eval_AverageEpLen : 115.66666666666667
Train_AverageReturn : 94.090911

At timestep:     1012 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 31.454545974731445
Eval_StdReturn : 6.840569496154785
Eval_MaxReturn : 50.0
Eval_MinReturn : 17.0
Eval_AverageEpLen : 31.454545454545453
Train_AverageReturn : 32.64516067504883
Train_StdReturn : 7.965847969055176
Train_MaxReturn : 50.0
Train_MinReturn : 22.0
Train_AverageEpLen : 32.645161290322584
Train_EnvstepsSoFar : 37283
TimeSinceStart : 32.55240797996521
Training Loss : 8.471728324890137
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 36 ************

Collecting data to be used for training...
At timestep:     1015 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 32.6129035949707
Eval_StdReturn : 6.4140849113464355
Eval_MaxReturn : 49.0
Eval_MinReturn : 23.0
Eval_AverageEpLen : 32.61290322580645
Train_AverageReturn : 33.833332061

At timestep:     1082 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 80.61538696289062
Eval_StdReturn : 30.595470428466797
Eval_MaxReturn : 139.0
Eval_MinReturn : 43.0
Eval_AverageEpLen : 80.61538461538461
Train_AverageReturn : 90.16666412353516
Train_StdReturn : 36.58285903930664
Train_MaxReturn : 185.0
Train_MinReturn : 57.0
Train_AverageEpLen : 90.16666666666667
Train_EnvstepsSoFar : 49662
TimeSinceStart : 43.21893501281738
Training Loss : 9.906045913696289
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 48 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 86.66666412353516
Eval_StdReturn : 35.12200927734375
Eval_MaxReturn : 144.0
Eval_MinReturn : 43.0
Eval_AverageEpLen : 86.66666666666667
Train_AverageReturn : 71.78571319

At timestep:     1022 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 116.33333587646484
Eval_StdReturn : 53.047149658203125
Eval_MaxReturn : 200.0
Eval_MinReturn : 38.0
Eval_AverageEpLen : 116.33333333333333
Train_AverageReturn : 113.55555725097656
Train_StdReturn : 32.47601318359375
Train_MaxReturn : 194.0
Train_MinReturn : 74.0
Train_AverageEpLen : 113.55555555555556
Train_EnvstepsSoFar : 62380
TimeSinceStart : 53.40447998046875
Training Loss : 10.92153549194336
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 60 ************

Collecting data to be used for training...
At timestep:     1028 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 142.875
Eval_StdReturn : 47.55375289916992
Eval_MaxReturn : 200.0
Eval_MinReturn : 51.0
Eval_AverageEpLen : 142.875
Train_AverageReturn : 146.85714721679688
Train_St

At timestep:     1084 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 187.000
Eval_StdReturn : 29.068883895874023
Eval_MaxReturn : 200.0
Eval_MinReturn : 122.0
Eval_AverageEpLen : 187.0
Train_AverageReturn : 180.6666717529297
Train_StdReturn : 25.58428382873535
Train_MaxReturn : 200.0
Train_MinReturn : 136.0
Train_AverageEpLen : 180.66666666666666
Train_EnvstepsSoFar : 75351
TimeSinceStart : 64.32054591178894
Training Loss : 9.968252182006836
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 72 ************

Collecting data to be used for training...
At timestep:     1031 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 167.500
Eval_StdReturn : 34.374168395996094
Eval_MaxReturn : 200.0
Eval_MinReturn : 111.0
Eval_AverageEpLen : 167.5
Train_AverageReturn : 147.2857208251953
Train_StdReturn : 35.53153228759

At timestep:     1120 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 126.375
Eval_StdReturn : 27.590476989746094
Eval_MaxReturn : 159.0
Eval_MinReturn : 84.0
Eval_AverageEpLen : 126.375
Train_AverageReturn : 124.44444274902344
Train_StdReturn : 34.86198425292969
Train_MaxReturn : 186.0
Train_MinReturn : 77.0
Train_AverageEpLen : 124.44444444444444
Train_EnvstepsSoFar : 88157
TimeSinceStart : 75.53221297264099
Training Loss : 10.168410301208496
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 84 ************

Collecting data to be used for training...
At timestep:     1060 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 114.44444274902344
Eval_StdReturn : 34.483848571777344
Eval_MaxReturn : 200.0
Eval_MinReturn : 79.0
Eval_AverageEpLen : 114.44444444444444
Train_AverageReturn : 117.77777862548828
Train_

At timestep:     1008 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 96.36363983154297
Eval_StdReturn : 24.673404693603516
Eval_MaxReturn : 149.0
Eval_MinReturn : 73.0
Eval_AverageEpLen : 96.36363636363636
Train_AverageReturn : 91.63636016845703
Train_StdReturn : 15.801453590393066
Train_MaxReturn : 136.0
Train_MinReturn : 72.0
Train_AverageEpLen : 91.63636363636364
Train_EnvstepsSoFar : 100843
TimeSinceStart : 87.7284939289093
Training Loss : 9.083993911743164
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 96 ************

Collecting data to be used for training...
At timestep:     1074 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 92.18181610107422
Eval_StdReturn : 16.04435920715332
Eval_MaxReturn : 121.0
Eval_MinReturn : 76.0
Eval_AverageEpLen : 92.18181818181819
Train_AverageReturn : 97.6363601

At timestep:     1017 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 39.153846740722656
Eval_StdReturn : 16.04726791381836
Eval_MaxReturn : 78.0
Eval_MinReturn : 19.0
Eval_AverageEpLen : 39.15384615384615
Train_AverageReturn : 44.21739196777344
Train_StdReturn : 19.5513973236084
Train_MaxReturn : 110.0
Train_MinReturn : 21.0
Train_AverageEpLen : 44.21739130434783
Train_EnvstepsSoFar : 8129
TimeSinceStart : 9.119026184082031
Training Loss : 10.537053108215332
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 8 ************

Collecting data to be used for training...
At timestep:     1049 / 1000/ 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 34.86206817626953
Eval_StdReturn : 10.3414945602417
Eval_MaxReturn : 65.0
Eval_MinReturn : 18.0
Eval_AverageEpLen : 34.86206896551724
Train_AverageReturn : 37.4642868

At timestep:     1016 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 45.5217399597168
Eval_StdReturn : 15.24258804321289
Eval_MaxReturn : 94.0
Eval_MinReturn : 26.0
Eval_AverageEpLen : 45.52173913043478
Train_AverageReturn : 42.33333206176758
Train_StdReturn : 13.18985366821289
Train_MaxReturn : 75.0
Train_MinReturn : 27.0
Train_AverageEpLen : 42.333333333333336
Train_EnvstepsSoFar : 20355
TimeSinceStart : 28.00845718383789
Training Loss : 9.066252708435059
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 20 ************

Collecting data to be used for training...
At timestep:     1004 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 54.894737243652344
Eval_StdReturn : 33.113670349121094
Eval_MaxReturn : 163.0
Eval_MinReturn : 23.0
Eval_AverageEpLen : 54.89473684210526
Train_AverageReturn : 45.636363983

At timestep:     1071 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 77.76923370361328
Eval_StdReturn : 33.88703918457031
Eval_MaxReturn : 162.0
Eval_MinReturn : 44.0
Eval_AverageEpLen : 77.76923076923077
Train_AverageReturn : 66.9375
Train_StdReturn : 28.547040939331055
Train_MaxReturn : 143.0
Train_MinReturn : 31.0
Train_AverageEpLen : 66.9375
Train_EnvstepsSoFar : 32814
TimeSinceStart : 44.21562433242798
Training Loss : 9.439033508300781
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 32 ************

Collecting data to be used for training...
At timestep:     1081 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 60.117645263671875
Eval_StdReturn : 19.127689361572266
Eval_MaxReturn : 95.0
Eval_MinReturn : 39.0
Eval_AverageEpLen : 60.11764705882353
Train_AverageReturn : 72.06666564941406
Train_StdRet

At timestep:     1011 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 32.83871078491211
Eval_StdReturn : 8.128393173217773
Eval_MaxReturn : 59.0
Eval_MinReturn : 22.0
Eval_AverageEpLen : 32.83870967741935
Train_AverageReturn : 34.86206817626953
Train_StdReturn : 11.112975120544434
Train_MaxReturn : 65.0
Train_MinReturn : 16.0
Train_AverageEpLen : 34.86206896551724
Train_EnvstepsSoFar : 45120
TimeSinceStart : 58.15203619003296
Training Loss : 8.523886680603027
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 44 ************

Collecting data to be used for training...
At timestep:     1028 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 31.53125
Eval_StdReturn : 7.885843276977539
Eval_MaxReturn : 46.0
Eval_MinReturn : 16.0
Eval_AverageEpLen : 31.53125
Train_AverageReturn : 32.125
Train_StdReturn : 7.33463

At timestep:     1041 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 36.42856979370117
Eval_StdReturn : 7.789317607879639
Eval_MaxReturn : 53.0
Eval_MinReturn : 18.0
Eval_AverageEpLen : 36.42857142857143
Train_AverageReturn : 33.58064651489258
Train_StdReturn : 7.970810413360596
Train_MaxReturn : 54.0
Train_MinReturn : 19.0
Train_AverageEpLen : 33.58064516129032
Train_EnvstepsSoFar : 57300
TimeSinceStart : 69.68487501144409
Training Loss : 8.140542030334473
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 56 ************

Collecting data to be used for training...
At timestep:     1001 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 36.46428680419922
Eval_StdReturn : 10.32501220703125
Eval_MaxReturn : 62.0
Eval_MinReturn : 19.0
Eval_AverageEpLen : 36.464285714285715
Train_AverageReturn : 32.29032135009

At timestep:     1056 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 55.0000
Eval_StdReturn : 18.643468856811523
Eval_MaxReturn : 115.0
Eval_MinReturn : 30.0
Eval_AverageEpLen : 55.0
Train_AverageReturn : 50.28571319580078
Train_StdReturn : 19.68113136291504
Train_MaxReturn : 100.0
Train_MinReturn : 30.0
Train_AverageEpLen : 50.285714285714285
Train_EnvstepsSoFar : 69514
TimeSinceStart : 83.27256441116333
Training Loss : 7.593859672546387
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 68 ************

Collecting data to be used for training...
At timestep:     1042 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 61.411766052246094
Eval_StdReturn : 18.44884490966797
Eval_MaxReturn : 100.0
Eval_MinReturn : 40.0
Eval_AverageEpLen : 61.411764705882355
Train_AverageReturn : 65.125
Train_StdReturn : 29.173

At timestep:     1043 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 38.88461685180664
Eval_StdReturn : 12.801223754882812
Eval_MaxReturn : 92.0
Eval_MinReturn : 27.0
Eval_AverageEpLen : 38.88461538461539
Train_AverageReturn : 40.11538314819336
Train_StdReturn : 8.872285842895508
Train_MaxReturn : 60.0
Train_MinReturn : 27.0
Train_AverageEpLen : 40.11538461538461
Train_EnvstepsSoFar : 82048
TimeSinceStart : 96.13049912452698
Training Loss : 5.899107933044434
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 80 ************

Collecting data to be used for training...
At timestep:     1051 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 42.0000
Eval_StdReturn : 12.606215476989746
Eval_MaxReturn : 73.0
Eval_MinReturn : 27.0
Eval_AverageEpLen : 42.0
Train_AverageReturn : 47.772727966308594
Train_StdReturn :

At timestep:     1003 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 30.393939971923828
Eval_StdReturn : 6.710188865661621
Eval_MaxReturn : 45.0
Eval_MinReturn : 19.0
Eval_AverageEpLen : 30.393939393939394
Train_AverageReturn : 31.34375
Train_StdReturn : 7.078176975250244
Train_MaxReturn : 47.0
Train_MinReturn : 21.0
Train_AverageEpLen : 31.34375
Train_EnvstepsSoFar : 94302
TimeSinceStart : 107.88527917861938
Training Loss : 4.448338508605957
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 92 ************

Collecting data to be used for training...
At timestep:     1007 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 32.290321350097656
Eval_StdReturn : 5.826488971710205
Eval_MaxReturn : 46.0
Eval_MinReturn : 23.0
Eval_AverageEpLen : 32.29032258064516
Train_AverageReturn : 33.56666564941406
Train_StdRe

In [70]:
pg_args = dict(pg_base_args_dict)

env_str = 'CartPole'
pg_args['env_name'] = '{}-v0'.format(env_str)
pg_args['reward_to_go'] = True
pg_args['n_iter'] = 100

# Delete all previous logs
remove_folder('logs/policy_gradient/{}/return_to_go/'.format(env_str))

for seed in range(3):
    print("Running policy gradient experiment with seed", seed)
    pg_args['seed'] = seed
    pg_args['logdir'] = 'logs/policy_gradient/{}/return_to_go/seed{}'.format(env_str, seed)
    pgtrainer = PG_Trainer(pg_args)
    pgtrainer.run_training_loop()

Folder logs/policy_gradient/CartPole/return_to_go/ does not exist yet. No old results to delete
Running policy gradient experiment with seed 0
########################
logging outputs to  logs/policy_gradient/CartPole/return_to_go/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
CartPole-v0


********** Iteration 0 ************

Collecting data to be used for training...
At timestep:     1012 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 27.578947067260742
Eval_StdReturn : 15.315337181091309
Eval_MaxReturn : 69.0
Eval_MinReturn : 11.0
Eval_AverageEpLen : 27.57894736842105
Train_AverageReturn : 22.0
Train_StdReturn : 10.642409324645996
Train_MaxReturn : 51.0
Train_MinReturn : 9.0
Train_AverageEpLen : 22.0
Train_EnvstepsSoFar : 1012
TimeSinceStart : 0.9216792583465576
Training Loss

At timestep:     1109 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 185.000
Eval_StdReturn : 23.3737735748291
Eval_MaxReturn : 200.0
Eval_MinReturn : 138.0
Eval_AverageEpLen : 185.0
Train_AverageReturn : 158.42857360839844
Train_StdReturn : 28.489883422851562
Train_MaxReturn : 200.0
Train_MinReturn : 117.0
Train_AverageEpLen : 158.42857142857142
Train_EnvstepsSoFar : 13498
TimeSinceStart : 15.585062265396118
Training Loss : 9.84910774230957
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 13 ************

Collecting data to be used for training...
At timestep:     1043 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 192.8333282470703
Eval_StdReturn : 9.788030624389648
Eval_MaxReturn : 200.0
Eval_MinReturn : 179.0
Eval_AverageEpLen : 192.83333333333334
Train_AverageReturn : 149.0
Train_StdReturn : 32.872047424316406

Eval_AverageReturn : 102.000
Eval_StdReturn : 18.718975067138672
Eval_MaxReturn : 122.0
Eval_MinReturn : 54.0
Eval_AverageEpLen : 102.0
Train_AverageReturn : 92.36363983154297
Train_StdReturn : 26.867595672607422
Train_MaxReturn : 121.0
Train_MinReturn : 41.0
Train_AverageEpLen : 92.36363636363636
Train_EnvstepsSoFar : 26264
TimeSinceStart : 29.208611249923706
Training Loss : 7.945452690124512
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 25 ************

Collecting data to be used for training...
At timestep:     1020 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 99.09091186523438
Eval_StdReturn : 27.4141788482666
Eval_MaxReturn : 131.0
Eval_MinReturn : 48.0
Eval_AverageEpLen : 99.0909090909091
Train_AverageReturn : 92.7272720336914
Train_StdReturn : 30.710054397583008
Train_MaxReturn : 124.0
Train_MinReturn : 39.0
Train_AverageEpLen : 92.72727272727273
Train_EnvstepsS

At timestep:     1198 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 200.000
Eval_StdReturn : 0.0
Eval_MaxReturn : 200.0
Eval_MinReturn : 200.0
Eval_AverageEpLen : 200.0
Train_AverageReturn : 199.6666717529297
Train_StdReturn : 0.745356023311615
Train_MaxReturn : 200.0
Train_MinReturn : 198.0
Train_AverageEpLen : 199.66666666666666
Train_EnvstepsSoFar : 40434
TimeSinceStart : 48.81577110290527
Training Loss : 8.923415184020996
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 38 ************

Collecting data to be used for training...
At timestep:     1000 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 200.000
Eval_StdReturn : 0.0
Eval_MaxReturn : 200.0
Eval_MinReturn : 200.0
Eval_AverageEpLen : 200.0
Train_AverageReturn : 200.0
Train_StdReturn : 0.0
Train_MaxReturn : 200.0
Train_MinReturn : 200.0
Train_AverageEpLen

At timestep:     1171 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 200.000
Eval_StdReturn : 0.0
Eval_MaxReturn : 200.0
Eval_MinReturn : 200.0
Eval_AverageEpLen : 200.0
Train_AverageReturn : 195.1666717529297
Train_StdReturn : 10.807662963867188
Train_MaxReturn : 200.0
Train_MinReturn : 171.0
Train_AverageEpLen : 195.16666666666666
Train_EnvstepsSoFar : 54290
TimeSinceStart : 68.05814099311829
Training Loss : 9.134928703308105
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 51 ************

Collecting data to be used for training...
At timestep:     1099 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 165.2857208251953
Eval_StdReturn : 47.5235710144043
Eval_MaxReturn : 200.0
Eval_MinReturn : 88.0
Eval_AverageEpLen : 165.28571428571428
Train_AverageReturn : 183.1666717529297
Train_StdReturn : 26.333860397338867
Tra

At timestep:     1065 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 174.6666717529297
Eval_StdReturn : 21.304668426513672
Eval_MaxReturn : 200.0
Eval_MinReturn : 134.0
Eval_AverageEpLen : 174.66666666666666
Train_AverageReturn : 152.14285278320312
Train_StdReturn : 54.67939758300781
Train_MaxReturn : 184.0
Train_MinReturn : 21.0
Train_AverageEpLen : 152.14285714285714
Train_EnvstepsSoFar : 68501
TimeSinceStart : 83.19861912727356
Training Loss : 8.311871528625488
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 64 ************

Collecting data to be used for training...
At timestep:     1126 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 174.3333282470703
Eval_StdReturn : 40.68442153930664
Eval_MaxReturn : 200.0
Eval_MinReturn : 86.0
Eval_AverageEpLen : 174.33333333333334
Train_AverageReturn : 187.6666717529297
Tr

At timestep:     1000 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 196.1666717529297
Eval_StdReturn : 5.610011100769043
Eval_MaxReturn : 200.0
Eval_MinReturn : 186.0
Eval_AverageEpLen : 196.16666666666666
Train_AverageReturn : 200.0
Train_StdReturn : 0.0
Train_MaxReturn : 200.0
Train_MinReturn : 200.0
Train_AverageEpLen : 200.0
Train_EnvstepsSoFar : 82269
TimeSinceStart : 97.24650287628174
Training Loss : 7.367879390716553
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 77 ************

Collecting data to be used for training...
At timestep:     1197 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 185.1666717529297
Eval_StdReturn : 10.66796875
Eval_MaxReturn : 200.0
Eval_MinReturn : 169.0
Eval_AverageEpLen : 185.16666666666666
Train_AverageReturn : 199.5
Train_StdReturn : 1.1180340051651
Train_MaxReturn : 200.0
T

At timestep:     1075 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 172.8333282470703
Eval_StdReturn : 6.618576526641846
Eval_MaxReturn : 186.0
Eval_MinReturn : 165.0
Eval_AverageEpLen : 172.83333333333334
Train_AverageReturn : 179.1666717529297
Train_StdReturn : 10.006941795349121
Train_MaxReturn : 193.0
Train_MinReturn : 167.0
Train_AverageEpLen : 179.16666666666666
Train_EnvstepsSoFar : 96395
TimeSinceStart : 112.86360120773315
Training Loss : 6.47220516204834
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 22.0
Done logging...




********** Iteration 90 ************

Collecting data to be used for training...
At timestep:     1087 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 182.1666717529297
Eval_StdReturn : 5.639641761779785
Eval_MaxReturn : 189.0
Eval_MinReturn : 172.0
Eval_AverageEpLen : 182.16666666666666
Train_AverageReturn : 181.1666717529297
T

At timestep:     1001 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 45.818180084228516
Eval_StdReturn : 22.16802978515625
Eval_MaxReturn : 110.0
Eval_MinReturn : 14.0
Eval_AverageEpLen : 45.81818181818182
Train_AverageReturn : 31.28125
Train_StdReturn : 17.965652465820312
Train_MaxReturn : 86.0
Train_MinReturn : 13.0
Train_AverageEpLen : 31.28125
Train_EnvstepsSoFar : 2022
TimeSinceStart : 2.2773258686065674
Training Loss : 7.402402400970459
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 2 ************

Collecting data to be used for training...
At timestep:     1038 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 48.80952453613281
Eval_StdReturn : 25.240970611572266
Eval_MaxReturn : 121.0
Eval_MinReturn : 16.0
Eval_AverageEpLen : 48.80952380952381
Train_AverageReturn : 37.07143020629883
Train_StdRe

At timestep:     1008 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 166.57142639160156
Eval_StdReturn : 34.066558837890625
Eval_MaxReturn : 200.0
Eval_MinReturn : 99.0
Eval_AverageEpLen : 166.57142857142858
Train_AverageReturn : 144.0
Train_StdReturn : 29.832868576049805
Train_MaxReturn : 192.0
Train_MinReturn : 103.0
Train_AverageEpLen : 144.0
Train_EnvstepsSoFar : 14694
TimeSinceStart : 15.656123876571655
Training Loss : 9.14818286895752
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 14 ************

Collecting data to be used for training...
At timestep:     1132 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 189.3333282470703
Eval_StdReturn : 18.651779174804688
Eval_MaxReturn : 200.0
Eval_MinReturn : 149.0
Eval_AverageEpLen : 189.33333333333334
Train_AverageReturn : 161.7142791748047
Train_StdR

At timestep:     1045 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 78.23076629638672
Eval_StdReturn : 18.288936614990234
Eval_MaxReturn : 110.0
Eval_MinReturn : 48.0
Eval_AverageEpLen : 78.23076923076923
Train_AverageReturn : 95.0
Train_StdReturn : 29.750476837158203
Train_MaxReturn : 143.0
Train_MinReturn : 62.0
Train_AverageEpLen : 95.0
Train_EnvstepsSoFar : 27594
TimeSinceStart : 28.880553007125854
Training Loss : 7.246464252471924
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 26 ************

Collecting data to be used for training...
At timestep:     1026 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 74.92857360839844
Eval_StdReturn : 14.184822082519531
Eval_MaxReturn : 93.0
Eval_MinReturn : 44.0
Eval_AverageEpLen : 74.92857142857143
Train_AverageReturn : 85.5
Train_StdReturn : 21.542593002

At timestep:     1110 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 178.1666717529297
Eval_StdReturn : 26.835403442382812
Eval_MaxReturn : 200.0
Eval_MinReturn : 130.0
Eval_AverageEpLen : 178.16666666666666
Train_AverageReturn : 158.57142639160156
Train_StdReturn : 28.0604305267334
Train_MaxReturn : 200.0
Train_MinReturn : 122.0
Train_AverageEpLen : 158.57142857142858
Train_EnvstepsSoFar : 40687
TimeSinceStart : 42.181416034698486
Training Loss : 5.584237098693848
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 38 ************

Collecting data to be used for training...
At timestep:     1049 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 164.7142791748047
Eval_StdReturn : 24.82345962524414
Eval_MaxReturn : 197.0
Eval_MinReturn : 131.0
Eval_AverageEpLen : 164.71428571428572
Train_AverageReturn : 174.

At timestep:     1033 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 90.58333587646484
Eval_StdReturn : 23.963369369506836
Eval_MaxReturn : 131.0
Eval_MinReturn : 51.0
Eval_AverageEpLen : 90.58333333333333
Train_AverageReturn : 93.90908813476562
Train_StdReturn : 20.36241912841797
Train_MaxReturn : 128.0
Train_MinReturn : 59.0
Train_AverageEpLen : 93.9090909090909
Train_EnvstepsSoFar : 53437
TimeSinceStart : 55.10128116607666
Training Loss : 4.942727565765381
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 50 ************

Collecting data to be used for training...
At timestep:     1006 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 75.57142639160156
Eval_StdReturn : 13.243403434753418
Eval_MaxReturn : 95.0
Eval_MinReturn : 45.0
Eval_AverageEpLen : 75.57142857142857
Train_AverageReturn : 83.833335876

At timestep:     1039 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 123.11111450195312
Eval_StdReturn : 13.731104850769043
Eval_MaxReturn : 144.0
Eval_MinReturn : 105.0
Eval_AverageEpLen : 123.11111111111111
Train_AverageReturn : 115.44444274902344
Train_StdReturn : 21.150840759277344
Train_MaxReturn : 133.0
Train_MinReturn : 62.0
Train_AverageEpLen : 115.44444444444444
Train_EnvstepsSoFar : 65999
TimeSinceStart : 69.19569301605225
Training Loss : 5.311056137084961
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 62 ************

Collecting data to be used for training...
At timestep:     1000 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 126.000/ 1000
Eval_StdReturn : 9.433980941772461
Eval_MaxReturn : 142.0
Eval_MinReturn : 108.0
Eval_AverageEpLen : 126.0
Train_AverageReturn : 125.0
Train_StdRetur

At timestep:     1089 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 174.1666717529297
Eval_StdReturn : 26.610877990722656
Eval_MaxReturn : 200.0
Eval_MinReturn : 132.0
Eval_AverageEpLen : 174.16666666666666
Train_AverageReturn : 155.57142639160156
Train_StdReturn : 21.704885482788086
Train_MaxReturn : 200.0
Train_MinReturn : 134.0
Train_AverageEpLen : 155.57142857142858
Train_EnvstepsSoFar : 78770
TimeSinceStart : 82.15489888191223
Training Loss : 5.036842346191406
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 74 ************

Collecting data to be used for training...
At timestep:     1142 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 197.1666717529297
Eval_StdReturn : 4.25897741317749
Eval_MaxReturn : 200.0
Eval_MinReturn : 189.0
Eval_AverageEpLen : 197.16666666666666
Train_AverageReturn : 163.

At timestep:     1121 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 185.1666717529297
Eval_StdReturn : 23.80767822265625
Eval_MaxReturn : 200.0
Eval_MinReturn : 136.0
Eval_AverageEpLen : 185.16666666666666
Train_AverageReturn : 160.14285278320312
Train_StdReturn : 27.689237594604492
Train_MaxReturn : 200.0
Train_MinReturn : 125.0
Train_AverageEpLen : 160.14285714285714
Train_EnvstepsSoFar : 92253
TimeSinceStart : 95.77959609031677
Training Loss : 5.042569160461426
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 87 ************

Collecting data to be used for training...
At timestep:     1076 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 200.000
Eval_StdReturn : 0.0
Eval_MaxReturn : 200.0
Eval_MinReturn : 200.0
Eval_AverageEpLen : 200.0
Train_AverageReturn : 153.7142791748047
Train_StdReturn : 22.71

At timestep:     1079 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 110.500
Eval_StdReturn : 7.473285675048828
Eval_MaxReturn : 124.0
Eval_MinReturn : 101.0
Eval_AverageEpLen : 110.5
Train_AverageReturn : 107.9000015258789
Train_StdReturn : 7.006425857543945
Train_MaxReturn : 121.0
Train_MinReturn : 97.0
Train_AverageEpLen : 107.9
Train_EnvstepsSoFar : 105126
TimeSinceStart : 108.97615909576416
Training Loss : 4.946568012237549
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 23.204545974731445
Done logging...




********** Iteration 99 ************

Collecting data to be used for training...
At timestep:     1000 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 108.000
Eval_StdReturn : 3.7947330474853516
Eval_MaxReturn : 115.0
Eval_MinReturn : 103.0
Eval_AverageEpLen : 108.0
Train_AverageReturn : 111.11111450195312
Train_StdReturn : 8.49109935760498
Train_Max

At timestep:     1031 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 100.36363983154297
Eval_StdReturn : 40.041507720947266
Eval_MaxReturn : 200.0
Eval_MinReturn : 58.0
Eval_AverageEpLen : 100.36363636363636
Train_AverageReturn : 93.7272720336914
Train_StdReturn : 53.47062301635742
Train_MaxReturn : 194.0
Train_MinReturn : 28.0
Train_AverageEpLen : 93.72727272727273
Train_EnvstepsSoFar : 11370
TimeSinceStart : 11.841468095779419
Training Loss : 9.340672492980957
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 11 ************

Collecting data to be used for training...
At timestep:     1068 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 137.125
Eval_StdReturn : 35.65963363647461
Eval_MaxReturn : 189.0
Eval_MinReturn : 84.0
Eval_AverageEpLen : 137.125
Train_AverageReturn : 106.80000305175781
Train_StdR

At timestep:     1167 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 161.000
Eval_StdReturn : 29.27212142944336
Eval_MaxReturn : 200.0
Eval_MinReturn : 119.0
Eval_AverageEpLen : 161.0
Train_AverageReturn : 166.7142791748047
Train_StdReturn : 35.127174377441406
Train_MaxReturn : 200.0
Train_MinReturn : 103.0
Train_AverageEpLen : 166.71428571428572
Train_EnvstepsSoFar : 24476
TimeSinceStart : 25.337657928466797
Training Loss : 9.709859848022461
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 23 ************

Collecting data to be used for training...
At timestep:     1120 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 147.57142639160156
Eval_StdReturn : 22.35976791381836
Eval_MaxReturn : 200.0
Eval_MinReturn : 130.0
Eval_AverageEpLen : 147.57142857142858
Train_AverageReturn : 160.0
Train_StdReturn : 35

Eval_AverageReturn : 187.500
Eval_StdReturn : 22.284149169921875
Eval_MaxReturn : 200.0
Eval_MinReturn : 139.0
Eval_AverageEpLen : 187.5
Train_AverageReturn : 193.1666717529297
Train_StdReturn : 9.872801780700684
Train_MaxReturn : 200.0
Train_MinReturn : 176.0
Train_AverageEpLen : 193.16666666666666
Train_EnvstepsSoFar : 37627
TimeSinceStart : 39.13060212135315
Training Loss : 9.845951080322266
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 35 ************

Collecting data to be used for training...
At timestep:     1039 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 182.500
Eval_StdReturn : 15.85086727142334
Eval_MaxReturn : 200.0
Eval_MinReturn : 159.0
Eval_AverageEpLen : 182.5
Train_AverageReturn : 173.1666717529297
Train_StdReturn : 33.86943817138672
Train_MaxReturn : 200.0
Train_MinReturn : 107.0
Train_AverageEpLen : 173.16666666666666
Train_EnvstepsSoF

Eval_AverageReturn : 200.000
Eval_StdReturn : 0.0
Eval_MaxReturn : 200.0
Eval_MinReturn : 200.0
Eval_AverageEpLen : 200.0
Train_AverageReturn : 168.8333282470703
Train_StdReturn : 46.29044723510742
Train_MaxReturn : 200.0
Train_MinReturn : 82.0
Train_AverageEpLen : 168.83333333333334
Train_EnvstepsSoFar : 50324
TimeSinceStart : 56.60594606399536
Training Loss : 10.142221450805664
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 47 ************

Collecting data to be used for training...
At timestep:     1167 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 148.14285278320312
Eval_StdReturn : 46.91460037231445
Eval_MaxReturn : 200.0
Eval_MinReturn : 91.0
Eval_AverageEpLen : 148.14285714285714
Train_AverageReturn : 194.5
Train_StdReturn : 12.29837417602539
Train_MaxReturn : 200.0
Train_MinReturn : 167.0
Train_AverageEpLen : 194.5
Train_EnvstepsSoFar : 51491
TimeSi

Eval_AverageReturn : 200.000
Eval_StdReturn : 0.0
Eval_MaxReturn : 200.0
Eval_MinReturn : 200.0
Eval_AverageEpLen : 200.0
Train_AverageReturn : 173.5
Train_StdReturn : 37.4777717590332
Train_MaxReturn : 200.0
Train_MinReturn : 120.0
Train_AverageEpLen : 173.5
Train_EnvstepsSoFar : 63117
TimeSinceStart : 72.0246090888977
Training Loss : 9.81534481048584
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 59 ************

Collecting data to be used for training...
At timestep:     1000 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 200.000
Eval_StdReturn : 0.0
Eval_MaxReturn : 200.0
Eval_MinReturn : 200.0
Eval_AverageEpLen : 200.0
Train_AverageReturn : 200.0
Train_StdReturn : 0.0
Train_MaxReturn : 200.0
Train_MinReturn : 200.0
Train_AverageEpLen : 200.0
Train_EnvstepsSoFar : 64117
TimeSinceStart : 73.42553901672363
Training Loss : 9.778752326965332
Baseline Loss : 

At timestep:     1098 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 200.000
Eval_StdReturn : 0.0
Eval_MaxReturn : 200.0
Eval_MinReturn : 200.0
Eval_AverageEpLen : 200.0
Train_AverageReturn : 183.0
Train_StdReturn : 32.16105270385742
Train_MaxReturn : 200.0
Train_MinReturn : 112.0
Train_AverageEpLen : 183.0
Train_EnvstepsSoFar : 76807
TimeSinceStart : 90.31444692611694
Training Loss : 7.828932762145996
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 72 ************

Collecting data to be used for training...
At timestep:     1000 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 186.000
Eval_StdReturn : 27.080127716064453
Eval_MaxReturn : 200.0
Eval_MinReturn : 126.0
Eval_AverageEpLen : 186.0
Train_AverageReturn : 200.0
Train_StdReturn : 0.0
Train_MaxReturn : 200.0
Train_MinReturn : 200.0
Train_AverageE

At timestep:     1059 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 120.77777862548828
Eval_StdReturn : 10.453896522521973
Eval_MaxReturn : 135.0
Eval_MinReturn : 105.0
Eval_AverageEpLen : 120.77777777777777
Train_AverageReturn : 117.66666412353516
Train_StdReturn : 8.550503730773926
Train_MaxReturn : 131.0
Train_MinReturn : 106.0
Train_AverageEpLen : 117.66666666666667
Train_EnvstepsSoFar : 89697
TimeSinceStart : 104.8240180015564
Training Loss : 6.458822250366211
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 84 ************

Collecting data to be used for training...
At timestep:     1013 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 123.22222137451172
Eval_StdReturn : 9.040375709533691
Eval_MaxReturn : 136.0
Eval_MinReturn : 109.0
Eval_AverageEpLen : 123.22222222222223
Train_AverageReturn : 11

At timestep:     1115 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 127.000
Eval_StdReturn : 10.828204154968262
Eval_MaxReturn : 154.0
Eval_MinReturn : 117.0
Eval_AverageEpLen : 127.0
Train_AverageReturn : 123.88888549804688
Train_StdReturn : 6.740333557128906
Train_MaxReturn : 136.0
Train_MinReturn : 112.0
Train_AverageEpLen : 123.88888888888889
Train_EnvstepsSoFar : 102346
TimeSinceStart : 119.7802369594574
Training Loss : 6.170619964599609
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 27.432432174682617
Done logging...




********** Iteration 96 ************

Collecting data to be used for training...
At timestep:     1029 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 133.750
Eval_StdReturn : 7.066647052764893
Eval_MaxReturn : 147.0
Eval_MinReturn : 123.0
Eval_AverageEpLen : 133.75
Train_AverageReturn : 128.625
Train_StdReturn : 9.109576225280762
Trai

We should see the reward to go estimator outperforming the full returns estimator, with some runs reaching the maximum reward of 200. There will likely however be high variance between runs.

In [71]:
### Visualize Policy Gradient results on CartPole
%load_ext tensorboard
%tensorboard --logdir logs/policy_gradient/CartPole

We can also compare our estimators on a more complex task, though you will probably see that they don't perform well (not getting much above 200 returns). Note that on this more complex task, we use a much larger batch size to reduce variance in the policy gradients.

In [72]:
pg_args = dict(pg_base_args_dict)

env_str = 'Hopper'
pg_args['env_name'] = '{}-v2'.format(env_str)
pg_args['learning_rate'] = 0.01
pg_args['reward_to_go'] = False
pg_args['batch_size'] = 10000
pg_args['train_batch_size'] = 10000
pg_args['n_iter'] = 100

# Delete all previous logs
remove_folder('logs/policy_gradient/{}/full_returns/'.format(env_str))

for seed in range(3):
    print("Running policy gradient experiment with seed", seed)
    pg_args['seed'] = seed
    pg_args['logdir'] = 'logs/policy_gradient/{}/full_returns/seed{}'.format(env_str, seed)
    pgtrainer = PG_Trainer(pg_args)
    pgtrainer.run_training_loop()

Folder logs/policy_gradient/Hopper/full_returns/ does not exist yet. No old results to delete
Running policy gradient experiment with seed 0
########################
logging outputs to  logs/policy_gradient/Hopper/full_returns/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
Hopper-v2


********** Iteration 0 ************

Collecting data to be used for training...
At timestep:     10004 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 61.34284210205078
Eval_StdReturn : 57.76802444458008
Eval_MaxReturn : 203.4663543701172
Eval_MinReturn : 7.391658306121826
Eval_AverageEpLen : 39.96153846153846
Train_AverageReturn : 9.657980918884277
Train_StdReturn : 5.853752613067627
Train_MaxReturn : 87.45542907714844
Train_MinReturn : 3.010934352874756
Train_AverageEpLen : 13.215323645970939
Tra

Eval_AverageReturn : 69.89344787597656
Eval_StdReturn : 15.411808967590332
Eval_MaxReturn : 107.07992553710938
Eval_MinReturn : 46.55685043334961
Eval_AverageEpLen : 39.57692307692308
Train_AverageReturn : 62.408966064453125
Train_StdReturn : 14.828570365905762
Train_MaxReturn : 163.44375610351562
Train_MinReturn : 3.8587586879730225
Train_AverageEpLen : 36.974169741697416
Train_EnvstepsSoFar : 110183
TimeSinceStart : 122.91044473648071
Training Loss : 114.06008911132812
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 11 ************

Collecting data to be used for training...
At timestep:     10038 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 77.67560577392578
Eval_StdReturn : 27.553897857666016
Eval_MaxReturn : 130.55270385742188
Eval_MinReturn : 44.040863037109375
Eval_AverageEpLen : 44.04347826086956
Train_AverageReturn : 70.12300872802734
Train_StdRetu

At timestep:     10024 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 161.77586364746094
Eval_StdReturn : 31.873241424560547
Eval_MaxReturn : 208.13577270507812
Eval_MinReturn : 91.60504150390625
Eval_AverageEpLen : 83.91666666666667
Train_AverageReturn : 182.6273956298828
Train_StdReturn : 21.851411819458008
Train_MaxReturn : 236.52513122558594
Train_MinReturn : 46.24435806274414
Train_AverageEpLen : 91.12727272727273
Train_EnvstepsSoFar : 220545
TimeSinceStart : 252.09158778190613
Training Loss : 135.387451171875
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 22 ************

Collecting data to be used for training...
At timestep:     10000 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 151.78512573242188
Eval_StdReturn : 37.332122802734375
Eval_MaxReturn : 199.15621948242188
Eval_MinReturn : 81.

At timestep:     10046 / 10000/ 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 124.95661163330078
Eval_StdReturn : 26.34847068786621
Eval_MaxReturn : 177.62786865234375
Eval_MinReturn : 84.92816162109375
Eval_AverageEpLen : 65.0625
Train_AverageReturn : 117.18943786621094
Train_StdReturn : 17.515832901000977
Train_MaxReturn : 190.31251525878906
Train_MinReturn : 80.65196228027344
Train_AverageEpLen : 63.18238993710692
Train_EnvstepsSoFar : 330891
TimeSinceStart : 358.3003737926483
Training Loss : 120.26919555664062
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 33 ************

Collecting data to be used for training...
At timestep:     10068 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 109.55646514892578
Eval_StdReturn : 20.1720027923584
Eval_MaxReturn : 160.6954345703125
Eval_MinReturn : 80.77812

At timestep:     10043 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 103.65486145019531
Eval_StdReturn : 15.924715995788574
Eval_MaxReturn : 138.12489318847656
Eval_MinReturn : 75.58454895019531
Eval_AverageEpLen : 55.888888888888886
Train_AverageReturn : 100.67804718017578
Train_StdReturn : 15.600932121276855
Train_MaxReturn : 142.73513793945312
Train_MinReturn : 68.3345947265625
Train_AverageEpLen : 54.87978142076503
Train_EnvstepsSoFar : 441202
TimeSinceStart : 477.6117477416992
Training Loss : 119.65421295166016
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 44 ************

Collecting data to be used for training...
At timestep:     10027 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 100.7995376586914
Eval_StdReturn : 11.36463737487793
Eval_MaxReturn : 129.0852508544922
Eval_MinReturn : 85.2

At timestep:     10026 / 100000000/ 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 106.31363677978516
Eval_StdReturn : 26.72528839111328
Eval_MaxReturn : 139.85496520996094
Eval_MinReturn : 15.086146354675293
Eval_AverageEpLen : 57.05555555555556
Train_AverageReturn : 111.17737579345703
Train_StdReturn : 14.717923164367676
Train_MaxReturn : 145.17945861816406
Train_MinReturn : 10.41646957397461
Train_AverageEpLen : 58.976470588235294
Train_EnvstepsSoFar : 551573
TimeSinceStart : 591.2109429836273
Training Loss : 121.83287048339844
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 55 ************

Collecting data to be used for training...
At timestep:     10030 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 113.39131927490234
Eval_StdReturn : 14.698614120483398
Eval_MaxReturn : 140.8932647705078
Eval_Mi

At timestep:     10039 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 110.55606079101562
Eval_StdReturn : 26.36712074279785
Eval_MaxReturn : 168.28091430664062
Eval_MinReturn : 71.86616516113281
Eval_AverageEpLen : 78.92307692307692
Train_AverageReturn : 111.29061889648438
Train_StdReturn : 33.58397674560547
Train_MaxReturn : 208.76467895507812
Train_MinReturn : 20.70589828491211
Train_AverageEpLen : 74.36296296296297
Train_EnvstepsSoFar : 662110
TimeSinceStart : 699.1257698535919
Training Loss : 100.21446228027344
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 66 ************

Collecting data to be used for training...
At timestep:     10034 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 108.50534057617188
Eval_StdReturn : 15.00790786743164
Eval_MaxReturn : 132.45179748535156
Eval_MinReturn : 80.7

At timestep:     10011 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 64.85153198242188
Eval_StdReturn : 4.41401481628418
Eval_MaxReturn : 73.93831634521484
Eval_MinReturn : 56.6352653503418
Eval_AverageEpLen : 40.36
Train_AverageReturn : 68.2757568359375
Train_StdReturn : 3.8290388584136963
Train_MaxReturn : 84.58049011230469
Train_MinReturn : 58.42625427246094
Train_AverageEpLen : 42.41949152542373
Train_EnvstepsSoFar : 772414
TimeSinceStart : 875.8442440032959
Training Loss : 107.4666748046875
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 77 ************

Collecting data to be used for training...
At timestep:     10021 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 65.5848388671875
Eval_StdReturn : 2.84312105178833
Eval_MaxReturn : 69.93355560302734
Eval_MinReturn : 59.59579086303711
Eval_Aver

At timestep:     10029 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 61.81971740722656
Eval_StdReturn : 7.102625846862793
Eval_MaxReturn : 80.59412384033203
Eval_MinReturn : 53.77024841308594
Eval_AverageEpLen : 37.51851851851852
Train_AverageReturn : 57.99477005004883
Train_StdReturn : 5.000138282775879
Train_MaxReturn : 96.67056274414062
Train_MinReturn : 46.930484771728516
Train_AverageEpLen : 35.56382978723404
Train_EnvstepsSoFar : 882672
TimeSinceStart : 996.1699328422546
Training Loss : 106.36421966552734
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 88 ************

Collecting data to be used for training...
At timestep:     10020 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 61.05305099487305
Eval_StdReturn : 11.282602310180664
Eval_MaxReturn : 98.31768035888672
Eval_MinReturn : 48.92453

At timestep:     10067 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 162.3877716064453
Eval_StdReturn : 23.548702239990234
Eval_MaxReturn : 176.61721801757812
Eval_MinReturn : 85.04437255859375
Eval_AverageEpLen : 79.84615384615384
Train_AverageReturn : 133.560302734375
Train_StdReturn : 36.030303955078125
Train_MaxReturn : 181.20301818847656
Train_MinReturn : 51.69378662109375
Train_AverageEpLen : 70.8943661971831
Train_EnvstepsSoFar : 993001
TimeSinceStart : 1100.396654844284
Training Loss : 138.59158325195312
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 99 ************

Collecting data to be used for training...
At timestep:     10065 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 143.8063507080078
Eval_StdReturn : 26.85618019104004
Eval_MaxReturn : 177.14804077148438
Eval_MinReturn : 96.7269

At timestep:     10024 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 165.42149353027344
Eval_StdReturn : 38.07697677612305
Eval_MaxReturn : 211.10232543945312
Eval_MinReturn : 92.5669937133789
Eval_AverageEpLen : 78.6923076923077
Train_AverageReturn : 154.2207489013672
Train_StdReturn : 36.97023391723633
Train_MaxReturn : 211.47828674316406
Train_MinReturn : 49.30400848388672
Train_AverageEpLen : 75.36842105263158
Train_EnvstepsSoFar : 100330
TimeSinceStart : 95.75620460510254
Training Loss : 120.7164077758789
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 10 ************

Collecting data to be used for training...
At timestep:     10005 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 138.81393432617188
Eval_StdReturn : 54.426849365234375
Eval_MaxReturn : 210.98178100585938
Eval_MinReturn : 67.827

At timestep:     10011 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 120.54751586914062
Eval_StdReturn : 11.254537582397461
Eval_MaxReturn : 145.80169677734375
Eval_MinReturn : 106.96900177001953
Eval_AverageEpLen : 68.2
Train_AverageReturn : 139.25306701660156
Train_StdReturn : 17.421104431152344
Train_MaxReturn : 164.65394592285156
Train_MinReturn : 102.83377075195312
Train_AverageEpLen : 76.41984732824427
Train_EnvstepsSoFar : 210596
TimeSinceStart : 197.6680407524109
Training Loss : 134.3524932861328
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 21 ************

Collecting data to be used for training...
At timestep:     10011 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 101.7261734008789
Eval_StdReturn : 5.617048740386963
Eval_MaxReturn : 111.98857116699219
Eval_MinReturn : 91.74194335937

At timestep:     10012 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 80.53517150878906
Eval_StdReturn : 4.015874862670898
Eval_MaxReturn : 88.64310455322266
Eval_MinReturn : 73.09619140625
Eval_AverageEpLen : 48.19047619047619
Train_AverageReturn : 77.28224182128906
Train_StdReturn : 4.099834442138672
Train_MaxReturn : 93.31214904785156
Train_MinReturn : 66.97000122070312
Train_AverageEpLen : 46.56744186046512
Train_EnvstepsSoFar : 320811
TimeSinceStart : 284.627742767334
Training Loss : 115.16990661621094
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 32 ************

Collecting data to be used for training...
At timestep:     10036 / 10000000010000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 82.57905578613281
Eval_StdReturn : 3.2894325256347656
Eval_MaxReturn : 90.04716491699219
Eval_MinReturn : 76.

At timestep:     10024 / 1000010000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 101.92935180664062
Eval_StdReturn : 10.341424942016602
Eval_MaxReturn : 126.89287567138672
Eval_MinReturn : 85.1498031616211
Eval_AverageEpLen : 61.0
Train_AverageReturn : 96.62285614013672
Train_StdReturn : 15.71243667602539
Train_MaxReturn : 163.42291259765625
Train_MinReturn : 47.43388366699219
Train_AverageEpLen : 56.95454545454545
Train_EnvstepsSoFar : 431172
TimeSinceStart : 370.129909992218
Training Loss : 114.26116180419922
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 43 ************

Collecting data to be used for training...
At timestep:     10033 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 101.37691497802734
Eval_StdReturn : 11.962509155273438
Eval_MaxReturn : 135.55445861816406
Eval_MinReturn : 82.385665893

At timestep:     10006 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 64.66065216064453
Eval_StdReturn : 10.448095321655273
Eval_MaxReturn : 83.57756042480469
Eval_MinReturn : 45.8424186706543
Eval_AverageEpLen : 40.04
Train_AverageReturn : 65.01966094970703
Train_StdReturn : 10.368696212768555
Train_MaxReturn : 88.42020416259766
Train_MinReturn : 40.67082214355469
Train_AverageEpLen : 40.18473895582329
Train_EnvstepsSoFar : 541397
TimeSinceStart : 3115.14515376091
Training Loss : 105.31111145019531
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 54 ************

Collecting data to be used for training...
At timestep:     10000 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 65.02336883544922
Eval_StdReturn : 7.5594916343688965
Eval_MaxReturn : 77.71466064453125
Eval_MinReturn : 50.021507263183594
E

At timestep:     10031 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 90.50091552734375
Eval_StdReturn : 6.6021881103515625
Eval_MaxReturn : 106.89591217041016
Eval_MinReturn : 80.86262512207031
Eval_AverageEpLen : 55.05263157894737
Train_AverageReturn : 85.822509765625
Train_StdReturn : 8.899385452270508
Train_MaxReturn : 103.23174285888672
Train_MinReturn : 45.7327995300293
Train_AverageEpLen : 52.244791666666664
Train_EnvstepsSoFar : 651594
TimeSinceStart : 30513.17103791237
Training Loss : 112.70311737060547
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 65 ************

Collecting data to be used for training...
At timestep:     10012 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 85.40222930908203
Eval_StdReturn : 3.5820629596710205
Eval_MaxReturn : 90.42115783691406
Eval_MinReturn : 75.6697

At timestep:     10021 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 32.853145599365234
Eval_StdReturn : 0.8465985655784607
Eval_MaxReturn : 35.573421478271484
Eval_MinReturn : 31.135677337646484
Eval_AverageEpLen : 24.634146341463413
Train_AverageReturn : 33.60761260986328
Train_StdReturn : 0.8726353645324707
Train_MaxReturn : 37.70553970336914
Train_MinReturn : 31.520498275756836
Train_AverageEpLen : 25.241813602015114
Train_EnvstepsSoFar : 761767
TimeSinceStart : 30633.567977905273
Training Loss : 76.75869750976562
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 76 ************

Collecting data to be used for training...
At timestep:     10015 / 10000/ 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 32.73672866821289
Eval_StdReturn : 1.124098777770996
Eval_MaxReturn : 36.19258117675781
Eval_MinRet

At timestep:     10015 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 31.52554702758789
Eval_StdReturn : 1.9619112014770508
Eval_MaxReturn : 38.447265625
Eval_MinReturn : 29.31250762939453
Eval_AverageEpLen : 23.767441860465116
Train_AverageReturn : 31.715791702270508
Train_StdReturn : 3.3649537563323975
Train_MaxReturn : 43.08676528930664
Train_MinReturn : 3.370514154434204
Train_AverageEpLen : 23.732227488151658
Train_EnvstepsSoFar : 871907
TimeSinceStart : 30749.841537714005
Training Loss : 73.47175598144531
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 87 ************

Collecting data to be used for training...
At timestep:     10012 / 100002922 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 30.341449737548828
Eval_StdReturn : 0.7615182399749756
Eval_MaxReturn : 32.496402740478516
Eval_MinRet

At timestep:     10007 / 10000 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 29.037914276123047
Eval_StdReturn : 0.39652082324028015
Eval_MaxReturn : 29.93153953552246
Eval_MinReturn : 28.122095108032227
Eval_AverageEpLen : 23.045454545454547
Train_AverageReturn : 29.196090698242188
Train_StdReturn : 0.43280401825904846
Train_MaxReturn : 30.470123291015625
Train_MinReturn : 28.142311096191406
Train_AverageEpLen : 23.057603686635943
Train_EnvstepsSoFar : 982009
TimeSinceStart : 30859.64511680603
Training Loss : 68.46556854248047
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 98 ************

Collecting data to be used for training...
At timestep:     10007 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 29.228071212768555
Eval_StdReturn : 0.7220174074172974
Eval_MaxReturn : 31.925382614135742
Eval_Mi

At timestep:     10007 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 60.08576583862305
Eval_StdReturn : 8.39274787902832
Eval_MaxReturn : 79.63451385498047
Eval_MinReturn : 46.0942497253418
Eval_AverageEpLen : 33.93333333333333
Train_AverageReturn : 64.28278350830078
Train_StdReturn : 9.76318645477295
Train_MaxReturn : 95.03132629394531
Train_MinReturn : 42.15681457519531
Train_AverageEpLen : 36.52189781021898
Train_EnvstepsSoFar : 90152
TimeSinceStart : 100.67051911354065
Training Loss : 112.88089752197266
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 9 ************

Collecting data to be used for training...
At timestep:     10022 / 100005902 / 10000 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 53.47464370727539
Eval_StdReturn : 6.519129753112793
Eval_MaxReturn : 65.19525909423828
Eval_MinRetu

At timestep:     10019 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 54.074859619140625
Eval_StdReturn : 5.245983123779297
Eval_MaxReturn : 65.56797790527344
Eval_MinReturn : 43.8773078918457
Eval_AverageEpLen : 30.393939393939394
Train_AverageReturn : 53.247493743896484
Train_StdReturn : 6.088235855102539
Train_MaxReturn : 88.28999328613281
Train_MinReturn : 40.43385696411133
Train_AverageEpLen : 29.907462686567165
Train_EnvstepsSoFar : 200330
TimeSinceStart : 252.25249814987183
Training Loss : 106.76034545898438
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 20 ************

Collecting data to be used for training...
At timestep:     10009 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 59.98946762084961
Eval_StdReturn : 6.5787529945373535
Eval_MaxReturn : 73.31512451171875
Eval_MinReturn : 48.3

At timestep:     10041 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 113.7114486694336
Eval_StdReturn : 28.89047622680664
Eval_MaxReturn : 153.9755401611328
Eval_MinReturn : 45.928226470947266
Eval_AverageEpLen : 64.8125
Train_AverageReturn : 147.67315673828125
Train_StdReturn : 36.05928421020508
Train_MaxReturn : 208.26779174804688
Train_MinReturn : 51.138729095458984
Train_AverageEpLen : 79.06299212598425
Train_EnvstepsSoFar : 310733
TimeSinceStart : 389.50833201408386
Training Loss : 129.50006103515625
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 31 ************

Collecting data to be used for training...
At timestep:     10007 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 109.54266357421875
Eval_StdReturn : 45.039180755615234
Eval_MaxReturn : 187.2097930908203
Eval_MinReturn : 48.202705383

At timestep:     10038 / 100000000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 52.291133880615234
Eval_StdReturn : 18.790029525756836
Eval_MaxReturn : 69.46527862548828
Eval_MinReturn : 2.3601958751678467
Eval_AverageEpLen : 40.16
Train_AverageReturn : 56.178043365478516
Train_StdReturn : 14.424795150756836
Train_MaxReturn : 76.59083557128906
Train_MinReturn : 0.21079349517822266
Train_AverageEpLen : 42.0
Train_EnvstepsSoFar : 421153
TimeSinceStart : 504.84230494499207
Training Loss : 104.5698471069336
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 42 ************

Collecting data to be used for training...
At timestep:     10006 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 50.069068908691406
Eval_StdReturn : 20.684389114379883
Eval_MaxReturn : 65.49968719482422
Eval_MinReturn : 0.5958443880081177
Ev

At timestep:     10019 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 63.55006408691406
Eval_StdReturn : 23.320634841918945
Eval_MaxReturn : 131.39764404296875
Eval_MinReturn : 10.364494323730469
Eval_AverageEpLen : 46.0
Train_AverageReturn : 64.24884796142578
Train_StdReturn : 14.702951431274414
Train_MaxReturn : 95.77143096923828
Train_MinReturn : -0.3721199631690979
Train_AverageEpLen : 46.38425925925926
Train_EnvstepsSoFar : 531379
TimeSinceStart : 631.3260102272034
Training Loss : 103.41677856445312
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 53 ************

Collecting data to be used for training...
At timestep:     10022 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 61.92425537109375
Eval_StdReturn : 19.006193161010742
Eval_MaxReturn : 84.81050872802734
Eval_MinReturn : 17.322891235351

At timestep:     10010 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 44.91085433959961
Eval_StdReturn : 3.0819718837738037
Eval_MaxReturn : 55.265419006347656
Eval_MinReturn : 40.76262664794922
Eval_AverageEpLen : 34.36666666666667
Train_AverageReturn : 46.04589080810547
Train_StdReturn : 3.6041557788848877
Train_MaxReturn : 61.93684387207031
Train_MinReturn : 39.25407791137695
Train_AverageEpLen : 35.49645390070922
Train_EnvstepsSoFar : 641602
TimeSinceStart : 730.8037631511688
Training Loss : 97.18539428710938
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 64 ************

Collecting data to be used for training...
At timestep:     10022 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 46.19573211669922
Eval_StdReturn : 3.8075315952301025
Eval_MaxReturn : 61.10976028442383
Eval_MinReturn : 42.053

At timestep:     10047 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 81.15370178222656
Eval_StdReturn : 1.4353430271148682
Eval_MaxReturn : 84.09574890136719
Eval_MinReturn : 78.45085906982422
Eval_AverageEpLen : 48.57142857142857
Train_AverageReturn : 82.77550506591797
Train_StdReturn : 1.6726380586624146
Train_MaxReturn : 87.71041870117188
Train_MinReturn : 78.97630310058594
Train_AverageEpLen : 49.49261083743843
Train_EnvstepsSoFar : 751982
TimeSinceStart : 848.9804799556732
Training Loss : 119.44677734375
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 75 ************

Collecting data to be used for training...
At timestep:     10002 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 80.95610046386719
Eval_StdReturn : 1.513807773590088
Eval_MaxReturn : 84.62277221679688
Eval_MinReturn : 78.8890380

At timestep:     10013 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 87.93426513671875
Eval_StdReturn : 37.084354400634766
Eval_MaxReturn : 162.18116760253906
Eval_MinReturn : 44.527828216552734
Eval_AverageEpLen : 53.578947368421055
Train_AverageReturn : 68.09782409667969
Train_StdReturn : 23.155122756958008
Train_MaxReturn : 154.24749755859375
Train_MinReturn : 41.43343734741211
Train_AverageEpLen : 47.45497630331754
Train_EnvstepsSoFar : 862233
TimeSinceStart : 942.3539712429047
Training Loss : 105.68356323242188
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 86 ************

Collecting data to be used for training...
At timestep:     10008 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 74.77178192138672
Eval_StdReturn : 22.843124389648438
Eval_MaxReturn : 121.78209686279297
Eval_MinReturn : 4

At timestep:     10018 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 58.700618743896484
Eval_StdReturn : 4.865888595581055
Eval_MaxReturn : 71.66315460205078
Eval_MinReturn : 52.04790496826172
Eval_AverageEpLen : 35.82142857142857
Train_AverageReturn : 65.21920776367188
Train_StdReturn : 8.4456148147583
Train_MaxReturn : 93.57017517089844
Train_MinReturn : 48.02781677246094
Train_AverageEpLen : 39.75396825396825
Train_EnvstepsSoFar : 972565
TimeSinceStart : 1030.543050289154
Training Loss : 109.3937759399414
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 97 ************

Collecting data to be used for training...
At timestep:     10024 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 56.43861389160156
Eval_StdReturn : 4.891231536865234
Eval_MaxReturn : 70.1384048461914
Eval_MinReturn : 49.520965576

In [73]:
pg_args = dict(pg_base_args_dict)

env_str = 'Hopper'
pg_args['env_name'] = '{}-v2'.format(env_str)
pg_args['learning_rate'] = 0.01
pg_args['reward_to_go'] = True
pg_args['batch_size'] = 10000
pg_args['train_batch_size'] = 10000
pg_args['n_iter'] = 100

# Delete all previous logs
remove_folder('logs/policy_gradient/{}/return_to_go/'.format(env_str))

for seed in range(3):
    print("Running policy gradient experiment with seed", seed)
    pg_args['seed'] = seed
    pg_args['logdir'] = 'logs/policy_gradient/{}/return_to_go/seed{}'.format(env_str, seed)
    pgtrainer = PG_Trainer(pg_args)
    pgtrainer.run_training_loop()

Folder logs/policy_gradient/Hopper/return_to_go/ does not exist yet. No old results to delete
Running policy gradient experiment with seed 0
########################
logging outputs to  logs/policy_gradient/Hopper/return_to_go/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
Hopper-v2


********** Iteration 0 ************

Collecting data to be used for training...
At timestep:     10004 / 1000010000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 97.36431121826172
Eval_StdReturn : 56.1705207824707
Eval_MaxReturn : 247.20233154296875
Eval_MinReturn : 17.86540985107422
Eval_AverageEpLen : 61.64705882352941
Train_AverageReturn : 9.657980918884277
Train_StdReturn : 5.853752613067627
Train_MaxReturn : 87.45542907714844
Train_MinReturn : 3.010934352874756
Train_AverageEpLen : 13.21532364597093

At timestep:     10044 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 141.20445251464844
Eval_StdReturn : 15.032516479492188
Eval_MaxReturn : 166.6125946044922
Eval_MinReturn : 111.88572692871094
Eval_AverageEpLen : 70.06666666666666
Train_AverageReturn : 156.4729766845703
Train_StdReturn : 16.73090934753418
Train_MaxReturn : 203.66421508789062
Train_MinReturn : 119.04261016845703
Train_AverageEpLen : 74.95522388059702
Train_EnvstepsSoFar : 120417
TimeSinceStart : 136.10811018943787
Training Loss : 145.02505493164062
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 12 ************

Collecting data to be used for training...
At timestep:     10074 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 130.7698974609375
Eval_StdReturn : 11.940099716186523
Eval_MaxReturn : 149.49746704101562
Eval_MinReturn : 10

At timestep:     10051 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 197.45745849609375
Eval_StdReturn : 10.737997055053711
Eval_MaxReturn : 211.17332458496094
Eval_MinReturn : 171.08966064453125
Eval_AverageEpLen : 91.0
Train_AverageReturn : 198.16322326660156
Train_StdReturn : 7.0099992752075195
Train_MaxReturn : 214.23690795898438
Train_MinReturn : 175.2529296875
Train_AverageEpLen : 91.37272727272727
Train_EnvstepsSoFar : 230882
TimeSinceStart : 253.248211145401
Training Loss : 156.94590759277344
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 23 ************

Collecting data to be used for training...
At timestep:     10052 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 190.188232421875
Eval_StdReturn : 18.8946590423584
Eval_MaxReturn : 206.81756591796875
Eval_MinReturn : 141.8779296875
Eval_A

At timestep:     10032 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 167.41226196289062
Eval_StdReturn : 15.48440170288086
Eval_MaxReturn : 192.04769897460938
Eval_MinReturn : 140.5677032470703
Eval_AverageEpLen : 82.6923076923077
Train_AverageReturn : 165.17172241210938
Train_StdReturn : 24.101377487182617
Train_MaxReturn : 200.47596740722656
Train_MinReturn : 102.49371337890625
Train_AverageEpLen : 80.90322580645162
Train_EnvstepsSoFar : 341228
TimeSinceStart : 353.2832462787628
Training Loss : 141.08807373046875
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 34 ************

Collecting data to be used for training...
At timestep:     10035 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 172.38087463378906
Eval_StdReturn : 12.002298355102539
Eval_MaxReturn : 188.1773681640625
Eval_MinReturn : 140

At timestep:     10030 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 154.8797607421875
Eval_StdReturn : 46.942325592041016
Eval_MaxReturn : 189.38722229003906
Eval_MinReturn : -1.7917325496673584
Eval_AverageEpLen : 79.07692307692308
Train_AverageReturn : 161.80113220214844
Train_StdReturn : 31.451339721679688
Train_MaxReturn : 274.0237731933594
Train_MinReturn : 2.4339327812194824
Train_AverageEpLen : 82.89256198347107
Train_EnvstepsSoFar : 451494
TimeSinceStart : 455.0623230934143
Training Loss : 134.97552490234375
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 45 ************

Collecting data to be used for training...
At timestep:     10076 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 168.03086853027344
Eval_StdReturn : 23.781099319458008
Eval_MaxReturn : 236.2731475830078
Eval_MinReturn : 1

At timestep:     10015 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 201.96621704101562
Eval_StdReturn : 48.78855514526367
Eval_MaxReturn : 288.8631591796875
Eval_MinReturn : 138.1765899658203
Eval_AverageEpLen : 100.4
Train_AverageReturn : 229.67051696777344
Train_StdReturn : 83.52855682373047
Train_MaxReturn : 516.6126098632812
Train_MinReturn : 150.03390502929688
Train_AverageEpLen : 106.54255319148936
Train_EnvstepsSoFar : 562004
TimeSinceStart : 550.2073972225189
Training Loss : 160.0828399658203
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 56 ************

Collecting data to be used for training...
At timestep:     10141 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 184.4468994140625
Eval_StdReturn : 72.87672424316406
Eval_MaxReturn : 335.3495788574219
Eval_MinReturn : 73.2235336303711
Ev

At timestep:     10083 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 338.6156005859375
Eval_StdReturn : 93.64818572998047
Eval_MaxReturn : 428.9682922363281
Eval_MinReturn : 107.15160369873047
Eval_AverageEpLen : 123.0
Train_AverageReturn : 296.2561340332031
Train_StdReturn : 136.74363708496094
Train_MaxReturn : 584.3955078125
Train_MinReturn : -0.6069857478141785
Train_AverageEpLen : 112.03333333333333
Train_EnvstepsSoFar : 672743
TimeSinceStart : 634.8208911418915
Training Loss : 206.20497131347656
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 67 ************

Collecting data to be used for training...
At timestep:     10015 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 245.22625732421875
Eval_StdReturn : 144.47906494140625
Eval_MaxReturn : 463.843017578125
Eval_MinReturn : 90.18399047851562
E

At timestep:     10010 / 1000010000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 299.5079345703125
Eval_StdReturn : 115.74913787841797
Eval_MaxReturn : 498.2506103515625
Eval_MinReturn : 95.42056274414062
Eval_AverageEpLen : 123.66666666666667
Train_AverageReturn : 307.2569580078125
Train_StdReturn : 110.95171356201172
Train_MaxReturn : 531.7677001953125
Train_MinReturn : 87.14209747314453
Train_AverageEpLen : 123.58024691358025
Train_EnvstepsSoFar : 783499
TimeSinceStart : 720.0322370529175
Training Loss : 194.72604370117188
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 78 ************

Collecting data to be used for training...
At timestep:     10118 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 350.98468017578125
Eval_StdReturn : 103.31111907958984
Eval_MaxReturn : 535.5054931640625
Eval_MinReturn :

At timestep:     10064 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 408.9747314453125
Eval_StdReturn : 125.59825897216797
Eval_MaxReturn : 612.5953369140625
Eval_MinReturn : 186.8405303955078
Eval_AverageEpLen : 174.83333333333334
Train_AverageReturn : 320.29071044921875
Train_StdReturn : 88.08817291259766
Train_MaxReturn : 534.1858520507812
Train_MinReturn : 173.45193481445312
Train_AverageEpLen : 150.2089552238806
Train_EnvstepsSoFar : 894230
TimeSinceStart : 803.9000110626221
Training Loss : 169.16238403320312
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...




********** Iteration 89 ************

Collecting data to be used for training...
At timestep:     10059 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 330.4130554199219
Eval_StdReturn : 94.88734436035156
Eval_MaxReturn : 485.009765625
Eval_MinReturn : 187.245178

At timestep:     10169 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 263.6753845214844
Eval_StdReturn : 61.844573974609375
Eval_MaxReturn : 378.4414978027344
Eval_MinReturn : 193.18544006347656
Eval_AverageEpLen : 187.0
Train_AverageReturn : 317.63580322265625
Train_StdReturn : 89.16246032714844
Train_MaxReturn : 496.59686279296875
Train_MinReturn : 120.39752960205078
Train_AverageEpLen : 172.35593220338984
Train_EnvstepsSoFar : 1005008
TimeSinceStart : 891.2509171962738
Training Loss : 145.78717041015625
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.657980918884277
Done logging...


Running policy gradient experiment with seed 1
########################
logging outputs to  logs/policy_gradient/Hopper/return_to_go/seed1
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
Hop

At timestep:     10065 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 240.5706787109375
Eval_StdReturn : 109.90740203857422
Eval_MaxReturn : 457.4300537109375
Eval_MinReturn : 80.2801742553711
Eval_AverageEpLen : 102.2
Train_AverageReturn : 207.6353759765625
Train_StdReturn : 89.88446044921875
Train_MaxReturn : 422.74578857421875
Train_MinReturn : 96.10395812988281
Train_AverageEpLen : 94.95283018867924
Train_EnvstepsSoFar : 110370
TimeSinceStart : 86.26026272773743
Training Loss : 155.79800415039062
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 11 ************

Collecting data to be used for training...
At timestep:     10152 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 324.76116943359375
Eval_StdReturn : 79.50770568847656
Eval_MaxReturn : 441.11126708984375
Eval_MinReturn : 167.33396911621094

At timestep:     10029 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 249.467041015625
Eval_StdReturn : 59.91781997680664
Eval_MaxReturn : 319.95159912109375
Eval_MinReturn : 147.69558715820312
Eval_AverageEpLen : 120.77777777777777
Train_AverageReturn : 228.01675415039062
Train_StdReturn : 29.467479705810547
Train_MaxReturn : 301.6164245605469
Train_MinReturn : 183.0254669189453
Train_AverageEpLen : 105.56842105263158
Train_EnvstepsSoFar : 221139
TimeSinceStart : 180.002338886261
Training Loss : 155.7416534423828
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 22 ************

Collecting data to be used for training...
At timestep:     10059 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 245.059326171875
Eval_StdReturn : 33.44093322753906
Eval_MaxReturn : 286.7674560546875
Eval_MinReturn : 189.704

At timestep:     10179 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 319.8657531738281
Eval_StdReturn : 40.59473419189453
Eval_MaxReturn : 391.2047119140625
Eval_MinReturn : 248.78717041015625
Eval_AverageEpLen : 144.28571428571428
Train_AverageReturn : 330.73822021484375
Train_StdReturn : 58.113037109375
Train_MaxReturn : 447.0581970214844
Train_MinReturn : 191.32598876953125
Train_AverageEpLen : 154.22727272727272
Train_EnvstepsSoFar : 332007
TimeSinceStart : 270.15208983421326
Training Loss : 159.89222717285156
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 33 ************

Collecting data to be used for training...
At timestep:     10076 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 318.9512023925781
Eval_StdReturn : 50.296966552734375
Eval_MaxReturn : 390.4940490722656
Eval_MinReturn : 227.

At timestep:     10052 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 276.0971984863281
Eval_StdReturn : 94.49604034423828
Eval_MaxReturn : 472.5479736328125
Eval_MinReturn : 145.5658721923828
Eval_AverageEpLen : 121.88888888888889
Train_AverageReturn : 326.4314270019531
Train_StdReturn : 93.15843200683594
Train_MaxReturn : 549.7386474609375
Train_MinReturn : 164.60086059570312
Train_AverageEpLen : 137.6986301369863
Train_EnvstepsSoFar : 442673
TimeSinceStart : 354.81927394866943
Training Loss : 174.45681762695312
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 44 ************

Collecting data to be used for training...
At timestep:     10053 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 338.8292236328125
Eval_StdReturn : 110.60386657714844
Eval_MaxReturn : 524.4000244140625
Eval_MinReturn : 180.5

At timestep:     10070 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 180.46446228027344
Eval_StdReturn : 120.91796875
Eval_MaxReturn : 437.39910888671875
Eval_MinReturn : 22.868453979492188
Eval_AverageEpLen : 78.71428571428571
Train_AverageReturn : 196.44117736816406
Train_StdReturn : 130.4986114501953
Train_MaxReturn : 555.2952880859375
Train_MinReturn : 12.081817626953125
Train_AverageEpLen : 83.91666666666667
Train_EnvstepsSoFar : 553360
TimeSinceStart : 439.28030276298523
Training Loss : 163.3215789794922
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 55 ************

Collecting data to be used for training...
At timestep:     10068 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 165.02169799804688
Eval_StdReturn : 110.20954132080078
Eval_MaxReturn : 365.70025634765625
Eval_MinReturn : 20.753

At timestep:     10123 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 395.3037109375
Eval_StdReturn : 52.223880767822266
Eval_MaxReturn : 487.44085693359375
Eval_MinReturn : 337.6259765625
Eval_AverageEpLen : 196.5
Train_AverageReturn : 345.215576171875
Train_StdReturn : 85.53211975097656
Train_MaxReturn : 497.7361755371094
Train_MinReturn : 111.84888458251953
Train_AverageEpLen : 165.95081967213116
Train_EnvstepsSoFar : 664009
TimeSinceStart : 528.0279359817505
Training Loss : 152.94293212890625
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 66 ************

Collecting data to be used for training...
At timestep:     10019 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 353.7002868652344
Eval_StdReturn : 26.3836727142334
Eval_MaxReturn : 387.6013488769531
Eval_MinReturn : 315.36749267578125
Eval_A

At timestep:     10006 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 321.2747497558594
Eval_StdReturn : 26.70758819580078
Eval_MaxReturn : 358.35211181640625
Eval_MinReturn : 283.5878601074219
Eval_AverageEpLen : 154.57142857142858
Train_AverageReturn : 332.9762268066406
Train_StdReturn : 36.44728088378906
Train_MaxReturn : 412.6811218261719
Train_MinReturn : 237.1576385498047
Train_AverageEpLen : 166.76666666666668
Train_EnvstepsSoFar : 774733
TimeSinceStart : 618.0173296928406
Training Loss : 146.64529418945312
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 77 ************

Collecting data to be used for training...
At timestep:     10019 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 344.9144592285156
Eval_StdReturn : 53.06108856201172
Eval_MaxReturn : 423.58892822265625
Eval_MinReturn : 281.9

At timestep:     10073 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 328.13348388671875
Eval_StdReturn : 77.15644073486328
Eval_MaxReturn : 396.3882751464844
Eval_MinReturn : 168.81509399414062
Eval_AverageEpLen : 179.16666666666666
Train_AverageReturn : 323.2600402832031
Train_StdReturn : 81.85011291503906
Train_MaxReturn : 458.7667236328125
Train_MinReturn : 77.87782287597656
Train_AverageEpLen : 176.71929824561403
Train_EnvstepsSoFar : 885705
TimeSinceStart : 717.770397901535
Training Loss : 133.04476928710938
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 88 ************

Collecting data to be used for training...
At timestep:     10079 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 280.90545654296875
Eval_StdReturn : 107.39653778076172
Eval_MaxReturn : 360.8293762207031
Eval_MinReturn : 105.

At timestep:     10111 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 346.0997009277344
Eval_StdReturn : 12.065855979919434
Eval_MaxReturn : 372.65301513671875
Eval_MinReturn : 332.50518798828125
Eval_AverageEpLen : 158.42857142857142
Train_AverageReturn : 334.3992004394531
Train_StdReturn : 20.820512771606445
Train_MaxReturn : 376.16119384765625
Train_MinReturn : 255.0543975830078
Train_AverageEpLen : 155.55384615384617
Train_EnvstepsSoFar : 997071
TimeSinceStart : 839.6443557739258
Training Loss : 153.64190673828125
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.049403190612793
Done logging...




********** Iteration 99 ************

Collecting data to be used for training...
At timestep:     10163 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 341.97119140625
Eval_StdReturn : 24.763477325439453
Eval_MaxReturn : 387.0968322753906
Eval_MinReturn : 308

At timestep:     10049 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 124.49490356445312
Eval_StdReturn : 20.39849853515625
Eval_MaxReturn : 153.55621337890625
Eval_MinReturn : 80.66252899169922
Eval_AverageEpLen : 64.4375
Train_AverageReturn : 125.23103332519531
Train_StdReturn : 27.933921813964844
Train_MaxReturn : 207.0884246826172
Train_MinReturn : 23.048498153686523
Train_AverageEpLen : 64.41666666666667
Train_EnvstepsSoFar : 100329
TimeSinceStart : 118.64782476425171
Training Loss : 124.97303009033203
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 10 ************

Collecting data to be used for training...
At timestep:     10041 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 122.97917938232422
Eval_StdReturn : 28.59758186340332
Eval_MaxReturn : 177.1664276123047
Eval_MinReturn : 77.964736938

At timestep:     10074 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 179.958984375
Eval_StdReturn : 28.428102493286133
Eval_MaxReturn : 201.1160430908203
Eval_MinReturn : 89.27960968017578
Eval_AverageEpLen : 89.41666666666667
Train_AverageReturn : 183.0699462890625
Train_StdReturn : 10.602184295654297
Train_MaxReturn : 209.00479125976562
Train_MinReturn : 138.59188842773438
Train_AverageEpLen : 86.84482758620689
Train_EnvstepsSoFar : 210755
TimeSinceStart : 218.78533101081848
Training Loss : 145.0617218017578
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 21 ************

Collecting data to be used for training...
At timestep:     10034 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 193.10069274902344
Eval_StdReturn : 9.789602279663086
Eval_MaxReturn : 208.69322204589844
Eval_MinReturn : 176.016

At timestep:     10012 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 119.11956787109375
Eval_StdReturn : 24.64145278930664
Eval_MaxReturn : 147.05624389648438
Eval_MinReturn : 72.73356628417969
Eval_AverageEpLen : 63.8125
Train_AverageReturn : 120.81153869628906
Train_StdReturn : 26.251598358154297
Train_MaxReturn : 211.395751953125
Train_MinReturn : 70.83440399169922
Train_AverageEpLen : 65.01298701298701
Train_EnvstepsSoFar : 321257
TimeSinceStart : 302.8427758216858
Training Loss : 117.8016128540039
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 32 ************

Collecting data to be used for training...
At timestep:     10030 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 123.45471954345703
Eval_StdReturn : 34.914459228515625
Eval_MaxReturn : 188.4209442138672
Eval_MinReturn : 62.716270446777

At timestep:     10060 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 117.4285888671875
Eval_StdReturn : 25.63446044921875
Eval_MaxReturn : 148.6278533935547
Eval_MinReturn : 74.84500885009766
Eval_AverageEpLen : 63.0625
Train_AverageReturn : 109.05223083496094
Train_StdReturn : 24.75372886657715
Train_MaxReturn : 156.15750122070312
Train_MinReturn : 56.35959243774414
Train_AverageEpLen : 59.1764705882353
Train_EnvstepsSoFar : 431593
TimeSinceStart : 394.08034086227417
Training Loss : 116.0173568725586
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 43 ************

Collecting data to be used for training...
At timestep:     10006 / 100003555 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 114.42523956298828
Eval_StdReturn : 22.647424697875977
Eval_MaxReturn : 143.56163024902344
Eval_MinReturn : 65.

At timestep:     10029 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 175.862548828125
Eval_StdReturn : 4.303638458251953
Eval_MaxReturn : 183.8257598876953
Eval_MinReturn : 169.3415069580078
Eval_AverageEpLen : 83.41666666666667
Train_AverageReturn : 177.13873291015625
Train_StdReturn : 6.031843662261963
Train_MaxReturn : 191.7562713623047
Train_MinReturn : 164.86927795410156
Train_AverageEpLen : 83.575
Train_EnvstepsSoFar : 541969
TimeSinceStart : 496.09584283828735
Training Loss : 148.24717712402344
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 54 ************

Collecting data to be used for training...
At timestep:     10019 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 172.92433166503906
Eval_StdReturn : 8.139564514160156
Eval_MaxReturn : 189.83995056152344
Eval_MinReturn : 152.068115234375

At timestep:     10049 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 180.47828674316406
Eval_StdReturn : 5.10537052154541
Eval_MaxReturn : 187.23126220703125
Eval_MinReturn : 172.02464294433594
Eval_AverageEpLen : 83.58333333333333
Train_AverageReturn : 174.3301544189453
Train_StdReturn : 7.0738091468811035
Train_MaxReturn : 195.1807861328125
Train_MinReturn : 156.80795288085938
Train_AverageEpLen : 83.0495867768595
Train_EnvstepsSoFar : 652341
TimeSinceStart : 586.307247877121
Training Loss : 145.8621368408203
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 65 ************

Collecting data to be used for training...
At timestep:     10063 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 192.0714874267578
Eval_StdReturn : 10.162866592407227
Eval_MaxReturn : 208.0217742919922
Eval_MinReturn : 178.174

At timestep:     10074 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 149.8384246826172
Eval_StdReturn : 55.151824951171875
Eval_MaxReturn : 186.46151733398438
Eval_MinReturn : 14.782209396362305
Eval_AverageEpLen : 72.07142857142857
Train_AverageReturn : 175.46347045898438
Train_StdReturn : 16.708438873291016
Train_MaxReturn : 201.61561584472656
Train_MinReturn : 17.821237564086914
Train_AverageEpLen : 82.57377049180327
Train_EnvstepsSoFar : 762957
TimeSinceStart : 672.0764496326447
Training Loss : 149.42384338378906
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 76 ************

Collecting data to be used for training...
At timestep:     10040 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 131.98321533203125
Eval_StdReturn : 64.62855529785156
Eval_MaxReturn : 183.99891662597656
Eval_MinReturn : 

At timestep:     10021 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 60.42292022705078
Eval_StdReturn : 5.351215839385986
Eval_MaxReturn : 71.07717895507812
Eval_MinReturn : 50.9275016784668
Eval_AverageEpLen : 34.4
Train_AverageReturn : 60.65076446533203
Train_StdReturn : 7.667541980743408
Train_MaxReturn : 145.97854614257812
Train_MinReturn : 48.97329330444336
Train_AverageEpLen : 34.31849315068493
Train_EnvstepsSoFar : 873459
TimeSinceStart : 758.4447586536407
Training Loss : 86.89322662353516
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 87 ************

Collecting data to be used for training...
At timestep:     10005 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 62.310691833496094
Eval_StdReturn : 13.986957550048828
Eval_MaxReturn : 131.86880493164062
Eval_MinReturn : 50.534942626953125
E

At timestep:     10006 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 166.9801483154297
Eval_StdReturn : 1.914033055305481
Eval_MaxReturn : 169.52749633789062
Eval_MinReturn : 163.375244140625
Eval_AverageEpLen : 83.58333333333333
Train_AverageReturn : 166.77931213378906
Train_StdReturn : 2.3248870372772217
Train_MaxReturn : 174.1495819091797
Train_MinReturn : 161.65708923339844
Train_AverageEpLen : 84.08403361344538
Train_EnvstepsSoFar : 983650
TimeSinceStart : 849.3052098751068
Training Loss : 135.0792236328125
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 18.039928436279297
Done logging...




********** Iteration 98 ************

Collecting data to be used for training...
At timestep:     10045 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 167.998291015625
Eval_StdReturn : 2.0268592834472656
Eval_MaxReturn : 171.5326690673828
Eval_MinReturn : 165.628

In [74]:
### Visualize Policy Gradient results on Hopper
%load_ext tensorboard
%tensorboard --logdir logs/policy_gradient/Hopper

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


## Variance Reduction with a Value Function Baseline
We can further reduce the policy gradient variance by including state-dependent baselines. In this section, we will train a value function network to predict the value of the policy at a state, then use the value function as a baseline by subtracting it from our reward-to-go estimate.

Implement the value function baseline loss in the update method of the MLPPolicyPG class in <code>policies/MLP_policy.py</code>.

In [88]:
# Test value function gradient
torch.manual_seed(0)
ac_dim = 2
ob_dim = 3
batch_size = 5

policy = MLPPolicyPG(
            ac_dim=ac_dim,
            ob_dim=ob_dim,
            n_layers=1,
            size=2,
            learning_rate=0.25,
            nn_baseline=True)

np.random.seed(0)
obs = np.random.normal(size=(batch_size, ob_dim))
acts = np.random.normal(size=(batch_size, ac_dim))
advs = 1000 * np.random.normal(size=(batch_size,))
qvals = advs

first_weight_before = np.array(ptu.to_numpy(next(policy.baseline.parameters())))
print("Weight before update", first_weight_before)

for i in range(5):
    loss = policy.update(obs, acts, advs, qvals=qvals)['Baseline Loss']

print(loss)
expected_loss = 0.925361
loss_error = rel_error(loss, expected_loss)
print("Loss Error", loss_error, "should be on the order of 1e-6 or lower")

first_weight_after = ptu.to_numpy(next(policy.baseline.parameters()))
print('Weight after update', first_weight_after)

weight_change = first_weight_after - first_weight_before
print("Change in weights", weight_change)

expected_change = np.array([[ 0.38988823,  0.70297027,  0.2609921 ],
                            [-1.0340402,  -0.84166795,  0.7254925 ]])
updated_weight_error = rel_error(weight_change, expected_change)
print("Weight Update Error", updated_weight_error, "should be on the order of 1e-6 or lower")

Weight before update [[-0.23799711  0.0213871   0.22824687]
 [ 0.34642333 -0.39140946 -0.25141457]]
0.92536086
Loss Error 7.648885767117787e-08 should be on the order of 1e-6 or lower
Weight after update [[ 0.15189107  0.72435737  0.48923904]
 [-0.68761694 -1.2330772   0.474078  ]]
Change in weights [[ 0.38988817  0.70297027  0.26099217]
 [-1.0340402  -0.8416677   0.7254926 ]]
Weight Update Error 1.4154350421085664e-07 should be on the order of 1e-6 or lower


In the estimate_advantage function in <code>agents/pg_agent.py</code>, fill out the advantage estimate using the baseline and test your implementation below.

In [91]:
### Test return computation
pg_args = dict(pg_base_args_dict)

env_str = 'CartPole'
pg_args['env_name'] = '{}-v0'.format(env_str)
pg_args['nn_baseline'] = True
pgtrainer = PG_Trainer(pg_args)
pgagent = pgtrainer.rl_trainer.agent

obs_dim = 4
N = 10
np.random.seed(0)
obs = np.random.normal(size=(N, obs_dim))
qs = np.random.normal(size=N)

baseline_advantages = pgagent.estimate_advantage(obs, qs)
expected_advantages = np.array([-0.44662586, -0.89629588, -1.14574752,  2.43957172, -0.06601728,
       -0.00501807, -0.74720337,  1.27468092, -1.20184486,  0.25312274])

advantage_error = rel_error(expected_advantages, baseline_advantages)
print("Advantage error", advantage_error, "should be on the order of 1e-6 or lower")

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
CartPole-v0
Advantage error 3.6888988344448045e-07 should be on the order of 1e-6 or lower


## Train your policies!
In this section, we will train our policies using the reward-to-go estimator and learning a value function baseline. On Hopper, you should see your methods get over 300 rewards consistently, and sometimes over 400. Returns will tend to oscillate during training. You should also see that using a value function baseline greatly improves performance over our earlier experiments without it.

In [92]:
pg_args = dict(pg_base_args_dict)

env_str = 'Hopper'
pg_args['env_name'] = '{}-v2'.format(env_str)
pg_args['learning_rate'] = 0.01
pg_args['reward_to_go'] = True
pg_args['nn_baseline'] = True
pg_args['batch_size'] = 10000
pg_args['train_batch_size'] = 10000
pg_args['n_iter'] = 100

# Delete all previous logs
remove_folder('logs/policy_gradient/{}/with_baseline/'.format(env_str))

for seed in range(3):
    print("Running policy gradient experiment with seed", seed)
    pg_args['seed'] = seed
    pg_args['logdir'] = 'logs/policy_gradient/{}/with_baseline/seed{}'.format(env_str, seed)
    pgtrainer = PG_Trainer(pg_args)
    pgtrainer.run_training_loop()

Folder logs/policy_gradient/Hopper/with_baseline/ does not exist yet. No old results to delete
Running policy gradient experiment with seed 0
########################
logging outputs to  logs/policy_gradient/Hopper/with_baseline/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
Hopper-v2


********** Iteration 0 ************

Collecting data to be used for training...
At timestep:     10015 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 89.84849548339844
Eval_StdReturn : 33.547855377197266
Eval_MaxReturn : 155.4940185546875
Eval_MinReturn : 29.384666442871094
Eval_AverageEpLen : 58.5
Train_AverageReturn : 9.403343200683594
Train_StdReturn : 4.718483924865723
Train_MaxReturn : 46.106727600097656
Train_MinReturn : 2.846346378326416
Train_AverageEpLen : 12.939276485788113
Train_Envst

At timestep:     10062 / 10000 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 184.14830017089844
Eval_StdReturn : 7.827200889587402
Eval_MaxReturn : 202.0488739013672
Eval_MinReturn : 174.2275390625
Eval_AverageEpLen : 84.66666666666667
Train_AverageReturn : 187.31427001953125
Train_StdReturn : 7.629891872406006
Train_MaxReturn : 209.68017578125
Train_MinReturn : 170.99754333496094
Train_AverageEpLen : 85.27118644067797
Train_EnvstepsSoFar : 110399
TimeSinceStart : 126.84771299362183
Training Loss : 3.736957311630249
Baseline Loss : 0.697478711605072
Initial_DataCollection_AverageReturn : 9.403343200683594
Done logging...




********** Iteration 11 ************

Collecting data to be used for training...
At timestep:     10047 / 10000 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 179.7754669189453
Eval_StdReturn : 6.320042610168457
Eval_MaxReturn : 189.43637084960938


At timestep:     10102 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 124.47021484375
Eval_StdReturn : 50.19852828979492
Eval_MaxReturn : 223.92430114746094
Eval_MinReturn : -0.5181845426559448
Eval_AverageEpLen : 71.42857142857143
Train_AverageReturn : 198.38404846191406
Train_StdReturn : 96.33367156982422
Train_MaxReturn : 518.9171142578125
Train_MinReturn : 6.852484703063965
Train_AverageEpLen : 113.50561797752809
Train_EnvstepsSoFar : 221028
TimeSinceStart : 232.41298580169678
Training Loss : -2.4056758880615234
Baseline Loss : 1.0359759330749512
Initial_DataCollection_AverageReturn : 9.403343200683594
Done logging...




********** Iteration 22 ************

Collecting data to be used for training...
At timestep:     10096 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 270.92279052734375
Eval_StdReturn : 161.006103515625
Eval_MaxReturn : 536.378662109375
Eval_M

At timestep:     10141 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 320.9103698730469
Eval_StdReturn : 98.31527709960938
Eval_MaxReturn : 402.2198486328125
Eval_MinReturn : 88.40558624267578
Eval_AverageEpLen : 125.22222222222223
Train_AverageReturn : 338.0621032714844
Train_StdReturn : 112.13220977783203
Train_MaxReturn : 524.9471435546875
Train_MinReturn : 78.20724487304688
Train_AverageEpLen : 128.36708860759492
Train_EnvstepsSoFar : 332182
TimeSinceStart : 339.60259675979614
Training Loss : -0.5081696510314941
Baseline Loss : 0.6546671986579895
Initial_DataCollection_AverageReturn : 9.403343200683594
Done logging...




********** Iteration 33 ************

Collecting data to be used for training...
At timestep:     10118 / 1000010000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 353.0106201171875
Eval_StdReturn : 131.37326049804688
Eval_MaxReturn : 573.7870483398438

At timestep:     10129 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 474.8321533203125
Eval_StdReturn : 89.25525665283203
Eval_MaxReturn : 551.1094970703125
Eval_MinReturn : 283.96124267578125
Eval_AverageEpLen : 174.16666666666666
Train_AverageReturn : 409.05877685546875
Train_StdReturn : 97.8569107055664
Train_MaxReturn : 604.012939453125
Train_MinReturn : 147.63624572753906
Train_AverageEpLen : 153.46969696969697
Train_EnvstepsSoFar : 443102
TimeSinceStart : 431.5042450428009
Training Loss : 2.533851385116577
Baseline Loss : 0.6139864325523376
Initial_DataCollection_AverageReturn : 9.403343200683594
Done logging...




********** Iteration 44 ************

Collecting data to be used for training...
At timestep:     10058 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 467.9366149902344
Eval_StdReturn : 68.28863525390625
Eval_MaxReturn : 539.91015625
Eval_MinRetur

At timestep:     10056 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 428.10302734375
Eval_StdReturn : 51.39466857910156
Eval_MaxReturn : 516.3291015625
Eval_MinReturn : 365.7142333984375
Eval_AverageEpLen : 154.14285714285714
Train_AverageReturn : 390.4894104003906
Train_StdReturn : 39.05012130737305
Train_MaxReturn : 566.8751220703125
Train_MinReturn : 342.672119140625
Train_AverageEpLen : 143.65714285714284
Train_EnvstepsSoFar : 554253
TimeSinceStart : 519.4802477359772
Training Loss : -2.486948013305664
Baseline Loss : 0.3010525405406952
Initial_DataCollection_AverageReturn : 9.403343200683594
Done logging...




********** Iteration 55 ************

Collecting data to be used for training...
At timestep:     10006 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 386.234619140625
Eval_StdReturn : 14.11349105834961
Eval_MaxReturn : 408.242919921875
Eval_MinReturn :

At timestep:     10042 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 436.8429260253906
Eval_StdReturn : 35.993289947509766
Eval_MaxReturn : 490.9903564453125
Eval_MinReturn : 394.73211669921875
Eval_AverageEpLen : 155.14285714285714
Train_AverageReturn : 411.9078369140625
Train_StdReturn : 54.22826385498047
Train_MaxReturn : 586.28271484375
Train_MinReturn : 293.96844482421875
Train_AverageEpLen : 149.88059701492537
Train_EnvstepsSoFar : 665124
TimeSinceStart : 624.1147677898407
Training Loss : -5.013966083526611
Baseline Loss : 0.3569519817829132
Initial_DataCollection_AverageReturn : 9.403343200683594
Done logging...




********** Iteration 66 ************

Collecting data to be used for training...
At timestep:     10018 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 416.8251647949219
Eval_StdReturn : 75.79549407958984
Eval_MaxReturn : 561.662109375
Eval_MinRet

At timestep:     10049 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 360.9720764160156
Eval_StdReturn : 90.32759094238281
Eval_MaxReturn : 480.27130126953125
Eval_MinReturn : 180.9306640625
Eval_AverageEpLen : 158.28571428571428
Train_AverageReturn : 364.48931884765625
Train_StdReturn : 76.59603881835938
Train_MaxReturn : 537.2095947265625
Train_MinReturn : 208.8971405029297
Train_AverageEpLen : 162.08064516129033
Train_EnvstepsSoFar : 775881
TimeSinceStart : 711.5594348907471
Training Loss : 1.0810003280639648
Baseline Loss : 0.5753395557403564
Initial_DataCollection_AverageReturn : 9.403343200683594
Done logging...




********** Iteration 77 ************

Collecting data to be used for training...
At timestep:     10080 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 373.47625732421875
Eval_StdReturn : 74.2846908569336
Eval_MaxReturn : 493.006103515625
Eval_MinRe

At timestep:     10047 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 381.12359619140625
Eval_StdReturn : 48.53044891357422
Eval_MaxReturn : 441.3617858886719
Eval_MinReturn : 294.1536865234375
Eval_AverageEpLen : 149.28571428571428
Train_AverageReturn : 395.85260009765625
Train_StdReturn : 60.27042007446289
Train_MaxReturn : 545.05224609375
Train_MinReturn : 171.5089569091797
Train_AverageEpLen : 156.984375
Train_EnvstepsSoFar : 886761
TimeSinceStart : 802.8970708847046
Training Loss : -4.916229724884033
Baseline Loss : 0.43578317761421204
Initial_DataCollection_AverageReturn : 9.403343200683594
Done logging...




********** Iteration 88 ************

Collecting data to be used for training...
At timestep:     10164 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 460.9981689453125
Eval_StdReturn : 68.61882781982422
Eval_MaxReturn : 544.2669677734375
Eval_MinReturn 

At timestep:     10098 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 422.30572509765625
Eval_StdReturn : 51.786460876464844
Eval_MaxReturn : 502.15667724609375
Eval_MinReturn : 355.6953430175781
Eval_AverageEpLen : 158.57142857142858
Train_AverageReturn : 405.5245056152344
Train_StdReturn : 55.02557373046875
Train_MaxReturn : 543.2630615234375
Train_MinReturn : 272.45416259765625
Train_AverageEpLen : 153.0
Train_EnvstepsSoFar : 997905
TimeSinceStart : 892.639720916748
Training Loss : -5.729735374450684
Baseline Loss : 0.3133246898651123
Initial_DataCollection_AverageReturn : 9.403343200683594
Done logging...




********** Iteration 99 ************

Collecting data to be used for training...
At timestep:     10069 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 406.0423889160156
Eval_StdReturn : 35.58473587036133
Eval_MaxReturn : 451.476318359375
Eval_MinReturn : 35

At timestep:     10080 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 202.6615753173828
Eval_StdReturn : 5.987583160400391
Eval_MaxReturn : 214.0958709716797
Eval_MinReturn : 194.27371215820312
Eval_AverageEpLen : 90.75
Train_AverageReturn : 206.04759216308594
Train_StdReturn : 7.868252754211426
Train_MaxReturn : 230.9669189453125
Train_MinReturn : 185.52537536621094
Train_AverageEpLen : 91.63636363636364
Train_EnvstepsSoFar : 100353
TimeSinceStart : 81.83540606498718
Training Loss : 0.708162784576416
Baseline Loss : 0.8947439789772034
Initial_DataCollection_AverageReturn : 12.80981159210205
Done logging...




********** Iteration 10 ************

Collecting data to be used for training...
At timestep:     10079 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 202.68226623535156
Eval_StdReturn : 9.17101001739502
Eval_MaxReturn : 219.1761932373047
Eval_MinReturn : 189

At timestep:     10012 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 332.6534729003906
Eval_StdReturn : 104.61033630371094
Eval_MaxReturn : 501.74609375
Eval_MinReturn : 178.9033203125
Eval_AverageEpLen : 133.125
Train_AverageReturn : 292.021728515625
Train_StdReturn : 57.30270004272461
Train_MaxReturn : 451.014404296875
Train_MinReturn : 134.7144775390625
Train_AverageEpLen : 130.02597402597402
Train_EnvstepsSoFar : 210818
TimeSinceStart : 176.49537897109985
Training Loss : -7.501700401306152
Baseline Loss : 1.07321298122406
Initial_DataCollection_AverageReturn : 12.80981159210205
Done logging...




********** Iteration 21 ************

Collecting data to be used for training...
At timestep:     10101 / 10000 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 298.98760986328125
Eval_StdReturn : 97.08160400390625
Eval_MaxReturn : 460.4727783203125
Eval_MinReturn : 149.0

At timestep:     10035 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 301.8909912109375
Eval_StdReturn : 22.6831111907959
Eval_MaxReturn : 353.6127624511719
Eval_MinReturn : 274.01641845703125
Eval_AverageEpLen : 133.5
Train_AverageReturn : 298.9471130371094
Train_StdReturn : 34.73854446411133
Train_MaxReturn : 406.7890625
Train_MinReturn : 203.9122772216797
Train_AverageEpLen : 133.8
Train_EnvstepsSoFar : 321574
TimeSinceStart : 269.7052049636841
Training Loss : -0.05584705248475075
Baseline Loss : 0.8844015598297119
Initial_DataCollection_AverageReturn : 12.80981159210205
Done logging...




********** Iteration 32 ************

Collecting data to be used for training...
At timestep:     10039 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 310.8370361328125
Eval_StdReturn : 30.52297019958496
Eval_MaxReturn : 387.67047119140625
Eval_MinReturn : 284.46337890625
Eval

At timestep:     10115 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 335.2375183105469
Eval_StdReturn : 47.121315002441406
Eval_MaxReturn : 400.77789306640625
Eval_MinReturn : 265.6454162597656
Eval_AverageEpLen : 150.57142857142858
Train_AverageReturn : 332.6598205566406
Train_StdReturn : 57.9860954284668
Train_MaxReturn : 524.4358520507812
Train_MinReturn : 175.9961700439453
Train_AverageEpLen : 144.5
Train_EnvstepsSoFar : 432256
TimeSinceStart : 358.995285987854
Training Loss : -4.6376729011535645
Baseline Loss : 0.7971521019935608
Initial_DataCollection_AverageReturn : 12.80981159210205
Done logging...




********** Iteration 43 ************

Collecting data to be used for training...
At timestep:     10169 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 338.533203125
Eval_StdReturn : 101.78726196289062
Eval_MaxReturn : 517.351806640625
Eval_MinReturn : 147.262

At timestep:     10071 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 381.9570617675781
Eval_StdReturn : 104.82630157470703
Eval_MaxReturn : 505.7802734375
Eval_MinReturn : 199.8398895263672
Eval_AverageEpLen : 160.57142857142858
Train_AverageReturn : 398.35894775390625
Train_StdReturn : 94.39500427246094
Train_MaxReturn : 575.130126953125
Train_MinReturn : 116.31471252441406
Train_AverageEpLen : 170.6949152542373
Train_EnvstepsSoFar : 543057
TimeSinceStart : 472.8730010986328
Training Loss : -13.394683837890625
Baseline Loss : 0.6632648706436157
Initial_DataCollection_AverageReturn : 12.80981159210205
Done logging...




********** Iteration 54 ************

Collecting data to be used for training...
At timestep:     10145 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 366.6787109375
Eval_StdReturn : 93.61835479736328
Eval_MaxReturn : 536.7858276367188
Eval_MinRetu

At timestep:     10115 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 402.9142761230469
Eval_StdReturn : 19.1783447265625
Eval_MaxReturn : 433.6881408691406
Eval_MinReturn : 373.0976257324219
Eval_AverageEpLen : 145.57142857142858
Train_AverageReturn : 428.8387756347656
Train_StdReturn : 50.22399139404297
Train_MaxReturn : 621.1859741210938
Train_MinReturn : 369.6539306640625
Train_AverageEpLen : 153.25757575757575
Train_EnvstepsSoFar : 654064
TimeSinceStart : 572.7072992324829
Training Loss : 1.2174065113067627
Baseline Loss : 0.3119867146015167
Initial_DataCollection_AverageReturn : 12.80981159210205
Done logging...




********** Iteration 65 ************

Collecting data to be used for training...
At timestep:     10029 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 382.3157043457031
Eval_StdReturn : 77.26905059814453
Eval_MaxReturn : 424.84075927734375
Eval_Min

At timestep:     10104 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 390.10137939453125
Eval_StdReturn : 159.2820587158203
Eval_MaxReturn : 618.4410400390625
Eval_MinReturn : 65.96343994140625
Eval_AverageEpLen : 136.5
Train_AverageReturn : 418.32073974609375
Train_StdReturn : 119.92778778076172
Train_MaxReturn : 649.919189453125
Train_MinReturn : 65.31732940673828
Train_AverageEpLen : 148.58823529411765
Train_EnvstepsSoFar : 764892
TimeSinceStart : 671.9147930145264
Training Loss : 0.5675199031829834
Baseline Loss : 0.2716856300830841
Initial_DataCollection_AverageReturn : 12.80981159210205
Done logging...




********** Iteration 76 ************

Collecting data to be used for training...
At timestep:     10078 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 375.9290771484375
Eval_StdReturn : 102.95918273925781
Eval_MaxReturn : 567.2153930664062
Eval_MinReturn : 2

At timestep:     10118 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 462.14691162109375
Eval_StdReturn : 118.026611328125
Eval_MaxReturn : 631.00634765625
Eval_MinReturn : 249.68344116210938
Eval_AverageEpLen : 165.71428571428572
Train_AverageReturn : 406.6960144042969
Train_StdReturn : 106.40902709960938
Train_MaxReturn : 574.4083862304688
Train_MinReturn : 229.50592041015625
Train_AverageEpLen : 165.86885245901638
Train_EnvstepsSoFar : 875697
TimeSinceStart : 769.4715940952301
Training Loss : -2.4587385654449463
Baseline Loss : 0.47539976239204407
Initial_DataCollection_AverageReturn : 12.80981159210205
Done logging...




********** Iteration 87 ************

Collecting data to be used for training...
At timestep:     10169 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 438.5653991699219
Eval_StdReturn : 109.49365997314453
Eval_MaxReturn : 558.08154296875
Eval_M

At timestep:     10040 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 359.00543212890625
Eval_StdReturn : 87.58600616455078
Eval_MaxReturn : 524.7109375
Eval_MinReturn : 238.81617736816406
Eval_AverageEpLen : 160.0
Train_AverageReturn : 459.1153869628906
Train_StdReturn : 84.13641357421875
Train_MaxReturn : 565.725341796875
Train_MinReturn : 249.058837890625
Train_AverageEpLen : 189.43396226415095
Train_EnvstepsSoFar : 986629
TimeSinceStart : 861.9353790283203
Training Loss : -0.8481332659721375
Baseline Loss : 0.666935920715332
Initial_DataCollection_AverageReturn : 12.80981159210205
Done logging...




********** Iteration 98 ************

Collecting data to be used for training...
At timestep:     10069 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 346.4390563964844
Eval_StdReturn : 115.29755401611328
Eval_MaxReturn : 546.2327880859375
Eval_MinReturn : 244.74015

At timestep:     10079 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 208.70855712890625
Eval_StdReturn : 3.2876529693603516
Eval_MaxReturn : 212.5325164794922
Eval_MinReturn : 203.3687286376953
Eval_AverageEpLen : 91.63636363636364
Train_AverageReturn : 186.04095458984375
Train_StdReturn : 24.494277954101562
Train_MaxReturn : 217.96131896972656
Train_MinReturn : 88.15354919433594
Train_AverageEpLen : 83.99166666666666
Train_EnvstepsSoFar : 90341
TimeSinceStart : 77.000333070755
Training Loss : -1.9618091583251953
Baseline Loss : 0.923035740852356
Initial_DataCollection_AverageReturn : 17.71956443786621
Done logging...




********** Iteration 9 ************

Collecting data to be used for training...
At timestep:     10086 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 201.22972106933594
Eval_StdReturn : 8.063844680786133
Eval_MaxReturn : 215.3082733154297
Eval_Min

At timestep:     10031 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 209.7123565673828
Eval_StdReturn : 45.26104736328125
Eval_MaxReturn : 288.2976379394531
Eval_MinReturn : 143.31881713867188
Eval_AverageEpLen : 93.27272727272727
Train_AverageReturn : 190.7678680419922
Train_StdReturn : 13.084724426269531
Train_MaxReturn : 251.01068115234375
Train_MinReturn : 171.97976684570312
Train_AverageEpLen : 87.99122807017544
Train_EnvstepsSoFar : 200890
TimeSinceStart : 174.77068614959717
Training Loss : -2.4105491638183594
Baseline Loss : 0.7185052633285522
Initial_DataCollection_AverageReturn : 17.71956443786621
Done logging...




********** Iteration 20 ************

Collecting data to be used for training...
At timestep:     10047 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 243.8162078857422
Eval_StdReturn : 69.92071533203125
Eval_MaxReturn : 327.10760498046875
Eva

At timestep:     10130 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 374.0146484375
Eval_StdReturn : 92.18383026123047
Eval_MaxReturn : 545.8323974609375
Eval_MinReturn : 261.19500732421875
Eval_AverageEpLen : 154.42857142857142
Train_AverageReturn : 413.4175109863281
Train_StdReturn : 93.91649627685547
Train_MaxReturn : 595.724609375
Train_MinReturn : 199.83261108398438
Train_AverageEpLen : 168.83333333333334
Train_EnvstepsSoFar : 311907
TimeSinceStart : 270.91811203956604
Training Loss : 3.256582498550415
Baseline Loss : 0.895759105682373
Initial_DataCollection_AverageReturn : 17.71956443786621
Done logging...




********** Iteration 31 ************

Collecting data to be used for training...
At timestep:     10097 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 335.84796142578125
Eval_StdReturn : 85.93113708496094
Eval_MaxReturn : 546.951416015625
Eval_MinReturn

At timestep:     10023 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 251.05921936035156
Eval_StdReturn : 78.6911392211914
Eval_MaxReturn : 407.6509704589844
Eval_MinReturn : 137.8756561279297
Eval_AverageEpLen : 122.22222222222223
Train_AverageReturn : 264.01470947265625
Train_StdReturn : 69.41127014160156
Train_MaxReturn : 467.44525146484375
Train_MinReturn : 108.61053466796875
Train_AverageEpLen : 122.23170731707317
Train_EnvstepsSoFar : 422872
TimeSinceStart : 378.8840491771698
Training Loss : -2.1058969497680664
Baseline Loss : 0.7392148971557617
Initial_DataCollection_AverageReturn : 17.71956443786621
Done logging...




********** Iteration 42 ************

Collecting data to be used for training...
At timestep:     10111 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 301.97735595703125
Eval_StdReturn : 87.8959732055664
Eval_MaxReturn : 475.96246337890625
Eva

At timestep:     10082 / 10000000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 383.4271240234375
Eval_StdReturn : 59.120689392089844
Eval_MaxReturn : 459.44500732421875
Eval_MinReturn : 286.208740234375
Eval_AverageEpLen : 140.25
Train_AverageReturn : 414.4706115722656
Train_StdReturn : 55.874290466308594
Train_MaxReturn : 552.168701171875
Train_MinReturn : 280.39556884765625
Train_AverageEpLen : 150.47761194029852
Train_EnvstepsSoFar : 533770
TimeSinceStart : 484.80753922462463
Training Loss : 11.455231666564941
Baseline Loss : 0.6129136681556702
Initial_DataCollection_AverageReturn : 17.71956443786621
Done logging...




********** Iteration 53 ************

Collecting data to be used for training...
At timestep:     10105 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 420.3740539550781
Eval_StdReturn : 45.831111907958984
Eval_MaxReturn : 527.7674560546875
Eval_MinRetur

At timestep:     10133 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 382.5570068359375
Eval_StdReturn : 104.75659942626953
Eval_MaxReturn : 520.971435546875
Eval_MinReturn : 129.6402130126953
Eval_AverageEpLen : 144.0
Train_AverageReturn : 403.8244934082031
Train_StdReturn : 150.5921173095703
Train_MaxReturn : 597.5501708984375
Train_MinReturn : 109.62350463867188
Train_AverageEpLen : 153.53030303030303
Train_EnvstepsSoFar : 644898
TimeSinceStart : 575.4390671253204
Training Loss : 9.079623222351074
Baseline Loss : 0.5533736944198608
Initial_DataCollection_AverageReturn : 17.71956443786621
Done logging...




********** Iteration 64 ************

Collecting data to be used for training...
At timestep:     10081 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 388.75537109375
Eval_StdReturn : 119.59822845458984
Eval_MaxReturn : 523.644287109375
Eval_MinReturn : 133.74

At timestep:     10009 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 388.9479675292969
Eval_StdReturn : 35.930938720703125
Eval_MaxReturn : 470.18145751953125
Eval_MinReturn : 361.3738098144531
Eval_AverageEpLen : 147.0
Train_AverageReturn : 413.7059020996094
Train_StdReturn : 63.097251892089844
Train_MaxReturn : 598.4412231445312
Train_MinReturn : 359.166748046875
Train_AverageEpLen : 153.98461538461538
Train_EnvstepsSoFar : 755499
TimeSinceStart : 667.2580530643463
Training Loss : 1.2619308233261108
Baseline Loss : 0.33457469940185547
Initial_DataCollection_AverageReturn : 17.71956443786621
Done logging...




********** Iteration 75 ************

Collecting data to be used for training...
At timestep:     10031 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 389.8663024902344
Eval_StdReturn : 21.39141082763672
Eval_MaxReturn : 430.45623779296875
Eval_MinReturn : 

At timestep:     10106 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 421.3179016113281
Eval_StdReturn : 125.04349517822266
Eval_MaxReturn : 522.5879516601562
Eval_MinReturn : 154.95077514648438
Eval_AverageEpLen : 168.33333333333334
Train_AverageReturn : 445.2340087890625
Train_StdReturn : 80.49298858642578
Train_MaxReturn : 563.525390625
Train_MinReturn : 164.30462646484375
Train_AverageEpLen : 168.43333333333334
Train_EnvstepsSoFar : 866342
TimeSinceStart : 765.9889929294586
Training Loss : 2.6736819744110107
Baseline Loss : 0.42514869570732117
Initial_DataCollection_AverageReturn : 17.71956443786621
Done logging...




********** Iteration 86 ************

Collecting data to be used for training...
At timestep:     10137 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 396.35662841796875
Eval_StdReturn : 114.17223358154297
Eval_MaxReturn : 563.4468994140625
Eval_M

At timestep:     10138 / 100000000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 438.61712646484375
Eval_StdReturn : 61.678165435791016
Eval_MaxReturn : 556.601806640625
Eval_MinReturn : 359.93548583984375
Eval_AverageEpLen : 159.14285714285714
Train_AverageReturn : 433.342041015625
Train_StdReturn : 76.7831039428711
Train_MaxReturn : 606.865234375
Train_MinReturn : 347.90740966796875
Train_AverageEpLen : 160.9206349206349
Train_EnvstepsSoFar : 977261
TimeSinceStart : 852.1909449100494
Training Loss : 1.7469687461853027
Baseline Loss : 0.2687358260154724
Initial_DataCollection_AverageReturn : 17.71956443786621
Done logging...




********** Iteration 97 ************

Collecting data to be used for training...
At timestep:     10067 / 100000000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 475.5470886230469
Eval_StdReturn : 89.56056213378906
Eval_MaxReturn : 611.4907836914062
Eval

In [93]:
# Plot learning curves
### Visualize Policy Gradient results on Hopper
%load_ext tensorboard
%tensorboard --logdir logs/policy_gradient/Hopper

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6010 (pid 58182), started 1:16:34 ago. (Use '!kill 58182' to kill it.)