# Deep Q Networks
Previously, we trained policies using policy gradient algorithms, which directly estimated the gradient of the returns for the policy and performed stochastic gradient ascent. In this section, we will now implement deep Q-learning, which does not explicitly optimize a policy, but simply infers the policy from a learned Q function that is trained via dynamic programming.

We will assume discrete action spaces for this notebook as to enable us to easily select actions that maximize the Q-function at given states.

In [1]:
# As usual, a bit of setup
import os
import shutil
import time
import torch
import numpy as np

import deeprl.infrastructure.pytorch_util as ptu

from deeprl.infrastructure.rl_trainer import RL_Trainer
from deeprl.infrastructure.trainers import PG_Trainer
from deeprl.infrastructure.trainers import DQN_Trainer

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def remove_folder(path):
    # check if folder exists
    if os.path.exists(path): 
        print("Clearing old results at {}".format(path))
        # remove if exists
        shutil.rmtree(path)
    else:
        print("Folder {} does not exist yet. No old results to delete".format(path))

In [2]:
dqn_base_args_dict = dict(
    env_name = 'LunarLander-v3', #@param 
    exp_name = 'test_dqn', #@param
    save_params = False, #@param {type: "boolean"}
    
    ## PDF will tell you how to set ep_len
    ## and discount for each environment
    ep_len = 200, #@param {type: "integer"}
    # discount = 0.95, #@param {type: "number"}

    # Training
    num_agent_train_steps_per_iter = 1, #@param {type: "integer"})
    num_critic_updates_per_agent_update = 1, #@param {type: "integer"}
  
    #@markdown Q-learning parameters
    double_q = False, #@param {type: "boolean"}

    # batches & buffers
    batch_size = 32, #@param {type: "integer"})
    batch_size_initial=1000,

    #@markdown logging
    video_log_freq = -1, #@param {type: "integer"}
    scalar_log_freq = 1000, #@param {type: "integer"}

    #@markdown gpu & run-time settings
    no_gpu = False, #@param {type: "boolean"}
    which_gpu = 0, #@param {type: "integer"}
    seed = 2, #@param {type: "integer"}
    logdir = 'test',
)

## DQN updates
Recall in Q-learning, we attempt to solve the optimal state-action values (which we refer to as Q-values $Q(s,a)$), by finding solutions to the Bellman equation given by
$$Q(s,a) = r(s,a) + \gamma \mathbb{E}_{s' \sim p(s'\vert s,a)}[\max_{a'}Q(s', a')].$$

Regular tabular Q-learning would take sample transitions $(s, a, r, s')$ and perform updates according to
$$Q(s,a) \leftarrow Q(s,a) + \alpha (r(s,a) + \gamma \max_{a'} Q(s', a') - Q(s,a)),$$
where $\alpha$ is a stepsize parameter.

This can be interpreted as updating $Q(s,a)$ by taking one gradient step on a squared Bellman error objective
$$(r(s, a) +\gamma \max_{a'} \tilde Q(s', a') - Q(s,a))^2,$$
where $\tilde Q$ is a copy of $Q$, but is not differentiated when taking the gradient step.

Adapting this update to the setting where we use a neural network with parameters $\theta$ to approximate $Q(s,a)$, we then train $\theta$ with the loss function 
$$\min_{\theta} \mathbb{E}_{s, a, s' \sim D} [L(Q_{\theta}(s,a), r(s,a) + \gamma \max_{a'} Q_{\tilde \theta}(s', a'))]$$
where $D$ is our replay buffer containing past transitions we've experienced, $L$ is some loss function capturing how far the predicted Q-values are from the target values, and $\tilde \theta$ are the target Q function parameters, which are usually a delayed copy of $\theta$ for stability reasons.

We note our previous policy gradient algorithms were _on-policy_ algoritms, which meant they updated the policy using only the data collected from the most recent policy, and discard all the data after using it just once. In contrast, DQN uses _off-policy_ updates by sampling data from all past interactions, allowing for data reuse over time.

Fill out the missing components for the basic Q-learning update in <code>critics/dqn_critic.py</code> (not including the double_q section).

In [84]:
#### Test DQN updates
dqn_args = dict(dqn_base_args_dict)

env_str = 'LunarLander'
dqn_args['env_name'] = '{}-v3'.format(env_str)
dqn_args['double_q'] = False
dqntrainer = DQN_Trainer(dqn_args)
dqnagent = dqntrainer.rl_trainer.agent
critic = dqnagent.critic

ob_dim = critic.ob_dim
ac_dim = 6
N = 5

np.random.seed(0)
obs = np.random.normal(size=(N, ob_dim))
acts = np.random.choice(ac_dim, size=(N,))
next_obs = np.random.normal(size=(N, ob_dim))
rewards = np.random.normal(size=N)
terminals = np.zeros(N)
terminals[0] = 1

first_weight_before = np.array(ptu.to_numpy(next(critic.q_net.parameters())))
print("Weight before update (first row)", first_weight_before[0])


loss = critic.update(obs, acts, next_obs, rewards, terminals)['Training Loss']
expected_loss = 0.9408444
loss_error = rel_error(loss, expected_loss)
print("Initial loss", loss)
print("Initial Loss Error", loss_error, "should be on the order of 1e-6 or lower")

for i in range(4):
    loss = critic.update(obs, acts, next_obs, rewards, terminals)['Training Loss']
    print(loss)

expected_loss = 0.7889254
loss_error = rel_error(loss, expected_loss)
print("Loss Error", loss_error, "should be on the order of 1e-6 or lower")


first_weight_after = np.array(ptu.to_numpy(next(critic.q_net.parameters())))
print("Weight after update (first row)", first_weight_after.shape)
# Test DQN gradient
print(first_weight_after[0])
weight_change_partial = first_weight_after[0] - first_weight_before[0]
expected_weight_change = np.array([-0.00491365, -0.00500049, -0.00499149, -0.00491229, -0.00490125,  0.00489534,
 -0.00282785, -0.00171614,  0.00485604])


updated_weight_error = rel_error(weight_change_partial, expected_weight_change)
print("Weight Update Error", updated_weight_error, "should be on the order of 1e-6 or lower")

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
LunarLander-v3
Weight before update (first row) [ 0.07646337 -0.07932475  0.09140956 -0.01702595  0.14239588  0.07935759
 -0.03831157 -0.2694876   0.07610479]
Initial loss 0.9408444
Initial Loss Error 8.831612855468697e-09 should be on the order of 1e-6 or lower
0.9007309
0.8621322
0.82467556
0.7889254
Loss Error 5.904878045285727e-09 should be on the order of 1e-6 or lower
Weight after update (first row) (64, 9)
[ 0.07154972 -0.08432525  0.08641808 -0.02193824  0.13749464  0.08425293
 -0.04113942 -0.27120373  0.08096083]
Weight Update Error 8.937585787696078e-07 should be on the order of 1e-6 or lower




Implement the missing components in the get_action method of <code>policies/argmax_policy.py</code> and the step_env method in <code>agents/dqn_agent.py</code> to allow our agent to interact with the environment.

In [45]:
### Test argmax policy
dqn_args = dict(dqn_base_args_dict)

env_str = 'LunarLander'
dqn_args['env_name'] = '{}-v3'.format(env_str)
dqn_args['double_q'] = False
dqntrainer = DQN_Trainer(dqn_args)
dqnagent = dqntrainer.rl_trainer.agent
actor = dqnagent.actor

ob_dim = critic.ob_dim
ac_dim = 6
N = 5

np.random.seed(0)
obs = np.random.normal(size=(N, ob_dim))

actions = actor.get_action(obs)
correct_actions = np.array([1, 0, 1, 0, 1])

assert np.all(correct_actions == actions)

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
LunarLander-v3




We can now test our DQN implementation on the LunarLander environment. These experiments can take a while to run (over 10 minutes per seed) on CPU, so start early.

In [56]:
dqn_args = dict(dqn_base_args_dict)

env_str = 'LunarLander'
dqn_args['env_name'] = '{}-v3'.format(env_str)
dqn_args['double_q'] = False

# Delete all previous logs
remove_folder('logs/dqn/{}/vanilla_dqn'.format(env_str))

for seed in range(3):
    print("Running DQN experiment with seed", seed)
    dqn_args['seed'] = seed
    dqn_args['logdir'] = 'logs/dqn/{}/vanilla_dqn/seed{}'.format(env_str, seed)
    dqntrainer = DQN_Trainer(dqn_args)
    dqntrainer.run_training_loop()

Clearing old results at logs/dqn/LunarLander/vanilla_dqn
Running DQN experiment with seed 0
########################
logging outputs to  logs/dqn/LunarLander/vanilla_dqn/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
LunarLander-v3


********** Iteration 0 ************

Training agent...

Beginning logging procedure...
Timestep 1
mean reward (100 episodes) nan
best mean reward -inf
running time 0.001006
Train_EnvstepsSoFar : 1
TimeSinceStart : 0.0010058879852294922
Done logging...








********** Iteration 1000 ************

Training agent...

Beginning logging procedure...
Timestep 1001
mean reward (100 episodes) -361.458927
best mean reward -inf
running time 0.663646
Train_EnvstepsSoFar : 1001
Train_AverageReturn : -361.4589273606985
TimeSinceStart : 0.6636459827423096
Done logging...




********** Iteration 2000 ************

Training agent...

Beginning logging procedure...
Timestep 2001
mean reward (100 episodes) -310.004133
best mean reward -inf
running time 4.109574
Train_EnvstepsSoFar : 2001
Train_AverageReturn : -310.0041333667857
TimeSinceStart : 4.109573841094971
Training Loss : 0.5019471645355225
Done logging...




********** Iteration 3000 ************

Training agent...

Beginning logging procedure...
Timestep 3001
mean reward (100 episodes) -310.660305
best mean reward -inf
running time 6.949341
Train_EnvstepsSoFar : 3001
Train_AverageReturn : -310.660304801111
TimeSinceStart : 6.9493408203125
Training Loss : 0.5492397546768188
Done logging...






Train_EnvstepsSoFar : 24001
Train_AverageReturn : -183.22477033921928
Train_BestReturn : -183.22477033921928
TimeSinceStart : 74.11091876029968
Training Loss : 0.6776131987571716
Done logging...




********** Iteration 25000 ************

Training agent...

Beginning logging procedure...
Timestep 25001
mean reward (100 episodes) -181.724345
best mean reward -181.724345
running time 77.561211
Train_EnvstepsSoFar : 25001
Train_AverageReturn : -181.7243454470348
Train_BestReturn : -181.7243454470348
TimeSinceStart : 77.5612108707428
Training Loss : 0.7930812835693359
Done logging...




********** Iteration 26000 ************

Training agent...

Beginning logging procedure...
Timestep 26001
mean reward (100 episodes) -180.220301
best mean reward -180.220301
running time 81.028885
Train_EnvstepsSoFar : 26001
Train_AverageReturn : -180.22030074503732
Train_BestReturn : -180.22030074503732
TimeSinceStart : 81.02888488769531
Training Loss : 1.0074446201324463
Done logging...




********** I

Train_EnvstepsSoFar : 45001
Train_AverageReturn : -149.8981531322144
Train_BestReturn : -149.46477256544978
TimeSinceStart : 153.97464394569397
Training Loss : 0.2759976387023926
Done logging...




********** Iteration 46000 ************

Training agent...

Beginning logging procedure...
Timestep 46001
mean reward (100 episodes) -150.156117
best mean reward -149.464773
running time 157.800271
Train_EnvstepsSoFar : 46001
Train_AverageReturn : -150.15611669220007
Train_BestReturn : -149.46477256544978
TimeSinceStart : 157.80027079582214
Training Loss : 4.0942888259887695
Done logging...




********** Iteration 47000 ************

Training agent...

Beginning logging procedure...
Timestep 47001
mean reward (100 episodes) -150.335903
best mean reward -149.464773
running time 162.017846
Train_EnvstepsSoFar : 47001
Train_AverageReturn : -150.33590301807683
Train_BestReturn : -149.46477256544978
TimeSinceStart : 162.01784563064575
Training Loss : 0.8969570398330688
Done logging...




*****

Train_EnvstepsSoFar : 66001
Train_AverageReturn : -125.45751014808347
Train_BestReturn : -125.45751014808347
TimeSinceStart : 232.91591572761536
Training Loss : 0.572282075881958
Done logging...




********** Iteration 67000 ************

Training agent...

Beginning logging procedure...
Timestep 67001
mean reward (100 episodes) -126.984340
best mean reward -125.457510
running time 236.038064
Train_EnvstepsSoFar : 67001
Train_AverageReturn : -126.9843401307386
Train_BestReturn : -125.45751014808347
TimeSinceStart : 236.03806376457214
Training Loss : 0.3295862674713135
Done logging...




********** Iteration 68000 ************

Training agent...

Beginning logging procedure...
Timestep 68001
mean reward (100 episodes) -127.220023
best mean reward -125.457510
running time 239.619195
Train_EnvstepsSoFar : 68001
Train_AverageReturn : -127.2200230149203
Train_BestReturn : -125.45751014808347
TimeSinceStart : 239.61919474601746
Training Loss : 0.2092868685722351
Done logging...




*******

Train_EnvstepsSoFar : 87001
Train_AverageReturn : -102.07491148447065
Train_BestReturn : -101.31235981569705
TimeSinceStart : 310.5790207386017
Training Loss : 0.19763830304145813
Done logging...




********** Iteration 88000 ************

Training agent...

Beginning logging procedure...
Timestep 88001
mean reward (100 episodes) -102.534383
best mean reward -101.312360
running time 314.243753
Train_EnvstepsSoFar : 88001
Train_AverageReturn : -102.53438279414128
Train_BestReturn : -101.31235981569705
TimeSinceStart : 314.2437529563904
Training Loss : 0.2599029839038849
Done logging...




********** Iteration 89000 ************

Training agent...

Beginning logging procedure...
Timestep 89001
mean reward (100 episodes) -99.595139
best mean reward -99.595139
running time 317.293139
Train_EnvstepsSoFar : 89001
Train_AverageReturn : -99.59513888166853
Train_BestReturn : -99.59513888166853
TimeSinceStart : 317.2931389808655
Training Loss : 0.40164095163345337
Done logging...




*********

Train_EnvstepsSoFar : 108001
Train_AverageReturn : -104.00898428013339
Train_BestReturn : -99.59513888166853
TimeSinceStart : 393.9530539512634
Training Loss : 0.20738233625888824
Done logging...




********** Iteration 109000 ************

Training agent...

Beginning logging procedure...
Timestep 109001
mean reward (100 episodes) -99.837549
best mean reward -99.595139
running time 397.447776
Train_EnvstepsSoFar : 109001
Train_AverageReturn : -99.83754908079607
Train_BestReturn : -99.59513888166853
TimeSinceStart : 397.44777607917786
Training Loss : 0.24212327599525452
Done logging...




********** Iteration 110000 ************

Training agent...

Beginning logging procedure...
Timestep 110001
mean reward (100 episodes) -99.677394
best mean reward -99.595139
running time 401.700662
Train_EnvstepsSoFar : 110001
Train_AverageReturn : -99.67739447152532
Train_BestReturn : -99.59513888166853
TimeSinceStart : 401.7006616592407
Training Loss : 0.548340916633606
Done logging...




*******

Train_EnvstepsSoFar : 129001
Train_AverageReturn : -70.82738196736045
Train_BestReturn : -70.82738196736045
TimeSinceStart : 476.67678689956665
Training Loss : 0.12682610750198364
Done logging...




********** Iteration 130000 ************

Training agent...

Beginning logging procedure...
Timestep 130001
mean reward (100 episodes) -65.028424
best mean reward -65.028424
running time 480.154533
Train_EnvstepsSoFar : 130001
Train_AverageReturn : -65.02842392143246
Train_BestReturn : -65.02842392143246
TimeSinceStart : 480.1545329093933
Training Loss : 0.2731015086174011
Done logging...




********** Iteration 131000 ************

Training agent...

Beginning logging procedure...
Timestep 131001
mean reward (100 episodes) -64.372451
best mean reward -64.372451
running time 483.653093
Train_EnvstepsSoFar : 131001
Train_AverageReturn : -64.3724511673571
Train_BestReturn : -64.3724511673571
TimeSinceStart : 483.65309286117554
Training Loss : 0.1849430501461029
Done logging...




*********

Train_EnvstepsSoFar : 150001
Train_AverageReturn : -9.126679408708032
Train_BestReturn : -9.126679408708032
TimeSinceStart : 553.61270403862
Training Loss : 0.12610465288162231
Done logging...




********** Iteration 151000 ************

Training agent...

Beginning logging procedure...
Timestep 151001
mean reward (100 episodes) -9.061874
best mean reward -9.061874
running time 558.088909
Train_EnvstepsSoFar : 151001
Train_AverageReturn : -9.061874021478602
Train_BestReturn : -9.061874021478602
TimeSinceStart : 558.0889086723328
Training Loss : 0.08390339463949203
Done logging...




********** Iteration 152000 ************

Training agent...

Beginning logging procedure...
Timestep 152001
mean reward (100 episodes) -2.446398
best mean reward -2.446398
running time 560.909734
Train_EnvstepsSoFar : 152001
Train_AverageReturn : -2.4463976016089406
Train_BestReturn : -2.4463976016089406
TimeSinceStart : 560.9097337722778
Training Loss : 0.10231329500675201
Done logging...




********** 

Train_EnvstepsSoFar : 171001
Train_AverageReturn : 41.75281807944117
Train_BestReturn : 42.37079933369172
TimeSinceStart : 627.8439569473267
Training Loss : 0.10063111782073975
Done logging...




********** Iteration 172000 ************

Training agent...

Beginning logging procedure...
Timestep 172001
mean reward (100 episodes) 43.416134
best mean reward 43.416134
running time 631.759931
Train_EnvstepsSoFar : 172001
Train_AverageReturn : 43.41613395609527
Train_BestReturn : 43.41613395609527
TimeSinceStart : 631.7599308490753
Training Loss : 1.8448090553283691
Done logging...




********** Iteration 173000 ************

Training agent...

Beginning logging procedure...
Timestep 173001
mean reward (100 episodes) 43.037295
best mean reward 43.416134
running time 635.126220
Train_EnvstepsSoFar : 173001
Train_AverageReturn : 43.03729457140954
Train_BestReturn : 43.41613395609527
TimeSinceStart : 635.1262197494507
Training Loss : 0.5921532511711121
Done logging...




********** Iteratio

Train_EnvstepsSoFar : 192001
Train_AverageReturn : 71.9812551379289
Train_BestReturn : 71.9812551379289
TimeSinceStart : 695.765344619751
Training Loss : 0.7016628384590149
Done logging...




********** Iteration 193000 ************

Training agent...

Beginning logging procedure...
Timestep 193001
mean reward (100 episodes) 72.132247
best mean reward 72.132247
running time 699.192281
Train_EnvstepsSoFar : 193001
Train_AverageReturn : 72.13224708498277
Train_BestReturn : 72.13224708498277
TimeSinceStart : 699.1922807693481
Training Loss : 0.46577930450439453
Done logging...




********** Iteration 194000 ************

Training agent...

Beginning logging procedure...
Timestep 194001
mean reward (100 episodes) 72.892416
best mean reward 72.892416
running time 704.190026
Train_EnvstepsSoFar : 194001
Train_AverageReturn : 72.89241568214784
Train_BestReturn : 72.89241568214784
TimeSinceStart : 704.1900260448456
Training Loss : 1.9745227098464966
Done logging...




********** Iteration 1

Train_EnvstepsSoFar : 213001
Train_AverageReturn : 79.18565247276112
Train_BestReturn : 79.18565247276112
TimeSinceStart : 767.2984919548035
Training Loss : 0.12912341952323914
Done logging...




********** Iteration 214000 ************

Training agent...

Beginning logging procedure...
Timestep 214001
mean reward (100 episodes) 79.491542
best mean reward 79.491542
running time 771.334680
Train_EnvstepsSoFar : 214001
Train_AverageReturn : 79.49154193806017
Train_BestReturn : 79.49154193806017
TimeSinceStart : 771.3346798419952
Training Loss : 0.14437708258628845
Done logging...




********** Iteration 215000 ************

Training agent...

Beginning logging procedure...
Timestep 215001
mean reward (100 episodes) 83.471761
best mean reward 83.471761
running time 774.711917
Train_EnvstepsSoFar : 215001
Train_AverageReturn : 83.4717612795487
Train_BestReturn : 83.4717612795487
TimeSinceStart : 774.7119166851044
Training Loss : 1.50114107131958
Done logging...




********** Iteration 2

Train_EnvstepsSoFar : 234001
Train_AverageReturn : 68.010531633467
Train_BestReturn : 89.01401176044608
TimeSinceStart : 860.758975982666
Training Loss : 0.2807939946651459
Done logging...




********** Iteration 235000 ************

Training agent...

Beginning logging procedure...
Timestep 235001
mean reward (100 episodes) 66.949310
best mean reward 89.014012
running time 866.497810
Train_EnvstepsSoFar : 235001
Train_AverageReturn : 66.94930971546906
Train_BestReturn : 89.01401176044608
TimeSinceStart : 866.4978098869324
Training Loss : 0.08750791847705841
Done logging...




********** Iteration 236000 ************

Training agent...

Beginning logging procedure...
Timestep 236001
mean reward (100 episodes) 65.609635
best mean reward 89.014012
running time 872.286431
Train_EnvstepsSoFar : 236001
Train_AverageReturn : 65.6096348828747
Train_BestReturn : 89.01401176044608
TimeSinceStart : 872.2864308357239
Training Loss : 0.7335444092750549
Done logging...




********** Iteration 23

Train_EnvstepsSoFar : 255001
Train_AverageReturn : 70.02545064606676
Train_BestReturn : 89.01401176044608
TimeSinceStart : 950.646158695221
Training Loss : 3.140202760696411
Done logging...




********** Iteration 256000 ************

Training agent...

Beginning logging procedure...
Timestep 256001
mean reward (100 episodes) 68.873311
best mean reward 89.014012
running time 953.694918
Train_EnvstepsSoFar : 256001
Train_AverageReturn : 68.87331068885314
Train_BestReturn : 89.01401176044608
TimeSinceStart : 953.6949179172516
Training Loss : 0.42642661929130554
Done logging...




********** Iteration 257000 ************

Training agent...

Beginning logging procedure...
Timestep 257001
mean reward (100 episodes) 72.093990
best mean reward 89.014012
running time 958.141695
Train_EnvstepsSoFar : 257001
Train_AverageReturn : 72.09398961744589
Train_BestReturn : 89.01401176044608
TimeSinceStart : 958.141695022583
Training Loss : 0.3066883087158203
Done logging...




********** Iteration 2

Train_EnvstepsSoFar : 276001
Train_AverageReturn : 72.36063960624897
Train_BestReturn : 89.01401176044608
TimeSinceStart : 1029.3634667396545
Training Loss : 0.5009975433349609
Done logging...




********** Iteration 277000 ************

Training agent...

Beginning logging procedure...
Timestep 277001
mean reward (100 episodes) 67.858672
best mean reward 89.014012
running time 1034.199267
Train_EnvstepsSoFar : 277001
Train_AverageReturn : 67.85867156082928
Train_BestReturn : 89.01401176044608
TimeSinceStart : 1034.1992666721344
Training Loss : 0.2840096950531006
Done logging...




********** Iteration 278000 ************

Training agent...

Beginning logging procedure...
Timestep 278001
mean reward (100 episodes) 64.871323
best mean reward 89.014012
running time 1041.999840
Train_EnvstepsSoFar : 278001
Train_AverageReturn : 64.87132295244066
Train_BestReturn : 89.01401176044608
TimeSinceStart : 1041.9998400211334
Training Loss : 0.16190633177757263
Done logging...




********** Ite

Train_EnvstepsSoFar : 297001
Train_AverageReturn : 57.59407269021192
Train_BestReturn : 89.01401176044608
TimeSinceStart : 1104.3640027046204
Training Loss : 0.32857590913772583
Done logging...




********** Iteration 298000 ************

Training agent...

Beginning logging procedure...
Timestep 298001
mean reward (100 episodes) 68.529858
best mean reward 89.014012
running time 1107.052532
Train_EnvstepsSoFar : 298001
Train_AverageReturn : 68.52985826948158
Train_BestReturn : 89.01401176044608
TimeSinceStart : 1107.0525319576263
Training Loss : 0.13248154520988464
Done logging...




********** Iteration 299000 ************

Training agent...

Beginning logging procedure...
Timestep 299001
mean reward (100 episodes) 65.330246
best mean reward 89.014012
running time 1109.789802
Train_EnvstepsSoFar : 299001
Train_AverageReturn : 65.33024554141468
Train_BestReturn : 89.01401176044608
TimeSinceStart : 1109.7898018360138
Training Loss : 0.07267972826957703
Done logging...




********** I

Train_EnvstepsSoFar : 318001
Train_AverageReturn : 105.62399265716418
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1163.4593946933746
Training Loss : 0.2225923091173172
Done logging...




********** Iteration 319000 ************

Training agent...

Beginning logging procedure...
Timestep 319001
mean reward (100 episodes) 103.940965
best mean reward 112.713147
running time 1166.264719
Train_EnvstepsSoFar : 319001
Train_AverageReturn : 103.9409646337599
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1166.2647187709808
Training Loss : 7.183232307434082
Done logging...




********** Iteration 320000 ************

Training agent...

Beginning logging procedure...
Timestep 320001
mean reward (100 episodes) 101.100486
best mean reward 112.713147
running time 1168.963505
Train_EnvstepsSoFar : 320001
Train_AverageReturn : 101.10048613548584
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1168.9635047912598
Training Loss : 2.81533145904541
Done logging...




*********

Train_EnvstepsSoFar : 339001
Train_AverageReturn : 88.14785415331022
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1221.2756669521332
Training Loss : 0.20074470341205597
Done logging...




********** Iteration 340000 ************

Training agent...

Beginning logging procedure...
Timestep 340001
mean reward (100 episodes) 79.587139
best mean reward 112.713147
running time 1224.263263
Train_EnvstepsSoFar : 340001
Train_AverageReturn : 79.58713893242697
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1224.2632629871368
Training Loss : 0.23371581733226776
Done logging...




********** Iteration 341000 ************

Training agent...

Beginning logging procedure...
Timestep 341001
mean reward (100 episodes) 78.476098
best mean reward 112.713147
running time 1227.150803
Train_EnvstepsSoFar : 341001
Train_AverageReturn : 78.47609804226025
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1227.1508028507233
Training Loss : 0.4239997863769531
Done logging...




********

Train_EnvstepsSoFar : 360001
Train_AverageReturn : 34.904932591577605
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1278.8098998069763
Training Loss : 1.3232483863830566
Done logging...




********** Iteration 361000 ************

Training agent...

Beginning logging procedure...
Timestep 361001
mean reward (100 episodes) 39.086737
best mean reward 112.713147
running time 1281.743326
Train_EnvstepsSoFar : 361001
Train_AverageReturn : 39.086737439560146
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1281.7433259487152
Training Loss : 7.247232913970947
Done logging...




********** Iteration 362000 ************

Training agent...

Beginning logging procedure...
Timestep 362001
mean reward (100 episodes) 35.436764
best mean reward 112.713147
running time 1285.384439
Train_EnvstepsSoFar : 362001
Train_AverageReturn : 35.43676381971909
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1285.384438753128
Training Loss : 1.2518861293792725
Done logging...




**********

Train_EnvstepsSoFar : 381001
Train_AverageReturn : 88.42453164274735
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1342.0668649673462
Training Loss : 3.601830244064331
Done logging...




********** Iteration 382000 ************

Training agent...

Beginning logging procedure...
Timestep 382001
mean reward (100 episodes) 88.895625
best mean reward 112.713147
running time 1345.961121
Train_EnvstepsSoFar : 382001
Train_AverageReturn : 88.89562459979784
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1345.9611208438873
Training Loss : 0.3802071809768677
Done logging...




********** Iteration 383000 ************

Training agent...

Beginning logging procedure...
Timestep 383001
mean reward (100 episodes) 90.887418
best mean reward 112.713147
running time 1349.791607
Train_EnvstepsSoFar : 383001
Train_AverageReturn : 90.88741772503803
Train_BestReturn : 112.71314682673022
TimeSinceStart : 1349.7916066646576
Training Loss : 2.4383816719055176
Done logging...




********** 

Train_EnvstepsSoFar : 402001
Train_AverageReturn : 105.39448679018633
Train_BestReturn : 127.9627156402933
TimeSinceStart : 1405.41073679924
Training Loss : 1.843034267425537
Done logging...




********** Iteration 403000 ************

Training agent...

Beginning logging procedure...
Timestep 403001
mean reward (100 episodes) 105.721719
best mean reward 127.962716
running time 1408.320566
Train_EnvstepsSoFar : 403001
Train_AverageReturn : 105.72171945828221
Train_BestReturn : 127.9627156402933
TimeSinceStart : 1408.320565700531
Training Loss : 0.8238515853881836
Done logging...




********** Iteration 404000 ************

Training agent...

Beginning logging procedure...
Timestep 404001
mean reward (100 episodes) 105.263671
best mean reward 127.962716
running time 1411.509156
Train_EnvstepsSoFar : 404001
Train_AverageReturn : 105.26367118277896
Train_BestReturn : 127.9627156402933
TimeSinceStart : 1411.5091557502747
Training Loss : 0.23168708384037018
Done logging...




********** 

Train_EnvstepsSoFar : 423001
Train_AverageReturn : 128.2322932155743
Train_BestReturn : 132.89345823039
TimeSinceStart : 1465.240294933319
Training Loss : 0.3722228407859802
Done logging...




********** Iteration 424000 ************

Training agent...

Beginning logging procedure...
Timestep 424001
mean reward (100 episodes) 128.321762
best mean reward 132.893458
running time 1468.685095
Train_EnvstepsSoFar : 424001
Train_AverageReturn : 128.32176211996867
Train_BestReturn : 132.89345823039
TimeSinceStart : 1468.685094833374
Training Loss : 10.197915077209473
Done logging...




********** Iteration 425000 ************

Training agent...

Beginning logging procedure...
Timestep 425001
mean reward (100 episodes) 137.702542
best mean reward 137.702542
running time 1471.393197
Train_EnvstepsSoFar : 425001
Train_AverageReturn : 137.7025422102559
Train_BestReturn : 137.7025422102559
TimeSinceStart : 1471.3931968212128
Training Loss : 0.19317761063575745
Done logging...




********** Iter

Train_EnvstepsSoFar : 444001
Train_AverageReturn : 89.19364956887993
Train_BestReturn : 158.98121240261554
TimeSinceStart : 1522.4783596992493
Training Loss : 0.24663697183132172
Done logging...




********** Iteration 445000 ************

Training agent...

Beginning logging procedure...
Timestep 445001
mean reward (100 episodes) 77.484938
best mean reward 158.981212
running time 1525.146244
Train_EnvstepsSoFar : 445001
Train_AverageReturn : 77.4849383083648
Train_BestReturn : 158.98121240261554
TimeSinceStart : 1525.1462438106537
Training Loss : 0.1818065196275711
Done logging...




********** Iteration 446000 ************

Training agent...

Beginning logging procedure...
Timestep 446001
mean reward (100 episodes) 73.624598
best mean reward 158.981212
running time 1527.869932
Train_EnvstepsSoFar : 446001
Train_AverageReturn : 73.62459800170214
Train_BestReturn : 158.98121240261554
TimeSinceStart : 1527.8699316978455
Training Loss : 0.2838208079338074
Done logging...




**********

Train_EnvstepsSoFar : 465001
Train_AverageReturn : 66.03885417814566
Train_BestReturn : 158.98121240261554
TimeSinceStart : 1580.3389739990234
Training Loss : 0.4711619019508362
Done logging...




********** Iteration 466000 ************

Training agent...

Beginning logging procedure...
Timestep 466001
mean reward (100 episodes) 50.793055
best mean reward 158.981212
running time 1582.932277
Train_EnvstepsSoFar : 466001
Train_AverageReturn : 50.79305548195736
Train_BestReturn : 158.98121240261554
TimeSinceStart : 1582.9322769641876
Training Loss : 0.24079059064388275
Done logging...




********** Iteration 467000 ************

Training agent...

Beginning logging procedure...
Timestep 467001
mean reward (100 episodes) 50.706680
best mean reward 158.981212
running time 1585.597980
Train_EnvstepsSoFar : 467001
Train_AverageReturn : 50.706680242039404
Train_BestReturn : 158.98121240261554
TimeSinceStart : 1585.5979797840118
Training Loss : 2.2208917140960693
Done logging...




********

Train_EnvstepsSoFar : 486001
Train_AverageReturn : 42.766572866292705
Train_BestReturn : 158.98121240261554
TimeSinceStart : 1636.3136088848114
Training Loss : 0.5933546423912048
Done logging...




********** Iteration 487000 ************

Training agent...

Beginning logging procedure...
Timestep 487001
mean reward (100 episodes) 30.219637
best mean reward 158.981212
running time 1638.956865
Train_EnvstepsSoFar : 487001
Train_AverageReturn : 30.2196366081615
Train_BestReturn : 158.98121240261554
TimeSinceStart : 1638.9568648338318
Training Loss : 0.21000882983207703
Done logging...




********** Iteration 488000 ************

Training agent...

Beginning logging procedure...
Timestep 488001
mean reward (100 episodes) 28.141394
best mean reward 158.981212
running time 1641.570862
Train_EnvstepsSoFar : 488001
Train_AverageReturn : 28.1413941950612
Train_BestReturn : 158.98121240261554
TimeSinceStart : 1641.5708618164062
Training Loss : 0.18153449892997742
Done logging...




*********



********** Iteration 8000 ************

Training agent...

Beginning logging procedure...
Timestep 8001
mean reward (100 episodes) -263.580618
best mean reward -inf
running time 18.183663
Train_EnvstepsSoFar : 8001
Train_AverageReturn : -263.5806179902761
TimeSinceStart : 18.18366289138794
Training Loss : 0.3545996844768524
Done logging...




********** Iteration 9000 ************

Training agent...

Beginning logging procedure...
Timestep 9001
mean reward (100 episodes) -258.909854
best mean reward -inf
running time 21.049743
Train_EnvstepsSoFar : 9001
Train_AverageReturn : -258.90985427004284
TimeSinceStart : 21.049742698669434
Training Loss : 0.5896323919296265
Done logging...




********** Iteration 10000 ************

Training agent...

Beginning logging procedure...
Timestep 10001
mean reward (100 episodes) -252.961994
best mean reward -inf
running time 24.903784
Train_EnvstepsSoFar : 10001
Train_AverageReturn : -252.96199442422886
TimeSinceStart : 24.903783798217773
Training



********** Iteration 30000 ************

Training agent...

Beginning logging procedure...
Timestep 30001
mean reward (100 episodes) -211.347909
best mean reward -211.347909
running time 95.439081
Train_EnvstepsSoFar : 30001
Train_AverageReturn : -211.34790948624567
Train_BestReturn : -211.34790948624567
TimeSinceStart : 95.43908095359802
Training Loss : 1.0748368501663208
Done logging...




********** Iteration 31000 ************

Training agent...

Beginning logging procedure...
Timestep 31001
mean reward (100 episodes) -208.592734
best mean reward -208.592734
running time 99.067875
Train_EnvstepsSoFar : 31001
Train_AverageReturn : -208.59273392055758
Train_BestReturn : -208.59273392055758
TimeSinceStart : 99.06787467002869
Training Loss : 1.4594875574111938
Done logging...




********** Iteration 32000 ************

Training agent...

Beginning logging procedure...
Timestep 32001
mean reward (100 episodes) -205.277088
best mean reward -205.277088
running time 103.035951
Train_En



********** Iteration 51000 ************

Training agent...

Beginning logging procedure...
Timestep 51001
mean reward (100 episodes) -158.954610
best mean reward -158.954610
running time 176.741582
Train_EnvstepsSoFar : 51001
Train_AverageReturn : -158.95461022685066
Train_BestReturn : -158.95461022685066
TimeSinceStart : 176.7415816783905
Training Loss : 0.389000803232193
Done logging...




********** Iteration 52000 ************

Training agent...

Beginning logging procedure...
Timestep 52001
mean reward (100 episodes) -156.717674
best mean reward -156.717674
running time 180.407864
Train_EnvstepsSoFar : 52001
Train_AverageReturn : -156.7176737868372
Train_BestReturn : -156.7176737868372
TimeSinceStart : 180.40786385536194
Training Loss : 1.032907485961914
Done logging...




********** Iteration 53000 ************

Training agent...

Beginning logging procedure...
Timestep 53001
mean reward (100 episodes) -155.986086
best mean reward -155.986086
running time 183.753349
Train_Env



********** Iteration 72000 ************

Training agent...

Beginning logging procedure...
Timestep 72001
mean reward (100 episodes) -115.261200
best mean reward -115.261200
running time 258.122749
Train_EnvstepsSoFar : 72001
Train_AverageReturn : -115.2611995377014
Train_BestReturn : -115.2611995377014
TimeSinceStart : 258.1227488517761
Training Loss : 0.7046878337860107
Done logging...




********** Iteration 73000 ************

Training agent...

Beginning logging procedure...
Timestep 73001
mean reward (100 episodes) -114.326683
best mean reward -114.326683
running time 261.961974
Train_EnvstepsSoFar : 73001
Train_AverageReturn : -114.32668318130665
Train_BestReturn : -114.32668318130665
TimeSinceStart : 261.96197390556335
Training Loss : 0.2598627209663391
Done logging...




********** Iteration 74000 ************

Training agent...

Beginning logging procedure...
Timestep 74001
mean reward (100 episodes) -109.808883
best mean reward -109.808883
running time 265.961318
Train_E



********** Iteration 93000 ************

Training agent...

Beginning logging procedure...
Timestep 93001
mean reward (100 episodes) -61.678147
best mean reward -61.678147
running time 337.888052
Train_EnvstepsSoFar : 93001
Train_AverageReturn : -61.6781466727888
Train_BestReturn : -61.6781466727888
TimeSinceStart : 337.88805198669434
Training Loss : 0.5950069427490234
Done logging...




********** Iteration 94000 ************

Training agent...

Beginning logging procedure...
Timestep 94001
mean reward (100 episodes) -62.675967
best mean reward -61.678147
running time 341.310179
Train_EnvstepsSoFar : 94001
Train_AverageReturn : -62.67596720913922
Train_BestReturn : -61.6781466727888
TimeSinceStart : 341.31017875671387
Training Loss : 0.1475774645805359
Done logging...




********** Iteration 95000 ************

Training agent...

Beginning logging procedure...
Timestep 95001
mean reward (100 episodes) -58.421487
best mean reward -58.421487
running time 344.807649
Train_EnvstepsSoF



********** Iteration 114000 ************

Training agent...

Beginning logging procedure...
Timestep 114001
mean reward (100 episodes) -40.772661
best mean reward -40.772661
running time 414.169818
Train_EnvstepsSoFar : 114001
Train_AverageReturn : -40.772661125244596
Train_BestReturn : -40.772661125244596
TimeSinceStart : 414.16981768608093
Training Loss : 1.0022165775299072
Done logging...




********** Iteration 115000 ************

Training agent...

Beginning logging procedure...
Timestep 115001
mean reward (100 episodes) -41.889907
best mean reward -40.772661
running time 417.680287
Train_EnvstepsSoFar : 115001
Train_AverageReturn : -41.889906600332786
Train_BestReturn : -40.772661125244596
TimeSinceStart : 417.68028688430786
Training Loss : 0.256264328956604
Done logging...




********** Iteration 116000 ************

Training agent...

Beginning logging procedure...
Timestep 116001
mean reward (100 episodes) -35.203860
best mean reward -35.203860
running time 420.743853
Tra



********** Iteration 135000 ************

Training agent...

Beginning logging procedure...
Timestep 135001
mean reward (100 episodes) -16.733884
best mean reward -16.524502
running time 489.592245
Train_EnvstepsSoFar : 135001
Train_AverageReturn : -16.733883839239155
Train_BestReturn : -16.524501633575372
TimeSinceStart : 489.59224486351013
Training Loss : 0.4501585066318512
Done logging...




********** Iteration 136000 ************

Training agent...

Beginning logging procedure...
Timestep 136001
mean reward (100 episodes) -18.756725
best mean reward -16.524502
running time 493.943577
Train_EnvstepsSoFar : 136001
Train_AverageReturn : -18.75672498842205
Train_BestReturn : -16.524501633575372
TimeSinceStart : 493.94357681274414
Training Loss : 0.0892878919839859
Done logging...




********** Iteration 137000 ************

Training agent...

Beginning logging procedure...
Timestep 137001
mean reward (100 episodes) -15.531476
best mean reward -15.531476
running time 497.578325
Tra



********** Iteration 156000 ************

Training agent...

Beginning logging procedure...
Timestep 156001
mean reward (100 episodes) -8.478115
best mean reward -6.947341
running time 569.694950
Train_EnvstepsSoFar : 156001
Train_AverageReturn : -8.478114987188185
Train_BestReturn : -6.9473412658907305
TimeSinceStart : 569.6949498653412
Training Loss : 0.13217952847480774
Done logging...




********** Iteration 157000 ************

Training agent...

Beginning logging procedure...
Timestep 157001
mean reward (100 episodes) -5.311748
best mean reward -5.311748
running time 573.315838
Train_EnvstepsSoFar : 157001
Train_AverageReturn : -5.311748187092747
Train_BestReturn : -5.311748187092747
TimeSinceStart : 573.3158378601074
Training Loss : 0.09902116656303406
Done logging...




********** Iteration 158000 ************

Training agent...

Beginning logging procedure...
Timestep 158001
mean reward (100 episodes) -2.817611
best mean reward -2.817611
running time 576.346345
Train_Envst



********** Iteration 177000 ************

Training agent...

Beginning logging procedure...
Timestep 177001
mean reward (100 episodes) 40.825404
best mean reward 40.962902
running time 647.949181
Train_EnvstepsSoFar : 177001
Train_AverageReturn : 40.82540428900028
Train_BestReturn : 40.96290201562405
TimeSinceStart : 647.9491808414459
Training Loss : 0.13814903795719147
Done logging...




********** Iteration 178000 ************

Training agent...

Beginning logging procedure...
Timestep 178001
mean reward (100 episodes) 42.069591
best mean reward 42.069591
running time 651.548230
Train_EnvstepsSoFar : 178001
Train_AverageReturn : 42.06959138528551
Train_BestReturn : 42.06959138528551
TimeSinceStart : 651.548229932785
Training Loss : 0.13533543050289154
Done logging...




********** Iteration 179000 ************

Training agent...

Beginning logging procedure...
Timestep 179001
mean reward (100 episodes) 41.695378
best mean reward 42.069591
running time 654.668233
Train_EnvstepsSoF



********** Iteration 198000 ************

Training agent...

Beginning logging procedure...
Timestep 198001
mean reward (100 episodes) 66.843976
best mean reward 66.843976
running time 722.565748
Train_EnvstepsSoFar : 198001
Train_AverageReturn : 66.8439760181735
Train_BestReturn : 66.8439760181735
TimeSinceStart : 722.5657479763031
Training Loss : 0.7108044624328613
Done logging...




********** Iteration 199000 ************

Training agent...

Beginning logging procedure...
Timestep 199001
mean reward (100 episodes) 64.373533
best mean reward 66.843976
running time 726.027502
Train_EnvstepsSoFar : 199001
Train_AverageReturn : 64.37353308335678
Train_BestReturn : 66.8439760181735
TimeSinceStart : 726.027501821518
Training Loss : 0.1251034289598465
Done logging...




********** Iteration 200000 ************

Training agent...

Beginning logging procedure...
Timestep 200001
mean reward (100 episodes) 62.619008
best mean reward 66.843976
running time 729.341247
Train_EnvstepsSoFar : 



********** Iteration 219000 ************

Training agent...

Beginning logging procedure...
Timestep 219001
mean reward (100 episodes) 65.616842
best mean reward 76.286098
running time 788.897580
Train_EnvstepsSoFar : 219001
Train_AverageReturn : 65.6168415015569
Train_BestReturn : 76.28609845349386
TimeSinceStart : 788.8975796699524
Training Loss : 0.19993971288204193
Done logging...




********** Iteration 220000 ************

Training agent...

Beginning logging procedure...
Timestep 220001
mean reward (100 episodes) 64.897681
best mean reward 76.286098
running time 791.997587
Train_EnvstepsSoFar : 220001
Train_AverageReturn : 64.89768073773132
Train_BestReturn : 76.28609845349386
TimeSinceStart : 791.9975869655609
Training Loss : 0.1783498376607895
Done logging...




********** Iteration 221000 ************

Training agent...

Beginning logging procedure...
Timestep 221001
mean reward (100 episodes) 65.824450
best mean reward 76.286098
running time 794.913528
Train_EnvstepsSoFa



********** Iteration 240000 ************

Training agent...

Beginning logging procedure...
Timestep 240001
mean reward (100 episodes) 55.009976
best mean reward 76.286098
running time 858.091973
Train_EnvstepsSoFar : 240001
Train_AverageReturn : 55.009975866113685
Train_BestReturn : 76.28609845349386
TimeSinceStart : 858.0919728279114
Training Loss : 1.1500493288040161
Done logging...




********** Iteration 241000 ************

Training agent...

Beginning logging procedure...
Timestep 241001
mean reward (100 episodes) 56.944458
best mean reward 76.286098
running time 860.812762
Train_EnvstepsSoFar : 241001
Train_AverageReturn : 56.94445810257561
Train_BestReturn : 76.28609845349386
TimeSinceStart : 860.8127617835999
Training Loss : 1.0348554849624634
Done logging...




********** Iteration 242000 ************

Training agent...

Beginning logging procedure...
Timestep 242001
mean reward (100 episodes) 62.716727
best mean reward 76.286098
running time 863.814560
Train_EnvstepsSoF



********** Iteration 261000 ************

Training agent...

Beginning logging procedure...
Timestep 261001
mean reward (100 episodes) 61.814477
best mean reward 76.286098
running time 922.577771
Train_EnvstepsSoFar : 261001
Train_AverageReturn : 61.81447662926888
Train_BestReturn : 76.28609845349386
TimeSinceStart : 922.57777094841
Training Loss : 0.8508968353271484
Done logging...




********** Iteration 262000 ************

Training agent...

Beginning logging procedure...
Timestep 262001
mean reward (100 episodes) 63.705878
best mean reward 76.286098
running time 925.371033
Train_EnvstepsSoFar : 262001
Train_AverageReturn : 63.705878168638264
Train_BestReturn : 76.28609845349386
TimeSinceStart : 925.3710327148438
Training Loss : 3.9408440589904785
Done logging...




********** Iteration 263000 ************

Training agent...

Beginning logging procedure...
Timestep 263001
mean reward (100 episodes) 61.913326
best mean reward 76.286098
running time 928.618428
Train_EnvstepsSoFar



********** Iteration 282000 ************

Training agent...

Beginning logging procedure...
Timestep 282001
mean reward (100 episodes) 66.855284
best mean reward 76.286098
running time 984.608394
Train_EnvstepsSoFar : 282001
Train_AverageReturn : 66.85528388475576
Train_BestReturn : 76.28609845349386
TimeSinceStart : 984.608393907547
Training Loss : 1.0584306716918945
Done logging...




********** Iteration 283000 ************

Training agent...

Beginning logging procedure...
Timestep 283001
mean reward (100 episodes) 65.799206
best mean reward 76.286098
running time 987.564997
Train_EnvstepsSoFar : 283001
Train_AverageReturn : 65.79920623411415
Train_BestReturn : 76.28609845349386
TimeSinceStart : 987.5649967193604
Training Loss : 0.2778506875038147
Done logging...




********** Iteration 284000 ************

Training agent...

Beginning logging procedure...
Timestep 284001
mean reward (100 episodes) 64.002060
best mean reward 76.286098
running time 991.058432
Train_EnvstepsSoFar



********** Iteration 303000 ************

Training agent...

Beginning logging procedure...
Timestep 303001
mean reward (100 episodes) 80.617444
best mean reward 87.664460
running time 1047.940372
Train_EnvstepsSoFar : 303001
Train_AverageReturn : 80.6174441038012
Train_BestReturn : 87.66446017761444
TimeSinceStart : 1047.9403719902039
Training Loss : 3.8416919708251953
Done logging...




********** Iteration 304000 ************

Training agent...

Beginning logging procedure...
Timestep 304001
mean reward (100 episodes) 79.289861
best mean reward 87.664460
running time 1052.749195
Train_EnvstepsSoFar : 304001
Train_AverageReturn : 79.28986087906958
Train_BestReturn : 87.66446017761444
TimeSinceStart : 1052.7491948604584
Training Loss : 3.0402886867523193
Done logging...




********** Iteration 305000 ************

Training agent...

Beginning logging procedure...
Timestep 305001
mean reward (100 episodes) 83.036126
best mean reward 87.664460
running time 1055.392846
Train_Envsteps



********** Iteration 324000 ************

Training agent...

Beginning logging procedure...
Timestep 324001
mean reward (100 episodes) 83.124949
best mean reward 87.664460
running time 1112.261129
Train_EnvstepsSoFar : 324001
Train_AverageReturn : 83.12494857133335
Train_BestReturn : 87.66446017761444
TimeSinceStart : 1112.2611289024353
Training Loss : 0.5830877423286438
Done logging...




********** Iteration 325000 ************

Training agent...

Beginning logging procedure...
Timestep 325001
mean reward (100 episodes) 89.310046
best mean reward 89.310046
running time 1114.954711
Train_EnvstepsSoFar : 325001
Train_AverageReturn : 89.31004561653567
Train_BestReturn : 89.31004561653567
TimeSinceStart : 1114.9547107219696
Training Loss : 1.9056756496429443
Done logging...




********** Iteration 326000 ************

Training agent...

Beginning logging procedure...
Timestep 326001
mean reward (100 episodes) 95.448305
best mean reward 95.448305
running time 1117.739417
Train_Envstep



********** Iteration 345000 ************

Training agent...

Beginning logging procedure...
Timestep 345001
mean reward (100 episodes) 68.936030
best mean reward 96.630523
running time 1171.130306
Train_EnvstepsSoFar : 345001
Train_AverageReturn : 68.93602970927468
Train_BestReturn : 96.6305226721904
TimeSinceStart : 1171.1303057670593
Training Loss : 0.7866537570953369
Done logging...




********** Iteration 346000 ************

Training agent...

Beginning logging procedure...
Timestep 346001
mean reward (100 episodes) 72.699691
best mean reward 96.630523
running time 1173.821855
Train_EnvstepsSoFar : 346001
Train_AverageReturn : 72.69969106908235
Train_BestReturn : 96.6305226721904
TimeSinceStart : 1173.8218548297882
Training Loss : 0.4092033803462982
Done logging...




********** Iteration 347000 ************

Training agent...

Beginning logging procedure...
Timestep 347001
mean reward (100 episodes) 77.302383
best mean reward 96.630523
running time 1176.487766
Train_EnvstepsS



********** Iteration 366000 ************

Training agent...

Beginning logging procedure...
Timestep 366001
mean reward (100 episodes) 138.072463
best mean reward 138.072463
running time 1231.310334
Train_EnvstepsSoFar : 366001
Train_AverageReturn : 138.0724634357421
Train_BestReturn : 138.0724634357421
TimeSinceStart : 1231.3103339672089
Training Loss : 0.9736014008522034
Done logging...




********** Iteration 367000 ************

Training agent...

Beginning logging procedure...
Timestep 367001
mean reward (100 episodes) 140.442024
best mean reward 140.442024
running time 1234.014262
Train_EnvstepsSoFar : 367001
Train_AverageReturn : 140.44202414409835
Train_BestReturn : 140.44202414409835
TimeSinceStart : 1234.0142619609833
Training Loss : 0.1635926216840744
Done logging...




********** Iteration 368000 ************

Training agent...

Beginning logging procedure...
Timestep 368001
mean reward (100 episodes) 140.995970
best mean reward 140.995970
running time 1236.900949
Train



********** Iteration 387000 ************

Training agent...

Beginning logging procedure...
Timestep 387001
mean reward (100 episodes) 146.763162
best mean reward 153.582931
running time 1288.491547
Train_EnvstepsSoFar : 387001
Train_AverageReturn : 146.76316206182813
Train_BestReturn : 153.58293085907854
TimeSinceStart : 1288.491546869278
Training Loss : 1.3121756315231323
Done logging...




********** Iteration 388000 ************

Training agent...

Beginning logging procedure...
Timestep 388001
mean reward (100 episodes) 145.849082
best mean reward 153.582931
running time 1291.200745
Train_EnvstepsSoFar : 388001
Train_AverageReturn : 145.84908215353244
Train_BestReturn : 153.58293085907854
TimeSinceStart : 1291.2007448673248
Training Loss : 0.9968553185462952
Done logging...




********** Iteration 389000 ************

Training agent...

Beginning logging procedure...
Timestep 389001
mean reward (100 episodes) 147.165499
best mean reward 153.582931
running time 1293.912772
Trai



********** Iteration 408000 ************

Training agent...

Beginning logging procedure...
Timestep 408001
mean reward (100 episodes) 154.149129
best mean reward 157.903032
running time 1345.860375
Train_EnvstepsSoFar : 408001
Train_AverageReturn : 154.14912895269407
Train_BestReturn : 157.9030320058246
TimeSinceStart : 1345.8603746891022
Training Loss : 0.39018499851226807
Done logging...




********** Iteration 409000 ************

Training agent...

Beginning logging procedure...
Timestep 409001
mean reward (100 episodes) 150.671057
best mean reward 157.903032
running time 1348.544316
Train_EnvstepsSoFar : 409001
Train_AverageReturn : 150.6710571222876
Train_BestReturn : 157.9030320058246
TimeSinceStart : 1348.544315814972
Training Loss : 0.78148353099823
Done logging...




********** Iteration 410000 ************

Training agent...

Beginning logging procedure...
Timestep 410001
mean reward (100 episodes) 151.948892
best mean reward 157.903032
running time 1351.269162
Train_En



********** Iteration 429000 ************

Training agent...

Beginning logging procedure...
Timestep 429001
mean reward (100 episodes) 125.754671
best mean reward 157.903032
running time 1402.354364
Train_EnvstepsSoFar : 429001
Train_AverageReturn : 125.75467140635976
Train_BestReturn : 157.9030320058246
TimeSinceStart : 1402.3543639183044
Training Loss : 2.213430643081665
Done logging...




********** Iteration 430000 ************

Training agent...

Beginning logging procedure...
Timestep 430001
mean reward (100 episodes) 121.822506
best mean reward 157.903032
running time 1405.168878
Train_EnvstepsSoFar : 430001
Train_AverageReturn : 121.82250636550093
Train_BestReturn : 157.9030320058246
TimeSinceStart : 1405.1688778400421
Training Loss : 0.7387622594833374
Done logging...




********** Iteration 431000 ************

Training agent...

Beginning logging procedure...
Timestep 431001
mean reward (100 episodes) 125.373486
best mean reward 157.903032
running time 1407.845112
Train_



********** Iteration 450000 ************

Training agent...

Beginning logging procedure...
Timestep 450001
mean reward (100 episodes) 125.800696
best mean reward 157.903032
running time 1461.950175
Train_EnvstepsSoFar : 450001
Train_AverageReturn : 125.80069589847103
Train_BestReturn : 157.9030320058246
TimeSinceStart : 1461.9501748085022
Training Loss : 0.1794559806585312
Done logging...




********** Iteration 451000 ************

Training agent...

Beginning logging procedure...
Timestep 451001
mean reward (100 episodes) 123.466151
best mean reward 157.903032
running time 1464.644028
Train_EnvstepsSoFar : 451001
Train_AverageReturn : 123.46615073078121
Train_BestReturn : 157.9030320058246
TimeSinceStart : 1464.6440279483795
Training Loss : 1.4061235189437866
Done logging...




********** Iteration 452000 ************

Training agent...

Beginning logging procedure...
Timestep 452001
mean reward (100 episodes) 124.924756
best mean reward 157.903032
running time 1468.497196
Train



********** Iteration 471000 ************

Training agent...

Beginning logging procedure...
Timestep 471001
mean reward (100 episodes) 134.306606
best mean reward 157.903032
running time 1523.008235
Train_EnvstepsSoFar : 471001
Train_AverageReturn : 134.30660569876636
Train_BestReturn : 157.9030320058246
TimeSinceStart : 1523.0082349777222
Training Loss : 2.7748019695281982
Done logging...




********** Iteration 472000 ************

Training agent...

Beginning logging procedure...
Timestep 472001
mean reward (100 episodes) 141.107175
best mean reward 157.903032
running time 1525.719099
Train_EnvstepsSoFar : 472001
Train_AverageReturn : 141.1071749174333
Train_BestReturn : 157.9030320058246
TimeSinceStart : 1525.7190988063812
Training Loss : 2.599553346633911
Done logging...




********** Iteration 473000 ************

Training agent...

Beginning logging procedure...
Timestep 473001
mean reward (100 episodes) 143.582199
best mean reward 157.903032
running time 1528.474528
Train_E



********** Iteration 492000 ************

Training agent...

Beginning logging procedure...
Timestep 492001
mean reward (100 episodes) 148.839253
best mean reward 157.903032
running time 1582.167103
Train_EnvstepsSoFar : 492001
Train_AverageReturn : 148.83925326609608
Train_BestReturn : 157.9030320058246
TimeSinceStart : 1582.1671028137207
Training Loss : 7.701116561889648
Done logging...




********** Iteration 493000 ************

Training agent...

Beginning logging procedure...
Timestep 493001
mean reward (100 episodes) 157.529837
best mean reward 157.903032
running time 1584.980689
Train_EnvstepsSoFar : 493001
Train_AverageReturn : 157.52983726979656
Train_BestReturn : 157.9030320058246
TimeSinceStart : 1584.9806888103485
Training Loss : 2.1182079315185547
Done logging...




********** Iteration 494000 ************

Training agent...

Beginning logging procedure...
Timestep 494001
mean reward (100 episodes) 151.644134
best mean reward 157.903032
running time 1587.699058
Train_



********** Iteration 14000 ************

Training agent...

Beginning logging procedure...
Timestep 14001
mean reward (100 episodes) -239.036714
best mean reward -inf
running time 33.978938
Train_EnvstepsSoFar : 14001
Train_AverageReturn : -239.03671408213813
TimeSinceStart : 33.97893810272217
Training Loss : 3.459628105163574
Done logging...




********** Iteration 15000 ************

Training agent...

Beginning logging procedure...
Timestep 15001
mean reward (100 episodes) -235.639982
best mean reward -inf
running time 36.670428
Train_EnvstepsSoFar : 15001
Train_AverageReturn : -235.63998235493042
TimeSinceStart : 36.67042827606201
Training Loss : 3.888798475265503
Done logging...




********** Iteration 16000 ************

Training agent...

Beginning logging procedure...
Timestep 16001
mean reward (100 episodes) -233.718345
best mean reward -233.718345
running time 39.355478
Train_EnvstepsSoFar : 16001
Train_AverageReturn : -233.7183445782002
Train_BestReturn : -233.7183445782



********** Iteration 35000 ************

Training agent...

Beginning logging procedure...
Timestep 35001
mean reward (100 episodes) -190.384332
best mean reward -190.384332
running time 111.442716
Train_EnvstepsSoFar : 35001
Train_AverageReturn : -190.38433163298964
Train_BestReturn : -190.38433163298964
TimeSinceStart : 111.44271612167358
Training Loss : 0.43242666125297546
Done logging...




********** Iteration 36000 ************

Training agent...

Beginning logging procedure...
Timestep 36001
mean reward (100 episodes) -190.859439
best mean reward -190.384332
running time 115.791739
Train_EnvstepsSoFar : 36001
Train_AverageReturn : -190.85943856819125
Train_BestReturn : -190.38433163298964
TimeSinceStart : 115.79173922538757
Training Loss : 1.268868327140808
Done logging...




********** Iteration 37000 ************

Training agent...

Beginning logging procedure...
Timestep 37001
mean reward (100 episodes) -188.503938
best mean reward -188.503938
running time 119.893769
Trai



********** Iteration 56000 ************

Training agent...

Beginning logging procedure...
Timestep 56001
mean reward (100 episodes) -171.672360
best mean reward -168.804139
running time 194.308165
Train_EnvstepsSoFar : 56001
Train_AverageReturn : -171.67236042416087
Train_BestReturn : -168.80413938928928
TimeSinceStart : 194.30816507339478
Training Loss : 0.27756622433662415
Done logging...




********** Iteration 57000 ************

Training agent...

Beginning logging procedure...
Timestep 57001
mean reward (100 episodes) -170.263256
best mean reward -168.804139
running time 197.942803
Train_EnvstepsSoFar : 57001
Train_AverageReturn : -170.26325648594798
Train_BestReturn : -168.80413938928928
TimeSinceStart : 197.94280338287354
Training Loss : 0.340167373418808
Done logging...




********** Iteration 58000 ************

Training agent...

Beginning logging procedure...
Timestep 58001
mean reward (100 episodes) -169.543343
best mean reward -168.804139
running time 202.347852
Trai



********** Iteration 77000 ************

Training agent...

Beginning logging procedure...
Timestep 77001
mean reward (100 episodes) -150.116340
best mean reward -148.153428
running time 274.127051
Train_EnvstepsSoFar : 77001
Train_AverageReturn : -150.11634023138126
Train_BestReturn : -148.15342786508066
TimeSinceStart : 274.127051115036
Training Loss : 0.13249947130680084
Done logging...




********** Iteration 78000 ************

Training agent...

Beginning logging procedure...
Timestep 78001
mean reward (100 episodes) -145.784975
best mean reward -145.784975
running time 277.859155
Train_EnvstepsSoFar : 78001
Train_AverageReturn : -145.78497454825632
Train_BestReturn : -145.78497454825632
TimeSinceStart : 277.85915517807007
Training Loss : 0.22739431262016296
Done logging...




********** Iteration 79000 ************

Training agent...

Beginning logging procedure...
Timestep 79001
mean reward (100 episodes) -148.330269
best mean reward -145.784975
running time 281.995651
Trai



********** Iteration 98000 ************

Training agent...

Beginning logging procedure...
Timestep 98001
mean reward (100 episodes) -134.959790
best mean reward -134.699894
running time 357.830692
Train_EnvstepsSoFar : 98001
Train_AverageReturn : -134.9597897384944
Train_BestReturn : -134.69989442300346
TimeSinceStart : 357.8306920528412
Training Loss : 0.19053146243095398
Done logging...




********** Iteration 99000 ************

Training agent...

Beginning logging procedure...
Timestep 99001
mean reward (100 episodes) -127.957167
best mean reward -127.957167
running time 360.894508
Train_EnvstepsSoFar : 99001
Train_AverageReturn : -127.95716722433497
Train_BestReturn : -127.95716722433497
TimeSinceStart : 360.8945081233978
Training Loss : 0.2479477822780609
Done logging...




********** Iteration 100000 ************

Training agent...

Beginning logging procedure...
Timestep 100001
mean reward (100 episodes) -128.858255
best mean reward -127.957167
running time 364.810489
Trai

Train_EnvstepsSoFar : 118001
Train_AverageReturn : -116.35545531798631
Train_BestReturn : -116.35545531798631
TimeSinceStart : 433.7761740684509
Training Loss : 0.26184317469596863
Done logging...




********** Iteration 119000 ************

Training agent...

Beginning logging procedure...
Timestep 119001
mean reward (100 episodes) -115.034593
best mean reward -115.034593
running time 437.651545
Train_EnvstepsSoFar : 119001
Train_AverageReturn : -115.03459292908177
Train_BestReturn : -115.03459292908177
TimeSinceStart : 437.65154504776
Training Loss : 0.1298454850912094
Done logging...




********** Iteration 120000 ************

Training agent...

Beginning logging procedure...
Timestep 120001
mean reward (100 episodes) -112.362062
best mean reward -112.362062
running time 441.400427
Train_EnvstepsSoFar : 120001
Train_AverageReturn : -112.36206243324395
Train_BestReturn : -112.36206243324395
TimeSinceStart : 441.4004271030426
Training Loss : 0.2226596474647522
Done logging...




*

Train_EnvstepsSoFar : 139001
Train_AverageReturn : -61.5638133684805
Train_BestReturn : -61.5638133684805
TimeSinceStart : 510.85567927360535
Training Loss : 0.2649880349636078
Done logging...




********** Iteration 140000 ************

Training agent...

Beginning logging procedure...
Timestep 140001
mean reward (100 episodes) -57.241312
best mean reward -57.241312
running time 514.991815
Train_EnvstepsSoFar : 140001
Train_AverageReturn : -57.24131242237132
Train_BestReturn : -57.24131242237132
TimeSinceStart : 514.991815328598
Training Loss : 0.438372403383255
Done logging...




********** Iteration 141000 ************

Training agent...

Beginning logging procedure...
Timestep 141001
mean reward (100 episodes) -55.856632
best mean reward -55.856632
running time 518.517978
Train_EnvstepsSoFar : 141001
Train_AverageReturn : -55.85663242933067
Train_BestReturn : -55.85663242933067
TimeSinceStart : 518.5179781913757
Training Loss : 0.4632313549518585
Done logging...




********** It

Train_EnvstepsSoFar : 160001
Train_AverageReturn : -3.389753725682612
Train_BestReturn : -3.389753725682612
TimeSinceStart : 589.0408780574799
Training Loss : 0.3007430136203766
Done logging...




********** Iteration 161000 ************

Training agent...

Beginning logging procedure...
Timestep 161001
mean reward (100 episodes) 0.316443
best mean reward 0.316443
running time 592.841819
Train_EnvstepsSoFar : 161001
Train_AverageReturn : 0.31644287118442976
Train_BestReturn : 0.31644287118442976
TimeSinceStart : 592.8418192863464
Training Loss : 5.273673057556152
Done logging...




********** Iteration 162000 ************

Training agent...

Beginning logging procedure...
Timestep 162001
mean reward (100 episodes) 0.610461
best mean reward 0.610461
running time 597.705613
Train_EnvstepsSoFar : 162001
Train_AverageReturn : 0.6104609177799613
Train_BestReturn : 0.6104609177799613
TimeSinceStart : 597.7056131362915
Training Loss : 0.1107105165719986
Done logging...




********** Iterat

Train_EnvstepsSoFar : 181001
Train_AverageReturn : 32.073610087613915
Train_BestReturn : 32.073610087613915
TimeSinceStart : 672.8343412876129
Training Loss : 0.07645134627819061
Done logging...




********** Iteration 182000 ************

Training agent...

Beginning logging procedure...
Timestep 182001
mean reward (100 episodes) 32.806741
best mean reward 32.806741
running time 677.663931
Train_EnvstepsSoFar : 182001
Train_AverageReturn : 32.806741498180514
Train_BestReturn : 32.806741498180514
TimeSinceStart : 677.6639311313629
Training Loss : 0.07221828401088715
Done logging...




********** Iteration 183000 ************

Training agent...

Beginning logging procedure...
Timestep 183001
mean reward (100 episodes) 34.789140
best mean reward 34.789140
running time 681.898948
Train_EnvstepsSoFar : 183001
Train_AverageReturn : 34.78914042476384
Train_BestReturn : 34.78914042476384
TimeSinceStart : 681.8989481925964
Training Loss : 0.032665424048900604
Done logging...




********** I

Train_EnvstepsSoFar : 202001
Train_AverageReturn : 39.402909802494555
Train_BestReturn : 44.614112275433165
TimeSinceStart : 759.349750995636
Training Loss : 0.0875818133354187
Done logging...




********** Iteration 203000 ************

Training agent...

Beginning logging procedure...
Timestep 203001
mean reward (100 episodes) 39.515292
best mean reward 44.614112
running time 763.199542
Train_EnvstepsSoFar : 203001
Train_AverageReturn : 39.51529163510105
Train_BestReturn : 44.614112275433165
TimeSinceStart : 763.1995420455933
Training Loss : 0.11163073033094406
Done logging...




********** Iteration 204000 ************

Training agent...

Beginning logging procedure...
Timestep 204001
mean reward (100 episodes) 39.273146
best mean reward 44.614112
running time 766.729535
Train_EnvstepsSoFar : 204001
Train_AverageReturn : 39.27314551093652
Train_BestReturn : 44.614112275433165
TimeSinceStart : 766.7295353412628
Training Loss : 0.06513969600200653
Done logging...




********** Iter

Train_EnvstepsSoFar : 223001
Train_AverageReturn : 26.777831342488106
Train_BestReturn : 44.614112275433165
TimeSinceStart : 841.5964510440826
Training Loss : 0.05151914432644844
Done logging...




********** Iteration 224000 ************

Training agent...

Beginning logging procedure...
Timestep 224001
mean reward (100 episodes) 26.420706
best mean reward 44.614112
running time 845.005002
Train_EnvstepsSoFar : 224001
Train_AverageReturn : 26.42070645276496
Train_BestReturn : 44.614112275433165
TimeSinceStart : 845.0050022602081
Training Loss : 0.3533135950565338
Done logging...




********** Iteration 225000 ************

Training agent...

Beginning logging procedure...
Timestep 225001
mean reward (100 episodes) 28.485175
best mean reward 44.614112
running time 849.867886
Train_EnvstepsSoFar : 225001
Train_AverageReturn : 28.48517452068364
Train_BestReturn : 44.614112275433165
TimeSinceStart : 849.8678860664368
Training Loss : 0.11745458841323853
Done logging...




********** Ite

Train_EnvstepsSoFar : 244001
Train_AverageReturn : 49.70382080194691
Train_BestReturn : 49.70382080194691
TimeSinceStart : 918.3198993206024
Training Loss : 0.08140988647937775
Done logging...




********** Iteration 245000 ************

Training agent...

Beginning logging procedure...
Timestep 245001
mean reward (100 episodes) 54.870328
best mean reward 54.870328
running time 921.412668
Train_EnvstepsSoFar : 245001
Train_AverageReturn : 54.870327695040054
Train_BestReturn : 54.870327695040054
TimeSinceStart : 921.4126682281494
Training Loss : 0.16218678653240204
Done logging...




********** Iteration 246000 ************

Training agent...

Beginning logging procedure...
Timestep 246001
mean reward (100 episodes) 60.176163
best mean reward 60.176163
running time 924.549884
Train_EnvstepsSoFar : 246001
Train_AverageReturn : 60.17616267389957
Train_BestReturn : 60.17616267389957
TimeSinceStart : 924.5498843193054
Training Loss : 0.06997520476579666
Done logging...




********** Iter

Train_EnvstepsSoFar : 265001
Train_AverageReturn : 105.35265567400191
Train_BestReturn : 105.35265567400191
TimeSinceStart : 987.0483512878418
Training Loss : 2.775606155395508
Done logging...




********** Iteration 266000 ************

Training agent...

Beginning logging procedure...
Timestep 266001
mean reward (100 episodes) 106.621388
best mean reward 106.621388
running time 990.399630
Train_EnvstepsSoFar : 266001
Train_AverageReturn : 106.62138814043428
Train_BestReturn : 106.62138814043428
TimeSinceStart : 990.3996300697327
Training Loss : 0.08931712061166763
Done logging...




********** Iteration 267000 ************

Training agent...

Beginning logging procedure...
Timestep 267001
mean reward (100 episodes) 108.007266
best mean reward 108.007266
running time 994.153111
Train_EnvstepsSoFar : 267001
Train_AverageReturn : 108.00726625841506
Train_BestReturn : 108.00726625841506
TimeSinceStart : 994.1531112194061
Training Loss : 1.050549030303955
Done logging...




********** 

Train_EnvstepsSoFar : 286001
Train_AverageReturn : 141.03915597003052
Train_BestReturn : 141.03915597003052
TimeSinceStart : 1052.3719532489777
Training Loss : 1.0230305194854736
Done logging...




********** Iteration 287000 ************

Training agent...

Beginning logging procedure...
Timestep 287001
mean reward (100 episodes) 143.281603
best mean reward 143.281603
running time 1055.346989
Train_EnvstepsSoFar : 287001
Train_AverageReturn : 143.28160273306779
Train_BestReturn : 143.28160273306779
TimeSinceStart : 1055.3469891548157
Training Loss : 0.19561129808425903
Done logging...




********** Iteration 288000 ************

Training agent...

Beginning logging procedure...
Timestep 288001
mean reward (100 episodes) 145.279422
best mean reward 145.279422
running time 1058.609653
Train_EnvstepsSoFar : 288001
Train_AverageReturn : 145.27942162841657
Train_BestReturn : 145.27942162841657
TimeSinceStart : 1058.6096529960632
Training Loss : 0.06238973140716553
Done logging...




***

Train_EnvstepsSoFar : 307001
Train_AverageReturn : 145.27234260710148
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1114.9614610671997
Training Loss : 0.9248181581497192
Done logging...




********** Iteration 308000 ************

Training agent...

Beginning logging procedure...
Timestep 308001
mean reward (100 episodes) 145.270862
best mean reward 154.280998
running time 1117.816237
Train_EnvstepsSoFar : 308001
Train_AverageReturn : 145.27086176904487
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1117.8162372112274
Training Loss : 0.073231540620327
Done logging...




********** Iteration 309000 ************

Training agent...

Beginning logging procedure...
Timestep 309001
mean reward (100 episodes) 141.675181
best mean reward 154.280998
running time 1120.603170
Train_EnvstepsSoFar : 309001
Train_AverageReturn : 141.6751809860166
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1120.6031701564789
Training Loss : 0.10196651518344879
Done logging...




******

Train_EnvstepsSoFar : 328001
Train_AverageReturn : 117.48048108313962
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1175.9052453041077
Training Loss : 0.07791432738304138
Done logging...




********** Iteration 329000 ************

Training agent...

Beginning logging procedure...
Timestep 329001
mean reward (100 episodes) 112.745800
best mean reward 154.280998
running time 1178.629293
Train_EnvstepsSoFar : 329001
Train_AverageReturn : 112.74579967472475
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1178.6292932033539
Training Loss : 0.10193532705307007
Done logging...




********** Iteration 330000 ************

Training agent...

Beginning logging procedure...
Timestep 330001
mean reward (100 episodes) 116.893294
best mean reward 154.280998
running time 1181.314324
Train_EnvstepsSoFar : 330001
Train_AverageReturn : 116.89329382201629
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1181.3143243789673
Training Loss : 2.7822487354278564
Done logging...




***

Train_EnvstepsSoFar : 349001
Train_AverageReturn : 123.65498163827974
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1237.997854232788
Training Loss : 0.4157940745353699
Done logging...




********** Iteration 350000 ************

Training agent...

Beginning logging procedure...
Timestep 350001
mean reward (100 episodes) 121.685108
best mean reward 154.280998
running time 1240.955635
Train_EnvstepsSoFar : 350001
Train_AverageReturn : 121.6851080594648
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1240.9556353092194
Training Loss : 0.20579469203948975
Done logging...




********** Iteration 351000 ************

Training agent...

Beginning logging procedure...
Timestep 351001
mean reward (100 episodes) 121.258097
best mean reward 154.280998
running time 1243.709301
Train_EnvstepsSoFar : 351001
Train_AverageReturn : 121.25809738313195
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1243.709300994873
Training Loss : 5.090541839599609
Done logging...




********

Train_EnvstepsSoFar : 370001
Train_AverageReturn : 134.69093720742424
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1296.1216413974762
Training Loss : 0.07017703354358673
Done logging...




********** Iteration 371000 ************

Training agent...

Beginning logging procedure...
Timestep 371001
mean reward (100 episodes) 132.093181
best mean reward 154.280998
running time 1299.475226
Train_EnvstepsSoFar : 371001
Train_AverageReturn : 132.09318083531412
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1299.4752261638641
Training Loss : 0.17159268260002136
Done logging...




********** Iteration 372000 ************

Training agent...

Beginning logging procedure...
Timestep 372001
mean reward (100 episodes) 128.968215
best mean reward 154.280998
running time 1302.196106
Train_EnvstepsSoFar : 372001
Train_AverageReturn : 128.9682152260247
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1302.1961061954498
Training Loss : 0.5718422532081604
Done logging...




****

Train_EnvstepsSoFar : 391001
Train_AverageReturn : 131.83938026148272
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1357.4693701267242
Training Loss : 0.14985671639442444
Done logging...




********** Iteration 392000 ************

Training agent...

Beginning logging procedure...
Timestep 392001
mean reward (100 episodes) 131.281820
best mean reward 154.280998
running time 1360.150543
Train_EnvstepsSoFar : 392001
Train_AverageReturn : 131.28182001293806
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1360.1505432128906
Training Loss : 0.21108773350715637
Done logging...




********** Iteration 393000 ************

Training agent...

Beginning logging procedure...
Timestep 393001
mean reward (100 episodes) 132.526317
best mean reward 154.280998
running time 1362.891650
Train_EnvstepsSoFar : 393001
Train_AverageReturn : 132.5263167659822
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1362.8916501998901
Training Loss : 2.980391025543213
Done logging...




*****

Train_EnvstepsSoFar : 412001
Train_AverageReturn : 115.94011008173494
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1416.8993372917175
Training Loss : 0.3488525152206421
Done logging...




********** Iteration 413000 ************

Training agent...

Beginning logging procedure...
Timestep 413001
mean reward (100 episodes) 116.943736
best mean reward 154.280998
running time 1419.660407
Train_EnvstepsSoFar : 413001
Train_AverageReturn : 116.94373561465076
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1419.6604070663452
Training Loss : 2.4622206687927246
Done logging...




********** Iteration 414000 ************

Training agent...

Beginning logging procedure...
Timestep 414001
mean reward (100 episodes) 122.909716
best mean reward 154.280998
running time 1422.390275
Train_EnvstepsSoFar : 414001
Train_AverageReturn : 122.9097156130925
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1422.3902752399445
Training Loss : 1.606164574623108
Done logging...




*******

Train_EnvstepsSoFar : 433001
Train_AverageReturn : 151.6298229944743
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1474.5339341163635
Training Loss : 0.09618035703897476
Done logging...




********** Iteration 434000 ************

Training agent...

Beginning logging procedure...
Timestep 434001
mean reward (100 episodes) 153.193028
best mean reward 154.280998
running time 1477.205372
Train_EnvstepsSoFar : 434001
Train_AverageReturn : 153.1930275809381
Train_BestReturn : 154.28099773601443
TimeSinceStart : 1477.2053723335266
Training Loss : 2.216078042984009
Done logging...




********** Iteration 435000 ************

Training agent...

Beginning logging procedure...
Timestep 435001
mean reward (100 episodes) 159.145950
best mean reward 159.145950
running time 1479.874536
Train_EnvstepsSoFar : 435001
Train_AverageReturn : 159.14594958299395
Train_BestReturn : 159.14594958299395
TimeSinceStart : 1479.8745362758636
Training Loss : 0.22744882106781006
Done logging...




******

Train_EnvstepsSoFar : 454001
Train_AverageReturn : 147.72136412578413
Train_BestReturn : 161.94480942757139
TimeSinceStart : 1530.8314652442932
Training Loss : 0.08983515948057175
Done logging...




********** Iteration 455000 ************

Training agent...

Beginning logging procedure...
Timestep 455001
mean reward (100 episodes) 147.857998
best mean reward 161.944809
running time 1533.480659
Train_EnvstepsSoFar : 455001
Train_AverageReturn : 147.8579983535985
Train_BestReturn : 161.94480942757139
TimeSinceStart : 1533.4806592464447
Training Loss : 3.9369094371795654
Done logging...




********** Iteration 456000 ************

Training agent...

Beginning logging procedure...
Timestep 456001
mean reward (100 episodes) 145.908502
best mean reward 161.944809
running time 1536.164843
Train_EnvstepsSoFar : 456001
Train_AverageReturn : 145.9085020744729
Train_BestReturn : 161.94480942757139
TimeSinceStart : 1536.164843082428
Training Loss : 0.20499911904335022
Done logging...




******

Train_EnvstepsSoFar : 475001
Train_AverageReturn : 109.07073747742672
Train_BestReturn : 161.94480942757139
TimeSinceStart : 1588.0679211616516
Training Loss : 1.6317424774169922
Done logging...




********** Iteration 476000 ************

Training agent...

Beginning logging procedure...
Timestep 476001
mean reward (100 episodes) 108.157606
best mean reward 161.944809
running time 1590.824520
Train_EnvstepsSoFar : 476001
Train_AverageReturn : 108.15760568752131
Train_BestReturn : 161.94480942757139
TimeSinceStart : 1590.8245203495026
Training Loss : 0.20396649837493896
Done logging...




********** Iteration 477000 ************

Training agent...

Beginning logging procedure...
Timestep 477001
mean reward (100 episodes) 100.193685
best mean reward 161.944809
running time 1593.555040
Train_EnvstepsSoFar : 477001
Train_AverageReturn : 100.1936852575259
Train_BestReturn : 161.94480942757139
TimeSinceStart : 1593.5550401210785
Training Loss : 0.18437299132347107
Done logging...




****

Train_EnvstepsSoFar : 496001
Train_AverageReturn : 70.37033736636005
Train_BestReturn : 161.94480942757139
TimeSinceStart : 1646.2972462177277
Training Loss : 7.371368408203125
Done logging...




********** Iteration 497000 ************

Training agent...

Beginning logging procedure...
Timestep 497001
mean reward (100 episodes) 66.434760
best mean reward 161.944809
running time 1648.955200
Train_EnvstepsSoFar : 497001
Train_AverageReturn : 66.4347597544451
Train_BestReturn : 161.94480942757139
TimeSinceStart : 1648.955199956894
Training Loss : 1.1901475191116333
Done logging...




********** Iteration 498000 ************

Training agent...

Beginning logging procedure...
Timestep 498001
mean reward (100 episodes) 64.204122
best mean reward 161.944809
running time 1651.603387
Train_EnvstepsSoFar : 498001
Train_AverageReturn : 64.20412236253841
Train_BestReturn : 161.94480942757139
TimeSinceStart : 1651.6033871173859
Training Loss : 0.20250026881694794
Done logging...




********** I

In [57]:
### Visualize vanilla DQN results on Lunar Lander
%load_ext tensorboard
%tensorboard --logdir logs/dqn/LunarLander/vanilla_dqn

## Double DQN
One potential issue with learning our Q functions with bootstrapping is _maximization bias_, where the learned Q-values tend to overestimate the actual expected future returns. The main idea is that when there is estimation error in the next state's Q-values, even if the values were correct on average, picking the action with the maximum Q-value would tend to select one where the value is overestimated. This overoptimistic value would then also get propagated via the Bellman backups to other states and actions, and can potentially slow down learning.

Double DQN (https://arxiv.org/abs/1509.06461) proposes a simple solution to alleviate this _maximization bias_. Instead of taking the next action that maximizes the target network's Q-value, it selects the action to maximize the _current_ Q function at the next state, and then takes the target network's estimate of that action's value. 

Implement the double DQN target value in the update method in <code>critics/dqn_critic.py</code>.

In [83]:
#### Test DQN target value with double Q
dqn_args = dict(dqn_base_args_dict)

env_str = 'LunarLander'
dqn_args['env_name'] = '{}-v3'.format(env_str)
dqn_args['double_q'] = True
dqntrainer = DQN_Trainer(dqn_args)
dqnagent = dqntrainer.rl_trainer.agent
critic = dqnagent.critic

ob_dim = critic.ob_dim
ac_dim = 6
N = 5

np.random.seed(0)
obs = np.random.normal(size=(N, ob_dim))
acts = np.random.choice(ac_dim, size=(N,))
next_obs = np.random.normal(size=(N, ob_dim))
rewards = np.random.normal(size=N)
terminals = np.zeros(N)
terminals[0] = 1

first_weight_before = np.array(ptu.to_numpy(next(critic.q_net.parameters())))
print("Weight before update (first row)", first_weight_before[0])


loss = critic.update(obs, acts, next_obs, rewards, terminals)['Training Loss']
expected_loss = 0.93894196
loss_error = rel_error(loss, expected_loss)
print("Initial loss", loss)
print("Initial Loss Error", loss_error, "should be on the order of 1e-6 or lower")

for i in range(4):
    loss = critic.update(obs, acts, next_obs, rewards, terminals)['Training Loss']
    print(loss)

expected_loss = 0.7871182
loss_error = rel_error(loss, expected_loss)
print("Loss Error", loss_error, "should be on the order of 1e-6 or lower")


first_weight_after = np.array(ptu.to_numpy(next(critic.q_net.parameters())))
print("Weight after update (first row)", first_weight_after.shape)
# Test DQN gradient
print(first_weight_after[0])
weight_change_partial = first_weight_after[0] - first_weight_before[0]
print(weight_change_partial)
expected_weight_change = np.array([-0.0049137, -0.00500057, -0.00499138, -0.00491226, -0.00490116,  0.00489506,
 -0.00284088, -0.00171939,  0.00485736])


updated_weight_error = rel_error(weight_change_partial, expected_weight_change)
print("Weight Update Error", updated_weight_error, "should be on the order of 1e-6 or lower")

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
LunarLander-v3
Weight before update (first row) [ 0.07646337 -0.07932475  0.09140956 -0.01702595  0.14239588  0.07935759
 -0.03831157 -0.2694876   0.07610479]
Initial loss 0.93894196
Initial Loss Error 2.3609519617425807e-09 should be on the order of 1e-6 or lower
0.8988525
0.8602768
0.8228415
0.7871182
Loss Error 2.231287016432748e-09 should be on the order of 1e-6 or lower
Weight after update (first row) (64, 9)
[ 0.07154967 -0.08432532  0.08641818 -0.02193821  0.13749473  0.08425266
 -0.04115245 -0.27120697  0.08096215]
[-0.0049137  -0.00500057 -0.00499138 -0.00491226 -0.00490116  0.00489506
 -0.00284088 -0.00171939  0.00485736]
Weight Update Error 1.3418982564751781e-06 should be on the order of 1e-6 or lower




We can now also run some experiments on LunarLander with Double DQN. You may be able to see that double DQN performs slightly better and more stably, but as there is very high variance, dont' worry if you do not.

In [85]:
# Run with double DQN
dqn_args = dict(dqn_base_args_dict)

env_str = 'LunarLander'
dqn_args['env_name'] = '{}-v3'.format(env_str)
dqn_args['double_q'] = True

# Delete all previous logs
remove_folder('logs/dqn/{}/double_dqn'.format(env_str))

for seed in range(3):
    print("Running DQN experiment with seed", seed)
    dqn_args['seed'] = seed
    dqn_args['logdir'] = 'logs/dqn/{}/double_dqn/seed{}'.format(env_str, seed)
    dqntrainer = DQN_Trainer(dqn_args)
    dqntrainer.run_training_loop()

Folder logs/dqn/LunarLander/double_dqn does not exist yet. No old results to delete
Running DQN experiment with seed 0
########################
logging outputs to  logs/dqn/LunarLander/double_dqn/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
LunarLander-v3


********** Iteration 0 ************

Training agent...

Beginning logging procedure...
Timestep 1
mean reward (100 episodes) nan
best mean reward -inf
running time 0.000730
Train_EnvstepsSoFar : 1
TimeSinceStart : 0.0007300376892089844
Done logging...








********** Iteration 1000 ************

Training agent...

Beginning logging procedure...
Timestep 1001
mean reward (100 episodes) -361.458927
best mean reward -inf
running time 0.657439
Train_EnvstepsSoFar : 1001
Train_AverageReturn : -361.4589273606985
TimeSinceStart : 0.6574389934539795
Done logging...




********** Iteration 2000 ************

Training agent...

Beginning logging procedure...
Timestep 2001
mean reward (100 episodes) -317.658504
best mean reward -inf
running time 4.847034
Train_EnvstepsSoFar : 2001
Train_AverageReturn : -317.6585039542492
TimeSinceStart : 4.847033977508545
Training Loss : 0.3789442181587219
Done logging...




********** Iteration 3000 ************

Training agent...

Beginning logging procedure...
Timestep 3001
mean reward (100 episodes) -298.225730
best mean reward -inf
running time 8.871307
Train_EnvstepsSoFar : 3001
Train_AverageReturn : -298.2257298425541
TimeSinceStart : 8.871306896209717
Training Loss : 0.24803029000759125
Done logging...




********** Iteration 24000 ************

Training agent...

Beginning logging procedure...
Timestep 24001
mean reward (100 episodes) -212.727138
best mean reward -212.727138
running time 109.342728
Train_EnvstepsSoFar : 24001
Train_AverageReturn : -212.7271384089546
Train_BestReturn : -212.7271384089546
TimeSinceStart : 109.34272789955139
Training Loss : 0.836423397064209
Done logging...




********** Iteration 25000 ************

Training agent...

Beginning logging procedure...
Timestep 25001
mean reward (100 episodes) -210.460493
best mean reward -210.460493
running time 114.993388
Train_EnvstepsSoFar : 25001
Train_AverageReturn : -210.46049303442564
Train_BestReturn : -210.46049303442564
TimeSinceStart : 114.99338793754578
Training Loss : 1.9687288999557495
Done logging...




********** Iteration 26000 ************

Training agent...

Beginning logging procedure...
Timestep 26001
mean reward (100 episodes) -210.516870
best mean reward -210.460493
running time 120.341567
Train_E



********** Iteration 45000 ************

Training agent...

Beginning logging procedure...
Timestep 45001
mean reward (100 episodes) -178.152212
best mean reward -178.152212
running time 214.358650
Train_EnvstepsSoFar : 45001
Train_AverageReturn : -178.1522118803098
Train_BestReturn : -178.1522118803098
TimeSinceStart : 214.35864996910095
Training Loss : 0.4385693073272705
Done logging...




********** Iteration 46000 ************

Training agent...

Beginning logging procedure...
Timestep 46001
mean reward (100 episodes) -171.024686
best mean reward -171.024686
running time 219.908688
Train_EnvstepsSoFar : 46001
Train_AverageReturn : -171.0246862056383
Train_BestReturn : -171.0246862056383
TimeSinceStart : 219.9086878299713
Training Loss : 3.0320754051208496
Done logging...




********** Iteration 47000 ************

Training agent...

Beginning logging procedure...
Timestep 47001
mean reward (100 episodes) -169.350078
best mean reward -169.350078
running time 224.242533
Train_Env



********** Iteration 66000 ************

Training agent...

Beginning logging procedure...
Timestep 66001
mean reward (100 episodes) -130.446549
best mean reward -130.446549
running time 346.445589
Train_EnvstepsSoFar : 66001
Train_AverageReturn : -130.4465490782374
Train_BestReturn : -130.4465490782374
TimeSinceStart : 346.44558906555176
Training Loss : 0.29139989614486694
Done logging...




********** Iteration 67000 ************

Training agent...

Beginning logging procedure...
Timestep 67001
mean reward (100 episodes) -129.396502
best mean reward -129.396502
running time 353.440817
Train_EnvstepsSoFar : 67001
Train_AverageReturn : -129.3965021382186
Train_BestReturn : -129.3965021382186
TimeSinceStart : 353.44081687927246
Training Loss : 0.9827867746353149
Done logging...




********** Iteration 68000 ************

Training agent...

Beginning logging procedure...
Timestep 68001
mean reward (100 episodes) -128.893449
best mean reward -128.893449
running time 360.792584
Train_E



********** Iteration 87000 ************

Training agent...

Beginning logging procedure...
Timestep 87001
mean reward (100 episodes) -86.418797
best mean reward -85.570402
running time 457.873509
Train_EnvstepsSoFar : 87001
Train_AverageReturn : -86.41879664527397
Train_BestReturn : -85.57040173237489
TimeSinceStart : 457.8735091686249
Training Loss : 0.29239621758461
Done logging...




********** Iteration 88000 ************

Training agent...

Beginning logging procedure...
Timestep 88001
mean reward (100 episodes) -83.353286
best mean reward -83.353286
running time 462.310347
Train_EnvstepsSoFar : 88001
Train_AverageReturn : -83.35328572454827
Train_BestReturn : -83.35328572454827
TimeSinceStart : 462.3103470802307
Training Loss : 0.40402737259864807
Done logging...




********** Iteration 89000 ************

Training agent...

Beginning logging procedure...
Timestep 89001
mean reward (100 episodes) -80.977610
best mean reward -80.977610
running time 466.754385
Train_EnvstepsSoF



********** Iteration 108000 ************

Training agent...

Beginning logging procedure...
Timestep 108001
mean reward (100 episodes) -44.270229
best mean reward -43.786751
running time 558.282304
Train_EnvstepsSoFar : 108001
Train_AverageReturn : -44.27022919220976
Train_BestReturn : -43.78675074820081
TimeSinceStart : 558.2823040485382
Training Loss : 0.21593137085437775
Done logging...




********** Iteration 109000 ************

Training agent...

Beginning logging procedure...
Timestep 109001
mean reward (100 episodes) -44.664730
best mean reward -43.786751
running time 567.411525
Train_EnvstepsSoFar : 109001
Train_AverageReturn : -44.66473042235596
Train_BestReturn : -43.78675074820081
TimeSinceStart : 567.411524772644
Training Loss : 0.20420609414577484
Done logging...




********** Iteration 110000 ************

Training agent...

Beginning logging procedure...
Timestep 110001
mean reward (100 episodes) -43.211142
best mean reward -43.211142
running time 574.705217
Train_E



********** Iteration 129000 ************

Training agent...

Beginning logging procedure...
Timestep 129001
mean reward (100 episodes) 3.618463
best mean reward 3.618463
running time 689.343130
Train_EnvstepsSoFar : 129001
Train_AverageReturn : 3.6184631740100484
Train_BestReturn : 3.6184631740100484
TimeSinceStart : 689.3431301116943
Training Loss : 0.177327498793602
Done logging...




********** Iteration 130000 ************

Training agent...

Beginning logging procedure...
Timestep 130001
mean reward (100 episodes) 8.750686
best mean reward 8.750686
running time 693.615366
Train_EnvstepsSoFar : 130001
Train_AverageReturn : 8.750686000995888
Train_BestReturn : 8.750686000995888
TimeSinceStart : 693.6153657436371
Training Loss : 0.14056003093719482
Done logging...




********** Iteration 131000 ************

Training agent...

Beginning logging procedure...
Timestep 131001
mean reward (100 episodes) 10.993360
best mean reward 10.993360
running time 698.961270
Train_EnvstepsSoFar 



********** Iteration 150000 ************

Training agent...

Beginning logging procedure...
Timestep 150001
mean reward (100 episodes) 63.082970
best mean reward 63.789636
running time 783.525314
Train_EnvstepsSoFar : 150001
Train_AverageReturn : 63.08297012185266
Train_BestReturn : 63.789635644749524
TimeSinceStart : 783.5253140926361
Training Loss : 0.48043230175971985
Done logging...




********** Iteration 151000 ************

Training agent...

Beginning logging procedure...
Timestep 151001
mean reward (100 episodes) 63.678259
best mean reward 63.789636
running time 788.437517
Train_EnvstepsSoFar : 151001
Train_AverageReturn : 63.67825914676343
Train_BestReturn : 63.789635644749524
TimeSinceStart : 788.4375169277191
Training Loss : 2.581671953201294
Done logging...




********** Iteration 152000 ************

Training agent...

Beginning logging procedure...
Timestep 152001
mean reward (100 episodes) 68.387196
best mean reward 68.387196
running time 793.189515
Train_EnvstepsSo



********** Iteration 171000 ************

Training agent...

Beginning logging procedure...
Timestep 171001
mean reward (100 episodes) 77.904460
best mean reward 81.613048
running time 887.823826
Train_EnvstepsSoFar : 171001
Train_AverageReturn : 77.90446032370059
Train_BestReturn : 81.6130483360056
TimeSinceStart : 887.8238258361816
Training Loss : 0.13742202520370483
Done logging...




********** Iteration 172000 ************

Training agent...

Beginning logging procedure...
Timestep 172001
mean reward (100 episodes) 80.537242
best mean reward 81.613048
running time 891.702370
Train_EnvstepsSoFar : 172001
Train_AverageReturn : 80.53724208772857
Train_BestReturn : 81.6130483360056
TimeSinceStart : 891.70236992836
Training Loss : 1.0892399549484253
Done logging...




********** Iteration 173000 ************

Training agent...

Beginning logging procedure...
Timestep 173001
mean reward (100 episodes) 85.492784
best mean reward 85.492784
running time 895.421942
Train_EnvstepsSoFar :



********** Iteration 192000 ************

Training agent...

Beginning logging procedure...
Timestep 192001
mean reward (100 episodes) 81.444715
best mean reward 88.808934
running time 980.953387
Train_EnvstepsSoFar : 192001
Train_AverageReturn : 81.44471493107243
Train_BestReturn : 88.80893372511214
TimeSinceStart : 980.9533870220184
Training Loss : 0.2208367884159088
Done logging...




********** Iteration 193000 ************

Training agent...

Beginning logging procedure...
Timestep 193001
mean reward (100 episodes) 80.468174
best mean reward 88.808934
running time 985.883024
Train_EnvstepsSoFar : 193001
Train_AverageReturn : 80.46817418477366
Train_BestReturn : 88.80893372511214
TimeSinceStart : 985.8830237388611
Training Loss : 0.12727606296539307
Done logging...




********** Iteration 194000 ************

Training agent...

Beginning logging procedure...
Timestep 194001
mean reward (100 episodes) 83.021849
best mean reward 88.808934
running time 990.602387
Train_EnvstepsSoF



********** Iteration 213000 ************

Training agent...

Beginning logging procedure...
Timestep 213001
mean reward (100 episodes) 82.521867
best mean reward 89.438188
running time 1085.445978
Train_EnvstepsSoFar : 213001
Train_AverageReturn : 82.52186746112051
Train_BestReturn : 89.4381881218353
TimeSinceStart : 1085.4459781646729
Training Loss : 1.537217140197754
Done logging...




********** Iteration 214000 ************

Training agent...

Beginning logging procedure...
Timestep 214001
mean reward (100 episodes) 83.611038
best mean reward 89.438188
running time 1089.582769
Train_EnvstepsSoFar : 214001
Train_AverageReturn : 83.61103764266446
Train_BestReturn : 89.4381881218353
TimeSinceStart : 1089.5827689170837
Training Loss : 0.3685391843318939
Done logging...




********** Iteration 215000 ************

Training agent...

Beginning logging procedure...
Timestep 215001
mean reward (100 episodes) 83.497457
best mean reward 89.438188
running time 1095.929100
Train_EnvstepsSo



********** Iteration 234000 ************

Training agent...

Beginning logging procedure...
Timestep 234001
mean reward (100 episodes) 86.957753
best mean reward 98.693026
running time 1179.956529
Train_EnvstepsSoFar : 234001
Train_AverageReturn : 86.95775260352875
Train_BestReturn : 98.69302625627078
TimeSinceStart : 1179.9565289020538
Training Loss : 0.21028867363929749
Done logging...




********** Iteration 235000 ************

Training agent...

Beginning logging procedure...
Timestep 235001
mean reward (100 episodes) 88.065522
best mean reward 98.693026
running time 1184.313819
Train_EnvstepsSoFar : 235001
Train_AverageReturn : 88.06552173353664
Train_BestReturn : 98.69302625627078
TimeSinceStart : 1184.3138189315796
Training Loss : 1.1236929893493652
Done logging...




********** Iteration 236000 ************

Training agent...

Beginning logging procedure...
Timestep 236001
mean reward (100 episodes) 90.545735
best mean reward 98.693026
running time 1188.012116
Train_Envste



********** Iteration 255000 ************

Training agent...

Beginning logging procedure...
Timestep 255001
mean reward (100 episodes) 99.434065
best mean reward 107.236988
running time 1277.984282
Train_EnvstepsSoFar : 255001
Train_AverageReturn : 99.43406482465804
Train_BestReturn : 107.23698763848921
TimeSinceStart : 1277.9842820167542
Training Loss : 0.237761452794075
Done logging...




********** Iteration 256000 ************

Training agent...

Beginning logging procedure...
Timestep 256001
mean reward (100 episodes) 98.374361
best mean reward 107.236988
running time 1285.116032
Train_EnvstepsSoFar : 256001
Train_AverageReturn : 98.3743606909325
Train_BestReturn : 107.23698763848921
TimeSinceStart : 1285.116031885147
Training Loss : 0.22871147096157074
Done logging...




********** Iteration 257000 ************

Training agent...

Beginning logging procedure...
Timestep 257001
mean reward (100 episodes) 102.761861
best mean reward 107.236988
running time 1289.461854
Train_Env



********** Iteration 276000 ************

Training agent...

Beginning logging procedure...
Timestep 276001
mean reward (100 episodes) 104.578314
best mean reward 107.912809
running time 1372.836519
Train_EnvstepsSoFar : 276001
Train_AverageReturn : 104.57831389261653
Train_BestReturn : 107.91280864283075
TimeSinceStart : 1372.8365187644958
Training Loss : 0.11376181244850159
Done logging...




********** Iteration 277000 ************

Training agent...

Beginning logging procedure...
Timestep 277001
mean reward (100 episodes) 104.801032
best mean reward 107.912809
running time 1376.724850
Train_EnvstepsSoFar : 277001
Train_AverageReturn : 104.80103226839668
Train_BestReturn : 107.91280864283075
TimeSinceStart : 1376.7248499393463
Training Loss : 1.3786461353302002
Done logging...




********** Iteration 278000 ************

Training agent...

Beginning logging procedure...
Timestep 278001
mean reward (100 episodes) 111.690154
best mean reward 111.690154
running time 1380.876976
Tr



********** Iteration 297000 ************

Training agent...

Beginning logging procedure...
Timestep 297001
mean reward (100 episodes) 136.851126
best mean reward 136.851126
running time 1480.807809
Train_EnvstepsSoFar : 297001
Train_AverageReturn : 136.8511261812021
Train_BestReturn : 136.8511261812021
TimeSinceStart : 1480.8078088760376
Training Loss : 0.474625825881958
Done logging...




********** Iteration 298000 ************

Training agent...

Beginning logging procedure...
Timestep 298001
mean reward (100 episodes) 133.411256
best mean reward 136.851126
running time 1484.645795
Train_EnvstepsSoFar : 298001
Train_AverageReturn : 133.4112561541634
Train_BestReturn : 136.8511261812021
TimeSinceStart : 1484.6457951068878
Training Loss : 0.8796972036361694
Done logging...




********** Iteration 299000 ************

Training agent...

Beginning logging procedure...
Timestep 299001
mean reward (100 episodes) 130.830482
best mean reward 136.851126
running time 1489.184239
Train_En



********** Iteration 318000 ************

Training agent...

Beginning logging procedure...
Timestep 318001
mean reward (100 episodes) 149.474139
best mean reward 151.294436
running time 1562.726366
Train_EnvstepsSoFar : 318001
Train_AverageReturn : 149.474138719292
Train_BestReturn : 151.294435537452
TimeSinceStart : 1562.7263658046722
Training Loss : 2.3065507411956787
Done logging...




********** Iteration 319000 ************

Training agent...

Beginning logging procedure...
Timestep 319001
mean reward (100 episodes) 143.346553
best mean reward 151.294436
running time 1566.554368
Train_EnvstepsSoFar : 319001
Train_AverageReturn : 143.34655288742172
Train_BestReturn : 151.294435537452
TimeSinceStart : 1566.554368019104
Training Loss : 1.1845017671585083
Done logging...




********** Iteration 320000 ************

Training agent...

Beginning logging procedure...
Timestep 320001
mean reward (100 episodes) 139.775496
best mean reward 151.294436
running time 1570.919694
Train_Envs



********** Iteration 339000 ************

Training agent...

Beginning logging procedure...
Timestep 339001
mean reward (100 episodes) 161.505905
best mean reward 166.167337
running time 1641.038642
Train_EnvstepsSoFar : 339001
Train_AverageReturn : 161.50590479405415
Train_BestReturn : 166.1673369362377
TimeSinceStart : 1641.0386419296265
Training Loss : 0.07412020862102509
Done logging...




********** Iteration 340000 ************

Training agent...

Beginning logging procedure...
Timestep 340001
mean reward (100 episodes) 164.608094
best mean reward 166.167337
running time 1644.660938
Train_EnvstepsSoFar : 340001
Train_AverageReturn : 164.6080942095437
Train_BestReturn : 166.1673369362377
TimeSinceStart : 1644.6609380245209
Training Loss : 0.6686381101608276
Done logging...




********** Iteration 341000 ************

Training agent...

Beginning logging procedure...
Timestep 341001
mean reward (100 episodes) 166.118038
best mean reward 166.167337
running time 1648.378166
Train



********** Iteration 360000 ************

Training agent...

Beginning logging procedure...
Timestep 360001
mean reward (100 episodes) 139.977337
best mean reward 169.148691
running time 1716.966740
Train_EnvstepsSoFar : 360001
Train_AverageReturn : 139.97733737958524
Train_BestReturn : 169.14869073015078
TimeSinceStart : 1716.9667398929596
Training Loss : 0.11134292185306549
Done logging...




********** Iteration 361000 ************

Training agent...

Beginning logging procedure...
Timestep 361001
mean reward (100 episodes) 131.371576
best mean reward 169.148691
running time 1720.530148
Train_EnvstepsSoFar : 361001
Train_AverageReturn : 131.37157584868544
Train_BestReturn : 169.14869073015078
TimeSinceStart : 1720.5301480293274
Training Loss : 1.143958568572998
Done logging...




********** Iteration 362000 ************

Training agent...

Beginning logging procedure...
Timestep 362001
mean reward (100 episodes) 121.412854
best mean reward 169.148691
running time 1724.113763
Tra



********** Iteration 381000 ************

Training agent...

Beginning logging procedure...
Timestep 381001
mean reward (100 episodes) 69.264997
best mean reward 169.148691
running time 1793.120000
Train_EnvstepsSoFar : 381001
Train_AverageReturn : 69.2649970978801
Train_BestReturn : 169.14869073015078
TimeSinceStart : 1793.119999885559
Training Loss : 0.11849690973758698
Done logging...




********** Iteration 382000 ************

Training agent...

Beginning logging procedure...
Timestep 382001
mean reward (100 episodes) 71.126934
best mean reward 169.148691
running time 1796.738592
Train_EnvstepsSoFar : 382001
Train_AverageReturn : 71.1269335044984
Train_BestReturn : 169.14869073015078
TimeSinceStart : 1796.7385919094086
Training Loss : 0.9562792778015137
Done logging...




********** Iteration 383000 ************

Training agent...

Beginning logging procedure...
Timestep 383001
mean reward (100 episodes) 67.230597
best mean reward 169.148691
running time 1800.297986
Train_Envs



********** Iteration 402000 ************

Training agent...

Beginning logging procedure...
Timestep 402001
mean reward (100 episodes) 98.661562
best mean reward 169.148691
running time 1868.213466
Train_EnvstepsSoFar : 402001
Train_AverageReturn : 98.66156161711189
Train_BestReturn : 169.14869073015078
TimeSinceStart : 1868.21346616745
Training Loss : 0.2660028636455536
Done logging...




********** Iteration 403000 ************

Training agent...

Beginning logging procedure...
Timestep 403001
mean reward (100 episodes) 90.087953
best mean reward 169.148691
running time 1871.822812
Train_EnvstepsSoFar : 403001
Train_AverageReturn : 90.08795270062663
Train_BestReturn : 169.14869073015078
TimeSinceStart : 1871.8228118419647
Training Loss : 0.27063509821891785
Done logging...




********** Iteration 404000 ************

Training agent...

Beginning logging procedure...
Timestep 404001
mean reward (100 episodes) 90.878493
best mean reward 169.148691
running time 1875.360139
Train_Env



********** Iteration 423000 ************

Training agent...

Beginning logging procedure...
Timestep 423001
mean reward (100 episodes) 87.448875
best mean reward 169.148691
running time 1943.402298
Train_EnvstepsSoFar : 423001
Train_AverageReturn : 87.44887546022335
Train_BestReturn : 169.14869073015078
TimeSinceStart : 1943.4022979736328
Training Loss : 0.33945316076278687
Done logging...




********** Iteration 424000 ************

Training agent...

Beginning logging procedure...
Timestep 424001
mean reward (100 episodes) 82.771252
best mean reward 169.148691
running time 1946.988360
Train_EnvstepsSoFar : 424001
Train_AverageReturn : 82.77125160958484
Train_BestReturn : 169.14869073015078
TimeSinceStart : 1946.988359928131
Training Loss : 0.15707272291183472
Done logging...




********** Iteration 425000 ************

Training agent...

Beginning logging procedure...
Timestep 425001
mean reward (100 episodes) 80.098098
best mean reward 169.148691
running time 1950.526788
Train_E



********** Iteration 444000 ************

Training agent...

Beginning logging procedure...
Timestep 444001
mean reward (100 episodes) 85.511613
best mean reward 169.148691
running time 2022.321227
Train_EnvstepsSoFar : 444001
Train_AverageReturn : 85.51161274649935
Train_BestReturn : 169.14869073015078
TimeSinceStart : 2022.3212270736694
Training Loss : 0.18655993044376373
Done logging...




********** Iteration 445000 ************

Training agent...

Beginning logging procedure...
Timestep 445001
mean reward (100 episodes) 91.804273
best mean reward 169.148691
running time 2025.989278
Train_EnvstepsSoFar : 445001
Train_AverageReturn : 91.80427319868969
Train_BestReturn : 169.14869073015078
TimeSinceStart : 2025.9892778396606
Training Loss : 0.29468584060668945
Done logging...




********** Iteration 446000 ************

Training agent...

Beginning logging procedure...
Timestep 446001
mean reward (100 episodes) 87.624038
best mean reward 169.148691
running time 2029.584872
Train_



********** Iteration 465000 ************

Training agent...

Beginning logging procedure...
Timestep 465001
mean reward (100 episodes) 50.697470
best mean reward 169.148691
running time 2098.555534
Train_EnvstepsSoFar : 465001
Train_AverageReturn : 50.69747000979316
Train_BestReturn : 169.14869073015078
TimeSinceStart : 2098.5555341243744
Training Loss : 1.5802923440933228
Done logging...




********** Iteration 466000 ************

Training agent...

Beginning logging procedure...
Timestep 466001
mean reward (100 episodes) 50.822402
best mean reward 169.148691
running time 2102.387278
Train_EnvstepsSoFar : 466001
Train_AverageReturn : 50.822402204353885
Train_BestReturn : 169.14869073015078
TimeSinceStart : 2102.3872780799866
Training Loss : 1.7395257949829102
Done logging...




********** Iteration 467000 ************

Training agent...

Beginning logging procedure...
Timestep 467001
mean reward (100 episodes) 41.756782
best mean reward 169.148691
running time 2106.004713
Train_E



********** Iteration 486000 ************

Training agent...

Beginning logging procedure...
Timestep 486001
mean reward (100 episodes) 19.656990
best mean reward 169.148691
running time 2176.510720
Train_EnvstepsSoFar : 486001
Train_AverageReturn : 19.656989911676916
Train_BestReturn : 169.14869073015078
TimeSinceStart : 2176.510720014572
Training Loss : 0.5926234722137451
Done logging...




********** Iteration 487000 ************

Training agent...

Beginning logging procedure...
Timestep 487001
mean reward (100 episodes) 10.868733
best mean reward 169.148691
running time 2180.277053
Train_EnvstepsSoFar : 487001
Train_AverageReturn : 10.868732564439288
Train_BestReturn : 169.14869073015078
TimeSinceStart : 2180.2770528793335
Training Loss : 0.4942762851715088
Done logging...




********** Iteration 488000 ************

Training agent...

Beginning logging procedure...
Timestep 488001
mean reward (100 episodes) 14.390385
best mean reward 169.148691
running time 2184.009177
Train_E

Train_EnvstepsSoFar : 7001
Train_AverageReturn : -253.7136636222047
TimeSinceStart : 26.185008764266968
Training Loss : 1.1927947998046875
Done logging...




********** Iteration 8000 ************

Training agent...

Beginning logging procedure...
Timestep 8001
mean reward (100 episodes) -246.240848
best mean reward -inf
running time 29.905900
Train_EnvstepsSoFar : 8001
Train_AverageReturn : -246.2408483433308
TimeSinceStart : 29.90590000152588
Training Loss : 3.269710063934326
Done logging...




********** Iteration 9000 ************

Training agent...

Beginning logging procedure...
Timestep 9001
mean reward (100 episodes) -240.547197
best mean reward -inf
running time 33.820899
Train_EnvstepsSoFar : 9001
Train_AverageReturn : -240.54719675806547
TimeSinceStart : 33.82089877128601
Training Loss : 3.5026354789733887
Done logging...




********** Iteration 10000 ************

Training agent...

Beginning logging procedure...
Timestep 10001
mean reward (100 episodes) -235.280430
best

Train_EnvstepsSoFar : 29001
Train_AverageReturn : -163.04587396234515
Train_BestReturn : -163.04587396234515
TimeSinceStart : 135.31939578056335
Training Loss : 0.49166059494018555
Done logging...




********** Iteration 30000 ************

Training agent...

Beginning logging procedure...
Timestep 30001
mean reward (100 episodes) -160.602186
best mean reward -160.602186
running time 140.871974
Train_EnvstepsSoFar : 30001
Train_AverageReturn : -160.60218611198002
Train_BestReturn : -160.60218611198002
TimeSinceStart : 140.87197375297546
Training Loss : 0.5718457102775574
Done logging...




********** Iteration 31000 ************

Training agent...

Beginning logging procedure...
Timestep 31001
mean reward (100 episodes) -159.587706
best mean reward -159.587706
running time 145.942649
Train_EnvstepsSoFar : 31001
Train_AverageReturn : -159.5877062871407
Train_BestReturn : -159.5877062871407
TimeSinceStart : 145.9426486492157
Training Loss : 0.5407890677452087
Done logging...




******

Train_EnvstepsSoFar : 50001
Train_AverageReturn : -130.66403034153416
Train_BestReturn : -130.66403034153416
TimeSinceStart : 244.38995575904846
Training Loss : 1.0754667520523071
Done logging...




********** Iteration 51000 ************

Training agent...

Beginning logging procedure...
Timestep 51001
mean reward (100 episodes) -130.562132
best mean reward -130.562132
running time 249.089722
Train_EnvstepsSoFar : 51001
Train_AverageReturn : -130.56213235136025
Train_BestReturn : -130.56213235136025
TimeSinceStart : 249.0897216796875
Training Loss : 0.7338390350341797
Done logging...




********** Iteration 52000 ************

Training agent...

Beginning logging procedure...
Timestep 52001
mean reward (100 episodes) -130.521569
best mean reward -130.521569
running time 254.160743
Train_EnvstepsSoFar : 52001
Train_AverageReturn : -130.52156942601476
Train_BestReturn : -130.52156942601476
TimeSinceStart : 254.1607427597046
Training Loss : 2.385704755783081
Done logging...




*******

Train_EnvstepsSoFar : 71001
Train_AverageReturn : -111.08238003941439
Train_BestReturn : -110.67847287271393
TimeSinceStart : 349.6446976661682
Training Loss : 0.22718748450279236
Done logging...




********** Iteration 72000 ************

Training agent...

Beginning logging procedure...
Timestep 72001
mean reward (100 episodes) -110.198487
best mean reward -110.198487
running time 355.627127
Train_EnvstepsSoFar : 72001
Train_AverageReturn : -110.19848720294257
Train_BestReturn : -110.19848720294257
TimeSinceStart : 355.62712693214417
Training Loss : 0.16120541095733643
Done logging...




********** Iteration 73000 ************

Training agent...

Beginning logging procedure...
Timestep 73001
mean reward (100 episodes) -109.561503
best mean reward -109.561503
running time 361.015009
Train_EnvstepsSoFar : 73001
Train_AverageReturn : -109.56150338952753
Train_BestReturn : -109.56150338952753
TimeSinceStart : 361.0150089263916
Training Loss : 0.16836439073085785
Done logging...




***

Train_EnvstepsSoFar : 92001
Train_AverageReturn : -85.7552297919261
Train_BestReturn : -85.7552297919261
TimeSinceStart : 451.9696125984192
Training Loss : 0.14192558825016022
Done logging...




********** Iteration 93000 ************

Training agent...

Beginning logging procedure...
Timestep 93001
mean reward (100 episodes) -83.696931
best mean reward -83.696931
running time 457.148394
Train_EnvstepsSoFar : 93001
Train_AverageReturn : -83.6969311983095
Train_BestReturn : -83.6969311983095
TimeSinceStart : 457.1483938694
Training Loss : 0.14871010184288025
Done logging...




********** Iteration 94000 ************

Training agent...

Beginning logging procedure...
Timestep 94001
mean reward (100 episodes) -82.290433
best mean reward -82.290433
running time 461.528254
Train_EnvstepsSoFar : 94001
Train_AverageReturn : -82.29043272408907
Train_BestReturn : -82.29043272408907
TimeSinceStart : 461.52825379371643
Training Loss : 0.1443696916103363
Done logging...




********** Iteration 

Train_EnvstepsSoFar : 113001
Train_AverageReturn : -38.94013941648632
Train_BestReturn : -38.94013941648632
TimeSinceStart : 573.4838757514954
Training Loss : 0.19876936078071594
Done logging...




********** Iteration 114000 ************

Training agent...

Beginning logging procedure...
Timestep 114001
mean reward (100 episodes) -37.702094
best mean reward -37.702094
running time 579.240916
Train_EnvstepsSoFar : 114001
Train_AverageReturn : -37.70209443447619
Train_BestReturn : -37.70209443447619
TimeSinceStart : 579.2409157752991
Training Loss : 0.1421586126089096
Done logging...




********** Iteration 115000 ************

Training agent...

Beginning logging procedure...
Timestep 115001
mean reward (100 episodes) -35.762724
best mean reward -35.762724
running time 584.613102
Train_EnvstepsSoFar : 115001
Train_AverageReturn : -35.76272432745173
Train_BestReturn : -35.76272432745173
TimeSinceStart : 584.6131017208099
Training Loss : 0.16620247066020966
Done logging...




********

Train_EnvstepsSoFar : 134001
Train_AverageReturn : 6.295530160449896
Train_BestReturn : 6.295530160449896
TimeSinceStart : 680.7950809001923
Training Loss : 0.1001020148396492
Done logging...




********** Iteration 135000 ************

Training agent...

Beginning logging procedure...
Timestep 135001
mean reward (100 episodes) 8.128076
best mean reward 8.128076
running time 685.277220
Train_EnvstepsSoFar : 135001
Train_AverageReturn : 8.128076282953847
Train_BestReturn : 8.128076282953847
TimeSinceStart : 685.2772197723389
Training Loss : 0.35216158628463745
Done logging...




********** Iteration 136000 ************

Training agent...

Beginning logging procedure...
Timestep 136001
mean reward (100 episodes) 9.296009
best mean reward 9.296009
running time 691.534290
Train_EnvstepsSoFar : 136001
Train_AverageReturn : 9.29600921257809
Train_BestReturn : 9.29600921257809
TimeSinceStart : 691.5342898368835
Training Loss : 0.13239380717277527
Done logging...




********** Iteration 137

Train_EnvstepsSoFar : 155001
Train_AverageReturn : 49.52842624477105
Train_BestReturn : 49.74569216749692
TimeSinceStart : 784.9376657009125
Training Loss : 0.056239739060401917
Done logging...




********** Iteration 156000 ************

Training agent...

Beginning logging procedure...
Timestep 156001
mean reward (100 episodes) 50.155182
best mean reward 50.155182
running time 790.992136
Train_EnvstepsSoFar : 156001
Train_AverageReturn : 50.15518166562048
Train_BestReturn : 50.15518166562048
TimeSinceStart : 790.9921357631683
Training Loss : 0.7790929079055786
Done logging...




********** Iteration 157000 ************

Training agent...

Beginning logging procedure...
Timestep 157001
mean reward (100 episodes) 48.583750
best mean reward 50.155182
running time 797.817924
Train_EnvstepsSoFar : 157001
Train_AverageReturn : 48.58374973339592
Train_BestReturn : 50.15518166562048
TimeSinceStart : 797.817923784256
Training Loss : 0.691045343875885
Done logging...




********** Iteration

Train_EnvstepsSoFar : 176001
Train_AverageReturn : 53.11007901617103
Train_BestReturn : 58.38030116510297
TimeSinceStart : 897.9791738986969
Training Loss : 0.1825830638408661
Done logging...




********** Iteration 177000 ************

Training agent...

Beginning logging procedure...
Timestep 177001
mean reward (100 episodes) 52.856909
best mean reward 58.380301
running time 903.195857
Train_EnvstepsSoFar : 177001
Train_AverageReturn : 52.85690922973777
Train_BestReturn : 58.38030116510297
TimeSinceStart : 903.1958568096161
Training Loss : 0.1502504199743271
Done logging...




********** Iteration 178000 ************

Training agent...

Beginning logging procedure...
Timestep 178001
mean reward (100 episodes) 53.127931
best mean reward 58.380301
running time 908.406853
Train_EnvstepsSoFar : 178001
Train_AverageReturn : 53.12793065028716
Train_BestReturn : 58.38030116510297
TimeSinceStart : 908.406852722168
Training Loss : 0.2219274491071701
Done logging...




********** Iteration 

Train_EnvstepsSoFar : 197001
Train_AverageReturn : 54.95821160839641
Train_BestReturn : 61.75372100715494
TimeSinceStart : 1002.2253828048706
Training Loss : 0.05537428706884384
Done logging...




********** Iteration 198000 ************

Training agent...

Beginning logging procedure...
Timestep 198001
mean reward (100 episodes) 53.980004
best mean reward 61.753721
running time 1007.067003
Train_EnvstepsSoFar : 198001
Train_AverageReturn : 53.9800041279502
Train_BestReturn : 61.75372100715494
TimeSinceStart : 1007.0670027732849
Training Loss : 0.10930000245571136
Done logging...




********** Iteration 199000 ************

Training agent...

Beginning logging procedure...
Timestep 199001
mean reward (100 episodes) 51.660589
best mean reward 61.753721
running time 1012.469322
Train_EnvstepsSoFar : 199001
Train_AverageReturn : 51.6605892481416
Train_BestReturn : 61.75372100715494
TimeSinceStart : 1012.4693219661713
Training Loss : 0.409710168838501
Done logging...




********** Itera

Train_EnvstepsSoFar : 218001
Train_AverageReturn : 43.430950071762084
Train_BestReturn : 61.75372100715494
TimeSinceStart : 1104.959930896759
Training Loss : 0.07710558921098709
Done logging...




********** Iteration 219000 ************

Training agent...

Beginning logging procedure...
Timestep 219001
mean reward (100 episodes) 40.370862
best mean reward 61.753721
running time 1110.311688
Train_EnvstepsSoFar : 219001
Train_AverageReturn : 40.37086216611678
Train_BestReturn : 61.75372100715494
TimeSinceStart : 1110.3116879463196
Training Loss : 0.10148104280233383
Done logging...




********** Iteration 220000 ************

Training agent...

Beginning logging procedure...
Timestep 220001
mean reward (100 episodes) 39.816360
best mean reward 61.753721
running time 1115.372871
Train_EnvstepsSoFar : 220001
Train_AverageReturn : 39.81636017693489
Train_BestReturn : 61.75372100715494
TimeSinceStart : 1115.3728709220886
Training Loss : 0.03826781362295151
Done logging...




********** I

Train_EnvstepsSoFar : 239001
Train_AverageReturn : 59.39432851196456
Train_BestReturn : 64.03175465080814
TimeSinceStart : 1207.1610958576202
Training Loss : 0.385295033454895
Done logging...




********** Iteration 240000 ************

Training agent...

Beginning logging procedure...
Timestep 240001
mean reward (100 episodes) 59.844980
best mean reward 64.031755
running time 1211.080508
Train_EnvstepsSoFar : 240001
Train_AverageReturn : 59.84497960258777
Train_BestReturn : 64.03175465080814
TimeSinceStart : 1211.0805077552795
Training Loss : 0.07036439329385757
Done logging...




********** Iteration 241000 ************

Training agent...

Beginning logging procedure...
Timestep 241001
mean reward (100 episodes) 63.080044
best mean reward 64.031755
running time 1214.940825
Train_EnvstepsSoFar : 241001
Train_AverageReturn : 63.08004375250173
Train_BestReturn : 64.03175465080814
TimeSinceStart : 1214.9408247470856
Training Loss : 0.134998619556427
Done logging...




********** Itera

Train_EnvstepsSoFar : 260001
Train_AverageReturn : 106.25454243877869
Train_BestReturn : 106.25454243877869
TimeSinceStart : 1303.052613735199
Training Loss : 0.09705817699432373
Done logging...




********** Iteration 261000 ************

Training agent...

Beginning logging procedure...
Timestep 261001
mean reward (100 episodes) 108.439437
best mean reward 108.439437
running time 1307.840602
Train_EnvstepsSoFar : 261001
Train_AverageReturn : 108.43943674516628
Train_BestReturn : 108.43943674516628
TimeSinceStart : 1307.840601682663
Training Loss : 0.2091386616230011
Done logging...




********** Iteration 262000 ************

Training agent...

Beginning logging procedure...
Timestep 262001
mean reward (100 episodes) 111.434224
best mean reward 111.434224
running time 1311.541255
Train_EnvstepsSoFar : 262001
Train_AverageReturn : 111.43422388194847
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1311.5412547588348
Training Loss : 0.07927919179201126
Done logging...




*****

Train_EnvstepsSoFar : 281001
Train_AverageReturn : 98.21304623164588
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1394.2537689208984
Training Loss : 0.4301981031894684
Done logging...




********** Iteration 282000 ************

Training agent...

Beginning logging procedure...
Timestep 282001
mean reward (100 episodes) 98.115374
best mean reward 111.434224
running time 1398.831038
Train_EnvstepsSoFar : 282001
Train_AverageReturn : 98.11537440352237
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1398.8310377597809
Training Loss : 0.04977509751915932
Done logging...




********** Iteration 283000 ************

Training agent...

Beginning logging procedure...
Timestep 283001
mean reward (100 episodes) 95.838155
best mean reward 111.434224
running time 1403.351485
Train_EnvstepsSoFar : 283001
Train_AverageReturn : 95.83815534826698
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1403.3514847755432
Training Loss : 2.5616588592529297
Done logging...




*********

Train_EnvstepsSoFar : 302001
Train_AverageReturn : 66.25788343540682
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1477.6686577796936
Training Loss : 0.1646890640258789
Done logging...




********** Iteration 303000 ************

Training agent...

Beginning logging procedure...
Timestep 303001
mean reward (100 episodes) 72.096381
best mean reward 111.434224
running time 1481.427359
Train_EnvstepsSoFar : 303001
Train_AverageReturn : 72.09638082603392
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1481.427358865738
Training Loss : 0.14765644073486328
Done logging...




********** Iteration 304000 ************

Training agent...

Beginning logging procedure...
Timestep 304001
mean reward (100 episodes) 78.520576
best mean reward 111.434224
running time 1485.230649
Train_EnvstepsSoFar : 304001
Train_AverageReturn : 78.52057607296659
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1485.2306487560272
Training Loss : 0.6232301592826843
Done logging...




**********

Train_EnvstepsSoFar : 323001
Train_AverageReturn : 81.36163598072828
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1577.4957387447357
Training Loss : 0.5112910866737366
Done logging...




********** Iteration 324000 ************

Training agent...

Beginning logging procedure...
Timestep 324001
mean reward (100 episodes) 85.289433
best mean reward 111.434224
running time 1584.111949
Train_EnvstepsSoFar : 324001
Train_AverageReturn : 85.28943327155476
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1584.1119487285614
Training Loss : 2.316347360610962
Done logging...




********** Iteration 325000 ************

Training agent...

Beginning logging procedure...
Timestep 325001
mean reward (100 episodes) 83.123821
best mean reward 111.434224
running time 1595.092228
Train_EnvstepsSoFar : 325001
Train_AverageReturn : 83.1238212953312
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1595.0922276973724
Training Loss : 0.21583746373653412
Done logging...




********** 

Train_EnvstepsSoFar : 344001
Train_AverageReturn : 80.03917779669517
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1706.9581990242004
Training Loss : 0.44380971789360046
Done logging...




********** Iteration 345000 ************

Training agent...

Beginning logging procedure...
Timestep 345001
mean reward (100 episodes) 77.978689
best mean reward 111.434224
running time 1712.299788
Train_EnvstepsSoFar : 345001
Train_AverageReturn : 77.97868928657584
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1712.2997879981995
Training Loss : 0.13271623849868774
Done logging...




********** Iteration 346000 ************

Training agent...

Beginning logging procedure...
Timestep 346001
mean reward (100 episodes) 78.895388
best mean reward 111.434224
running time 1717.729288
Train_EnvstepsSoFar : 346001
Train_AverageReturn : 78.89538834733906
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1717.7292878627777
Training Loss : 0.49396106600761414
Done logging...




*******

Train_EnvstepsSoFar : 365001
Train_AverageReturn : 79.09499866636916
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1813.6853487491608
Training Loss : 0.26603585481643677
Done logging...




********** Iteration 366000 ************

Training agent...

Beginning logging procedure...
Timestep 366001
mean reward (100 episodes) 78.809102
best mean reward 111.434224
running time 1820.726495
Train_EnvstepsSoFar : 366001
Train_AverageReturn : 78.80910180035778
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1820.7264947891235
Training Loss : 0.10514353960752487
Done logging...




********** Iteration 367000 ************

Training agent...

Beginning logging procedure...
Timestep 367001
mean reward (100 episodes) 82.094490
best mean reward 111.434224
running time 1824.582403
Train_EnvstepsSoFar : 367001
Train_AverageReturn : 82.09449039687364
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1824.5824027061462
Training Loss : 0.21501457691192627
Done logging...




*******

Train_EnvstepsSoFar : 386001
Train_AverageReturn : 72.8212636268294
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1916.5738077163696
Training Loss : 0.15756593644618988
Done logging...




********** Iteration 387000 ************

Training agent...

Beginning logging procedure...
Timestep 387001
mean reward (100 episodes) 73.700423
best mean reward 111.434224
running time 1920.509021
Train_EnvstepsSoFar : 387001
Train_AverageReturn : 73.70042320303729
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1920.509020805359
Training Loss : 0.13601134717464447
Done logging...




********** Iteration 388000 ************

Training agent...

Beginning logging procedure...
Timestep 388001
mean reward (100 episodes) 72.818299
best mean reward 111.434224
running time 1924.430449
Train_EnvstepsSoFar : 388001
Train_AverageReturn : 72.81829933177663
Train_BestReturn : 111.43422388194847
TimeSinceStart : 1924.430448770523
Training Loss : 0.06638436019420624
Done logging...




**********

Train_EnvstepsSoFar : 407001
Train_AverageReturn : 69.1687133411388
Train_BestReturn : 111.43422388194847
TimeSinceStart : 2011.8274908065796
Training Loss : 0.10528846085071564
Done logging...




********** Iteration 408000 ************

Training agent...

Beginning logging procedure...
Timestep 408001
mean reward (100 episodes) 69.187521
best mean reward 111.434224
running time 2016.606773
Train_EnvstepsSoFar : 408001
Train_AverageReturn : 69.18752065640848
Train_BestReturn : 111.43422388194847
TimeSinceStart : 2016.6067728996277
Training Loss : 0.26196619868278503
Done logging...




********** Iteration 409000 ************

Training agent...

Beginning logging procedure...
Timestep 409001
mean reward (100 episodes) 70.039205
best mean reward 111.434224
running time 2020.658957
Train_EnvstepsSoFar : 409001
Train_AverageReturn : 70.03920526344868
Train_BestReturn : 111.43422388194847
TimeSinceStart : 2020.6589567661285
Training Loss : 0.5282951593399048
Done logging...




*********

Train_EnvstepsSoFar : 428001
Train_AverageReturn : 87.69342333525545
Train_BestReturn : 111.43422388194847
TimeSinceStart : 2109.064951658249
Training Loss : 0.14004671573638916
Done logging...




********** Iteration 429000 ************

Training agent...

Beginning logging procedure...
Timestep 429001
mean reward (100 episodes) 84.776784
best mean reward 111.434224
running time 2114.478986
Train_EnvstepsSoFar : 429001
Train_AverageReturn : 84.77678350368772
Train_BestReturn : 111.43422388194847
TimeSinceStart : 2114.478985786438
Training Loss : 0.0845358818769455
Done logging...




********** Iteration 430000 ************

Training agent...

Beginning logging procedure...
Timestep 430001
mean reward (100 episodes) 87.272257
best mean reward 111.434224
running time 2118.739276
Train_EnvstepsSoFar : 430001
Train_AverageReturn : 87.27225709287
Train_BestReturn : 111.43422388194847
TimeSinceStart : 2118.739275932312
Training Loss : 0.184231698513031
Done logging...




********** Itera

Train_EnvstepsSoFar : 449001
Train_AverageReturn : 134.17577720157288
Train_BestReturn : 135.79901615427525
TimeSinceStart : 2217.778963804245
Training Loss : 0.11487796157598495
Done logging...




********** Iteration 450000 ************

Training agent...

Beginning logging procedure...
Timestep 450001
mean reward (100 episodes) 134.609376
best mean reward 135.799016
running time 2223.213791
Train_EnvstepsSoFar : 450001
Train_AverageReturn : 134.60937637193254
Train_BestReturn : 135.79901615427525
TimeSinceStart : 2223.2137908935547
Training Loss : 0.09704088419675827
Done logging...




********** Iteration 451000 ************

Training agent...

Beginning logging procedure...
Timestep 451001
mean reward (100 episodes) 136.345846
best mean reward 136.345846
running time 2227.517277
Train_EnvstepsSoFar : 451001
Train_AverageReturn : 136.34584640560243
Train_BestReturn : 136.34584640560243
TimeSinceStart : 2227.517276763916
Training Loss : 1.4264951944351196
Done logging...




*****

Train_EnvstepsSoFar : 470001
Train_AverageReturn : 140.98946209107322
Train_BestReturn : 150.3788654123014
TimeSinceStart : 2323.90580868721
Training Loss : 0.10000870376825333
Done logging...




********** Iteration 471000 ************

Training agent...

Beginning logging procedure...
Timestep 471001
mean reward (100 episodes) 141.617317
best mean reward 150.378865
running time 2328.025901
Train_EnvstepsSoFar : 471001
Train_AverageReturn : 141.6173165470646
Train_BestReturn : 150.3788654123014
TimeSinceStart : 2328.0259006023407
Training Loss : 0.36122339963912964
Done logging...




********** Iteration 472000 ************

Training agent...

Beginning logging procedure...
Timestep 472001
mean reward (100 episodes) 142.111187
best mean reward 150.378865
running time 2331.944759
Train_EnvstepsSoFar : 472001
Train_AverageReturn : 142.11118718311081
Train_BestReturn : 150.3788654123014
TimeSinceStart : 2331.9447588920593
Training Loss : 1.2600679397583008
Done logging...




*********

Train_EnvstepsSoFar : 491001
Train_AverageReturn : 137.50576119274183
Train_BestReturn : 150.3788654123014
TimeSinceStart : 2421.9307997226715
Training Loss : 0.1173701137304306
Done logging...




********** Iteration 492000 ************

Training agent...

Beginning logging procedure...
Timestep 492001
mean reward (100 episodes) 134.681733
best mean reward 150.378865
running time 2425.681937
Train_EnvstepsSoFar : 492001
Train_AverageReturn : 134.68173333976395
Train_BestReturn : 150.3788654123014
TimeSinceStart : 2425.6819367408752
Training Loss : 0.0847950279712677
Done logging...




********** Iteration 493000 ************

Training agent...

Beginning logging procedure...
Timestep 493001
mean reward (100 episodes) 142.151452
best mean reward 150.378865
running time 2430.201422
Train_EnvstepsSoFar : 493001
Train_AverageReturn : 142.15145228803226
Train_BestReturn : 150.3788654123014
TimeSinceStart : 2430.201421737671
Training Loss : 0.14168283343315125
Done logging...




********

Train_EnvstepsSoFar : 13001
Train_AverageReturn : -220.39655579400207
TimeSinceStart : 46.53973984718323
Training Loss : 1.612704873085022
Done logging...




********** Iteration 14000 ************

Training agent...

Beginning logging procedure...
Timestep 14001
mean reward (100 episodes) -218.416954
best mean reward -inf
running time 52.195799
Train_EnvstepsSoFar : 14001
Train_AverageReturn : -218.41695441379497
TimeSinceStart : 52.195799112319946
Training Loss : 6.472461700439453
Done logging...




********** Iteration 15000 ************

Training agent...

Beginning logging procedure...
Timestep 15001
mean reward (100 episodes) -220.200848
best mean reward -inf
running time 56.910662
Train_EnvstepsSoFar : 15001
Train_AverageReturn : -220.20084823536314
TimeSinceStart : 56.910661935806274
Training Loss : 1.5414334535598755
Done logging...




********** Iteration 16000 ************

Training agent...

Beginning logging procedure...
Timestep 16001
mean reward (100 episodes) -226.90

Train_EnvstepsSoFar : 35001
Train_AverageReturn : -180.26678982527534
Train_BestReturn : -180.26678982527534
TimeSinceStart : 152.64046382904053
Training Loss : 0.8518844246864319
Done logging...




********** Iteration 36000 ************

Training agent...

Beginning logging procedure...
Timestep 36001
mean reward (100 episodes) -179.460022
best mean reward -179.460022
running time 157.701024
Train_EnvstepsSoFar : 36001
Train_AverageReturn : -179.46002160208693
Train_BestReturn : -179.46002160208693
TimeSinceStart : 157.70102405548096
Training Loss : 1.1785967350006104
Done logging...




********** Iteration 37000 ************

Training agent...

Beginning logging procedure...
Timestep 37001
mean reward (100 episodes) -179.316802
best mean reward -179.316802
running time 162.562716
Train_EnvstepsSoFar : 37001
Train_AverageReturn : -179.31680229048817
Train_BestReturn : -179.31680229048817
TimeSinceStart : 162.56271600723267
Training Loss : 0.33487430214881897
Done logging...




***

Train_EnvstepsSoFar : 56001
Train_AverageReturn : -158.10126358991678
Train_BestReturn : -158.10126358991678
TimeSinceStart : 252.4087109565735
Training Loss : 0.5487604737281799
Done logging...




********** Iteration 57000 ************

Training agent...

Beginning logging procedure...
Timestep 57001
mean reward (100 episodes) -154.090605
best mean reward -154.090605
running time 257.358698
Train_EnvstepsSoFar : 57001
Train_AverageReturn : -154.09060468124144
Train_BestReturn : -154.09060468124144
TimeSinceStart : 257.35869812965393
Training Loss : 0.2686486840248108
Done logging...




********** Iteration 58000 ************

Training agent...

Beginning logging procedure...
Timestep 58001
mean reward (100 episodes) -149.077859
best mean reward -149.077859
running time 262.520087
Train_EnvstepsSoFar : 58001
Train_AverageReturn : -149.07785902419675
Train_BestReturn : -149.07785902419675
TimeSinceStart : 262.5200870037079
Training Loss : 0.3289509117603302
Done logging...




******

Train_EnvstepsSoFar : 77001
Train_AverageReturn : -137.99554585694568
Train_BestReturn : -137.99554585694568
TimeSinceStart : 356.94683599472046
Training Loss : 0.5645427703857422
Done logging...




********** Iteration 78000 ************

Training agent...

Beginning logging procedure...
Timestep 78001
mean reward (100 episodes) -135.948716
best mean reward -135.948716
running time 361.933619
Train_EnvstepsSoFar : 78001
Train_AverageReturn : -135.94871619547345
Train_BestReturn : -135.94871619547345
TimeSinceStart : 361.9336190223694
Training Loss : 0.2336585521697998
Done logging...




********** Iteration 79000 ************

Training agent...

Beginning logging procedure...
Timestep 79001
mean reward (100 episodes) -135.872690
best mean reward -135.872690
running time 368.712294
Train_EnvstepsSoFar : 79001
Train_AverageReturn : -135.87268981213845
Train_BestReturn : -135.87268981213845
TimeSinceStart : 368.7122938632965
Training Loss : 0.22659209370613098
Done logging...




*****

Train_EnvstepsSoFar : 98001
Train_AverageReturn : -74.04644722581939
Train_BestReturn : -74.04644722581939
TimeSinceStart : 465.6822350025177
Training Loss : 1.1019253730773926
Done logging...




********** Iteration 99000 ************

Training agent...

Beginning logging procedure...
Timestep 99001
mean reward (100 episodes) -72.458857
best mean reward -72.458857
running time 470.549030
Train_EnvstepsSoFar : 99001
Train_AverageReturn : -72.45885658505051
Train_BestReturn : -72.45885658505051
TimeSinceStart : 470.5490298271179
Training Loss : 1.1360039710998535
Done logging...




********** Iteration 100000 ************

Training agent...

Beginning logging procedure...
Timestep 100001
mean reward (100 episodes) -70.307850
best mean reward -70.307850
running time 475.009503
Train_EnvstepsSoFar : 100001
Train_AverageReturn : -70.30784973902453
Train_BestReturn : -70.30784973902453
TimeSinceStart : 475.0095031261444
Training Loss : 0.1778167337179184
Done logging...




********** Ite

Train_EnvstepsSoFar : 119001
Train_AverageReturn : -31.462014834901012
Train_BestReturn : -31.462014834901012
TimeSinceStart : 565.4830510616302
Training Loss : 0.5311988592147827
Done logging...




********** Iteration 120000 ************

Training agent...

Beginning logging procedure...
Timestep 120001
mean reward (100 episodes) -28.874590
best mean reward -28.874590
running time 569.844695
Train_EnvstepsSoFar : 120001
Train_AverageReturn : -28.87459023671616
Train_BestReturn : -28.87459023671616
TimeSinceStart : 569.8446950912476
Training Loss : 0.13801506161689758
Done logging...




********** Iteration 121000 ************

Training agent...

Beginning logging procedure...
Timestep 121001
mean reward (100 episodes) -28.735348
best mean reward -28.735348
running time 574.774840
Train_EnvstepsSoFar : 121001
Train_AverageReturn : -28.73534796955171
Train_BestReturn : -28.73534796955171
TimeSinceStart : 574.7748398780823
Training Loss : 0.12755540013313293
Done logging...




******

Train_EnvstepsSoFar : 140001
Train_AverageReturn : 17.678490576927025
Train_BestReturn : 17.678490576927025
TimeSinceStart : 665.7740979194641
Training Loss : 0.3348587155342102
Done logging...




********** Iteration 141000 ************

Training agent...

Beginning logging procedure...
Timestep 141001
mean reward (100 episodes) 21.299537
best mean reward 21.299537
running time 669.821908
Train_EnvstepsSoFar : 141001
Train_AverageReturn : 21.29953704907351
Train_BestReturn : 21.29953704907351
TimeSinceStart : 669.8219079971313
Training Loss : 0.28934594988822937
Done logging...




********** Iteration 142000 ************

Training agent...

Beginning logging procedure...
Timestep 142001
mean reward (100 episodes) 22.324878
best mean reward 22.324878
running time 674.881321
Train_EnvstepsSoFar : 142001
Train_AverageReturn : 22.324878451793683
Train_BestReturn : 22.324878451793683
TimeSinceStart : 674.8813209533691
Training Loss : 0.16667698323726654
Done logging...




********** Ite

Train_EnvstepsSoFar : 161001
Train_AverageReturn : 46.658702821473696
Train_BestReturn : 46.658702821473696
TimeSinceStart : 760.4295670986176
Training Loss : 0.15436209738254547
Done logging...




********** Iteration 162000 ************

Training agent...

Beginning logging procedure...
Timestep 162001
mean reward (100 episodes) 49.304016
best mean reward 49.304016
running time 764.871473
Train_EnvstepsSoFar : 162001
Train_AverageReturn : 49.304016102237675
Train_BestReturn : 49.304016102237675
TimeSinceStart : 764.8714728355408
Training Loss : 0.8248716592788696
Done logging...




********** Iteration 163000 ************

Training agent...

Beginning logging procedure...
Timestep 163001
mean reward (100 episodes) 46.527713
best mean reward 49.304016
running time 768.812597
Train_EnvstepsSoFar : 163001
Train_AverageReturn : 46.52771275897785
Train_BestReturn : 49.304016102237675
TimeSinceStart : 768.8125970363617
Training Loss : 0.38871175050735474
Done logging...




********** It

Train_EnvstepsSoFar : 182001
Train_AverageReturn : 55.283013124363194
Train_BestReturn : 60.37864307523524
TimeSinceStart : 867.2779459953308
Training Loss : 0.324370801448822
Done logging...




********** Iteration 183000 ************

Training agent...

Beginning logging procedure...
Timestep 183001
mean reward (100 episodes) 56.578148
best mean reward 60.378643
running time 871.497681
Train_EnvstepsSoFar : 183001
Train_AverageReturn : 56.578148183258335
Train_BestReturn : 60.37864307523524
TimeSinceStart : 871.4976809024811
Training Loss : 0.11987411230802536
Done logging...




********** Iteration 184000 ************

Training agent...

Beginning logging procedure...
Timestep 184001
mean reward (100 episodes) 57.392839
best mean reward 60.378643
running time 875.792369
Train_EnvstepsSoFar : 184001
Train_AverageReturn : 57.39283931752638
Train_BestReturn : 60.37864307523524
TimeSinceStart : 875.792368888855
Training Loss : 0.10790576785802841
Done logging...




********** Iterati

Train_EnvstepsSoFar : 203001
Train_AverageReturn : 58.177166253464755
Train_BestReturn : 61.74907224094897
TimeSinceStart : 962.6851608753204
Training Loss : 0.27471429109573364
Done logging...




********** Iteration 204000 ************

Training agent...

Beginning logging procedure...
Timestep 204001
mean reward (100 episodes) 56.629506
best mean reward 61.749072
running time 967.910075
Train_EnvstepsSoFar : 204001
Train_AverageReturn : 56.629506016494446
Train_BestReturn : 61.74907224094897
TimeSinceStart : 967.9100749492645
Training Loss : 0.3135134279727936
Done logging...




********** Iteration 205000 ************

Training agent...

Beginning logging procedure...
Timestep 205001
mean reward (100 episodes) 58.640884
best mean reward 61.749072
running time 973.618285
Train_EnvstepsSoFar : 205001
Train_AverageReturn : 58.640884047751015
Train_BestReturn : 61.74907224094897
TimeSinceStart : 973.6182851791382
Training Loss : 0.07075856626033783
Done logging...




********** Iter

Train_EnvstepsSoFar : 224001
Train_AverageReturn : 64.35299128695638
Train_BestReturn : 64.35299128695638
TimeSinceStart : 1064.2861700057983
Training Loss : 0.07396013289690018
Done logging...




********** Iteration 225000 ************

Training agent...

Beginning logging procedure...
Timestep 225001
mean reward (100 episodes) 68.941797
best mean reward 68.941797
running time 1068.101576
Train_EnvstepsSoFar : 225001
Train_AverageReturn : 68.94179679750329
Train_BestReturn : 68.94179679750329
TimeSinceStart : 1068.1015758514404
Training Loss : 0.08676868677139282
Done logging...




********** Iteration 226000 ************

Training agent...

Beginning logging procedure...
Timestep 226001
mean reward (100 episodes) 67.157239
best mean reward 68.941797
running time 1073.313509
Train_EnvstepsSoFar : 226001
Train_AverageReturn : 67.15723878158747
Train_BestReturn : 68.94179679750329
TimeSinceStart : 1073.3135089874268
Training Loss : 0.07635335624217987
Done logging...




********** I

Train_EnvstepsSoFar : 245001
Train_AverageReturn : 83.05806822475547
Train_BestReturn : 87.81835612549794
TimeSinceStart : 1160.8997037410736
Training Loss : 0.13402274250984192
Done logging...




********** Iteration 246000 ************

Training agent...

Beginning logging procedure...
Timestep 246001
mean reward (100 episodes) 82.673599
best mean reward 87.818356
running time 1165.322554
Train_EnvstepsSoFar : 246001
Train_AverageReturn : 82.67359900673542
Train_BestReturn : 87.81835612549794
TimeSinceStart : 1165.3225538730621
Training Loss : 0.076202392578125
Done logging...




********** Iteration 247000 ************

Training agent...

Beginning logging procedure...
Timestep 247001
mean reward (100 episodes) 81.127046
best mean reward 87.818356
running time 1169.990861
Train_EnvstepsSoFar : 247001
Train_AverageReturn : 81.1270463349923
Train_BestReturn : 87.81835612549794
TimeSinceStart : 1169.9908609390259
Training Loss : 0.14393244683742523
Done logging...




********** Iter

Train_EnvstepsSoFar : 266001
Train_AverageReturn : 97.29954314832132
Train_BestReturn : 100.09141911277864
TimeSinceStart : 1264.57817196846
Training Loss : 0.0925464928150177
Done logging...




********** Iteration 267000 ************

Training agent...

Beginning logging procedure...
Timestep 267001
mean reward (100 episodes) 98.843703
best mean reward 100.091419
running time 1268.612122
Train_EnvstepsSoFar : 267001
Train_AverageReturn : 98.84370349326305
Train_BestReturn : 100.09141911277864
TimeSinceStart : 1268.6121218204498
Training Loss : 0.17558664083480835
Done logging...




********** Iteration 268000 ************

Training agent...

Beginning logging procedure...
Timestep 268001
mean reward (100 episodes) 98.131027
best mean reward 100.091419
running time 1272.857733
Train_EnvstepsSoFar : 268001
Train_AverageReturn : 98.13102660398968
Train_BestReturn : 100.09141911277864
TimeSinceStart : 1272.8577330112457
Training Loss : 0.17761093378067017
Done logging...




**********

Train_EnvstepsSoFar : 287001
Train_AverageReturn : 98.17764345655901
Train_BestReturn : 103.05884896663953
TimeSinceStart : 1353.711089849472
Training Loss : 0.5750413537025452
Done logging...




********** Iteration 288000 ************

Training agent...

Beginning logging procedure...
Timestep 288001
mean reward (100 episodes) 100.629095
best mean reward 103.058849
running time 1357.820476
Train_EnvstepsSoFar : 288001
Train_AverageReturn : 100.62909468096058
Train_BestReturn : 103.05884896663953
TimeSinceStart : 1357.8204758167267
Training Loss : 0.07169114798307419
Done logging...




********** Iteration 289000 ************

Training agent...

Beginning logging procedure...
Timestep 289001
mean reward (100 episodes) 102.298332
best mean reward 103.058849
running time 1362.302156
Train_EnvstepsSoFar : 289001
Train_AverageReturn : 102.29833247350729
Train_BestReturn : 103.05884896663953
TimeSinceStart : 1362.302155971527
Training Loss : 0.268636018037796
Done logging...




********

Train_EnvstepsSoFar : 308001
Train_AverageReturn : 122.50858347477974
Train_BestReturn : 123.46580186635167
TimeSinceStart : 1443.897311925888
Training Loss : 0.34545692801475525
Done logging...




********** Iteration 309000 ************

Training agent...

Beginning logging procedure...
Timestep 309001
mean reward (100 episodes) 116.827799
best mean reward 123.465802
running time 1448.322484
Train_EnvstepsSoFar : 309001
Train_AverageReturn : 116.8277992114639
Train_BestReturn : 123.46580186635167
TimeSinceStart : 1448.3224840164185
Training Loss : 0.33344218134880066
Done logging...




********** Iteration 310000 ************

Training agent...

Beginning logging procedure...
Timestep 310001
mean reward (100 episodes) 119.853353
best mean reward 123.465802
running time 1452.546265
Train_EnvstepsSoFar : 310001
Train_AverageReturn : 119.85335280823998
Train_BestReturn : 123.46580186635167
TimeSinceStart : 1452.546264886856
Training Loss : 0.30405616760253906
Done logging...




*****

Train_EnvstepsSoFar : 329001
Train_AverageReturn : 93.73238126658745
Train_BestReturn : 127.19637001154697
TimeSinceStart : 1537.0546569824219
Training Loss : 3.0453031063079834
Done logging...




********** Iteration 330000 ************

Training agent...

Beginning logging procedure...
Timestep 330001
mean reward (100 episodes) 97.784415
best mean reward 127.196370
running time 1541.017111
Train_EnvstepsSoFar : 330001
Train_AverageReturn : 97.78441474854701
Train_BestReturn : 127.19637001154697
TimeSinceStart : 1541.0171110630035
Training Loss : 0.4218275845050812
Done logging...




********** Iteration 331000 ************

Training agent...

Beginning logging procedure...
Timestep 331001
mean reward (100 episodes) 98.118947
best mean reward 127.196370
running time 1546.230082
Train_EnvstepsSoFar : 331001
Train_AverageReturn : 98.11894682087451
Train_BestReturn : 127.19637001154697
TimeSinceStart : 1546.2300820350647
Training Loss : 0.08261734992265701
Done logging...




*********

Train_EnvstepsSoFar : 350001
Train_AverageReturn : 140.2052781094483
Train_BestReturn : 140.2052781094483
TimeSinceStart : 1635.94496011734
Training Loss : 0.11286661773920059
Done logging...




********** Iteration 351000 ************

Training agent...

Beginning logging procedure...
Timestep 351001
mean reward (100 episodes) 142.789345
best mean reward 142.789345
running time 1640.830972
Train_EnvstepsSoFar : 351001
Train_AverageReturn : 142.7893445951395
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1640.830971956253
Training Loss : 1.6925592422485352
Done logging...




********** Iteration 352000 ************

Training agent...

Beginning logging procedure...
Timestep 352001
mean reward (100 episodes) 137.040683
best mean reward 142.789345
running time 1645.132080
Train_EnvstepsSoFar : 352001
Train_AverageReturn : 137.0406827398798
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1645.1320798397064
Training Loss : 0.10047511756420135
Done logging...




********** I

Train_EnvstepsSoFar : 371001
Train_AverageReturn : 112.80113955979819
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1727.803288936615
Training Loss : 1.1597628593444824
Done logging...




********** Iteration 372000 ************

Training agent...

Beginning logging procedure...
Timestep 372001
mean reward (100 episodes) 114.913166
best mean reward 142.789345
running time 1732.101292
Train_EnvstepsSoFar : 372001
Train_AverageReturn : 114.91316614574323
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1732.1012921333313
Training Loss : 0.2048526257276535
Done logging...




********** Iteration 373000 ************

Training agent...

Beginning logging procedure...
Timestep 373001
mean reward (100 episodes) 114.961343
best mean reward 142.789345
running time 1736.246670
Train_EnvstepsSoFar : 373001
Train_AverageReturn : 114.96134311203409
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1736.246669769287
Training Loss : 1.8314226865768433
Done logging...




**********

Train_EnvstepsSoFar : 392001
Train_AverageReturn : 122.58103527504592
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1807.482048034668
Training Loss : 0.4453897774219513
Done logging...




********** Iteration 393000 ************

Training agent...

Beginning logging procedure...
Timestep 393001
mean reward (100 episodes) 122.909106
best mean reward 142.789345
running time 1811.092639
Train_EnvstepsSoFar : 393001
Train_AverageReturn : 122.90910586837593
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1811.0926389694214
Training Loss : 0.4382851719856262
Done logging...




********** Iteration 394000 ************

Training agent...

Beginning logging procedure...
Timestep 394001
mean reward (100 episodes) 117.930441
best mean reward 142.789345
running time 1814.991692
Train_EnvstepsSoFar : 394001
Train_AverageReturn : 117.93044083252099
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1814.9916920661926
Training Loss : 2.275508165359497
Done logging...




**********

Train_EnvstepsSoFar : 413001
Train_AverageReturn : 133.30392573762677
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1886.878389120102
Training Loss : 0.48190179467201233
Done logging...




********** Iteration 414000 ************

Training agent...

Beginning logging procedure...
Timestep 414001
mean reward (100 episodes) 135.634789
best mean reward 142.789345
running time 1890.828728
Train_EnvstepsSoFar : 414001
Train_AverageReturn : 135.63478919531053
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1890.828727722168
Training Loss : 0.2411002218723297
Done logging...




********** Iteration 415000 ************

Training agent...

Beginning logging procedure...
Timestep 415001
mean reward (100 episodes) 135.733225
best mean reward 142.789345
running time 1894.968113
Train_EnvstepsSoFar : 415001
Train_AverageReturn : 135.73322500173077
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1894.9681129455566
Training Loss : 0.14736312627792358
Done logging...




********

Train_EnvstepsSoFar : 434001
Train_AverageReturn : 126.4744099109083
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1969.6733059883118
Training Loss : 1.5191841125488281
Done logging...




********** Iteration 435000 ************

Training agent...

Beginning logging procedure...
Timestep 435001
mean reward (100 episodes) 132.309693
best mean reward 142.789345
running time 1973.314711
Train_EnvstepsSoFar : 435001
Train_AverageReturn : 132.30969280799002
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1973.3147110939026
Training Loss : 2.764617443084717
Done logging...




********** Iteration 436000 ************

Training agent...

Beginning logging procedure...
Timestep 436001
mean reward (100 episodes) 132.126307
best mean reward 142.789345
running time 1976.928221
Train_EnvstepsSoFar : 436001
Train_AverageReturn : 132.12630748632992
Train_BestReturn : 142.7893445951395
TimeSinceStart : 1976.92822098732
Training Loss : 1.7505033016204834
Done logging...




********** I

Train_EnvstepsSoFar : 455001
Train_AverageReturn : 149.68419810478292
Train_BestReturn : 152.15921525328358
TimeSinceStart : 2048.4714210033417
Training Loss : 0.1774071902036667
Done logging...




********** Iteration 456000 ************

Training agent...

Beginning logging procedure...
Timestep 456001
mean reward (100 episodes) 142.348715
best mean reward 152.159215
running time 2052.645479
Train_EnvstepsSoFar : 456001
Train_AverageReturn : 142.34871533808527
Train_BestReturn : 152.15921525328358
TimeSinceStart : 2052.645478963852
Training Loss : 0.3210744857788086
Done logging...




********** Iteration 457000 ************

Training agent...

Beginning logging procedure...
Timestep 457001
mean reward (100 episodes) 138.145982
best mean reward 152.159215
running time 2056.828600
Train_EnvstepsSoFar : 457001
Train_AverageReturn : 138.14598155385536
Train_BestReturn : 152.15921525328358
TimeSinceStart : 2056.8285999298096
Training Loss : 0.20742270350456238
Done logging...




*****

Train_EnvstepsSoFar : 476001
Train_AverageReturn : 133.75172778044828
Train_BestReturn : 152.15921525328358
TimeSinceStart : 2128.074028968811
Training Loss : 0.3200947940349579
Done logging...




********** Iteration 477000 ************

Training agent...

Beginning logging procedure...
Timestep 477001
mean reward (100 episodes) 133.148790
best mean reward 152.159215
running time 2131.694787
Train_EnvstepsSoFar : 477001
Train_AverageReturn : 133.14878984269617
Train_BestReturn : 152.15921525328358
TimeSinceStart : 2131.694786787033
Training Loss : 0.2311696857213974
Done logging...




********** Iteration 478000 ************

Training agent...

Beginning logging procedure...
Timestep 478001
mean reward (100 episodes) 129.001348
best mean reward 152.159215
running time 2135.326839
Train_EnvstepsSoFar : 478001
Train_AverageReturn : 129.00134808316
Train_BestReturn : 152.15921525328358
TimeSinceStart : 2135.3268389701843
Training Loss : 0.3433936536312103
Done logging...




**********

Train_EnvstepsSoFar : 497001
Train_AverageReturn : 133.03615477807654
Train_BestReturn : 152.15921525328358
TimeSinceStart : 2204.8352420330048
Training Loss : 0.644425094127655
Done logging...




********** Iteration 498000 ************

Training agent...

Beginning logging procedure...
Timestep 498001
mean reward (100 episodes) 132.500988
best mean reward 152.159215
running time 2208.465910
Train_EnvstepsSoFar : 498001
Train_AverageReturn : 132.50098807127966
Train_BestReturn : 152.15921525328358
TimeSinceStart : 2208.4659099578857
Training Loss : 2.8511452674865723
Done logging...




********** Iteration 499000 ************

Training agent...

Beginning logging procedure...
Timestep 499001
mean reward (100 episodes) 136.360393
best mean reward 152.159215
running time 2212.102291
Train_EnvstepsSoFar : 499001
Train_AverageReturn : 136.36039334839845
Train_BestReturn : 152.15921525328358
TimeSinceStart : 2212.102290868759
Training Loss : 1.2850500345230103
Done logging...




In [86]:
### Visualize all DQN results on Lunar Lander
%load_ext tensorboard
%tensorboard --logdir logs/dqn/LunarLander/

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard
