# Actor Critic Algorithms
In the DQN algorithm, we learned a Q-function by minimizing Bellman errors for an implicit policy that always took the action that maximized the Q-function. However, this scheme requires a discrete action spaces to allow for us to easily compute the optimal action at each state, unlike generic policy gradient algorithms that also worked with continuous action spaces.

In this section, we will explore actor-critic algorithms which maintain an explicit policy (actor) like the policy gradient algorithms, learns a Q-function (critic) capturing the values of the _current policy_, and uses this learned Q-function to update the policy. Using a learned critic can provide much lower variance updates for the policy compared to using Monte-Carlo retun estimates, and also allows us to reuse our data by training the actor and critic on _off-policy_ data for more sample efficiency. We can thus take many more policy updates with an actor critic algorithm using our learned critic, instead of needing to wait and gather fresh samples every time.

In [1]:
# As usual, a bit of setup
import os
import shutil
import time
import numpy as np
import torch

import deeprl.infrastructure.pytorch_util as ptu
from deeprl.infrastructure.rl_trainer import RL_Trainer
from deeprl.infrastructure.trainers import AC_Trainer

from deeprl.policies.MLP_policy import MLPPolicyAC

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def remove_folder(path):
    # check if folder exists
    if os.path.exists(path): 
        print("Clearing old results at {}".format(path))
        # remove if exists
        shutil.rmtree(path)
    else:
        print("Folder {} does not exist yet. No old results to delete".format(path))

In [12]:
ac_base_args_dict = dict(
    env_name = 'Hopper-v2', #@param ['Ant-v2', 'Humanoid-v2', 'Walker2d-v2', 'HalfCheetah-v2', 'Hopper-v2']
    exp_name = 'test_ac', #@param
    save_params = False, #@param {type: "boolean"}
    
    ## PDF will tell you how to set ep_len
    ## and discount for each environment
    ep_len = 200, #@param {type: "integer"}
    discount = 0.99, #@param {type: "number"}

    # Training
    num_agent_train_steps_per_iter = 1000, #@param {type: "integer"})
    n_iter = 100, #@param {type: "integer"})

    # batches & buffers
    batch_size = 1000, #@param {type: "integer"})
    eval_batch_size = 1000, #@param {type: "integer"}
    train_batch_size = 256, #@param {type: "integer"}
    max_replay_buffer_size = 1000000, #@param {type: "integer"}

    #@markdown actor network
    n_layers = 2, #@param {type: "integer"}
    size = 256, #@param {type: "integer"}
    entropy_weight=0, #@param {type: "number"}
    learning_rate = 3e-4, #@param {type: "number"}
    
    # critic network
    critic_n_layers = 2, #@param {type: "integer"}
    critic_size = 256, #@param {type: "integer"}
    target_update_rate = 5e-3,

    #@markdown logging
    video_log_freq = -1, #@param {type: "integer"}
    scalar_log_freq = 1, #@param {type: "integer"}

    #@markdown gpu & run-time settings
    no_gpu = False, #@param {type: "boolean"}
    which_gpu = 0, #@param {type: "integer"}
    seed = 2, #@param {type: "integer"}
    logdir = 'test',
)

First fill out the target value calculation in the <code>compute_target_value</code> method of <code>critics/bootstrapped_continuous_critic.py</code>. Compared to the DQN critic, the key difference is that we are now estimating the value of the current policy, instead of the optimal policy as in DQN or Q-learning.

To train our critic to evaluate the current policy $\pi$, we simply sample actions from the current policy in our target value. For each sample $(s,a,s')$, our loss will be
$$L(Q_{\theta}(s, a), r(s,a) + \gamma \mathbb{E}_{a'\sim \pi(s')}[Q_{\bar \theta} (s', a')]),$$
where $L$ is our loss function (for example squared error or the smooth L1 loss).
In this assignment, we will simply sample a single action from the policy to estimate the target value.

In [30]:
# Test bellman error for policy evaluation
ac_dim = 3
ob_dim = 11
N = 5

np.random.seed(0)
obs = np.random.normal(size=(N, ob_dim))
acts = np.random.choice(ac_dim, size=(N,))
next_obs = np.random.normal(size=(N, ob_dim))
rewards = np.random.normal(size=N)
terminals = np.zeros(N)
terminals[0] = 1

ac_args = dict(ac_base_args_dict)

env_str = 'Hopper'
ac_args['env_name'] = '{}-v2'.format(env_str)
ac_args['entropy_weight'] = 0.1
actrainer = AC_Trainer(ac_args)
critic = actrainer.rl_trainer.agent.critic

class DummyDist:
    def sample(self):
        return ptu.from_numpy(1 + np.zeros(shape=(N, ac_dim)))

def dummy_actor(next_obs):
    return DummyDist()

# assumes you call actor(next_obs) to get the distribution, then call distribution.sample()
target_vals = critic.compute_target_value(ptu.from_numpy(next_obs), 
                                          ptu.from_numpy(rewards), 
                                          ptu.from_numpy(terminals), 
                                          dummy_actor)
target_vals = ptu.to_numpy(target_vals)
expected_targets = np.array([-0.9167948, -0.11123351, -0.36787638, -2.1131861,  -0.13868617])

target_error = rel_error(target_vals, expected_targets)
print("Target value error", target_error, "should be on the order of 1e-6 or lower")

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
Hopper-v2
Target value error 1.6779968730287167e-07 should be on the order of 1e-6 or lower


For this section, we will also update our target network parameters as an exponential moving average of the critic parameters, instead of simply copying the current parameters periodically as in DQN. Generally, either method for target networks tends to work with appropriately chosen update rates.

Fill out the update_target_parameter_ema method in <code>critics/bootstrapped_continuous_critic.py</code>.

In [37]:
# Test target network update
ac_args = dict(ac_base_args_dict)

env_str = 'Hopper'
ac_args['env_name'] = '{}-v2'.format(env_str)
ac_args['entropy_weight'] = 0.1
actrainer = AC_Trainer(ac_args)
critic = actrainer.rl_trainer.agent.critic

critic.target_update_rate = 0.5

# at initialization, target and critic networks are the same
for p in critic.critic_network.parameters():
    p.data += 1.
    
critic.update_target_network_ema()

for p, target_p in zip(critic.critic_network.parameters(), critic.target_network.parameters()):
    assert np.all(ptu.to_numpy((p-target_p)) == 0.5)

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
Hopper-v2


Next, we will implement the actor update using the learned critic instead of Monte Carlo returns. 

To update our policy at a particular state $s$, our previous policy gradient (using the reward to go estimator) took a step on the objective (treating $Q^{\pi}$ as a function that didn't depend on $\pi$ and using the results of a single trajectory to estimate $Q^\pi$)
$$\mathbb E_{a \sim \pi_{\theta}(s)}[Q^{\pi_\theta}(s, a)],$$
using the REINFORCE gradient estimator 
$$\mathbb E_{a \sim \pi_{\theta}(s)}[\nabla_{\theta} \log \pi_{\theta}(a\vert s) Q^{\pi}(s, a)].$$
This estimator only relied on the estimated value $Q^{\pi_{\theta}}(s,a)$, so was very general and could be applied with Monte Carlo estimates of $Q^{\pi_\theta}$.

One way to estimate policy gradients with an actor critic algorithm would be to directly replace the Monte Carlo estimate of $Q$ with the learned critic $Q_{\phi}$, and continue using the REINFORCE gradient estimator.
However, we note that we can explicitly compute derivatives of our learned critic $Q(s, a)$ with respect to the action $a$, which can enable potentially better gradient estimates. 

In order to take advantage of this, we would also need to differentiate sampled actions $a$ with respect to our policy parameters, which we can through a technique known as the _reparameterization trick_ or the _pathwise_ estimator. 
The idea is that if our policy sampled actions according to $a \sim \mathcal N(\mu_{\theta}(s), \sigma^2_{\theta}(s))$, we can rewrite $a = f_{\theta}(z)$, where $z \sim \mathcal N(0, 1)$, and $f(z) = \mu_{\theta}(s) + z \cdot \sigma_{\theta}(s)$. Now all the randomness comes from sampling $z$, which doesn't depend on our policy, so we can now differentiate the sampled action $a$ with respect to our policy parameters $\theta$ by simply differentiating through the function $f$ applied at the random noise $z$. 

Using the chain rule then allows to directly estimate gradients of 
$$\mathbb{E}_{a \sim \pi_{\theta}(s)}[Q_{\phi}(s,a)]$$
by drawing samples from $\pi$ and differentiating $Q_{\phi}(s,a)$ on the samples. 

Implement the actor update using this pathwise estimator in the update method of the MLPPolicyAC class in <code>policies/MLP_policy.py</code> (Hint: see the rsample function in for torch.distributions). Note that our implementation samples states uniformly from the entire replay buffer, not necessarily from the state distribution of the current policy. While this means we are no longer taking unbiased policy gradients (our estimates were already biased anyways due to using a learned critic), it works well in practice.

In [64]:
# Compute actor update using the policy gradient. 
# For this test to pass, make sure you only call sample once per actor update to not throw off 
# the actor samples expected for the updates in the this test.
torch.manual_seed(0)
ac_dim = 2
ob_dim = 3
batch_size = 5

np.random.seed(0)
obs = np.random.normal(size=(N, ob_dim))

policy = MLPPolicyAC(
            ac_dim=ac_dim,
            ob_dim=ob_dim,
            n_layers=1,
            size=2,
            learning_rate=0.25,
            entropy_weight=0.)

def dummy_critic(obs, acts):
    return torch.sum(acts + 1) + torch.sum(obs)

initial_loss = policy.update(obs, dummy_critic)['Actor Training Loss']
expected_initial_loss = -17.083496

print("Initial loss error", rel_error(expected_initial_loss, initial_loss), "should be on the order of 1e-6 or less.")
for i in range(5):
    loss = policy.update(obs, dummy_critic)['Actor Training Loss']
    print(loss)

expected_final_loss = -30.103575

print("Final loss error", rel_error(expected_final_loss, loss), "should be on the order of 1e-6 or less.")
    


tensor(-17.0835, grad_fn=<NegBackward>)
Initial loss error 2.7438762975115168e-09 should be on the order of 1e-6 or less.
tensor(-17.3147, grad_fn=<NegBackward>)
-17.314678
tensor(-22.6220, grad_fn=<NegBackward>)
-22.621984
tensor(-25.8285, grad_fn=<NegBackward>)
-25.828484
tensor(-29.6705, grad_fn=<NegBackward>)
-29.670477
tensor(-30.1036, grad_fn=<NegBackward>)
-30.103575
Final loss error 4.105698129439902e-09 should be on the order of 1e-6 or less.


Now we'll train our actor critic agent on the HalfCheetah task. You should see your policies generally get over 600 returns. 

We note that these actor critic algorithms, since they make use of off-policy updates, can be much more sample efficient than the basic policy gradient algorith we saw earlier. In our actor critic algorithms here, we only take 1000 new samples from the environment per iteration, while the policy gradient algorithms often needed many more samples per iteration to estimate the Monte Carlo returns (for example, we used 10000 in the Hopper experiments with policy gradient).



In [65]:
ac_args = dict(ac_base_args_dict)

env_str = 'HalfCheetah'
ac_args['env_name'] = '{}-v2'.format(env_str)
ac_args['n_iter'] = 50

# Delete all previous logs
remove_folder('logs/actor_critic/{}'.format(env_str))

for seed in range(3):
    print("Running actor critic experiment with seed", seed)
    ac_args['seed'] = seed
    ac_args['logdir'] = 'logs/actor_critic/{}/seed{}'.format(env_str, seed)
    actrainer = AC_Trainer(ac_args)
    actrainer.run_training_loop()

Clearing old results at logs/actor_critic/HalfCheetah
Running actor critic experiment with seed 0
########################
logging outputs to  logs/actor_critic/HalfCheetah/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
HalfCheetah-v2


********** Iteration 0 ************

Collecting initial random data to be used for training...
At timestep:     10050 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : -90.31543731689453
Eval_StdReturn : 3.3160579204559326
Eval_MaxReturn : -85.4679946899414
Eval_MinReturn : -95.55809020996094
Eval_AverageEpLen : 201.0
Train_AverageReturn : -53.001216888427734
Train_StdReturn : 30.279033660888672
Train_MaxReturn : -0.5765558481216431
Train_MinReturn : -122.80113220214844
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 10050
TimeSinceStart : 19.5030

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 296.83331298828125
Eval_StdReturn : 13.115283966064453
Eval_MaxReturn : 311.7648620605469
Eval_MinReturn : 278.59674072265625
Eval_AverageEpLen : 201.0
Train_AverageReturn : 268.18548583984375
Train_StdReturn : 22.93358039855957
Train_MaxReturn : 302.7632141113281
Train_MinReturn : 243.4164581298828
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 20100
TimeSinceStart : 212.42126989364624
Critic Training Loss : 0.7096513509750366
Critic Mean : 116.7470703125
Actor Training Loss : -117.97205352783203
Initial_DataCollection_AverageReturn : -53.001216888427734
Done logging...




********** Iteration 11 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 275.52008056640625
Eval_StdReturn : 24.71648406982422
Eval_MaxReturn : 302.26

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 298.9920654296875
Eval_StdReturn : 35.22999572753906
Eval_MaxReturn : 359.7818603515625
Eval_MinReturn : 266.73486328125
Eval_AverageEpLen : 201.0
Train_AverageReturn : 437.71197509765625
Train_StdReturn : 16.211833953857422
Train_MaxReturn : 450.0612487792969
Train_MinReturn : 406.266357421875
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 31155
TimeSinceStart : 412.22196769714355
Critic Training Loss : 2.3492178916931152
Critic Mean : 211.05072021484375
Actor Training Loss : -212.59518432617188
Initial_DataCollection_AverageReturn : -53.001216888427734
Done logging...




********** Iteration 22 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 439.94647216796875
Eval_StdReturn : 35.0020751953125
Eval_MaxReturn : 488.3386

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 490.30072021484375
Eval_StdReturn : 232.32740783691406
Eval_MaxReturn : 654.0078125
Eval_MinReturn : 33.60002136230469
Eval_AverageEpLen : 201.0
Train_AverageReturn : 582.2974243164062
Train_StdReturn : 35.1473388671875
Train_MaxReturn : 612.49853515625
Train_MinReturn : 514.5028076171875
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 42210
TimeSinceStart : 609.0287318229675
Critic Training Loss : 3.0767507553100586
Critic Mean : 292.81640625
Actor Training Loss : -294.4369201660156
Initial_DataCollection_AverageReturn : -53.001216888427734
Done logging...




********** Iteration 33 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 567.0406494140625
Eval_StdReturn : 60.002071380615234
Eval_MaxReturn : 646.346923828125
Eval

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 644.9080810546875
Eval_StdReturn : 6.920444965362549
Eval_MaxReturn : 651.5767211914062
Eval_MinReturn : 634.420166015625
Eval_AverageEpLen : 201.0
Train_AverageReturn : 627.4456787109375
Train_StdReturn : 28.83306312561035
Train_MaxReturn : 653.9252319335938
Train_MinReturn : 574.9691162109375
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 53265
TimeSinceStart : 819.5125896930695
Critic Training Loss : 0.940638542175293
Critic Mean : 325.6229248046875
Actor Training Loss : -327.39990234375
Initial_DataCollection_AverageReturn : -53.001216888427734
Done logging...




********** Iteration 44 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 683.0411987304688
Eval_StdReturn : 21.250858306884766
Eval_MaxReturn : 709.697875976

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : -54.915313720703125
Eval_StdReturn : 53.12058639526367
Eval_MaxReturn : 32.72560501098633
Eval_MinReturn : -101.84767150878906
Eval_AverageEpLen : 201.0
Train_AverageReturn : -93.17149353027344
Train_StdReturn : 3.2919929027557373
Train_MaxReturn : -86.76609802246094
Train_MinReturn : -96.22815704345703
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 13065
TimeSinceStart : 68.00520706176758
Critic Training Loss : 0.5188156366348267
Critic Mean : 38.95003128051758
Actor Training Loss : -39.79171371459961
Initial_DataCollection_AverageReturn : -57.30979919433594
Done logging...




********** Iteration 4 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 187.21176147460938
Eval_StdReturn : 30.95991325378418
Eval_MaxReturn : 220

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 340.7718200683594
Eval_StdReturn : 71.93827056884766
Eval_MaxReturn : 446.171875
Eval_MinReturn : 243.5679168701172
Eval_AverageEpLen : 201.0
Train_AverageReturn : 292.6722106933594
Train_StdReturn : 85.98463439941406
Train_MaxReturn : 452.1413269042969
Train_MinReturn : 207.26675415039062
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 24120
TimeSinceStart : 269.3797631263733
Critic Training Loss : 0.6846283674240112
Critic Mean : 197.54595947265625
Actor Training Loss : -199.10740661621094
Initial_DataCollection_AverageReturn : -57.30979919433594
Done logging...




********** Iteration 15 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 343.530517578125
Eval_StdReturn : 48.53744888305664
Eval_MaxReturn : 437.499145507812

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 522.2496948242188
Eval_StdReturn : 20.495969772338867
Eval_MaxReturn : 544.5376586914062
Eval_MinReturn : 483.7055358886719
Eval_AverageEpLen : 201.0
Train_AverageReturn : 505.71221923828125
Train_StdReturn : 23.481733322143555
Train_MaxReturn : 526.937744140625
Train_MinReturn : 461.70147705078125
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 35175
TimeSinceStart : 490.4499170780182
Critic Training Loss : 1.6813393831253052
Critic Mean : 252.1392059326172
Actor Training Loss : -253.73748779296875
Initial_DataCollection_AverageReturn : -57.30979919433594
Done logging...




********** Iteration 26 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 508.10858154296875
Eval_StdReturn : 33.19829177856445
Eval_MaxReturn : 549.51

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 590.8043823242188
Eval_StdReturn : 16.68000030517578
Eval_MaxReturn : 609.492431640625
Eval_MinReturn : 566.3306884765625
Eval_AverageEpLen : 201.0
Train_AverageReturn : 569.7374267578125
Train_StdReturn : 37.032684326171875
Train_MaxReturn : 612.1680908203125
Train_MinReturn : 501.86456298828125
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 46230
TimeSinceStart : 704.956621170044
Critic Training Loss : 0.7830114364624023
Critic Mean : 290.03411865234375
Actor Training Loss : -291.5611572265625
Initial_DataCollection_AverageReturn : -57.30979919433594
Done logging...




********** Iteration 37 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 503.63641357421875
Eval_StdReturn : 88.48041534423828
Eval_MaxReturn : 605.42285

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 620.275146484375
Eval_StdReturn : 37.99312973022461
Eval_MaxReturn : 663.49462890625
Eval_MinReturn : 554.2861328125
Eval_AverageEpLen : 201.0
Train_AverageReturn : 629.9954833984375
Train_StdReturn : 8.502632141113281
Train_MaxReturn : 644.68017578125
Train_MinReturn : 618.5933837890625
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 57285
TimeSinceStart : 910.269702911377
Critic Training Loss : 0.7392016649246216
Critic Mean : 318.253173828125
Actor Training Loss : -319.74029541015625
Initial_DataCollection_AverageReturn : -57.30979919433594
Done logging...




********** Iteration 48 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 629.515625
Eval_StdReturn : 28.99941635131836
Eval_MaxReturn : 670.9302978515625
Eval_MinR

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 135.34637451171875
Eval_StdReturn : 33.240848541259766
Eval_MaxReturn : 185.63919067382812
Eval_MinReturn : 82.64554595947266
Eval_AverageEpLen : 201.0
Train_AverageReturn : 154.79464721679688
Train_StdReturn : 13.404752731323242
Train_MaxReturn : 171.15994262695312
Train_MinReturn : 136.44964599609375
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 17085
TimeSinceStart : 143.79071807861328
Critic Training Loss : 0.6230981349945068
Critic Mean : 74.67707824707031
Actor Training Loss : -75.71951293945312
Initial_DataCollection_AverageReturn : -60.98212814331055
Done logging...




********** Iteration 8 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 179.94143676757812
Eval_StdReturn : 48.453277587890625
Eval_MaxReturn : 23

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 330.90142822265625
Eval_StdReturn : 64.97025299072266
Eval_MaxReturn : 399.74169921875
Eval_MinReturn : 225.84930419921875
Eval_AverageEpLen : 201.0
Train_AverageReturn : 426.4727478027344
Train_StdReturn : 34.10279846191406
Train_MaxReturn : 464.93963623046875
Train_MinReturn : 367.15594482421875
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 28140
TimeSinceStart : 347.5069100856781
Critic Training Loss : 3.3256938457489014
Critic Mean : 205.3630828857422
Actor Training Loss : -207.2161407470703
Initial_DataCollection_AverageReturn : -60.98212814331055
Done logging...




********** Iteration 19 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 465.5052185058594
Eval_StdReturn : 19.08128547668457
Eval_MaxReturn : 490.09356

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 431.457275390625
Eval_StdReturn : 126.84259796142578
Eval_MaxReturn : 580.45166015625
Eval_MinReturn : 267.4462585449219
Eval_AverageEpLen : 201.0
Train_AverageReturn : 559.89013671875
Train_StdReturn : 14.651942253112793
Train_MaxReturn : 587.482177734375
Train_MinReturn : 547.2520141601562
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 39195
TimeSinceStart : 557.7432789802551
Critic Training Loss : 1.4543863534927368
Critic Mean : 269.675048828125
Actor Training Loss : -271.54827880859375
Initial_DataCollection_AverageReturn : -60.98212814331055
Done logging...




********** Iteration 30 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 538.0958251953125
Eval_StdReturn : 14.682600975036621
Eval_MaxReturn : 552.0822143554

At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 607.6558837890625
Eval_StdReturn : 35.90357971191406
Eval_MaxReturn : 645.7451171875
Eval_MinReturn : 542.9337768554688
Eval_AverageEpLen : 201.0
Train_AverageReturn : 565.2939453125
Train_StdReturn : 77.66166687011719
Train_MaxReturn : 643.484619140625
Train_MinReturn : 428.44720458984375
Train_AverageEpLen : 201.0
Train_EnvstepsSoFar : 50250
TimeSinceStart : 769.0237860679626
Critic Training Loss : 6.291599273681641
Critic Mean : 309.32720947265625
Actor Training Loss : -310.7425537109375
Initial_DataCollection_AverageReturn : -60.98212814331055
Done logging...




********** Iteration 41 ************

Collecting data to be used for training...
At timestep:     1005 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 618.4855346679688
Eval_StdReturn : 19.40786361694336
Eval_MaxReturn : 642.1265869140625

In [66]:
### Visualize Actor Critic results on Halfheetah
%load_ext tensorboard
%tensorboard --logdir logs/actor_critic/HalfCheetah