<a href="https://colab.research.google.com/github/rohandekate10/Purdue---Offline-Reinforcement-Learning-Research/blob/main/Solving_Pendulum_Problem_with_Offline_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenAI Gym Pendulum

Reference - https://www.gymlibrary.dev/environments/classic_control/pendulum/

# Setting Things Up

Run Once to setup colab and install dependencies!

In [None]:
# Setup Colab

!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

# Install d3rlpy
!pip install d3rlpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install gym[classic_control]
#suppress pygame's attempt to use a real display device by telling SDL to use a dummy driver
import os
os.environ['SDL_VIDEODRIVER']='dummy'
import pygame
pygame.display.set_mode((640,480))

# Reproducibility
# set random seeds in random module, numpy module and PyTorch module.
import d3rlpy
d3rlpy.seed(100)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Import Datasets

In [None]:
# Load Datasets in the d3rlpy package

from d3rlpy.datasets import get_pendulum # Fetches Pendulum-v0 dataset which is deprecated and not functional

In [None]:
# Create cartpole environment and dataset
#dataset, env = get_pendulum()

In [None]:
'''collect method
In offline RL experiments, data collection plays an important role especially when you try new tasks.
From this version, `collect` method is finally available.'''

import d3rlpy
import gym

#prepare environment
#env = gym.make('Pendulum-v0')
env = gym.make('Pendulum-v1', g=9.81)

#prepare algorithm
#sac = d3rlpy.algos.SAC()

#continuous action-space
policy = d3rlpy.algos.RandomPolicy()

#discrete action-space
#policy = d3rlpy.algos.DiscreteRandomPolicy()

#prepare replay buffer
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100000, env=env)

#start data collection without updates
policy.collect(env, buffer)

#export to MDPDataset
dataset = buffer.to_mdp_dataset()

#save as file
dataset.dump('pendulum.h5')

#Along with this change, random policies are also introduced. These are useful to collect dataset with random policy.


'''Enhancements
- CQL and BEAR become closer to the official implementations
- `callback` argument has been added to algorithms
- random dataset has been added to cartpole and pendulum dataset
- you can specify it via `dataset_type='random'` at `get_cartpole` and `get_pendulum` method'''

2022-11-03 00:17.43 [debug    ] Building model...
2022-11-03 00:17.43 [debug    ] Model has been built.


  "Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future."
  "Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future."


  0%|          | 0/1000000 [00:00<?, ?it/s]

"Enhancements\n- CQL and BEAR become closer to the official implementations\n- `callback` argument has been added to algorithms\n- random dataset has been added to cartpole and pendulum dataset\n- you can specify it via `dataset_type='random'` at `get_cartpole` and `get_pendulum` method"

## Check Dataset

In [None]:
episode = dataset.episodes[0]

# x = cos(theta),y = sin(angle),Angular Velocity
episode.observations

array([[-7.43572235e-01, -6.68655634e-01, -7.58046806e-01],
       [-7.88283765e-01, -6.15311921e-01, -1.39235473e+00],
       [-8.45494628e-01, -5.33983946e-01, -1.98952007e+00],
       [-9.03054893e-01, -4.29525167e-01, -2.38677382e+00],
       [-9.52883065e-01, -3.03337902e-01, -2.71546483e+00],
       [-9.88052785e-01, -1.54115841e-01, -3.06922197e+00],
       [-9.99987960e-01,  4.90204385e-03, -3.19269228e+00],
       [-9.86084163e-01,  1.66246846e-01, -3.24240494e+00],
       [-9.48754907e-01,  3.16012919e-01, -3.09003544e+00],
       [-8.93292546e-01,  4.49475706e-01, -2.89308524e+00],
       [-8.26134801e-01,  5.63472509e-01, -2.64809537e+00],
       [-7.58924305e-01,  6.51178837e-01, -2.21107221e+00],
       [-7.01421261e-01,  7.12746918e-01, -1.68539965e+00],
       [-6.60988271e-01,  7.50396252e-01, -1.10509229e+00],
       [-6.42452419e-01,  7.66325593e-01, -4.88816112e-01],
       [-6.47054553e-01,  7.62443662e-01,  1.20414726e-01],
       [-6.74823403e-01,  7.37979233e-01

In [None]:
episode.observations.shape

(199, 3)

In [None]:
# Torque
episode.actions

array([[-0.94896334],
       [-0.962998  ],
       [-0.02916598],
       [-0.08445314],
       [-0.8705091 ],
       [-0.06719635],
       [-0.35546294],
       [ 0.20035684],
       [-0.23704217],
       [-0.57141346],
       [ 0.14965533],
       [ 0.31045195],
       [ 0.37269217],
       [ 0.42781407],
       [ 0.30271184],
       [ 0.39217353],
       [ 0.3393412 ],
       [ 0.86981666],
       [-0.5371492 ],
       [ 0.792887  ],
       [ 0.7335417 ],
       [-0.09701575],
       [-0.112661  ],
       [ 0.3646668 ],
       [-0.4424601 ],
       [ 0.4587785 ],
       [-0.9707479 ],
       [ 0.5747761 ],
       [ 0.16082293],
       [ 0.10764398],
       [-0.8909391 ],
       [-0.46323302],
       [-0.58708626],
       [ 0.1194509 ],
       [-0.7067579 ],
       [-0.6642202 ],
       [-0.04457137],
       [-0.15284209],
       [ 0.4608678 ],
       [ 0.15799785],
       [-0.24622844],
       [-0.89547783],
       [ 0.31955856],
       [ 0.14915474],
       [-0.56532514],
       [-0

In [None]:
episode.actions.shape

(199, 1)

In [None]:
episode.rewards

array([ -5.8625765,  -6.339297 ,  -7.043384 ,  -7.846861 ,  -8.766283 ,
        -9.86335  , -10.858283 ,  -9.899427 ,  -8.907675 ,  -7.9951625,
        -7.168161 ,  -6.4058223,  -5.7981734,  -5.37984  ,  -5.170035 ,
        -5.1750054,  -5.3980384,  -5.83469  ,  -6.524614 ,  -7.281151 ,
        -8.300965 ,  -9.459543 , -10.58529  , -10.437466 ,  -9.411457 ,
        -8.354664 ,  -7.4385605,  -6.5517573,  -5.905061 ,  -5.3854036,
        -5.0398955,  -4.903833 ,  -5.012407 ,  -5.3618636,  -5.8888907,
        -6.6752696,  -7.6674585,  -8.746128 ,  -9.923375 , -11.041434 ,
       -10.091524 ,  -9.045189 ,  -8.069565 ,  -7.071994 ,  -6.227502 ,
        -5.56807  ,  -5.0535784,  -4.6999826,  -4.563006 ,  -4.6703196,
        -4.9408503,  -5.477125 ,  -6.256686 ,  -7.1169105,  -8.204477 ,
        -9.257075 , -10.471008 , -10.708343 ,  -9.687781 ,  -8.532714 ,
        -7.528878 ,  -6.681453 ,  -5.88086  ,  -5.311302 ,  -4.9391723,
        -4.784148 ,  -4.796957 ,  -4.9972334,  -5.3369846,  -5.8

In [None]:
# Transition
transition = episode.transitions[0]
transition.observation

array([-0.74357224, -0.66865563, -0.7580468 ], dtype=float32)

# Split Dataset Intor Training & Test

In [None]:
from sklearn.model_selection import train_test_split

train_episodes, test_episodes = train_test_split(dataset, test_size=0.2)

# Setup Algorithm

In [None]:
from d3rlpy.algos import CQL

# if you don't use GPU, set use_gpu=False instead.
cql = CQL(use_gpu=True, scaler='standard')

# initialize neural networks with the given observation shape and action size.
# this is not necessary when you directly call fit or fit_online method.
cql.build_with_dataset(dataset)

cql

d3rlpy.algos.cql.CQL(action_scaler=None, actor_encoder_factory=d3rlpy.models.encoders.DefaultEncoderFactory(activation='relu', use_batch_norm=False, dropout_rate=None), actor_learning_rate=0.0001, actor_optim_factory=d3rlpy.models.optimizers.AdamFactory(optim_cls='Adam', betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False), alpha_learning_rate=0.0001, alpha_optim_factory=d3rlpy.models.optimizers.AdamFactory(optim_cls='Adam', betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False), alpha_threshold=10.0, batch_size=256, conservative_weight=5.0, critic_encoder_factory=d3rlpy.models.encoders.DefaultEncoderFactory(activation='relu', use_batch_norm=False, dropout_rate=None), critic_learning_rate=0.0003, critic_optim_factory=d3rlpy.models.optimizers.AdamFactory(optim_cls='Adam', betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False), gamma=0.99, generated_maxlen=100000, impl=<d3rlpy.algos.torch.cql_impl.CQLImpl object at 0x7f8b6f6a0490>, initial_alpha=1.0, initial_temper

# Setup Metrics

In [None]:
from d3rlpy.metrics.scorer import td_error_scorer
'''Returns average TD error
This metics suggests how Q functions overfit to training sets. 
If the TD error is large, the Q functions are overfitting.'''
#td_error_train = td_error_scorer(discrete_bc, test_episodes) #Error: BC does not support value estimation

from d3rlpy.metrics.scorer import discounted_sum_of_advantage_scorer 
'''Returns average of discounted sum of advantage.
This metrics suggests how the greedy-policy selects different actions in action-value space. 
If the sum of advantage is small, the policy selects actions with larger estimated action-values.'''
#discounted_sum_advtg_train = discounted_sum_of_advantage_scorer(train_episodes, test_episodes) #Error: 'list' object has no attribute 'n_frames'

from d3rlpy.metrics.scorer import average_value_estimation_scorer
'''Returns average value estimation.
This metrics suggests the scale for estimation of Q functions. 
If average value estimation is too large, the Q functions overestimate action-values, which possibly makes training failed.'''
#avg_val_estimate_train = average_value_estimation_scorer(discrete_bc, test_episodes) #Error: assert self._mean is not None and self._std is not None

from d3rlpy.metrics.scorer import value_estimation_std_scorer
'''Returns standard deviation of value estimation.
This metrics suggests how confident Q functions are for the given episodes. 
This metrics will be more accurate with boostrap enabled and the larger n_critics at algorithm. 
If standard deviation of value estimation is large, the Q functions are overfitting to the training set.'''
#val_estimate_std_train = value_estimation_std_scorer(discrete_bc, train_episodes) #AssertionError:

from d3rlpy.metrics.scorer import initial_state_value_estimation_scorer
'''
Returns mean estimated action-values at the initial states.
This metrics suggests how much return the trained policy would get from the initial states by deploying the policy to the states. 
If the estimated value is large, the trained policy is expected to get higher returns.'''
#init_state_value_estimate_train = initial_state_value_estimation_scorer(discrete_bc, train_episodes) #AssertionError:

from d3rlpy.metrics.scorer import soft_opc_scorer
'''
Returns Soft Off-Policy Classification metrics.
This function returns scorer function, which is suitable to the standard scikit-learn scorer function style. 
The metrics of the scorer funciton is evaluating gaps of action-value estimation between the success episodes and the all episodes. 
If the learned Q-function is optimal, action-values in success episodes are expected to be higher than the others. 
The success episode is defined as an episode with a return above the given threshold.'''
#scorer = soft_opc_scorer(return_threshold=180)

from d3rlpy.metrics.scorer import continuous_action_diff_scorer
'''
Returns squared difference of actions between algorithm and dataset.
This metrics suggests how different the greedy-policy is from the given episodes in continuous action-space. 
If the given episodes are near-optimal, the small action difference would be better.'''

from d3rlpy.metrics.scorer import discrete_action_match_scorer
'''
Returns percentage of identical actions between algorithm and dataset.
This metrics suggests how different the greedy-policy is from the given episodes in discrete action-space. 
If the given episdoes are near-optimal, the large percentage would be better.'''

from d3rlpy.metrics.scorer import evaluate_on_environment
'''Returns scorer function of evaluation on environment.
This function returns scorer function, which is suitable to the standard scikit-learn scorer function style. 
The metrics of the scorer function is ideal metrics to evaluate the resulted policies.'''

from d3rlpy.metrics.comparer import compare_continuous_action_diff
'''Returns scorer function of action difference between algorithms.
This metrics suggests how different the two algorithms are in continuous action-space. 
If the algorithm to compare with is near-optimal, the small action difference would be better.'''

from d3rlpy.metrics.comparer import compare_continuous_action_diff
'''Returns scorer function of action matches between algorithms.
This metrics suggests how different the two algorithms are in discrete action-space. 
If the algorithm to compare with is near-optimal, the small action difference would be better.'''

from d3rlpy.metrics.scorer import dynamics_observation_prediction_error_scorer
'''Returns MSE of observation prediction.
This metrics suggests how dynamics model is generalized to test sets. 
If the MSE is large, the dynamics model are overfitting.'''

from d3rlpy.metrics.scorer import dynamics_reward_prediction_error_scorer
'''Returns MSE of reward prediction.
This metrics suggests how dynamics model is generalized to test sets. 
If the MSE is large, the dynamics model are overfitting.'''

from d3rlpy.metrics.scorer import dynamics_prediction_variance_scorer
'''Returns prediction variance of ensemble dynamics.
This metrics suggests how dynamics model is confident of test sets. 
If the variance is large, the dynamics model has large uncertainty.'''

all_scorers={'td_error' : td_error_scorer,
             'discounted_sum_of_advantage': discounted_sum_of_advantage_scorer,
             'average_value_estimation' : average_value_estimation_scorer,
             'value_estimation_std':value_estimation_std_scorer,
             'initial_state_value_estimation':initial_state_value_estimation_scorer,
             'soft_opc' :soft_opc_scorer(return_threshold=180),
             'continuous_action_diff':continuous_action_diff_scorer,
             'discrete_action_match':discrete_action_match_scorer,
             'evaluate_on_environment': evaluate_on_environment(env),
             'compare_continuous_action_diff': compare_continuous_action_diff(cql),
             'compare_continuous_action_diff': compare_continuous_action_diff(cql),
             #'dynamics_observation_prediction_error':dynamics_observation_prediction_error_scorer,
             #'dynamics_reward_prediction_error':dynamics_reward_prediction_error_scorer
             }
all_scorers

{'td_error': <function d3rlpy.metrics.scorer.td_error_scorer(algo: d3rlpy.metrics.scorer.AlgoProtocol, episodes: List[d3rlpy.dataset.Episode]) -> float>,
 'discounted_sum_of_advantage': <function d3rlpy.metrics.scorer.discounted_sum_of_advantage_scorer(algo: d3rlpy.metrics.scorer.AlgoProtocol, episodes: List[d3rlpy.dataset.Episode]) -> float>,
 'average_value_estimation': <function d3rlpy.metrics.scorer.average_value_estimation_scorer(algo: d3rlpy.metrics.scorer.AlgoProtocol, episodes: List[d3rlpy.dataset.Episode]) -> float>,
 'value_estimation_std': <function d3rlpy.metrics.scorer.value_estimation_std_scorer(algo: d3rlpy.metrics.scorer.AlgoProtocol, episodes: List[d3rlpy.dataset.Episode]) -> float>,
 'initial_state_value_estimation': <function d3rlpy.metrics.scorer.initial_state_value_estimation_scorer(algo: d3rlpy.metrics.scorer.AlgoProtocol, episodes: List[d3rlpy.dataset.Episode]) -> float>,
 'soft_opc': <function d3rlpy.metrics.scorer.soft_opc_scorer.<locals>.scorer(algo: d3rlpy.me

# Start Training

In [None]:
#Load the TensorBoard notebook extension
%load_ext tensorboard

import tensorflow as tf
import datetime

# Clear any logs from previous runs
#rm -rf ./logs/

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [None]:
cql.fit(train_episodes,
        n_epochs=100,
        verbose=True,
        eval_episodes=test_episodes,
        scorers=all_scorers,
        tensorboard_dir='cql_run'
        )

2022-11-03 00:25.31 [debug    ] RoundIterator is selected.
2022-11-03 00:25.31 [info     ] Directory is created at d3rlpy_logs/CQL_20221103002531
2022-11-03 00:25.31 [debug    ] Fitting scaler...              scaler=standard
2022-11-03 00:25.31 [info     ] Parameters are saved to d3rlpy_logs/CQL_20221103002531/params.json params={'action_scaler': None, 'actor_encoder_factory': {'type': 'default', 'params': {'activation': 'relu', 'use_batch_norm': False, 'dropout_rate': None}}, 'actor_learning_rate': 0.0001, 'actor_optim_factory': {'optim_cls': 'Adam', 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}, 'alpha_learning_rate': 0.0001, 'alpha_optim_factory': {'optim_cls': 'Adam', 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}, 'alpha_threshold': 10.0, 'batch_size': 256, 'conservative_weight': 5.0, 'critic_encoder_factory': {'type': 'default', 'params': {'activation': 'relu', 'use_batch_norm': False, 'dropout_rate': None}}, 'critic_learning

Epoch 1/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:25.50 [info     ] CQL_20221103002531: epoch=1 step=310 epoch=1 metrics={'time_sample_batch': 0.0007875950105728642, 'time_algorithm_update': 0.03695263631882206, 'temp_loss': 1.5528283830611938, 'temp': 0.9264482005949943, 'alpha_loss': -11.382873568996306, 'alpha': 1.0820077500035685, 'critic_loss': 11.7296535922635, 'actor_loss': 23.677820605616414, 'time_step': 0.037899459561993996, 'td_error': 45.817779341490656, 'discounted_sum_of_advantage': -2.0480497417351318, 'average_value_estimation': -28.35122573438889, 'value_estimation_std': 0.07755039680004763, 'initial_state_value_estimation': -18.536521911621094, 'soft_opc': nan, 'continuous_action_diff': 0.34615130289514556, 'discrete_action_match': 0.0, 'evaluate_on_environment': -1382.0189640696585, 'compare_continuous_action_diff': 0.0} step=310




2022-11-03 00:25.50 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_310.pt


Epoch 2/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:26.10 [info     ] CQL_20221103002531: epoch=2 step=620 epoch=2 metrics={'time_sample_batch': 0.00108150512941422, 'time_algorithm_update': 0.04125119024707425, 'temp_loss': 1.504301239213636, 'temp': 0.8991311617435948, 'alpha_loss': -11.775877721848026, 'alpha': 1.1174159715252538, 'critic_loss': 12.116087421294182, 'actor_loss': 32.24229314250331, 'time_step': 0.04253011441999866, 'td_error': 44.010949489201295, 'discounted_sum_of_advantage': -2.5782382622894473, 'average_value_estimation': -36.956079611576534, 'value_estimation_std': 0.09173138431393915, 'initial_state_value_estimation': -25.776845932006836, 'soft_opc': nan, 'continuous_action_diff': 0.3496989122892053, 'discrete_action_match': 0.0, 'evaluate_on_environment': -1351.3368746393248, 'compare_continuous_action_diff': 0.0} step=620




2022-11-03 00:26.10 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_620.pt


Epoch 3/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:26.28 [info     ] CQL_20221103002531: epoch=3 step=930 epoch=3 metrics={'time_sample_batch': 0.0007782013185562626, 'time_algorithm_update': 0.03656632361873503, 'temp_loss': 1.4577212437506646, 'temp': 0.872952627558862, 'alpha_loss': -12.188248871218773, 'alpha': 1.1544255375862122, 'critic_loss': 12.56622403975456, 'actor_loss': 40.95600094949045, 'time_step': 0.037495830751234485, 'td_error': 42.395805447783744, 'discounted_sum_of_advantage': -2.6585224872127164, 'average_value_estimation': -45.728118789650246, 'value_estimation_std': 0.09794835564976788, 'initial_state_value_estimation': -34.3692512512207, 'soft_opc': nan, 'continuous_action_diff': 0.3483290639898293, 'discrete_action_match': 0.0, 'evaluate_on_environment': -1370.38731818816, 'compare_continuous_action_diff': 0.0} step=930




2022-11-03 00:26.28 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_930.pt


Epoch 4/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:26.49 [info     ] CQL_20221103002531: epoch=4 step=1240 epoch=4 metrics={'time_sample_batch': 0.0011353746537239322, 'time_algorithm_update': 0.04337808624390633, 'temp_loss': 1.4084152279361601, 'temp': 0.847811112865325, 'alpha_loss': -12.6302934985007, 'alpha': 1.1930666050603314, 'critic_loss': 13.110383498284124, 'actor_loss': 49.710082035679974, 'time_step': 0.044740516139614966, 'td_error': 40.65791243862985, 'discounted_sum_of_advantage': -3.8174310307260533, 'average_value_estimation': -54.229935372420606, 'value_estimation_std': 0.1223726407608141, 'initial_state_value_estimation': -44.0687141418457, 'soft_opc': nan, 'continuous_action_diff': 0.3613024747454824, 'discrete_action_match': 0.0, 'evaluate_on_environment': -1341.8579174869712, 'compare_continuous_action_diff': 0.0} step=1240




2022-11-03 00:26.49 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_1240.pt


Epoch 5/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:27.07 [info     ] CQL_20221103002531: epoch=5 step=1550 epoch=5 metrics={'time_sample_batch': 0.0007551416274039977, 'time_algorithm_update': 0.03653955536503946, 'temp_loss': 1.3635088424528798, 'temp': 0.8236415011267508, 'alpha_loss': -13.103608995868314, 'alpha': 1.2333813821115802, 'critic_loss': 13.72760427536503, 'actor_loss': 58.40785667665543, 'time_step': 0.03744863540895524, 'td_error': 38.58242398235805, 'discounted_sum_of_advantage': -5.987208408820379, 'average_value_estimation': -62.86672281622827, 'value_estimation_std': 0.13622590891732025, 'initial_state_value_estimation': -54.824581146240234, 'soft_opc': nan, 'continuous_action_diff': 0.3791653393159784, 'discrete_action_match': 0.0, 'evaluate_on_environment': -1217.6355647854336, 'compare_continuous_action_diff': 0.0} step=1550




2022-11-03 00:27.07 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_1550.pt


Epoch 6/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:27.29 [info     ] CQL_20221103002531: epoch=6 step=1860 epoch=6 metrics={'time_sample_batch': 0.0010198000938661637, 'time_algorithm_update': 0.041776112587221206, 'temp_loss': 1.3146791373529743, 'temp': 0.8003825799111397, 'alpha_loss': -13.61239374222294, 'alpha': 1.275425330285103, 'critic_loss': 14.382290557123, 'actor_loss': 66.96375649975192, 'time_step': 0.042981663057881016, 'td_error': 37.009519850711705, 'discounted_sum_of_advantage': -7.116457760418803, 'average_value_estimation': -71.30763076448312, 'value_estimation_std': 0.14843093815741784, 'initial_state_value_estimation': -65.88003540039062, 'soft_opc': nan, 'continuous_action_diff': 0.38937854630523433, 'discrete_action_match': 0.0, 'evaluate_on_environment': -1008.5475770417812, 'compare_continuous_action_diff': 0.0} step=1860




2022-11-03 00:27.29 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_1860.pt


Epoch 7/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:27.49 [info     ] CQL_20221103002531: epoch=7 step=2170 epoch=7 metrics={'time_sample_batch': 0.000989737049225838, 'time_algorithm_update': 0.043250339261947136, 'temp_loss': 1.2693898139461395, 'temp': 0.7779710504316515, 'alpha_loss': -14.143150141931349, 'alpha': 1.3192159033590747, 'critic_loss': 15.147567198353428, 'actor_loss': 75.36984949419575, 'time_step': 0.04443991030416181, 'td_error': 35.467400735055286, 'discounted_sum_of_advantage': -8.927363483833346, 'average_value_estimation': -79.48834403789881, 'value_estimation_std': 0.16006344754220395, 'initial_state_value_estimation': -76.68757629394531, 'soft_opc': nan, 'continuous_action_diff': 0.4046571286757591, 'discrete_action_match': 0.0, 'evaluate_on_environment': -1008.9416677297975, 'compare_continuous_action_diff': 0.0} step=2170




2022-11-03 00:27.49 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_2170.pt


Epoch 8/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:28.10 [info     ] CQL_20221103002531: epoch=8 step=2480 epoch=8 metrics={'time_sample_batch': 0.0011117358361521074, 'time_algorithm_update': 0.04169234537309216, 'temp_loss': 1.2233870552432153, 'temp': 0.7563540475983773, 'alpha_loss': -14.706059378962363, 'alpha': 1.364777264671941, 'critic_loss': 16.103085797832858, 'actor_loss': 83.63062293760238, 'time_step': 0.042991712016444056, 'td_error': 34.080657802981456, 'discounted_sum_of_advantage': -10.541739217691203, 'average_value_estimation': -87.71369588588497, 'value_estimation_std': 0.17155858189158857, 'initial_state_value_estimation': -87.21977233886719, 'soft_opc': nan, 'continuous_action_diff': 0.4335734429604144, 'discrete_action_match': 0.0, 'evaluate_on_environment': -888.6307755691271, 'compare_continuous_action_diff': 0.0} step=2480




2022-11-03 00:28.10 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_2480.pt


Epoch 9/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:28.33 [info     ] CQL_20221103002531: epoch=9 step=2790 epoch=9 metrics={'time_sample_batch': 0.0012297422655167117, 'time_algorithm_update': 0.046970605081127534, 'temp_loss': 1.1778975017609135, 'temp': 0.7354994760405633, 'alpha_loss': -15.305879651346514, 'alpha': 1.4121608511094124, 'critic_loss': 17.134791934105657, 'actor_loss': 91.74265899658204, 'time_step': 0.048436912413566346, 'td_error': 32.61567769876938, 'discounted_sum_of_advantage': -12.658120251194804, 'average_value_estimation': -95.64454075116755, 'value_estimation_std': 0.17961847652803573, 'initial_state_value_estimation': -96.69023895263672, 'soft_opc': nan, 'continuous_action_diff': 0.43832714464613093, 'discrete_action_match': 0.0, 'evaluate_on_environment': -994.2170127297719, 'compare_continuous_action_diff': 0.0} step=2790




2022-11-03 00:28.33 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_2790.pt


Epoch 10/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:28.51 [info     ] CQL_20221103002531: epoch=10 step=3100 epoch=10 metrics={'time_sample_batch': 0.0008084181816347184, 'time_algorithm_update': 0.0383561618866459, 'temp_loss': 1.1334995715848861, 'temp': 0.7153528836465651, 'alpha_loss': -15.931469714257025, 'alpha': 1.4614241707709528, 'critic_loss': 18.226592033140122, 'actor_loss': 99.74447757351783, 'time_step': 0.03932829595381214, 'td_error': 31.440532729642236, 'discounted_sum_of_advantage': -14.258887050468402, 'average_value_estimation': -103.3129471249909, 'value_estimation_std': 0.19662020450086842, 'initial_state_value_estimation': -105.27530670166016, 'soft_opc': nan, 'continuous_action_diff': 0.443284802495564, 'discrete_action_match': 0.0, 'evaluate_on_environment': -758.1083516595398, 'compare_continuous_action_diff': 0.0} step=3100




2022-11-03 00:28.51 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_3100.pt


Epoch 11/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:29.10 [info     ] CQL_20221103002531: epoch=11 step=3410 epoch=11 metrics={'time_sample_batch': 0.0007723054578227382, 'time_algorithm_update': 0.03726540842363911, 'temp_loss': 1.09742977042352, 'temp': 0.6958395100408985, 'alpha_loss': -16.60484044782577, 'alpha': 1.512587663435167, 'critic_loss': 19.336881772933467, 'actor_loss': 107.61107039913054, 'time_step': 0.038171147531078704, 'td_error': 30.22601579533578, 'discounted_sum_of_advantage': -15.738919327747375, 'average_value_estimation': -111.30560027781648, 'value_estimation_std': 0.20782155875688907, 'initial_state_value_estimation': -113.52323150634766, 'soft_opc': nan, 'continuous_action_diff': 0.4591246728767505, 'discrete_action_match': 2.5255050757601015e-07, 'evaluate_on_environment': -1067.2941004555655, 'compare_continuous_action_diff': 0.0} step=3410




2022-11-03 00:29.10 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_3410.pt


Epoch 12/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:29.30 [info     ] CQL_20221103002531: epoch=12 step=3720 epoch=12 metrics={'time_sample_batch': 0.0009537427656112179, 'time_algorithm_update': 0.041491265450754475, 'temp_loss': 1.0590598771649022, 'temp': 0.6768684296838698, 'alpha_loss': -17.29815588920347, 'alpha': 1.565710711479187, 'critic_loss': 20.551609777635143, 'actor_loss': 115.32425492809665, 'time_step': 0.04263048787270823, 'td_error': 29.67498668421378, 'discounted_sum_of_advantage': -14.904979062547092, 'average_value_estimation': -118.7774653432369, 'value_estimation_std': 0.22067764688387415, 'initial_state_value_estimation': -120.54293823242188, 'soft_opc': nan, 'continuous_action_diff': 0.4575607284420249, 'discrete_action_match': 0.0, 'evaluate_on_environment': -890.9681684060939, 'compare_continuous_action_diff': 0.0} step=3720




2022-11-03 00:29.30 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_3720.pt


Epoch 13/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:29.49 [info     ] CQL_20221103002531: epoch=13 step=4030 epoch=13 metrics={'time_sample_batch': 0.0008127743198025611, 'time_algorithm_update': 0.0368228635480327, 'temp_loss': 1.024507445866062, 'temp': 0.6584541059309437, 'alpha_loss': -17.992088440925844, 'alpha': 1.6207963016725355, 'critic_loss': 21.692893526631018, 'actor_loss': 122.92265071253622, 'time_step': 0.03779460307090513, 'td_error': 28.22128739360723, 'discounted_sum_of_advantage': -17.014715151417512, 'average_value_estimation': -126.58447136122152, 'value_estimation_std': 0.21328211192224572, 'initial_state_value_estimation': -127.59979248046875, 'soft_opc': nan, 'continuous_action_diff': 0.48489492241990795, 'discrete_action_match': 0.0, 'evaluate_on_environment': -988.3668696390472, 'compare_continuous_action_diff': 0.0} step=4030




2022-11-03 00:29.49 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_4030.pt


Epoch 14/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:30.09 [info     ] CQL_20221103002531: epoch=14 step=4340 epoch=14 metrics={'time_sample_batch': 0.0009488528774630639, 'time_algorithm_update': 0.03976715687782534, 'temp_loss': 0.9898731445112536, 'temp': 0.6405377380309566, 'alpha_loss': -18.706312271856493, 'alpha': 1.6778634705851154, 'critic_loss': 22.827614458145632, 'actor_loss': 130.38098863170993, 'time_step': 0.04086515672745243, 'td_error': 27.00390677677181, 'discounted_sum_of_advantage': -18.63283093574594, 'average_value_estimation': -133.7112456957976, 'value_estimation_std': 0.233083325901361, 'initial_state_value_estimation': -134.0327911376953, 'soft_opc': nan, 'continuous_action_diff': 0.4882314431840984, 'discrete_action_match': 0.0, 'evaluate_on_environment': -918.1393226333427, 'compare_continuous_action_diff': 0.0} step=4340




2022-11-03 00:30.09 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_4340.pt


Epoch 15/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:30.28 [info     ] CQL_20221103002531: epoch=15 step=4650 epoch=15 metrics={'time_sample_batch': 0.0007946214368266444, 'time_algorithm_update': 0.03744856772884246, 'temp_loss': 0.9576962315267132, 'temp': 0.6231121778488159, 'alpha_loss': -19.446248725152785, 'alpha': 1.7368562440718374, 'critic_loss': 23.806026132645144, 'actor_loss': 137.71927268735823, 'time_step': 0.03839555017409786, 'td_error': 26.17916243208109, 'discounted_sum_of_advantage': -18.136780847939548, 'average_value_estimation': -141.0836592026741, 'value_estimation_std': 0.23844953005957015, 'initial_state_value_estimation': -140.6465606689453, 'soft_opc': nan, 'continuous_action_diff': 0.4869693307599219, 'discrete_action_match': 0.0, 'evaluate_on_environment': -956.417497644243, 'compare_continuous_action_diff': 0.0} step=4650




2022-11-03 00:30.28 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_4650.pt


Epoch 16/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:30.48 [info     ] CQL_20221103002531: epoch=16 step=4960 epoch=16 metrics={'time_sample_batch': 0.0010183588151008852, 'time_algorithm_update': 0.04094397406424245, 'temp_loss': 0.9247773651153811, 'temp': 0.6062146207978648, 'alpha_loss': -20.15659460867605, 'alpha': 1.7979431782999347, 'critic_loss': 24.66725261442123, 'actor_loss': 144.95645141601562, 'time_step': 0.042146519691713395, 'td_error': 25.112898380004616, 'discounted_sum_of_advantage': -18.851873195418246, 'average_value_estimation': -148.37966426571245, 'value_estimation_std': 0.25257002773015386, 'initial_state_value_estimation': -147.53065490722656, 'soft_opc': nan, 'continuous_action_diff': 0.4649940983921, 'discrete_action_match': 0.0, 'evaluate_on_environment': -841.7099177739768, 'compare_continuous_action_diff': 0.0} step=4960




2022-11-03 00:30.48 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_4960.pt


Epoch 17/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:31.06 [info     ] CQL_20221103002531: epoch=17 step=5270 epoch=17 metrics={'time_sample_batch': 0.0007375301853302987, 'time_algorithm_update': 0.03666180641420426, 'temp_loss': 0.8973698985192083, 'temp': 0.5897319403386885, 'alpha_loss': -20.883365631103516, 'alpha': 1.8609736038792517, 'critic_loss': 25.540466776201804, 'actor_loss': 152.07298446163054, 'time_step': 0.03754280228768626, 'td_error': 23.898242690623942, 'discounted_sum_of_advantage': -20.886704227487282, 'average_value_estimation': -155.62019622582224, 'value_estimation_std': 0.2501515839718213, 'initial_state_value_estimation': -154.47593688964844, 'soft_opc': nan, 'continuous_action_diff': 0.48410524808165983, 'discrete_action_match': 0.0, 'evaluate_on_environment': -913.7196509752691, 'compare_continuous_action_diff': 0.0} step=5270




2022-11-03 00:31.06 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_5270.pt


Epoch 18/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:31.26 [info     ] CQL_20221103002531: epoch=18 step=5580 epoch=18 metrics={'time_sample_batch': 0.0010214190329274823, 'time_algorithm_update': 0.04127864530009608, 'temp_loss': 0.867211038643314, 'temp': 0.5737206009126479, 'alpha_loss': -21.62420565082181, 'alpha': 1.9261641917690153, 'critic_loss': 26.28187628715269, 'actor_loss': 159.06373650335496, 'time_step': 0.04246063540058751, 'td_error': 23.26284984936891, 'discounted_sum_of_advantage': -19.850880242483967, 'average_value_estimation': -162.49150482312788, 'value_estimation_std': 0.2563645062751152, 'initial_state_value_estimation': -161.2802276611328, 'soft_opc': nan, 'continuous_action_diff': 0.4932727616233937, 'discrete_action_match': 0.0, 'evaluate_on_environment': -887.4140035079533, 'compare_continuous_action_diff': 0.0} step=5580




2022-11-03 00:31.26 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_5580.pt


Epoch 19/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:31.44 [info     ] CQL_20221103002531: epoch=19 step=5890 epoch=19 metrics={'time_sample_batch': 0.0007638708237678774, 'time_algorithm_update': 0.036690069783118465, 'temp_loss': 0.8429947624283453, 'temp': 0.5581012829657523, 'alpha_loss': -22.36562343720467, 'alpha': 1.993478686578812, 'critic_loss': 27.08843440394248, 'actor_loss': 165.99720129197644, 'time_step': 0.03759390692557058, 'td_error': 22.376651788012637, 'discounted_sum_of_advantage': -20.665711961120245, 'average_value_estimation': -169.18142408303635, 'value_estimation_std': 0.24709131829362926, 'initial_state_value_estimation': -167.92056274414062, 'soft_opc': nan, 'continuous_action_diff': 0.5092544536228862, 'discrete_action_match': 0.0, 'evaluate_on_environment': -778.0696821267856, 'compare_continuous_action_diff': 0.0} step=5890




2022-11-03 00:31.44 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_5890.pt


Epoch 20/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:32.07 [info     ] CQL_20221103002531: epoch=20 step=6200 epoch=20 metrics={'time_sample_batch': 0.0009487890428112399, 'time_algorithm_update': 0.04116896583187964, 'temp_loss': 0.8161984378291715, 'temp': 0.5428876671098893, 'alpha_loss': -23.1304386631135, 'alpha': 2.0629593241599298, 'critic_loss': 27.886741084437215, 'actor_loss': 172.76388347994896, 'time_step': 0.04229698719516877, 'td_error': 21.594508579490803, 'discounted_sum_of_advantage': -20.953711581412392, 'average_value_estimation': -175.63142021427942, 'value_estimation_std': 0.2533918399715016, 'initial_state_value_estimation': -174.81985473632812, 'soft_opc': nan, 'continuous_action_diff': 0.5069009012484229, 'discrete_action_match': 0.0, 'evaluate_on_environment': -690.6560825822908, 'compare_continuous_action_diff': 0.0} step=6200




2022-11-03 00:32.07 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_6200.pt


Epoch 21/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:32.26 [info     ] CQL_20221103002531: epoch=21 step=6510 epoch=21 metrics={'time_sample_batch': 0.0007560829962453534, 'time_algorithm_update': 0.03708791886606524, 'temp_loss': 0.7890555429843165, 'temp': 0.5280819985174363, 'alpha_loss': -23.936051257964102, 'alpha': 2.134861457732416, 'critic_loss': 28.830605722242787, 'actor_loss': 179.4743926017515, 'time_step': 0.03800482211574431, 'td_error': 20.402172795899947, 'discounted_sum_of_advantage': -23.21645292218558, 'average_value_estimation': -182.17736444703124, 'value_estimation_std': 0.24822427242673648, 'initial_state_value_estimation': -181.68748474121094, 'soft_opc': nan, 'continuous_action_diff': 0.5250175013899685, 'discrete_action_match': 0.0, 'evaluate_on_environment': -440.14984242008506, 'compare_continuous_action_diff': 0.0} step=6510




2022-11-03 00:32.26 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_6510.pt


Epoch 22/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:32.45 [info     ] CQL_20221103002531: epoch=22 step=6820 epoch=22 metrics={'time_sample_batch': 0.0009282527431364983, 'time_algorithm_update': 0.03951369870093561, 'temp_loss': 0.759663870065443, 'temp': 0.5137524018364568, 'alpha_loss': -24.747359097388482, 'alpha': 2.209174250018212, 'critic_loss': 29.606209373474123, 'actor_loss': 186.02808498259515, 'time_step': 0.04058861886301348, 'td_error': 19.824473979597695, 'discounted_sum_of_advantage': -23.190047046091557, 'average_value_estimation': -189.05220988159454, 'value_estimation_std': 0.24749009166689312, 'initial_state_value_estimation': -188.77272033691406, 'soft_opc': nan, 'continuous_action_diff': 0.5419217296288393, 'discrete_action_match': 0.0, 'evaluate_on_environment': -446.6081059533296, 'compare_continuous_action_diff': 0.0} step=6820




2022-11-03 00:32.45 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_6820.pt


Epoch 23/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:33.03 [info     ] CQL_20221103002531: epoch=23 step=7130 epoch=23 metrics={'time_sample_batch': 0.0007937685135872134, 'time_algorithm_update': 0.03665715109917425, 'temp_loss': 0.734957076849476, 'temp': 0.4998208939067779, 'alpha_loss': -25.608051349270728, 'alpha': 2.286048053926037, 'critic_loss': 30.605165358512632, 'actor_loss': 192.47396259923136, 'time_step': 0.03760368747095908, 'td_error': 19.37576532630905, 'discounted_sum_of_advantage': -22.54748572022187, 'average_value_estimation': -195.66156631074483, 'value_estimation_std': 0.24543907274536397, 'initial_state_value_estimation': -195.2541046142578, 'soft_opc': nan, 'continuous_action_diff': 0.5495441268362956, 'discrete_action_match': 0.0, 'evaluate_on_environment': -249.59652146926754, 'compare_continuous_action_diff': 0.0} step=7130




2022-11-03 00:33.03 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_7130.pt


Epoch 24/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:33.24 [info     ] CQL_20221103002531: epoch=24 step=7440 epoch=24 metrics={'time_sample_batch': 0.0008977513159475019, 'time_algorithm_update': 0.041765111492526145, 'temp_loss': 0.7068232501706769, 'temp': 0.48630657013385525, 'alpha_loss': -26.48982562403525, 'alpha': 2.365602947050525, 'critic_loss': 31.686365798211867, 'actor_loss': 198.8673538208008, 'time_step': 0.04284874546912409, 'td_error': 19.013133758655943, 'discounted_sum_of_advantage': -22.698210001737706, 'average_value_estimation': -201.33331164232146, 'value_estimation_std': 0.25414983534774643, 'initial_state_value_estimation': -200.79165649414062, 'soft_opc': nan, 'continuous_action_diff': 0.5603580900264997, 'discrete_action_match': 0.0, 'evaluate_on_environment': -448.01377697140015, 'compare_continuous_action_diff': 0.0} step=7440




2022-11-03 00:33.24 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_7440.pt


Epoch 25/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:33.42 [info     ] CQL_20221103002531: epoch=25 step=7750 epoch=25 metrics={'time_sample_batch': 0.0007816499279391381, 'time_algorithm_update': 0.03741877694283762, 'temp_loss': 0.6869857447762643, 'temp': 0.47316919959360554, 'alpha_loss': -27.41950427639869, 'alpha': 2.4479214875928816, 'critic_loss': 32.87847322648572, 'actor_loss': 205.10450596963204, 'time_step': 0.03834832560631537, 'td_error': 18.311468778477916, 'discounted_sum_of_advantage': -24.932168445947624, 'average_value_estimation': -208.00524827568205, 'value_estimation_std': 0.2660972551695543, 'initial_state_value_estimation': -207.31495666503906, 'soft_opc': nan, 'continuous_action_diff': 0.5633556218111085, 'discrete_action_match': 0.0, 'evaluate_on_environment': -438.43965536102024, 'compare_continuous_action_diff': 0.0} step=7750




2022-11-03 00:33.42 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_7750.pt


Epoch 26/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:34.02 [info     ] CQL_20221103002531: epoch=26 step=8060 epoch=26 metrics={'time_sample_batch': 0.0012900106368526336, 'time_algorithm_update': 0.04123260667247157, 'temp_loss': 0.6628338317717275, 'temp': 0.46033862627321676, 'alpha_loss': -28.37681879843435, 'alpha': 2.533039669067629, 'critic_loss': 34.10919561078472, 'actor_loss': 211.2827372889365, 'time_step': 0.04268653931156281, 'td_error': 18.063375748052806, 'discounted_sum_of_advantage': -23.75828268681555, 'average_value_estimation': -213.65577028093128, 'value_estimation_std': 0.24035291793179384, 'initial_state_value_estimation': -212.64215087890625, 'soft_opc': nan, 'continuous_action_diff': 0.5650791811075278, 'discrete_action_match': 0.0, 'evaluate_on_environment': -662.050701847944, 'compare_continuous_action_diff': 0.0} step=8060




2022-11-03 00:34.02 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_8060.pt


Epoch 27/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:34.20 [info     ] CQL_20221103002531: epoch=27 step=8370 epoch=27 metrics={'time_sample_batch': 0.0007579688102968277, 'time_algorithm_update': 0.03690568324058287, 'temp_loss': 0.6437836871993157, 'temp': 0.4478816713056257, 'alpha_loss': -29.415664260618147, 'alpha': 2.6212899077323177, 'critic_loss': 35.44017548099641, 'actor_loss': 217.28805507536856, 'time_step': 0.03782102369493054, 'td_error': 17.743563634285668, 'discounted_sum_of_advantage': -23.737260151116562, 'average_value_estimation': -219.65206060477263, 'value_estimation_std': 0.2367118139518548, 'initial_state_value_estimation': -218.17324829101562, 'soft_opc': nan, 'continuous_action_diff': 0.5704780461787325, 'discrete_action_match': 0.0, 'evaluate_on_environment': -296.58844345668, 'compare_continuous_action_diff': 0.0} step=8370




2022-11-03 00:34.20 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_8370.pt


Epoch 28/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:34.40 [info     ] CQL_20221103002531: epoch=28 step=8680 epoch=28 metrics={'time_sample_batch': 0.00103084348863171, 'time_algorithm_update': 0.041787347485942226, 'temp_loss': 0.6200965821743012, 'temp': 0.4357178468858042, 'alpha_loss': -30.416906596768285, 'alpha': 2.7125879210810506, 'critic_loss': 36.89169617929766, 'actor_loss': 223.20921394594254, 'time_step': 0.04299834312931184, 'td_error': 17.335074175629824, 'discounted_sum_of_advantage': -24.41091382754506, 'average_value_estimation': -225.3791311588129, 'value_estimation_std': 0.24220398784442648, 'initial_state_value_estimation': -223.67196655273438, 'soft_opc': nan, 'continuous_action_diff': 0.5535455873529014, 'discrete_action_match': 0.0, 'evaluate_on_environment': -444.1802450659692, 'compare_continuous_action_diff': 0.0} step=8680




2022-11-03 00:34.40 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_8680.pt


Epoch 29/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:34.59 [info     ] CQL_20221103002531: epoch=29 step=8990 epoch=29 metrics={'time_sample_batch': 0.0007891724186558877, 'time_algorithm_update': 0.036860386786922335, 'temp_loss': 0.6017295197133095, 'temp': 0.4239245745443529, 'alpha_loss': -31.49737300257529, 'alpha': 2.807049301362807, 'critic_loss': 38.2884387354697, 'actor_loss': 229.08314159762475, 'time_step': 0.03779883923069123, 'td_error': 17.04694116549386, 'discounted_sum_of_advantage': -24.65383725018136, 'average_value_estimation': -231.1162383835582, 'value_estimation_std': 0.2530172879419724, 'initial_state_value_estimation': -229.22866821289062, 'soft_opc': nan, 'continuous_action_diff': 0.5627117794188013, 'discrete_action_match': 0.0, 'evaluate_on_environment': -668.0286962993554, 'compare_continuous_action_diff': 0.0} step=8990




2022-11-03 00:34.59 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_8990.pt


Epoch 30/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:35.19 [info     ] CQL_20221103002531: epoch=30 step=9300 epoch=30 metrics={'time_sample_batch': 0.0008266625865813225, 'time_algorithm_update': 0.04156081445755497, 'temp_loss': 0.5812946211907172, 'temp': 0.41240750262814185, 'alpha_loss': -32.59595708539409, 'alpha': 2.9048167367135327, 'critic_loss': 39.74416844767909, 'actor_loss': 234.84296156360256, 'time_step': 0.04257239757045623, 'td_error': 16.749776425955236, 'discounted_sum_of_advantage': -24.92182914969818, 'average_value_estimation': -237.016693532604, 'value_estimation_std': 0.24862501230885256, 'initial_state_value_estimation': -235.19046020507812, 'soft_opc': nan, 'continuous_action_diff': 0.5769097815536166, 'discrete_action_match': 0.0, 'evaluate_on_environment': -381.6182735464669, 'compare_continuous_action_diff': 0.0} step=9300




2022-11-03 00:35.19 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_9300.pt


Epoch 31/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:35.37 [info     ] CQL_20221103002531: epoch=31 step=9610 epoch=31 metrics={'time_sample_batch': 0.0007982846229307113, 'time_algorithm_update': 0.03719484652242353, 'temp_loss': 0.5616042963920101, 'temp': 0.40124309783981693, 'alpha_loss': -33.72959319699195, 'alpha': 3.005893810333744, 'critic_loss': 41.258509186775456, 'actor_loss': 240.51742381434286, 'time_step': 0.03813578313396823, 'td_error': 16.313640490585048, 'discounted_sum_of_advantage': -25.503415323088124, 'average_value_estimation': -242.52908722383162, 'value_estimation_std': 0.24824567766544917, 'initial_state_value_estimation': -241.06893920898438, 'soft_opc': nan, 'continuous_action_diff': 0.6049119840235916, 'discrete_action_match': 0.0, 'evaluate_on_environment': -493.3232530494802, 'compare_continuous_action_diff': 0.0} step=9610




2022-11-03 00:35.37 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_9610.pt


Epoch 32/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:36.00 [info     ] CQL_20221103002531: epoch=32 step=9920 epoch=32 metrics={'time_sample_batch': 0.0013126434818390878, 'time_algorithm_update': 0.04763066076463269, 'temp_loss': 0.543325821718862, 'temp': 0.39037715504246373, 'alpha_loss': -34.920929619573776, 'alpha': 3.11053466796875, 'critic_loss': 42.87396264845325, 'actor_loss': 246.07217584425402, 'time_step': 0.04920074632090907, 'td_error': 15.953712370507308, 'discounted_sum_of_advantage': -26.56643456229938, 'average_value_estimation': -248.33724554172335, 'value_estimation_std': 0.24982225555516527, 'initial_state_value_estimation': -246.90966796875, 'soft_opc': nan, 'continuous_action_diff': 0.5838056252241782, 'discrete_action_match': 0.0, 'evaluate_on_environment': -425.9130643282591, 'compare_continuous_action_diff': 0.0} step=9920




2022-11-03 00:36.00 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_9920.pt


Epoch 33/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:36.19 [info     ] CQL_20221103002531: epoch=33 step=10230 epoch=33 metrics={'time_sample_batch': 0.0007973578668409779, 'time_algorithm_update': 0.037570175047843685, 'temp_loss': 0.5227677682715077, 'temp': 0.3798388767626978, 'alpha_loss': -36.13261998084284, 'alpha': 3.2188823823005923, 'critic_loss': 44.506640612694525, 'actor_loss': 251.55904481949344, 'time_step': 0.038521341354616226, 'td_error': 16.34281106061362, 'discounted_sum_of_advantage': -23.13660914373769, 'average_value_estimation': -253.59625591885055, 'value_estimation_std': 0.26266458914524105, 'initial_state_value_estimation': -252.57632446289062, 'soft_opc': nan, 'continuous_action_diff': 0.595943875532939, 'discrete_action_match': 0.0, 'evaluate_on_environment': -364.12392109784327, 'compare_continuous_action_diff': 0.0} step=10230




2022-11-03 00:36.19 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_10230.pt


Epoch 34/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:36.39 [info     ] CQL_20221103002531: epoch=34 step=10540 epoch=34 metrics={'time_sample_batch': 0.0011837597816221177, 'time_algorithm_update': 0.04200965742911062, 'temp_loss': 0.5062993372640302, 'temp': 0.3695628717061012, 'alpha_loss': -37.361911318379065, 'alpha': 3.3308228485045897, 'critic_loss': 46.19145696086268, 'actor_loss': 256.9710156348444, 'time_step': 0.043377919350900955, 'td_error': 15.900535100412641, 'discounted_sum_of_advantage': -24.59261541513596, 'average_value_estimation': -258.4692380294548, 'value_estimation_std': 0.2609170354882691, 'initial_state_value_estimation': -257.69866943359375, 'soft_opc': nan, 'continuous_action_diff': 0.599041908743937, 'discrete_action_match': 0.0, 'evaluate_on_environment': -455.36072090631467, 'compare_continuous_action_diff': 0.0} step=10540




2022-11-03 00:36.39 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_10540.pt


Epoch 35/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:36.57 [info     ] CQL_20221103002531: epoch=35 step=10850 epoch=35 metrics={'time_sample_batch': 0.0007307867850026776, 'time_algorithm_update': 0.03679714202880859, 'temp_loss': 0.49315381511565176, 'temp': 0.3595377278904761, 'alpha_loss': -38.68354975792669, 'alpha': 3.446705662819647, 'critic_loss': 47.92317044658046, 'actor_loss': 262.2261856571321, 'time_step': 0.037681148898216985, 'td_error': 15.408711671395755, 'discounted_sum_of_advantage': -25.719863622315245, 'average_value_estimation': -264.2069944363974, 'value_estimation_std': 0.265649629012499, 'initial_state_value_estimation': -263.531494140625, 'soft_opc': nan, 'continuous_action_diff': 0.5980499139476205, 'discrete_action_match': 2.5255050757601015e-07, 'evaluate_on_environment': -469.1213979847429, 'compare_continuous_action_diff': 0.0} step=10850




2022-11-03 00:36.57 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_10850.pt


Epoch 36/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:37.17 [info     ] CQL_20221103002531: epoch=36 step=11160 epoch=36 metrics={'time_sample_batch': 0.001182138535284227, 'time_algorithm_update': 0.041539824393487745, 'temp_loss': 0.47626646487943586, 'temp': 0.3497786317140825, 'alpha_loss': -39.92601437722483, 'alpha': 3.5665265237131427, 'critic_loss': 49.60651146673387, 'actor_loss': 267.4646950506395, 'time_step': 0.042899759354129915, 'td_error': 15.326899211868716, 'discounted_sum_of_advantage': -24.476060623171378, 'average_value_estimation': -269.0811758662286, 'value_estimation_std': 0.2890784638238978, 'initial_state_value_estimation': -268.789306640625, 'soft_opc': nan, 'continuous_action_diff': 0.5775260367860019, 'discrete_action_match': 0.0, 'evaluate_on_environment': -410.84179173706934, 'compare_continuous_action_diff': 0.0} step=11160




2022-11-03 00:37.17 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_11160.pt


Epoch 37/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:37.36 [info     ] CQL_20221103002531: epoch=37 step=11470 epoch=37 metrics={'time_sample_batch': 0.0007867605455460087, 'time_algorithm_update': 0.037499995385446856, 'temp_loss': 0.4616570842842902, 'temp': 0.34025617088041, 'alpha_loss': -41.30566008783156, 'alpha': 3.6904138380481353, 'critic_loss': 51.30108950215001, 'actor_loss': 272.6568064043599, 'time_step': 0.03843083996926584, 'td_error': 15.537339163029184, 'discounted_sum_of_advantage': -23.993722940863808, 'average_value_estimation': -274.5183292636515, 'value_estimation_std': 0.28601068232939675, 'initial_state_value_estimation': -273.90557861328125, 'soft_opc': nan, 'continuous_action_diff': 0.6141208011879452, 'discrete_action_match': 0.0, 'evaluate_on_environment': -315.4014268432202, 'compare_continuous_action_diff': 0.0} step=11470




2022-11-03 00:37.36 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_11470.pt


Epoch 38/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:37.56 [info     ] CQL_20221103002531: epoch=38 step=11780 epoch=38 metrics={'time_sample_batch': 0.0010572741108555949, 'time_algorithm_update': 0.04174705013152092, 'temp_loss': 0.44673186665581116, 'temp': 0.3310385630015404, 'alpha_loss': -42.65371521980532, 'alpha': 3.8185456968122913, 'critic_loss': 53.02462740252095, 'actor_loss': 277.74585285802044, 'time_step': 0.042973822932089525, 'td_error': 15.412839915001472, 'discounted_sum_of_advantage': -22.12699964751514, 'average_value_estimation': -279.43679994105673, 'value_estimation_std': 0.26874420526346027, 'initial_state_value_estimation': -278.90283203125, 'soft_opc': nan, 'continuous_action_diff': 0.6041676019796878, 'discrete_action_match': 0.0, 'evaluate_on_environment': -397.474161688602, 'compare_continuous_action_diff': 0.0} step=11780




2022-11-03 00:37.56 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_11780.pt


Epoch 39/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:38.14 [info     ] CQL_20221103002531: epoch=39 step=12090 epoch=39 metrics={'time_sample_batch': 0.0007768338726412866, 'time_algorithm_update': 0.037614687027469756, 'temp_loss': 0.43405542373657224, 'temp': 0.3219967851715703, 'alpha_loss': -44.0586554496519, 'alpha': 3.9510509283311905, 'critic_loss': 54.78178858603201, 'actor_loss': 282.78572092363913, 'time_step': 0.038524736127545756, 'td_error': 14.938948583182027, 'discounted_sum_of_advantage': -23.60603615664477, 'average_value_estimation': -284.5028053842219, 'value_estimation_std': 0.27211197460666475, 'initial_state_value_estimation': -283.9095153808594, 'soft_opc': nan, 'continuous_action_diff': 0.6078201671961971, 'discrete_action_match': 0.0, 'evaluate_on_environment': -389.09694399847245, 'compare_continuous_action_diff': 0.0} step=12090




2022-11-03 00:38.14 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_12090.pt


Epoch 40/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:38.34 [info     ] CQL_20221103002531: epoch=40 step=12400 epoch=40 metrics={'time_sample_batch': 0.0011977341867262317, 'time_algorithm_update': 0.04205829405015515, 'temp_loss': 0.4193573263383681, 'temp': 0.3132165376217135, 'alpha_loss': -45.48398629465411, 'alpha': 4.087918386151713, 'critic_loss': 56.71921922006915, 'actor_loss': 287.7186517530872, 'time_step': 0.04345617678857619, 'td_error': 15.090288195799275, 'discounted_sum_of_advantage': -21.518985679409884, 'average_value_estimation': -289.3446579139725, 'value_estimation_std': 0.2701619279322017, 'initial_state_value_estimation': -288.48974609375, 'soft_opc': nan, 'continuous_action_diff': 0.6266532824765755, 'discrete_action_match': 0.0, 'evaluate_on_environment': -420.9780025912334, 'compare_continuous_action_diff': 0.0} step=12400




2022-11-03 00:38.34 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_12400.pt


Epoch 41/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:38.53 [info     ] CQL_20221103002531: epoch=41 step=12710 epoch=41 metrics={'time_sample_batch': 0.0007574973567839591, 'time_algorithm_update': 0.03752244057193879, 'temp_loss': 0.40881628865195857, 'temp': 0.30468359493440195, 'alpha_loss': -46.9370660228114, 'alpha': 4.229511699368877, 'critic_loss': 58.38670462331464, 'actor_loss': 292.62701445548765, 'time_step': 0.03841637026879095, 'td_error': 15.13824689841913, 'discounted_sum_of_advantage': -20.45243990218813, 'average_value_estimation': -293.7553819545162, 'value_estimation_std': 0.2617068411122112, 'initial_state_value_estimation': -292.7869873046875, 'soft_opc': nan, 'continuous_action_diff': 0.5939992712755816, 'discrete_action_match': 5.051010151520203e-07, 'evaluate_on_environment': -483.0363623378151, 'compare_continuous_action_diff': 0.0} step=12710




2022-11-03 00:38.53 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_12710.pt


Epoch 42/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:39.14 [info     ] CQL_20221103002531: epoch=42 step=13020 epoch=42 metrics={'time_sample_batch': 0.0011784168981736707, 'time_algorithm_update': 0.0420671655285743, 'temp_loss': 0.3956994496045574, 'temp': 0.29636408475137527, 'alpha_loss': -48.422482201360886, 'alpha': 4.375823113226121, 'critic_loss': 60.22434179244503, 'actor_loss': 297.4895812988281, 'time_step': 0.0434375855230516, 'td_error': 14.532895838402704, 'discounted_sum_of_advantage': -21.736995194580683, 'average_value_estimation': -298.5136540952069, 'value_estimation_std': 0.26257420695432443, 'initial_state_value_estimation': -297.6562194824219, 'soft_opc': nan, 'continuous_action_diff': 0.6144286087008527, 'discrete_action_match': 0.0, 'evaluate_on_environment': -471.25977535408964, 'compare_continuous_action_diff': 0.0} step=13020




2022-11-03 00:39.15 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_13020.pt


Epoch 43/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:39.35 [info     ] CQL_20221103002531: epoch=43 step=13330 epoch=43 metrics={'time_sample_batch': 0.0011173540546048073, 'time_algorithm_update': 0.04287768794644264, 'temp_loss': 0.3826257756640834, 'temp': 0.28825596917060115, 'alpha_loss': -49.995423224664506, 'alpha': 4.527110439731229, 'critic_loss': 62.15163912619314, 'actor_loss': 302.25026136829007, 'time_step': 0.0441640030953192, 'td_error': 14.668491049146205, 'discounted_sum_of_advantage': -21.10986362482385, 'average_value_estimation': -303.6951619474539, 'value_estimation_std': 0.2713393284291212, 'initial_state_value_estimation': -302.5964050292969, 'soft_opc': nan, 'continuous_action_diff': 0.6508973535139728, 'discrete_action_match': 0.0, 'evaluate_on_environment': -263.54989537497175, 'compare_continuous_action_diff': 0.0} step=13330




2022-11-03 00:39.35 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_13330.pt


Epoch 44/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:39.55 [info     ] CQL_20221103002531: epoch=44 step=13640 epoch=44 metrics={'time_sample_batch': 0.0011695492652154737, 'time_algorithm_update': 0.041956626215288714, 'temp_loss': 0.3726206354556545, 'temp': 0.2803874375358705, 'alpha_loss': -51.56802338630922, 'alpha': 4.683671214503627, 'critic_loss': 64.14565376773957, 'actor_loss': 306.93871105563255, 'time_step': 0.04331623969539519, 'td_error': 14.630626180006578, 'discounted_sum_of_advantage': -20.116774303741344, 'average_value_estimation': -308.21615417652913, 'value_estimation_std': 0.2694712371811042, 'initial_state_value_estimation': -307.19830322265625, 'soft_opc': nan, 'continuous_action_diff': 0.6219841188598777, 'discrete_action_match': 0.0, 'evaluate_on_environment': -367.6485933737753, 'compare_continuous_action_diff': 0.0} step=13640




2022-11-03 00:39.55 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_13640.pt


Epoch 45/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:40.14 [info     ] CQL_20221103002531: epoch=45 step=13950 epoch=45 metrics={'time_sample_batch': 0.0008567487039873677, 'time_algorithm_update': 0.0379437554267145, 'temp_loss': 0.36380371557128044, 'temp': 0.272683292146652, 'alpha_loss': -53.25151981230705, 'alpha': 4.845481854100381, 'critic_loss': 66.06657254618983, 'actor_loss': 311.61441581479966, 'time_step': 0.038950777053833006, 'td_error': 13.929156125907777, 'discounted_sum_of_advantage': -23.17084620575535, 'average_value_estimation': -312.11776638326677, 'value_estimation_std': 0.27047675309778946, 'initial_state_value_estimation': -311.15020751953125, 'soft_opc': nan, 'continuous_action_diff': 0.6157455307915671, 'discrete_action_match': 0.0, 'evaluate_on_environment': -395.1091120081731, 'compare_continuous_action_diff': 0.0} step=13950




2022-11-03 00:40.14 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_13950.pt


Epoch 46/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:40.34 [info     ] CQL_20221103002531: epoch=46 step=14260 epoch=46 metrics={'time_sample_batch': 0.0008925568672918504, 'time_algorithm_update': 0.04242581090619487, 'temp_loss': 0.3514047673633022, 'temp': 0.26518445582159106, 'alpha_loss': -54.85195239897697, 'alpha': 5.012891672503564, 'critic_loss': 68.01623313657699, 'actor_loss': 316.2063149729083, 'time_step': 0.043527595458492156, 'td_error': 14.072246683791958, 'discounted_sum_of_advantage': -22.07001448804485, 'average_value_estimation': -316.83944593723896, 'value_estimation_std': 0.2686212666527027, 'initial_state_value_estimation': -316.0666198730469, 'soft_opc': nan, 'continuous_action_diff': 0.6421970708499906, 'discrete_action_match': 2.5255050757601015e-07, 'evaluate_on_environment': -482.6957410680426, 'compare_continuous_action_diff': 0.0} step=14260




2022-11-03 00:40.34 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_14260.pt


Epoch 47/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:40.53 [info     ] CQL_20221103002531: epoch=47 step=14570 epoch=47 metrics={'time_sample_batch': 0.0008621408093360162, 'time_algorithm_update': 0.038439934484420284, 'temp_loss': 0.33979782904348066, 'temp': 0.2579141644700881, 'alpha_loss': -56.65564258944604, 'alpha': 5.186062835877942, 'critic_loss': 70.07053236192273, 'actor_loss': 320.8004681002709, 'time_step': 0.03945419865269815, 'td_error': 14.295241134728318, 'discounted_sum_of_advantage': -18.034154480289597, 'average_value_estimation': -322.2010454883074, 'value_estimation_std': 0.2809629947705226, 'initial_state_value_estimation': -321.01092529296875, 'soft_opc': nan, 'continuous_action_diff': 0.6502058485854486, 'discrete_action_match': 0.0, 'evaluate_on_environment': -304.851931295867, 'compare_continuous_action_diff': 0.0} step=14570




2022-11-03 00:40.53 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_14570.pt


Epoch 48/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:41.13 [info     ] CQL_20221103002531: epoch=48 step=14880 epoch=48 metrics={'time_sample_batch': 0.0009933763934719948, 'time_algorithm_update': 0.04239943796588529, 'temp_loss': 0.32862314341529725, 'temp': 0.2508895680308342, 'alpha_loss': -58.365234411916425, 'alpha': 5.365097108963997, 'critic_loss': 72.07498309227728, 'actor_loss': 325.25600782825103, 'time_step': 0.04358938970873433, 'td_error': 14.404213417172812, 'discounted_sum_of_advantage': -17.196784750630172, 'average_value_estimation': -326.0151968291311, 'value_estimation_std': 0.2704646673799212, 'initial_state_value_estimation': -325.31640625, 'soft_opc': nan, 'continuous_action_diff': 0.6254709854185213, 'discrete_action_match': 0.0, 'evaluate_on_environment': -276.1359337683192, 'compare_continuous_action_diff': 0.0} step=14880




2022-11-03 00:41.13 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_14880.pt


Epoch 49/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:41.32 [info     ] CQL_20221103002531: epoch=49 step=15190 epoch=49 metrics={'time_sample_batch': 0.0007886655869022493, 'time_algorithm_update': 0.037869281922617265, 'temp_loss': 0.321363690303218, 'temp': 0.24400429759294756, 'alpha_loss': -60.186530857701456, 'alpha': 5.550052115225022, 'critic_loss': 74.28603633757561, 'actor_loss': 329.71271647791707, 'time_step': 0.038803954278269125, 'td_error': 14.338271914110816, 'discounted_sum_of_advantage': -16.523260332031075, 'average_value_estimation': -331.16457867786903, 'value_estimation_std': 0.2845683191544986, 'initial_state_value_estimation': -330.45269775390625, 'soft_opc': nan, 'continuous_action_diff': 0.6320189880994705, 'discrete_action_match': 0.0, 'evaluate_on_environment': -460.25664532975213, 'compare_continuous_action_diff': 0.0} step=15190




2022-11-03 00:41.32 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_15190.pt


Epoch 50/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:41.52 [info     ] CQL_20221103002531: epoch=50 step=15500 epoch=50 metrics={'time_sample_batch': 0.0010151986152895035, 'time_algorithm_update': 0.04247639102320517, 'temp_loss': 0.31281160423832555, 'temp': 0.23727891897001574, 'alpha_loss': -62.107152015932144, 'alpha': 5.741442363492904, 'critic_loss': 76.38299467025264, 'actor_loss': 334.10096159904236, 'time_step': 0.04368624917922481, 'td_error': 14.02121195445933, 'discounted_sum_of_advantage': -16.314428884192626, 'average_value_estimation': -335.16622118907446, 'value_estimation_std': 0.27397816405483655, 'initial_state_value_estimation': -334.366455078125, 'soft_opc': nan, 'continuous_action_diff': 0.6363086569856405, 'discrete_action_match': 0.0, 'evaluate_on_environment': -309.88967635087795, 'compare_continuous_action_diff': 0.0} step=15500




2022-11-03 00:41.52 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_15500.pt


Epoch 51/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:42.11 [info     ] CQL_20221103002531: epoch=51 step=15810 epoch=51 metrics={'time_sample_batch': 0.0007892408678608556, 'time_algorithm_update': 0.037902363654105896, 'temp_loss': 0.3005465881478402, 'temp': 0.23077146252316813, 'alpha_loss': -64.02965537040464, 'alpha': 5.939439213660456, 'critic_loss': 78.62857488816785, 'actor_loss': 338.422750952936, 'time_step': 0.03883553397270941, 'td_error': 13.867559054441951, 'discounted_sum_of_advantage': -16.3225230218103, 'average_value_estimation': -340.0381823445406, 'value_estimation_std': 0.2808011365455985, 'initial_state_value_estimation': -338.7613220214844, 'soft_opc': nan, 'continuous_action_diff': 0.6609507447123271, 'discrete_action_match': 0.0, 'evaluate_on_environment': -358.6208432784237, 'compare_continuous_action_diff': 0.0} step=15810




2022-11-03 00:42.11 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_15810.pt


Epoch 52/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:42.31 [info     ] CQL_20221103002531: epoch=52 step=16120 epoch=52 metrics={'time_sample_batch': 0.0010569541685042842, 'time_algorithm_update': 0.04246832632249401, 'temp_loss': 0.29424361644252656, 'temp': 0.22444555591191015, 'alpha_loss': -66.12902904633553, 'alpha': 6.144602189525481, 'critic_loss': 81.0909792254048, 'actor_loss': 342.73527457944806, 'time_step': 0.04369901610958961, 'td_error': 13.255244274909288, 'discounted_sum_of_advantage': -19.272241190085413, 'average_value_estimation': -343.5084692406313, 'value_estimation_std': 0.2958118160243361, 'initial_state_value_estimation': -342.6126708984375, 'soft_opc': nan, 'continuous_action_diff': 0.6430455632808615, 'discrete_action_match': 0.0, 'evaluate_on_environment': -381.2288473862028, 'compare_continuous_action_diff': 0.0} step=16120




2022-11-03 00:42.32 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_16120.pt


Epoch 53/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:42.51 [info     ] CQL_20221103002531: epoch=53 step=16430 epoch=53 metrics={'time_sample_batch': 0.0007451726544287897, 'time_algorithm_update': 0.037952803796337496, 'temp_loss': 0.2849038967201787, 'temp': 0.21826450690146415, 'alpha_loss': -68.2491829410676, 'alpha': 6.3567311440744705, 'critic_loss': 83.39837262553553, 'actor_loss': 346.97579355547504, 'time_step': 0.03886212456610895, 'td_error': 14.196845124406767, 'discounted_sum_of_advantage': -12.674043181928793, 'average_value_estimation': -348.1182245467589, 'value_estimation_std': 0.27926001392535316, 'initial_state_value_estimation': -347.0975036621094, 'soft_opc': nan, 'continuous_action_diff': 0.6485954649716206, 'discrete_action_match': 0.0, 'evaluate_on_environment': -183.93806917893565, 'compare_continuous_action_diff': 0.0} step=16430




2022-11-03 00:42.51 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_16430.pt


Epoch 54/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:43.14 [info     ] CQL_20221103002531: epoch=54 step=16740 epoch=54 metrics={'time_sample_batch': 0.001221417611645114, 'time_algorithm_update': 0.04995508809243479, 'temp_loss': 0.27943655026535835, 'temp': 0.2122243004941171, 'alpha_loss': -70.42597324002173, 'alpha': 6.576267274733513, 'critic_loss': 85.80567681097216, 'actor_loss': 351.19428041519654, 'time_step': 0.05144812214759088, 'td_error': 13.659804921409538, 'discounted_sum_of_advantage': -14.81758060479634, 'average_value_estimation': -352.3599741459131, 'value_estimation_std': 0.28162180067856535, 'initial_state_value_estimation': -351.2664489746094, 'soft_opc': nan, 'continuous_action_diff': 0.6618404597553763, 'discrete_action_match': 0.0, 'evaluate_on_environment': -241.71869558686095, 'compare_continuous_action_diff': 0.0} step=16740




2022-11-03 00:43.14 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_16740.pt


Epoch 55/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:43.33 [info     ] CQL_20221103002531: epoch=55 step=17050 epoch=55 metrics={'time_sample_batch': 0.0007776014266475554, 'time_algorithm_update': 0.03792945646470593, 'temp_loss': 0.2691590468729696, 'temp': 0.20636805863149704, 'alpha_loss': -72.5544083872149, 'alpha': 6.803282786953834, 'critic_loss': 88.30587761171402, 'actor_loss': 355.34229509907385, 'time_step': 0.0388540760163338, 'td_error': 12.876005998416858, 'discounted_sum_of_advantage': -17.8069138394663, 'average_value_estimation': -356.2547827742674, 'value_estimation_std': 0.2746714205346122, 'initial_state_value_estimation': -355.1800842285156, 'soft_opc': nan, 'continuous_action_diff': 0.6767085656236286, 'discrete_action_match': 0.0, 'evaluate_on_environment': -261.055233943836, 'compare_continuous_action_diff': 0.0} step=17050




2022-11-03 00:43.33 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_17050.pt


Epoch 56/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:43.53 [info     ] CQL_20221103002531: epoch=56 step=17360 epoch=56 metrics={'time_sample_batch': 0.001146498034077306, 'time_algorithm_update': 0.04284330183459866, 'temp_loss': 0.25972679041085706, 'temp': 0.20072544433416858, 'alpha_loss': -74.91467885663432, 'alpha': 7.038100545637069, 'critic_loss': 90.87469726070282, 'actor_loss': 359.4045648390247, 'time_step': 0.04417549717810846, 'td_error': 13.870276575927718, 'discounted_sum_of_advantage': -11.204018826145555, 'average_value_estimation': -360.28840610654083, 'value_estimation_std': 0.2826168321393497, 'initial_state_value_estimation': -359.2056884765625, 'soft_opc': nan, 'continuous_action_diff': 0.6575211305942236, 'discrete_action_match': 0.0, 'evaluate_on_environment': -204.52422155542655, 'compare_continuous_action_diff': 0.0} step=17360




2022-11-03 00:43.53 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_17360.pt


Epoch 57/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:44.12 [info     ] CQL_20221103002531: epoch=57 step=17670 epoch=57 metrics={'time_sample_batch': 0.0008378498015865202, 'time_algorithm_update': 0.03813452566823652, 'temp_loss': 0.2549294025667252, 'temp': 0.19519841584467118, 'alpha_loss': -77.38529628630607, 'alpha': 7.281114310602988, 'critic_loss': 93.34646665511593, 'actor_loss': 363.45313996345766, 'time_step': 0.039126150838790404, 'td_error': 13.50042424865243, 'discounted_sum_of_advantage': -11.88448098584004, 'average_value_estimation': -364.8055151958372, 'value_estimation_std': 0.28303349706812037, 'initial_state_value_estimation': -363.4696350097656, 'soft_opc': nan, 'continuous_action_diff': 0.6642013496756937, 'discrete_action_match': 0.0, 'evaluate_on_environment': -333.4841226680367, 'compare_continuous_action_diff': 0.0} step=17670




2022-11-03 00:44.12 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_17670.pt


Epoch 58/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:44.32 [info     ] CQL_20221103002531: epoch=58 step=17980 epoch=58 metrics={'time_sample_batch': 0.0009740483376287645, 'time_algorithm_update': 0.04256096270776564, 'temp_loss': 0.24625117524016288, 'temp': 0.18983613348776293, 'alpha_loss': -79.76552845124276, 'alpha': 7.532576922447451, 'critic_loss': 95.9607566341277, 'actor_loss': 367.4532012939453, 'time_step': 0.0437385912864439, 'td_error': 13.426921944422618, 'discounted_sum_of_advantage': -11.932791016506815, 'average_value_estimation': -368.0229900759546, 'value_estimation_std': 0.29162284254481474, 'initial_state_value_estimation': -366.81988525390625, 'soft_opc': nan, 'continuous_action_diff': 0.669719821987368, 'discrete_action_match': 0.0, 'evaluate_on_environment': -280.9887599617538, 'compare_continuous_action_diff': 0.0} step=17980




2022-11-03 00:44.32 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_17980.pt


Epoch 59/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:44.51 [info     ] CQL_20221103002531: epoch=59 step=18290 epoch=59 metrics={'time_sample_batch': 0.0007674801734185988, 'time_algorithm_update': 0.038016077010862286, 'temp_loss': 0.2409714645916416, 'temp': 0.18458469947499614, 'alpha_loss': -82.40025302517799, 'alpha': 7.792734750624626, 'critic_loss': 98.76914224932271, 'actor_loss': 371.4403425647366, 'time_step': 0.03892902097394389, 'td_error': 13.250022332614716, 'discounted_sum_of_advantage': -11.850847578875342, 'average_value_estimation': -372.6726776391628, 'value_estimation_std': 0.28151403135797703, 'initial_state_value_estimation': -371.273681640625, 'soft_opc': nan, 'continuous_action_diff': 0.6903835670255669, 'discrete_action_match': 2.5255050757601015e-07, 'evaluate_on_environment': -275.23632483179466, 'compare_continuous_action_diff': 0.0} step=18290




2022-11-03 00:44.51 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_18290.pt


Epoch 60/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:45.11 [info     ] CQL_20221103002531: epoch=60 step=18600 epoch=60 metrics={'time_sample_batch': 0.0011250534365254063, 'time_algorithm_update': 0.04276595730935374, 'temp_loss': 0.23355031936399398, 'temp': 0.17949132275196814, 'alpha_loss': -85.18726063389931, 'alpha': 8.06233698321927, 'critic_loss': 101.89201369747039, 'actor_loss': 375.35280653430567, 'time_step': 0.044063474285987114, 'td_error': 12.808400353670402, 'discounted_sum_of_advantage': -12.639685053970288, 'average_value_estimation': -376.58432740399286, 'value_estimation_std': 0.2781220552006056, 'initial_state_value_estimation': -375.0102233886719, 'soft_opc': nan, 'continuous_action_diff': 0.6807516795468243, 'discrete_action_match': 0.0, 'evaluate_on_environment': -254.4607392189946, 'compare_continuous_action_diff': 0.0} step=18600




2022-11-03 00:45.11 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_18600.pt


Epoch 61/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:45.30 [info     ] CQL_20221103002531: epoch=61 step=18910 epoch=61 metrics={'time_sample_batch': 0.0008130088929207094, 'time_algorithm_update': 0.038340244754668205, 'temp_loss': 0.22587282335565936, 'temp': 0.1745377629995346, 'alpha_loss': -87.87037259994014, 'alpha': 8.341334087617936, 'critic_loss': 104.58362370152628, 'actor_loss': 379.1788288731729, 'time_step': 0.0393033012267082, 'td_error': 13.051530627133785, 'discounted_sum_of_advantage': -11.480808547217752, 'average_value_estimation': -380.2601034090225, 'value_estimation_std': 0.2810328700685592, 'initial_state_value_estimation': -379.0344543457031, 'soft_opc': nan, 'continuous_action_diff': 0.6688993393020657, 'discrete_action_match': 0.0, 'evaluate_on_environment': -313.7666105856603, 'compare_continuous_action_diff': 0.0} step=18910




2022-11-03 00:45.30 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_18910.pt


Epoch 62/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:45.50 [info     ] CQL_20221103002531: epoch=62 step=19220 epoch=62 metrics={'time_sample_batch': 0.0009020951486402942, 'time_algorithm_update': 0.04309648698376071, 'temp_loss': 0.22056393767556837, 'temp': 0.16973028312767705, 'alpha_loss': -90.83641982540007, 'alpha': 8.629876146008892, 'critic_loss': 107.63499682026524, 'actor_loss': 383.017475054341, 'time_step': 0.04421696662902832, 'td_error': 12.565240078926212, 'discounted_sum_of_advantage': -13.255731415235251, 'average_value_estimation': -384.3897024826689, 'value_estimation_std': 0.2805402873281917, 'initial_state_value_estimation': -382.98712158203125, 'soft_opc': nan, 'continuous_action_diff': 0.6458126073875295, 'discrete_action_match': 0.0, 'evaluate_on_environment': -268.25262867100327, 'compare_continuous_action_diff': 0.0} step=19220




2022-11-03 00:45.50 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_19220.pt


Epoch 63/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:46.09 [info     ] CQL_20221103002531: epoch=63 step=19530 epoch=63 metrics={'time_sample_batch': 0.0007756363961004442, 'time_algorithm_update': 0.03811139137514176, 'temp_loss': 0.2133417643847004, 'temp': 0.1650544427094921, 'alpha_loss': -93.83056121333954, 'alpha': 8.92853161288846, 'critic_loss': 110.88266013360793, 'actor_loss': 386.8306507725869, 'time_step': 0.039031893976273074, 'td_error': 13.386668379143982, 'discounted_sum_of_advantage': -7.881432803805528, 'average_value_estimation': -387.6310063333604, 'value_estimation_std': 0.28465281585483776, 'initial_state_value_estimation': -385.988037109375, 'soft_opc': nan, 'continuous_action_diff': 0.647795363057281, 'discrete_action_match': 0.0, 'evaluate_on_environment': -361.23409598591087, 'compare_continuous_action_diff': 0.0} step=19530




2022-11-03 00:46.09 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_19530.pt


Epoch 64/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:46.33 [info     ] CQL_20221103002531: epoch=64 step=19840 epoch=64 metrics={'time_sample_batch': 0.0015512697158321256, 'time_algorithm_update': 0.04723154960140105, 'temp_loss': 0.20791533036578086, 'temp': 0.16049349774276056, 'alpha_loss': -96.97674575313445, 'alpha': 9.237739796792308, 'critic_loss': 114.29789588682114, 'actor_loss': 390.5170115809287, 'time_step': 0.048969471070074266, 'td_error': 12.649143608052364, 'discounted_sum_of_advantage': -11.71091871789991, 'average_value_estimation': -391.3437955398323, 'value_estimation_std': 0.29053879101696606, 'initial_state_value_estimation': -390.0496520996094, 'soft_opc': nan, 'continuous_action_diff': 0.6753849462260071, 'discrete_action_match': 2.5255050757601015e-07, 'evaluate_on_environment': -386.58201099574666, 'compare_continuous_action_diff': 0.0} step=19840




2022-11-03 00:46.33 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_19840.pt


Epoch 65/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:46.52 [info     ] CQL_20221103002531: epoch=65 step=20150 epoch=65 metrics={'time_sample_batch': 0.0008222556883288968, 'time_algorithm_update': 0.03813617152552451, 'temp_loss': 0.2015674674703229, 'temp': 0.15608457379764126, 'alpha_loss': -100.16294457220262, 'alpha': 9.557734206414992, 'critic_loss': 117.60142499862178, 'actor_loss': 394.2039105815272, 'time_step': 0.03910013937181042, 'td_error': 12.48352785505108, 'discounted_sum_of_advantage': -12.867551374813976, 'average_value_estimation': -394.99270270502393, 'value_estimation_std': 0.2934226173049189, 'initial_state_value_estimation': -393.5263366699219, 'soft_opc': nan, 'continuous_action_diff': 0.6815088953626753, 'discrete_action_match': 0.0, 'evaluate_on_environment': -290.37019156862544, 'compare_continuous_action_diff': 0.0} step=20150




2022-11-03 00:46.52 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_20150.pt


Epoch 66/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:47.12 [info     ] CQL_20221103002531: epoch=66 step=20460 epoch=66 metrics={'time_sample_batch': 0.001172121109501008, 'time_algorithm_update': 0.04270782009247811, 'temp_loss': 0.19599981918450324, 'temp': 0.15178216830376656, 'alpha_loss': -103.54737558672505, 'alpha': 9.888944302835773, 'critic_loss': 121.28630828857422, 'actor_loss': 397.80325209094633, 'time_step': 0.04405375449888168, 'td_error': 13.134469008106741, 'discounted_sum_of_advantage': -7.663620243038445, 'average_value_estimation': -398.81347446420665, 'value_estimation_std': 0.2870947655259005, 'initial_state_value_estimation': -397.20086669921875, 'soft_opc': nan, 'continuous_action_diff': 0.70044030687708, 'discrete_action_match': 0.0, 'evaluate_on_environment': -372.10536902612563, 'compare_continuous_action_diff': 0.0} step=20460




2022-11-03 00:47.12 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_20460.pt


Epoch 67/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:47.31 [info     ] CQL_20221103002531: epoch=67 step=20770 epoch=67 metrics={'time_sample_batch': 0.000755187003843246, 'time_algorithm_update': 0.03798898496935445, 'temp_loss': 0.18992743679592686, 'temp': 0.14760331694156892, 'alpha_loss': -107.01595845376292, 'alpha': 10.231817522356588, 'critic_loss': 124.83751902426442, 'actor_loss': 401.3803752283896, 'time_step': 0.038887031616703156, 'td_error': 12.701651764805662, 'discounted_sum_of_advantage': -9.278003291709686, 'average_value_estimation': -402.87553857467043, 'value_estimation_std': 0.2930907295589334, 'initial_state_value_estimation': -401.4286804199219, 'soft_opc': nan, 'continuous_action_diff': 0.6553354166277822, 'discrete_action_match': 0.0, 'evaluate_on_environment': -229.68953813398784, 'compare_continuous_action_diff': 0.0} step=20770




2022-11-03 00:47.31 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_20770.pt


Epoch 68/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:47.51 [info     ] CQL_20221103002531: epoch=68 step=21080 epoch=68 metrics={'time_sample_batch': 0.001002223260941044, 'time_algorithm_update': 0.04263856257161786, 'temp_loss': 0.18575289975250922, 'temp': 0.1435330543306566, 'alpha_loss': -110.53358698198872, 'alpha': 10.5864518104061, 'critic_loss': 128.49245366742534, 'actor_loss': 404.9722590292654, 'time_step': 0.04381301710682531, 'td_error': 12.324205343867993, 'discounted_sum_of_advantage': -11.056049938312208, 'average_value_estimation': -405.1035452193291, 'value_estimation_std': 0.29434227596191614, 'initial_state_value_estimation': -403.9288330078125, 'soft_opc': nan, 'continuous_action_diff': 0.6389468713582664, 'discrete_action_match': 0.0, 'evaluate_on_environment': -479.46352649821773, 'compare_continuous_action_diff': 0.0} step=21080




2022-11-03 00:47.51 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_21080.pt


Epoch 69/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:48.09 [info     ] CQL_20221103002531: epoch=69 step=21390 epoch=69 metrics={'time_sample_batch': 0.0007344099783128308, 'time_algorithm_update': 0.0379067390195785, 'temp_loss': 0.18094347758639243, 'temp': 0.1395517028627857, 'alpha_loss': -114.57449025800152, 'alpha': 10.953824624707622, 'critic_loss': 132.63601544287897, 'actor_loss': 408.4334686279297, 'time_step': 0.03876133964907739, 'td_error': 12.740913606666245, 'discounted_sum_of_advantage': -8.060141700841184, 'average_value_estimation': -409.1180546765614, 'value_estimation_std': 0.30709892137958317, 'initial_state_value_estimation': -407.86297607421875, 'soft_opc': nan, 'continuous_action_diff': 0.6678009554348486, 'discrete_action_match': 0.0, 'evaluate_on_environment': -412.83875921547843, 'compare_continuous_action_diff': 0.0} step=21390




2022-11-03 00:48.10 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_21390.pt


Epoch 70/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:48.30 [info     ] CQL_20221103002531: epoch=70 step=21700 epoch=70 metrics={'time_sample_batch': 0.0010206153315882528, 'time_algorithm_update': 0.04245169085841025, 'temp_loss': 0.17570087049276598, 'temp': 0.13568277296520048, 'alpha_loss': -118.19778144590316, 'alpha': 11.33427769445604, 'critic_loss': 136.51538750433153, 'actor_loss': 411.9251403808594, 'time_step': 0.04363754257079094, 'td_error': 12.833426093589692, 'discounted_sum_of_advantage': -7.372641463533032, 'average_value_estimation': -412.74754810075734, 'value_estimation_std': 0.3184755689859449, 'initial_state_value_estimation': -411.34674072265625, 'soft_opc': nan, 'continuous_action_diff': 0.666653618678764, 'discrete_action_match': 0.0, 'evaluate_on_environment': -385.3035228810644, 'compare_continuous_action_diff': 0.0} step=21700




2022-11-03 00:48.30 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_21700.pt


Epoch 71/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:48.49 [info     ] CQL_20221103002531: epoch=71 step=22010 epoch=71 metrics={'time_sample_batch': 0.0007824097910235005, 'time_algorithm_update': 0.03797680101087016, 'temp_loss': 0.1695410776042169, 'temp': 0.13193417028073343, 'alpha_loss': -122.23272409746724, 'alpha': 11.727395620653706, 'critic_loss': 140.864259781376, 'actor_loss': 415.3404575470955, 'time_step': 0.03891858362382458, 'td_error': 12.79139905285683, 'discounted_sum_of_advantage': -6.33599106936502, 'average_value_estimation': -415.814461698245, 'value_estimation_std': 0.3063478727865667, 'initial_state_value_estimation': -414.34405517578125, 'soft_opc': nan, 'continuous_action_diff': 0.6673190170529373, 'discrete_action_match': 0.0, 'evaluate_on_environment': -460.7282291673787, 'compare_continuous_action_diff': 0.0} step=22010




2022-11-03 00:48.49 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_22010.pt


Epoch 72/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:49.09 [info     ] CQL_20221103002531: epoch=72 step=22320 epoch=72 metrics={'time_sample_batch': 0.0012066671925206338, 'time_algorithm_update': 0.04108278443736415, 'temp_loss': 0.1668826145029837, 'temp': 0.12828955679170548, 'alpha_loss': -126.34734351865707, 'alpha': 12.13420322787377, 'critic_loss': 145.04197540283204, 'actor_loss': 418.70931829637095, 'time_step': 0.042453281341060516, 'td_error': 12.644178899366876, 'discounted_sum_of_advantage': -6.848791481609192, 'average_value_estimation': -418.9979975536616, 'value_estimation_std': 0.3100472074572182, 'initial_state_value_estimation': -417.491943359375, 'soft_opc': nan, 'continuous_action_diff': 0.6819776321863288, 'discrete_action_match': 0.0, 'evaluate_on_environment': -416.89107580185885, 'compare_continuous_action_diff': 0.0} step=22320




2022-11-03 00:49.09 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_22320.pt


Epoch 73/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:49.28 [info     ] CQL_20221103002531: epoch=73 step=22630 epoch=73 metrics={'time_sample_batch': 0.0007752564645582629, 'time_algorithm_update': 0.038233175585346835, 'temp_loss': 0.1614683529061656, 'temp': 0.12473382663822943, 'alpha_loss': -130.68027292067003, 'alpha': 12.55540995444021, 'critic_loss': 149.80208961732927, 'actor_loss': 422.0395758844191, 'time_step': 0.03913825558077905, 'td_error': 12.52719301689561, 'discounted_sum_of_advantage': -7.586817199894582, 'average_value_estimation': -422.5069793836989, 'value_estimation_std': 0.29771640089222223, 'initial_state_value_estimation': -421.2919006347656, 'soft_opc': nan, 'continuous_action_diff': 0.6774597248149624, 'discrete_action_match': 0.0, 'evaluate_on_environment': -456.94150575264393, 'compare_continuous_action_diff': 0.0} step=22630




2022-11-03 00:49.28 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_22630.pt


Epoch 74/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:49.48 [info     ] CQL_20221103002531: epoch=74 step=22940 epoch=74 metrics={'time_sample_batch': 0.0010911556982224988, 'time_algorithm_update': 0.04160797749796221, 'temp_loss': 0.15829890788562836, 'temp': 0.12126616443837843, 'alpha_loss': -135.37726189397998, 'alpha': 12.99171483132147, 'critic_loss': 154.44252004315777, 'actor_loss': 425.2957447667276, 'time_step': 0.042859745794726956, 'td_error': 12.085940510682967, 'discounted_sum_of_advantage': -9.277386564516728, 'average_value_estimation': -425.79878407533744, 'value_estimation_std': 0.30906594042039265, 'initial_state_value_estimation': -424.5594482421875, 'soft_opc': nan, 'continuous_action_diff': 0.6549478963682748, 'discrete_action_match': 0.0, 'evaluate_on_environment': -391.3441944871981, 'compare_continuous_action_diff': 0.0} step=22940




2022-11-03 00:49.48 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_22940.pt


Epoch 75/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:50.07 [info     ] CQL_20221103002531: epoch=75 step=23250 epoch=75 metrics={'time_sample_batch': 0.0007854730852188603, 'time_algorithm_update': 0.038105791614901637, 'temp_loss': 0.15413548148447467, 'temp': 0.11789378502195881, 'alpha_loss': -139.9965087890625, 'alpha': 13.443488634786299, 'critic_loss': 159.16634555939706, 'actor_loss': 428.51798637144026, 'time_step': 0.039015045473652504, 'td_error': 12.646180318831727, 'discounted_sum_of_advantage': -5.215619536386676, 'average_value_estimation': -429.1789270319549, 'value_estimation_std': 0.2920625111624588, 'initial_state_value_estimation': -427.92144775390625, 'soft_opc': nan, 'continuous_action_diff': 0.6673204715700481, 'discrete_action_match': 0.0, 'evaluate_on_environment': -340.16801134485536, 'compare_continuous_action_diff': 0.0} step=23250




2022-11-03 00:50.07 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_23250.pt


Epoch 76/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:50.30 [info     ] CQL_20221103002531: epoch=76 step=23560 epoch=76 metrics={'time_sample_batch': 0.001012354512368479, 'time_algorithm_update': 0.047007812992219004, 'temp_loss': 0.14887349677662695, 'temp': 0.11462154328342407, 'alpha_loss': -144.6522243376701, 'alpha': 13.910474174253402, 'critic_loss': 163.95867073305192, 'actor_loss': 431.7629975349672, 'time_step': 0.04820624551465434, 'td_error': 12.418073751256916, 'discounted_sum_of_advantage': -5.8729494266120135, 'average_value_estimation': -432.0280484469345, 'value_estimation_std': 0.30201437392621916, 'initial_state_value_estimation': -430.7159118652344, 'soft_opc': nan, 'continuous_action_diff': 0.6639796273159515, 'discrete_action_match': 0.0, 'evaluate_on_environment': -426.45429334417713, 'compare_continuous_action_diff': 0.0} step=23560




2022-11-03 00:50.30 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_23560.pt


Epoch 77/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:50.49 [info     ] CQL_20221103002531: epoch=77 step=23870 epoch=77 metrics={'time_sample_batch': 0.0007882595062255859, 'time_algorithm_update': 0.03831080159833354, 'temp_loss': 0.14673852886884442, 'temp': 0.11142422516019114, 'alpha_loss': -149.62102316579512, 'alpha': 14.393894115571053, 'critic_loss': 169.2091810164913, 'actor_loss': 434.8627190374559, 'time_step': 0.03922342331178727, 'td_error': 12.287781914501023, 'discounted_sum_of_advantage': -8.106097378946343, 'average_value_estimation': -435.9139601200721, 'value_estimation_std': 0.3122608543780367, 'initial_state_value_estimation': -434.695068359375, 'soft_opc': nan, 'continuous_action_diff': 0.6898658010989822, 'discrete_action_match': 0.0, 'evaluate_on_environment': -366.8669998494239, 'compare_continuous_action_diff': 0.0} step=23870




2022-11-03 00:50.49 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_23870.pt


Epoch 78/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:51.09 [info     ] CQL_20221103002531: epoch=78 step=24180 epoch=78 metrics={'time_sample_batch': 0.0011401168761714811, 'time_algorithm_update': 0.04262947344010876, 'temp_loss': 0.13997068015798445, 'temp': 0.10835131923517873, 'alpha_loss': -154.9283455141129, 'alpha': 14.894247221177624, 'critic_loss': 174.50890699817288, 'actor_loss': 437.98220835039695, 'time_step': 0.04392797100928522, 'td_error': 12.095400955989351, 'discounted_sum_of_advantage': -5.545486829346368, 'average_value_estimation': -438.38452507589017, 'value_estimation_std': 0.2974855024247898, 'initial_state_value_estimation': -436.9476013183594, 'soft_opc': nan, 'continuous_action_diff': 0.6511482189072515, 'discrete_action_match': 0.0, 'evaluate_on_environment': -469.8770816649803, 'compare_continuous_action_diff': 0.0} step=24180




2022-11-03 00:51.10 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_24180.pt


Epoch 79/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:51.28 [info     ] CQL_20221103002531: epoch=79 step=24490 epoch=79 metrics={'time_sample_batch': 0.0007744604541409401, 'time_algorithm_update': 0.038310816980177353, 'temp_loss': 0.13843496090942814, 'temp': 0.10534789079139309, 'alpha_loss': -160.35532723703693, 'alpha': 15.412235481508317, 'critic_loss': 180.28216331235825, 'actor_loss': 441.0160559869582, 'time_step': 0.03923462975409723, 'td_error': 12.315486892485154, 'discounted_sum_of_advantage': -5.3604820777633995, 'average_value_estimation': -441.87239428441137, 'value_estimation_std': 0.30754045878342967, 'initial_state_value_estimation': -440.6325378417969, 'soft_opc': nan, 'continuous_action_diff': 0.6479679622768107, 'discrete_action_match': 0.0, 'evaluate_on_environment': -325.76690883257197, 'compare_continuous_action_diff': 0.0} step=24490




2022-11-03 00:51.28 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_24490.pt


Epoch 80/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:51.48 [info     ] CQL_20221103002531: epoch=80 step=24800 epoch=80 metrics={'time_sample_batch': 0.0008719475038589969, 'time_algorithm_update': 0.040186649753201395, 'temp_loss': 0.13373020668664287, 'temp': 0.10240788678488423, 'alpha_loss': -165.78841080204134, 'alpha': 15.948596422133907, 'critic_loss': 185.78092882710118, 'actor_loss': 444.0509024343183, 'time_step': 0.041190141247164816, 'td_error': 12.303005971681532, 'discounted_sum_of_advantage': -5.679839738795788, 'average_value_estimation': -445.27727228899647, 'value_estimation_std': 0.30450273744900347, 'initial_state_value_estimation': -443.7253112792969, 'soft_opc': nan, 'continuous_action_diff': 0.6557561131863751, 'discrete_action_match': 0.0, 'evaluate_on_environment': -370.40998042582294, 'compare_continuous_action_diff': 0.0} step=24800




2022-11-03 00:51.48 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_24800.pt


Epoch 81/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:52.07 [info     ] CQL_20221103002531: epoch=81 step=25110 epoch=81 metrics={'time_sample_batch': 0.0007857161183511057, 'time_algorithm_update': 0.038277870608914286, 'temp_loss': 0.12959716831964832, 'temp': 0.09958611608993623, 'alpha_loss': -171.294874523532, 'alpha': 16.50250714209772, 'critic_loss': 191.51304193312123, 'actor_loss': 447.1001719813193, 'time_step': 0.0391861661787956, 'td_error': 12.867944180577535, 'discounted_sum_of_advantage': -2.42653339342044, 'average_value_estimation': -447.7804137553343, 'value_estimation_std': 0.3201199633961632, 'initial_state_value_estimation': -446.1153564453125, 'soft_opc': nan, 'continuous_action_diff': 0.6804627509861814, 'discrete_action_match': 0.0, 'evaluate_on_environment': -378.85063594224687, 'compare_continuous_action_diff': 0.0} step=25110




2022-11-03 00:52.07 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_25110.pt


Epoch 82/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:52.28 [info     ] CQL_20221103002531: epoch=82 step=25420 epoch=82 metrics={'time_sample_batch': 0.000956851436245826, 'time_algorithm_update': 0.03988938254694785, 'temp_loss': 0.12669773462318604, 'temp': 0.09682419461108023, 'alpha_loss': -177.36804691437752, 'alpha': 17.07649453070856, 'critic_loss': 197.76275462488974, 'actor_loss': 450.05403570359755, 'time_step': 0.041002135892068185, 'td_error': 12.966207032101183, 'discounted_sum_of_advantage': -0.47893335015228133, 'average_value_estimation': -451.1867534997213, 'value_estimation_std': 0.3190674365467002, 'initial_state_value_estimation': -449.5409851074219, 'soft_opc': nan, 'continuous_action_diff': 0.7012126070115735, 'discrete_action_match': 0.0, 'evaluate_on_environment': -304.73653526320584, 'compare_continuous_action_diff': 0.0} step=25420




2022-11-03 00:52.28 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_25420.pt


Epoch 83/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:52.46 [info     ] CQL_20221103002531: epoch=83 step=25730 epoch=83 metrics={'time_sample_batch': 0.0008187755461662046, 'time_algorithm_update': 0.038387628524534166, 'temp_loss': 0.12404001660404666, 'temp': 0.09414022288495494, 'alpha_loss': -183.67229865289502, 'alpha': 17.670319427982452, 'critic_loss': 204.13778760356288, 'actor_loss': 452.9765670284148, 'time_step': 0.03936291817695864, 'td_error': 12.509008741985697, 'discounted_sum_of_advantage': -4.26974569224488, 'average_value_estimation': -453.2272014219244, 'value_estimation_std': 0.3096528612896505, 'initial_state_value_estimation': -451.4392395019531, 'soft_opc': nan, 'continuous_action_diff': 0.6814441816313369, 'discrete_action_match': 0.0, 'evaluate_on_environment': -464.4390878401494, 'compare_continuous_action_diff': 0.0} step=25730




2022-11-03 00:52.46 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_25730.pt


Epoch 84/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:53.07 [info     ] CQL_20221103002531: epoch=84 step=26040 epoch=84 metrics={'time_sample_batch': 0.0007655043755808185, 'time_algorithm_update': 0.03843261887950282, 'temp_loss': 0.11993011296276124, 'temp': 0.09151618490296025, 'alpha_loss': -189.58523416826802, 'alpha': 18.284529101464056, 'critic_loss': 210.35424100814328, 'actor_loss': 455.81510748094126, 'time_step': 0.039331734564996536, 'td_error': 12.658468035577151, 'discounted_sum_of_advantage': -2.8639336614151647, 'average_value_estimation': -456.5174764310232, 'value_estimation_std': 0.306623495081949, 'initial_state_value_estimation': -454.96856689453125, 'soft_opc': nan, 'continuous_action_diff': 0.6634421428253586, 'discrete_action_match': 0.0, 'evaluate_on_environment': -257.90137291294434, 'compare_continuous_action_diff': 0.0} step=26040




2022-11-03 00:53.07 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_26040.pt


Epoch 85/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:53.25 [info     ] CQL_20221103002531: epoch=85 step=26350 epoch=85 metrics={'time_sample_batch': 0.0007879980148807648, 'time_algorithm_update': 0.03865606015728366, 'temp_loss': 0.11595840771352091, 'temp': 0.08898274170294884, 'alpha_loss': -196.43319908880417, 'alpha': 18.920596104283486, 'critic_loss': 217.10291314894152, 'actor_loss': 458.64229864305065, 'time_step': 0.03958274164507466, 'td_error': 12.002770929484338, 'discounted_sum_of_advantage': -5.746079535183937, 'average_value_estimation': -459.1535961655858, 'value_estimation_std': 0.31492010832153394, 'initial_state_value_estimation': -457.7252197265625, 'soft_opc': nan, 'continuous_action_diff': 0.6813032746431966, 'discrete_action_match': 0.0, 'evaluate_on_environment': -371.1610847794212, 'compare_continuous_action_diff': 0.0} step=26350




2022-11-03 00:53.25 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_26350.pt


Epoch 86/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:53.46 [info     ] CQL_20221103002531: epoch=86 step=26660 epoch=86 metrics={'time_sample_batch': 0.0007556307700372512, 'time_algorithm_update': 0.038797784620715724, 'temp_loss': 0.1134550918975184, 'temp': 0.08653273027270071, 'alpha_loss': -203.29842361942414, 'alpha': 19.578777147108507, 'critic_loss': 224.2403814500378, 'actor_loss': 461.51524156139743, 'time_step': 0.03969492297018728, 'td_error': 12.501231945062562, 'discounted_sum_of_advantage': -4.653423352001452, 'average_value_estimation': -461.4894938621059, 'value_estimation_std': 0.3208624404151215, 'initial_state_value_estimation': -460.3437194824219, 'soft_opc': nan, 'continuous_action_diff': 0.6542597402363186, 'discrete_action_match': 0.0, 'evaluate_on_environment': -334.8297715171623, 'compare_continuous_action_diff': 0.0} step=26660




2022-11-03 00:53.46 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_26660.pt


Epoch 87/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:54.05 [info     ] CQL_20221103002531: epoch=87 step=26970 epoch=87 metrics={'time_sample_batch': 0.0007513000119116998, 'time_algorithm_update': 0.03893540136275753, 'temp_loss': 0.11040952126345327, 'temp': 0.08411847831260773, 'alpha_loss': -210.543287855579, 'alpha': 20.26061902815296, 'critic_loss': 231.53026251023815, 'actor_loss': 464.2616667716734, 'time_step': 0.039837524967808874, 'td_error': 12.728375790029983, 'discounted_sum_of_advantage': -1.2937916953018411, 'average_value_estimation': -465.22011687710517, 'value_estimation_std': 0.3125326860351351, 'initial_state_value_estimation': -463.6593322753906, 'soft_opc': nan, 'continuous_action_diff': 0.6758445452360944, 'discrete_action_match': 0.0, 'evaluate_on_environment': -340.1864774441519, 'compare_continuous_action_diff': 0.0} step=26970




2022-11-03 00:54.05 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_26970.pt


Epoch 88/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:54.24 [info     ] CQL_20221103002531: epoch=88 step=27280 epoch=88 metrics={'time_sample_batch': 0.0008255481719970703, 'time_algorithm_update': 0.038570146406850504, 'temp_loss': 0.10698756820732547, 'temp': 0.0818049687772028, 'alpha_loss': -217.65813706920994, 'alpha': 20.96527275577668, 'critic_loss': 238.83083033407888, 'actor_loss': 467.0170197517641, 'time_step': 0.039532842943745275, 'td_error': 12.035414488017416, 'discounted_sum_of_advantage': -5.529716483232182, 'average_value_estimation': -467.60022304720707, 'value_estimation_std': 0.3176388201118245, 'initial_state_value_estimation': -466.3232116699219, 'soft_opc': nan, 'continuous_action_diff': 0.6472763120222506, 'discrete_action_match': 0.0, 'evaluate_on_environment': -434.3155865368055, 'compare_continuous_action_diff': 0.0} step=27280




2022-11-03 00:54.24 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_27280.pt


Epoch 89/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:54.42 [info     ] CQL_20221103002531: epoch=89 step=27590 epoch=89 metrics={'time_sample_batch': 0.0007505293815366684, 'time_algorithm_update': 0.03886693677594585, 'temp_loss': 0.10535534110761458, 'temp': 0.07952456433446177, 'alpha_loss': -225.11361211961315, 'alpha': 21.694966586943597, 'critic_loss': 246.5590329077936, 'actor_loss': 469.6626172465663, 'time_step': 0.03975032914069391, 'td_error': 12.184489776815004, 'discounted_sum_of_advantage': -4.81445657606456, 'average_value_estimation': -470.40033610294626, 'value_estimation_std': 0.3197837839740084, 'initial_state_value_estimation': -469.04046630859375, 'soft_opc': nan, 'continuous_action_diff': 0.6421108476849339, 'discrete_action_match': 5.051010151520203e-07, 'evaluate_on_environment': -305.919847088166, 'compare_continuous_action_diff': 0.0} step=27590




2022-11-03 00:54.42 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_27590.pt


Epoch 90/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:55.06 [info     ] CQL_20221103002531: epoch=90 step=27900 epoch=90 metrics={'time_sample_batch': 0.001089862085157825, 'time_algorithm_update': 0.04390938897286692, 'temp_loss': 0.10380633170566252, 'temp': 0.0772909379534183, 'alpha_loss': -233.20334964875252, 'alpha': 22.449466563809302, 'critic_loss': 254.56457790251702, 'actor_loss': 472.32120243195567, 'time_step': 0.04518718642573203, 'td_error': 12.27210281135882, 'discounted_sum_of_advantage': -4.306631279051388, 'average_value_estimation': -472.8326475393511, 'value_estimation_std': 0.3227818622029866, 'initial_state_value_estimation': -471.54547119140625, 'soft_opc': nan, 'continuous_action_diff': 0.6870387890170592, 'discrete_action_match': 0.0, 'evaluate_on_environment': -312.6302727294511, 'compare_continuous_action_diff': 0.0} step=27900




2022-11-03 00:55.06 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_27900.pt


Epoch 91/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:55.25 [info     ] CQL_20221103002531: epoch=91 step=28210 epoch=91 metrics={'time_sample_batch': 0.0007518083818497197, 'time_algorithm_update': 0.03854177228866085, 'temp_loss': 0.09900078316850047, 'temp': 0.07514050521196858, 'alpha_loss': -241.05964483445692, 'alpha': 23.230734062194824, 'critic_loss': 262.67694096719066, 'actor_loss': 474.97604478405367, 'time_step': 0.03945698122824392, 'td_error': 11.977038774176547, 'discounted_sum_of_advantage': -5.654719961386825, 'average_value_estimation': -475.506757235477, 'value_estimation_std': 0.3322017988759695, 'initial_state_value_estimation': -473.6894226074219, 'soft_opc': nan, 'continuous_action_diff': 0.6605416028146543, 'discrete_action_match': 0.0, 'evaluate_on_environment': -355.1451088548087, 'compare_continuous_action_diff': 0.0} step=28210




2022-11-03 00:55.25 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_28210.pt


Epoch 92/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:55.46 [info     ] CQL_20221103002531: epoch=92 step=28520 epoch=92 metrics={'time_sample_batch': 0.0010116300275248866, 'time_algorithm_update': 0.0436487244021508, 'temp_loss': 0.09668324166728605, 'temp': 0.07305899534013964, 'alpha_loss': -249.51382874519595, 'alpha': 24.039043629554012, 'critic_loss': 271.16659664031, 'actor_loss': 477.55948584771926, 'time_step': 0.04486132283364573, 'td_error': 12.053283709676574, 'discounted_sum_of_advantage': -4.534405503942218, 'average_value_estimation': -477.59792569985757, 'value_estimation_std': 0.3278573711318797, 'initial_state_value_estimation': -476.1619873046875, 'soft_opc': nan, 'continuous_action_diff': 0.6310137856171584, 'discrete_action_match': 0.0, 'evaluate_on_environment': -383.5034113914161, 'compare_continuous_action_diff': 0.0} step=28520




2022-11-03 00:55.46 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_28520.pt


Epoch 93/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:56.05 [info     ] CQL_20221103002531: epoch=93 step=28830 epoch=93 metrics={'time_sample_batch': 0.0007525059484666393, 'time_algorithm_update': 0.03833379976211056, 'temp_loss': 0.09387071139870151, 'temp': 0.07103343870370618, 'alpha_loss': -257.9198771815146, 'alpha': 24.875188501419558, 'critic_loss': 279.98480382119453, 'actor_loss': 480.1101992699408, 'time_step': 0.039227607942396596, 'td_error': 11.970107304714512, 'discounted_sum_of_advantage': -4.205534194240687, 'average_value_estimation': -480.50323339523896, 'value_estimation_std': 0.32679494132760195, 'initial_state_value_estimation': -478.94921875, 'soft_opc': nan, 'continuous_action_diff': 0.6810701448565376, 'discrete_action_match': 0.0, 'evaluate_on_environment': -509.6619532181668, 'compare_continuous_action_diff': 0.0} step=28830




2022-11-03 00:56.05 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_28830.pt


Epoch 94/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:56.25 [info     ] CQL_20221103002531: epoch=94 step=29140 epoch=94 metrics={'time_sample_batch': 0.0010331446124661353, 'time_algorithm_update': 0.04280714604162401, 'temp_loss': 0.09141526827889104, 'temp': 0.06906577454459283, 'alpha_loss': -267.19511265908517, 'alpha': 25.740652613486013, 'critic_loss': 289.1649560743763, 'actor_loss': 482.61361074139995, 'time_step': 0.04402922507255308, 'td_error': 12.62977969653039, 'discounted_sum_of_advantage': -0.6961995606479449, 'average_value_estimation': -483.2759552703928, 'value_estimation_std': 0.3276081347331532, 'initial_state_value_estimation': -481.30743408203125, 'soft_opc': nan, 'continuous_action_diff': 0.6796788758995198, 'discrete_action_match': 0.0, 'evaluate_on_environment': -292.50398927809186, 'compare_continuous_action_diff': 0.0} step=29140




2022-11-03 00:56.25 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_29140.pt


Epoch 95/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:56.44 [info     ] CQL_20221103002531: epoch=95 step=29450 epoch=95 metrics={'time_sample_batch': 0.0007743081738871912, 'time_algorithm_update': 0.03872188291242046, 'temp_loss': 0.08944230586771042, 'temp': 0.06715020753683583, 'alpha_loss': -276.43838530509703, 'alpha': 26.636345795662173, 'critic_loss': 298.42801858225175, 'actor_loss': 485.1419982910156, 'time_step': 0.039639426815894344, 'td_error': 12.243567644430305, 'discounted_sum_of_advantage': -3.4573359806600243, 'average_value_estimation': -485.7025460251714, 'value_estimation_std': 0.3373971035121001, 'initial_state_value_estimation': -483.9903564453125, 'soft_opc': nan, 'continuous_action_diff': 0.6443703355141318, 'discrete_action_match': 0.0, 'evaluate_on_environment': -483.71804825533934, 'compare_continuous_action_diff': 0.0} step=29450




2022-11-03 00:56.44 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_29450.pt


Epoch 96/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:57.04 [info     ] CQL_20221103002531: epoch=96 step=29760 epoch=96 metrics={'time_sample_batch': 0.000853667720671623, 'time_algorithm_update': 0.0403499226416311, 'temp_loss': 0.08752204778213654, 'temp': 0.06527228593345612, 'alpha_loss': -286.34342582456526, 'alpha': 27.56412035419095, 'critic_loss': 308.5507061373803, 'actor_loss': 487.6088155438823, 'time_step': 0.04137496794423749, 'td_error': 12.40456860568147, 'discounted_sum_of_advantage': -2.2652413192102645, 'average_value_estimation': -489.008537560299, 'value_estimation_std': 0.3486974964197763, 'initial_state_value_estimation': -487.28765869140625, 'soft_opc': nan, 'continuous_action_diff': 0.6579133756221603, 'discrete_action_match': 0.0, 'evaluate_on_environment': -450.6219307863542, 'compare_continuous_action_diff': 0.0} step=29760




2022-11-03 00:57.04 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_29760.pt


Epoch 97/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:57.23 [info     ] CQL_20221103002531: epoch=97 step=30070 epoch=97 metrics={'time_sample_batch': 0.0007615558562740202, 'time_algorithm_update': 0.03843784178456953, 'temp_loss': 0.08550722711989957, 'temp': 0.06344523492359345, 'alpha_loss': -296.20122729885964, 'alpha': 28.524057160654376, 'critic_loss': 318.5599306168095, 'actor_loss': 490.0178721766318, 'time_step': 0.03934823159248598, 'td_error': 12.009036050012272, 'discounted_sum_of_advantage': -3.6332898672251495, 'average_value_estimation': -490.41513246478456, 'value_estimation_std': 0.3477159229937492, 'initial_state_value_estimation': -488.9451904296875, 'soft_opc': nan, 'continuous_action_diff': 0.6232605360419823, 'discrete_action_match': 0.0, 'evaluate_on_environment': -376.2744870339386, 'compare_continuous_action_diff': 0.0} step=30070




2022-11-03 00:57.23 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_30070.pt


Epoch 98/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:57.43 [info     ] CQL_20221103002531: epoch=98 step=30380 epoch=98 metrics={'time_sample_batch': 0.0007370148935625631, 'time_algorithm_update': 0.03876826532425419, 'temp_loss': 0.08272925737884737, 'temp': 0.06166633452859617, 'alpha_loss': -306.6239335583102, 'alpha': 29.51726514754757, 'critic_loss': 328.92642871487527, 'actor_loss': 492.4075221892326, 'time_step': 0.03965718054002331, 'td_error': 12.473636844047284, 'discounted_sum_of_advantage': -0.8487471142291281, 'average_value_estimation': -492.8485536999745, 'value_estimation_std': 0.35441431081497843, 'initial_state_value_estimation': -491.17413330078125, 'soft_opc': nan, 'continuous_action_diff': 0.6495037082370569, 'discrete_action_match': 0.0, 'evaluate_on_environment': -435.97794713749647, 'compare_continuous_action_diff': 0.0} step=30380




2022-11-03 00:57.43 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_30380.pt


Epoch 99/100:   0%|          | 0/310 [00:00<?, ?it/s]



2022-11-03 00:58.02 [info     ] CQL_20221103002531: epoch=99 step=30690 epoch=99 metrics={'time_sample_batch': 0.0007720370446482012, 'time_algorithm_update': 0.038736561036879016, 'temp_loss': 0.08155495846463788, 'temp': 0.05994908687087797, 'alpha_loss': -317.43363086331277, 'alpha': 30.54458626777895, 'critic_loss': 340.0071957495905, 'actor_loss': 494.76472453455773, 'time_step': 0.03967460509269468, 'td_error': 12.257475543521414, 'discounted_sum_of_advantage': -1.959221880257249, 'average_value_estimation': -495.5292101048484, 'value_estimation_std': 0.34174674336495303, 'initial_state_value_estimation': -493.699951171875, 'soft_opc': nan, 'continuous_action_diff': 0.6447359503535812, 'discrete_action_match': 0.0, 'evaluate_on_environment': -475.06287000693374, 'compare_continuous_action_diff': 0.0} step=30690




2022-11-03 00:58.02 [info     ] Model parameters are saved to d3rlpy_logs/CQL_20221103002531/model_30690.pt


Epoch 100/100:   0%|          | 0/310 [00:00<?, ?it/s]

In [None]:
#%tensorboard --logdir logs/fit
%tensorboard --logdir cql_run