<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_03_keras_reinforce.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 12: Deep Learning and Security**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 12 Video Material

* Part 12.1: Introduction to the OpenAI Gym [[Video]](https://www.youtube.com/watch?v=_KbUxgyisjM&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_01_ai_gym.ipynb)
* Part 12.2: Introduction to Q-Learning [[Video]](https://www.youtube.com/watch?v=uwcXWe_Fra0&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_02_qlearningreinforcement.ipynb)
* **Part 12.3: Keras Q-Learning in the OpenAI Gym** [[Video]](https://www.youtube.com/watch?v=Ya1gYt63o3M&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_03_keras_reinforce.ipynb)
* Part 12.4: Atari Games with Keras Neural Networks [[Video]](https://www.youtube.com/watch?v=t2yIu6cRa38&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_04_atari.ipynb)
* Part 12.5: How Alpha Zero used Reinforcement Learning to Master Chess [[Video]](https://www.youtube.com/watch?v=ikDgyD7nVI8&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_05_alpha_zero.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [None]:
try:
    from google.colab import drive
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Part 12.3: Keras Q-Learning in the OpenAI Gym

![Deep Q-Learning](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/deepqlearning.png "Reinforcement Learning")

* **CEMAgent**
    * **model** - The neural network that will be trained.
    * **nb_actions** - The number of actions the agent can take (e.g. up, down, left, right, fire)
    * **memory** - The EpisodeParameterMemory object to use.  This object observes and save all of the state transitions so that you can train your network on them later on (instead of having to make observations from the environment all the time).
    * **batch_size** - The batch size for neural network training, same concept as deep learning batch sizes.
    * **nb_steps_warmup** - Number of training steps to pass before any learning occurs.
    * **train_interval** - Logging interval, defines how often to log.
    * **elite_frac**
* **CEMAgent.fit**
    * **env** - The OpenAI gym environment being used.
    * **nb_steps** - Number of training steps to be performed.
    * **visualize** - If `True`, the environment is visualized during training. However,
                this is likely going to slow down training significantly and is thus intended to be
                a debugging instrument.
    * **verbose** - 0 for no logging, 1 for interval logging (compare `log_interval`), 2 for episode logging




In [1]:
import numpy as np
import gym

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten
from tensorflow.keras.optimizers import Adam

from rl.agents.cem import CEMAgent
from rl.memory import EpisodeParameterMemory

ENV_NAME = 'CartPole-v0'


# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)

nb_actions = env.action_space.n
obs_dim = env.observation_space.shape[0]

# Option 1 : Simple model
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(nb_actions))
model.add(Activation('softmax'))

# Option 2: deep network
# model = Sequential()
# model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(nb_actions))
# model.add(Activation('softmax'))


print(model.summary())


# Finally, we configure and compile our agent. You can use every built-in tensorflow.keras optimizer and
# even the metrics!
memory = EpisodeParameterMemory(limit=1000, window_length=1)

cem = CEMAgent(model=model, nb_actions=nb_actions, memory=memory,
               batch_size=50, nb_steps_warmup=2000, train_interval=50, elite_frac=0.05)
cem.compile()

# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
cem.fit(env, nb_steps=100000, visualize=False, verbose=2)

# After training is done, we save the best weights.
cem.save_weights('cem_{}_params.h5f'.format(ENV_NAME), overwrite=True)

# Finally, evaluate our algorithm for 5 episodes.
cem.test(env, nb_episodes=5, visualize=True)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 4)                 0         
_________________________________________________________________
dense (Dense)                (None, 2)                 10        
_________________________________________________________________
activation (Activation)      (None, 2)                 0         
Total params: 10
Trainable params: 10
Non-trainable params: 0
_________________________________________________________________
None
Training for 100000 steps ...
    28/100000: episode: 1, duration: 0.049s, episode steps:  28, steps per second: 567, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
    41/100000: episode: 2, duration: 0.008s, episode steps:  13, steps per second: 1537, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000

   883/100000: episode: 45, duration: 0.018s, episode steps:  31, steps per second: 1678, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.516 [0.000, 1.000],  mean_best_reward: --
   902/100000: episode: 46, duration: 0.015s, episode steps:  19, steps per second: 1245, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.368 [0.000, 1.000],  mean_best_reward: --
   917/100000: episode: 47, duration: 0.011s, episode steps:  15, steps per second: 1343, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.333 [0.000, 1.000],  mean_best_reward: --
   945/100000: episode: 48, duration: 0.016s, episode steps:  28, steps per second: 1728, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.679 [0.000, 1.000],  mean_best_reward: --
   974/100000: episode: 49, duration: 0.018s, episode steps:  29, steps per second: 1607, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action:

  1848/100000: episode: 97, duration: 0.028s, episode steps:  48, steps per second: 1735, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.458 [0.000, 1.000],  mean_best_reward: --
  1866/100000: episode: 98, duration: 0.012s, episode steps:  18, steps per second: 1549, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.333 [0.000, 1.000],  mean_best_reward: --
  1882/100000: episode: 99, duration: 0.011s, episode steps:  16, steps per second: 1498, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.312 [0.000, 1.000],  mean_best_reward: --
  1897/100000: episode: 100, duration: 0.011s, episode steps:  15, steps per second: 1414, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.133 [0.000, 1.000],  mean_best_reward: --
  1910/100000: episode: 101, duration: 0.009s, episode steps:  13, steps per second: 1490, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean actio

  2912/100000: episode: 147, duration: 0.020s, episode steps:  38, steps per second: 1899, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.421 [0.000, 1.000],  mean_best_reward: --
  2925/100000: episode: 148, duration: 0.009s, episode steps:  13, steps per second: 1482, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.308 [0.000, 1.000],  mean_best_reward: --
  3000/100000: episode: 149, duration: 0.036s, episode steps:  75, steps per second: 2068, episode reward: 75.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.467 [0.000, 1.000],  mean_best_reward: --
  3037/100000: episode: 150, duration: 0.019s, episode steps:  37, steps per second: 1985, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.351 [0.000, 1.000],  mean_best_reward: --
  3054/100000: episode: 151, duration: 0.010s, episode steps:  17, steps per second: 1648, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

  4012/100000: episode: 204, duration: 0.008s, episode steps:  12, steps per second: 1490, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.583 [0.000, 1.000],  mean_best_reward: --
  4030/100000: episode: 205, duration: 0.011s, episode steps:  18, steps per second: 1664, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
  4043/100000: episode: 206, duration: 0.007s, episode steps:  13, steps per second: 1839, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.308 [0.000, 1.000],  mean_best_reward: --
  4061/100000: episode: 207, duration: 0.011s, episode steps:  18, steps per second: 1695, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  4111/100000: episode: 208, duration: 0.026s, episode steps:  50, steps per second: 1951, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

  5161/100000: episode: 250, duration: 0.008s, episode steps:  12, steps per second: 1570, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.583 [0.000, 1.000],  mean_best_reward: --
  5270/100000: episode: 251, duration: 0.061s, episode steps: 109, steps per second: 1774, episode reward: 109.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.514 [0.000, 1.000],  mean_best_reward: 88.500000
  5332/100000: episode: 252, duration: 0.039s, episode steps:  62, steps per second: 1593, episode reward: 62.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.484 [0.000, 1.000],  mean_best_reward: --
  5355/100000: episode: 253, duration: 0.014s, episode steps:  23, steps per second: 1640, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.609 [0.000, 1.000],  mean_best_reward: --
  5384/100000: episode: 254, duration: 0.017s, episode steps:  29, steps per second: 1659, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000],

  6642/100000: episode: 302, duration: 0.010s, episode steps:  16, steps per second: 1617, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.688 [0.000, 1.000],  mean_best_reward: --
  6685/100000: episode: 303, duration: 0.023s, episode steps:  43, steps per second: 1839, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.488 [0.000, 1.000],  mean_best_reward: --
  6769/100000: episode: 304, duration: 0.040s, episode steps:  84, steps per second: 2111, episode reward: 84.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.512 [0.000, 1.000],  mean_best_reward: --
  6788/100000: episode: 305, duration: 0.011s, episode steps:  19, steps per second: 1761, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.474 [0.000, 1.000],  mean_best_reward: --
  6824/100000: episode: 306, duration: 0.018s, episode steps:  36, steps per second: 1962, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

  8192/100000: episode: 353, duration: 0.021s, episode steps:  39, steps per second: 1849, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.487 [0.000, 1.000],  mean_best_reward: --
  8212/100000: episode: 354, duration: 0.012s, episode steps:  20, steps per second: 1723, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.650 [0.000, 1.000],  mean_best_reward: --
  8223/100000: episode: 355, duration: 0.006s, episode steps:  11, steps per second: 1695, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.636 [0.000, 1.000],  mean_best_reward: --
  8237/100000: episode: 356, duration: 0.007s, episode steps:  14, steps per second: 1869, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.571 [0.000, 1.000],  mean_best_reward: --
  8252/100000: episode: 357, duration: 0.008s, episode steps:  15, steps per second: 1900, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

  9366/100000: episode: 393, duration: 0.013s, episode steps:  24, steps per second: 1792, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  9401/100000: episode: 394, duration: 0.019s, episode steps:  35, steps per second: 1818, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.457 [0.000, 1.000],  mean_best_reward: --
  9439/100000: episode: 395, duration: 0.019s, episode steps:  38, steps per second: 2017, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  9452/100000: episode: 396, duration: 0.007s, episode steps:  13, steps per second: 1848, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.769 [0.000, 1.000],  mean_best_reward: --
  9485/100000: episode: 397, duration: 0.018s, episode steps:  33, steps per second: 1868, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 10535/100000: episode: 437, duration: 0.022s, episode steps:  42, steps per second: 1871, episode reward: 42.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 10560/100000: episode: 438, duration: 0.014s, episode steps:  25, steps per second: 1793, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 10613/100000: episode: 439, duration: 0.027s, episode steps:  53, steps per second: 1972, episode reward: 53.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.472 [0.000, 1.000],  mean_best_reward: --
 10647/100000: episode: 440, duration: 0.018s, episode steps:  34, steps per second: 1939, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
 10776/100000: episode: 441, duration: 0.066s, episode steps: 129, steps per second: 1968, episode reward: 129.000, mean reward:  1.000 [ 1.000,  1.000], mean a

 12094/100000: episode: 485, duration: 0.029s, episode steps:  53, steps per second: 1831, episode reward: 53.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.491 [0.000, 1.000],  mean_best_reward: --
 12121/100000: episode: 486, duration: 0.014s, episode steps:  27, steps per second: 1888, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 12158/100000: episode: 487, duration: 0.019s, episode steps:  37, steps per second: 1970, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.541 [0.000, 1.000],  mean_best_reward: --
 12178/100000: episode: 488, duration: 0.011s, episode steps:  20, steps per second: 1834, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.400 [0.000, 1.000],  mean_best_reward: --
 12193/100000: episode: 489, duration: 0.008s, episode steps:  15, steps per second: 1893, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 13727/100000: episode: 530, duration: 0.035s, episode steps:  64, steps per second: 1824, episode reward: 64.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.516 [0.000, 1.000],  mean_best_reward: --
 13748/100000: episode: 531, duration: 0.011s, episode steps:  21, steps per second: 1907, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.524 [0.000, 1.000],  mean_best_reward: --
 13782/100000: episode: 532, duration: 0.017s, episode steps:  34, steps per second: 1965, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 13852/100000: episode: 533, duration: 0.035s, episode steps:  70, steps per second: 2019, episode reward: 70.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 13892/100000: episode: 534, duration: 0.020s, episode steps:  40, steps per second: 2025, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 15337/100000: episode: 575, duration: 0.015s, episode steps:  27, steps per second: 1857, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 15390/100000: episode: 576, duration: 0.028s, episode steps:  53, steps per second: 1907, episode reward: 53.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.491 [0.000, 1.000],  mean_best_reward: --
 15425/100000: episode: 577, duration: 0.019s, episode steps:  35, steps per second: 1811, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.457 [0.000, 1.000],  mean_best_reward: --
 15455/100000: episode: 578, duration: 0.015s, episode steps:  30, steps per second: 1968, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
 15491/100000: episode: 579, duration: 0.019s, episode steps:  36, steps per second: 1920, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 16940/100000: episode: 623, duration: 0.011s, episode steps:  18, steps per second: 1600, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 16951/100000: episode: 624, duration: 0.007s, episode steps:  11, steps per second: 1643, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.273 [0.000, 1.000],  mean_best_reward: --
 16978/100000: episode: 625, duration: 0.015s, episode steps:  27, steps per second: 1836, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.519 [0.000, 1.000],  mean_best_reward: --
 17014/100000: episode: 626, duration: 0.019s, episode steps:  36, steps per second: 1927, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.472 [0.000, 1.000],  mean_best_reward: --
 17044/100000: episode: 627, duration: 0.016s, episode steps:  30, steps per second: 1928, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 18120/100000: episode: 662, duration: 0.032s, episode steps:  57, steps per second: 1778, episode reward: 57.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.544 [0.000, 1.000],  mean_best_reward: --
 18179/100000: episode: 663, duration: 0.029s, episode steps:  59, steps per second: 2051, episode reward: 59.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.508 [0.000, 1.000],  mean_best_reward: --
 18207/100000: episode: 664, duration: 0.015s, episode steps:  28, steps per second: 1882, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 18265/100000: episode: 665, duration: 0.033s, episode steps:  58, steps per second: 1747, episode reward: 58.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 18274/100000: episode: 666, duration: 0.006s, episode steps:   9, steps per second: 1552, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 19717/100000: episode: 705, duration: 0.020s, episode steps:  36, steps per second: 1834, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.472 [0.000, 1.000],  mean_best_reward: --
 19764/100000: episode: 706, duration: 0.025s, episode steps:  47, steps per second: 1904, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.511 [0.000, 1.000],  mean_best_reward: --
 19784/100000: episode: 707, duration: 0.010s, episode steps:  20, steps per second: 1982, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 19822/100000: episode: 708, duration: 0.019s, episode steps:  38, steps per second: 2015, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.474 [0.000, 1.000],  mean_best_reward: --
 19873/100000: episode: 709, duration: 0.026s, episode steps:  51, steps per second: 1993, episode reward: 51.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 21305/100000: episode: 750, duration: 0.020s, episode steps:  38, steps per second: 1908, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.553 [0.000, 1.000],  mean_best_reward: --
 21338/100000: episode: 751, duration: 0.019s, episode steps:  33, steps per second: 1745, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.485 [0.000, 1.000],  mean_best_reward: 69.000000
 21350/100000: episode: 752, duration: 0.007s, episode steps:  12, steps per second: 1797, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.583 [0.000, 1.000],  mean_best_reward: --
 21382/100000: episode: 753, duration: 0.016s, episode steps:  32, steps per second: 2009, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.469 [0.000, 1.000],  mean_best_reward: --
 21423/100000: episode: 754, duration: 0.021s, episode steps:  41, steps per second: 1975, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], 

 22475/100000: episode: 789, duration: 0.024s, episode steps:  43, steps per second: 1811, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.488 [0.000, 1.000],  mean_best_reward: --
 22499/100000: episode: 790, duration: 0.014s, episode steps:  24, steps per second: 1696, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.542 [0.000, 1.000],  mean_best_reward: --
 22548/100000: episode: 791, duration: 0.027s, episode steps:  49, steps per second: 1835, episode reward: 49.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.490 [0.000, 1.000],  mean_best_reward: --
 22581/100000: episode: 792, duration: 0.018s, episode steps:  33, steps per second: 1792, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.333 [0.000, 1.000],  mean_best_reward: --
 22644/100000: episode: 793, duration: 0.031s, episode steps:  63, steps per second: 2018, episode reward: 63.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 24098/100000: episode: 834, duration: 0.019s, episode steps:  40, steps per second: 2058, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.475 [0.000, 1.000],  mean_best_reward: --
 24136/100000: episode: 835, duration: 0.018s, episode steps:  38, steps per second: 2068, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 24161/100000: episode: 836, duration: 0.013s, episode steps:  25, steps per second: 1858, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.560 [0.000, 1.000],  mean_best_reward: --
 24205/100000: episode: 837, duration: 0.021s, episode steps:  44, steps per second: 2070, episode reward: 44.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.477 [0.000, 1.000],  mean_best_reward: --
 24236/100000: episode: 838, duration: 0.014s, episode steps:  31, steps per second: 2160, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 25848/100000: episode: 876, duration: 0.025s, episode steps:  50, steps per second: 1979, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 25894/100000: episode: 877, duration: 0.023s, episode steps:  46, steps per second: 1973, episode reward: 46.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 25922/100000: episode: 878, duration: 0.015s, episode steps:  28, steps per second: 1908, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.464 [0.000, 1.000],  mean_best_reward: --
 25939/100000: episode: 879, duration: 0.010s, episode steps:  17, steps per second: 1740, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
 25972/100000: episode: 880, duration: 0.016s, episode steps:  33, steps per second: 2069, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 27543/100000: episode: 927, duration: 0.015s, episode steps:  29, steps per second: 1961, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.621 [0.000, 1.000],  mean_best_reward: --
 27559/100000: episode: 928, duration: 0.009s, episode steps:  16, steps per second: 1789, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.562 [0.000, 1.000],  mean_best_reward: --
 27572/100000: episode: 929, duration: 0.007s, episode steps:  13, steps per second: 1850, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.538 [0.000, 1.000],  mean_best_reward: --
 27595/100000: episode: 930, duration: 0.012s, episode steps:  23, steps per second: 1998, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.478 [0.000, 1.000],  mean_best_reward: --
 27652/100000: episode: 931, duration: 0.026s, episode steps:  57, steps per second: 2156, episode reward: 57.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 28732/100000: episode: 968, duration: 0.048s, episode steps: 102, steps per second: 2127, episode reward: 102.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.490 [0.000, 1.000],  mean_best_reward: --
 28753/100000: episode: 969, duration: 0.011s, episode steps:  21, steps per second: 1918, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.619 [0.000, 1.000],  mean_best_reward: --
 28768/100000: episode: 970, duration: 0.008s, episode steps:  15, steps per second: 1927, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.467 [0.000, 1.000],  mean_best_reward: --
 28857/100000: episode: 971, duration: 0.041s, episode steps:  89, steps per second: 2157, episode reward: 89.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.494 [0.000, 1.000],  mean_best_reward: --
 28874/100000: episode: 972, duration: 0.009s, episode steps:  17, steps per second: 1920, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean a

 30252/100000: episode: 1017, duration: 0.027s, episode steps:  54, steps per second: 1970, episode reward: 54.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.519 [0.000, 1.000],  mean_best_reward: --
 30272/100000: episode: 1018, duration: 0.012s, episode steps:  20, steps per second: 1718, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.600 [0.000, 1.000],  mean_best_reward: --
 30293/100000: episode: 1019, duration: 0.011s, episode steps:  21, steps per second: 1930, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.524 [0.000, 1.000],  mean_best_reward: --
 30319/100000: episode: 1020, duration: 0.014s, episode steps:  26, steps per second: 1800, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.423 [0.000, 1.000],  mean_best_reward: --
 30334/100000: episode: 1021, duration: 0.008s, episode steps:  15, steps per second: 1781, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], me

 31507/100000: episode: 1057, duration: 0.018s, episode steps:  31, steps per second: 1754, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.516 [0.000, 1.000],  mean_best_reward: --
 31556/100000: episode: 1058, duration: 0.023s, episode steps:  49, steps per second: 2088, episode reward: 49.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.510 [0.000, 1.000],  mean_best_reward: --
 31606/100000: episode: 1059, duration: 0.024s, episode steps:  50, steps per second: 2123, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.440 [0.000, 1.000],  mean_best_reward: --
 31634/100000: episode: 1060, duration: 0.014s, episode steps:  28, steps per second: 1985, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.464 [0.000, 1.000],  mean_best_reward: --
 31655/100000: episode: 1061, duration: 0.011s, episode steps:  21, steps per second: 1980, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], me

 33172/100000: episode: 1102, duration: 0.015s, episode steps:  28, steps per second: 1848, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 33193/100000: episode: 1103, duration: 0.012s, episode steps:  21, steps per second: 1807, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.524 [0.000, 1.000],  mean_best_reward: --
 33218/100000: episode: 1104, duration: 0.013s, episode steps:  25, steps per second: 1964, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 33233/100000: episode: 1105, duration: 0.008s, episode steps:  15, steps per second: 1986, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.400 [0.000, 1.000],  mean_best_reward: --
 33265/100000: episode: 1106, duration: 0.015s, episode steps:  32, steps per second: 2080, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], me

 34403/100000: episode: 1141, duration: 0.012s, episode steps:  23, steps per second: 1973, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.391 [0.000, 1.000],  mean_best_reward: --
 34419/100000: episode: 1142, duration: 0.010s, episode steps:  16, steps per second: 1663, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.438 [0.000, 1.000],  mean_best_reward: --
 34443/100000: episode: 1143, duration: 0.012s, episode steps:  24, steps per second: 1939, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 34487/100000: episode: 1144, duration: 0.021s, episode steps:  44, steps per second: 2119, episode reward: 44.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 34518/100000: episode: 1145, duration: 0.015s, episode steps:  31, steps per second: 2067, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], me

 36147/100000: episode: 1183, duration: 0.019s, episode steps:  38, steps per second: 2034, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 36173/100000: episode: 1184, duration: 0.014s, episode steps:  26, steps per second: 1852, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.538 [0.000, 1.000],  mean_best_reward: --
 36216/100000: episode: 1185, duration: 0.020s, episode steps:  43, steps per second: 2138, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.512 [0.000, 1.000],  mean_best_reward: --
 36267/100000: episode: 1186, duration: 0.025s, episode steps:  51, steps per second: 2072, episode reward: 51.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
 36300/100000: episode: 1187, duration: 0.016s, episode steps:  33, steps per second: 2098, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], me

 37870/100000: episode: 1235, duration: 0.010s, episode steps:  19, steps per second: 1827, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.474 [0.000, 1.000],  mean_best_reward: --
 37883/100000: episode: 1236, duration: 0.007s, episode steps:  13, steps per second: 1755, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.385 [0.000, 1.000],  mean_best_reward: --
 37903/100000: episode: 1237, duration: 0.011s, episode steps:  20, steps per second: 1826, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.550 [0.000, 1.000],  mean_best_reward: --
 37946/100000: episode: 1238, duration: 0.020s, episode steps:  43, steps per second: 2111, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.512 [0.000, 1.000],  mean_best_reward: --
 37999/100000: episode: 1239, duration: 0.025s, episode steps:  53, steps per second: 2118, episode reward: 53.000, mean reward:  1.000 [ 1.000,  1.000], me

 39146/100000: episode: 1277, duration: 0.018s, episode steps:  35, steps per second: 1906, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.486 [0.000, 1.000],  mean_best_reward: --
 39225/100000: episode: 1278, duration: 0.037s, episode steps:  79, steps per second: 2134, episode reward: 79.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.519 [0.000, 1.000],  mean_best_reward: --
 39266/100000: episode: 1279, duration: 0.019s, episode steps:  41, steps per second: 2104, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.488 [0.000, 1.000],  mean_best_reward: --
 39283/100000: episode: 1280, duration: 0.008s, episode steps:  17, steps per second: 2020, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.588 [0.000, 1.000],  mean_best_reward: --
 39311/100000: episode: 1281, duration: 0.013s, episode steps:  28, steps per second: 2084, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], me

 40827/100000: episode: 1323, duration: 0.028s, episode steps:  61, steps per second: 2155, episode reward: 61.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.508 [0.000, 1.000],  mean_best_reward: --
 40854/100000: episode: 1324, duration: 0.017s, episode steps:  27, steps per second: 1603, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.593 [0.000, 1.000],  mean_best_reward: --
 40890/100000: episode: 1325, duration: 0.018s, episode steps:  36, steps per second: 1950, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.472 [0.000, 1.000],  mean_best_reward: --
 40924/100000: episode: 1326, duration: 0.018s, episode steps:  34, steps per second: 1895, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 40947/100000: episode: 1327, duration: 0.011s, episode steps:  23, steps per second: 2070, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], me

 42522/100000: episode: 1373, duration: 0.023s, episode steps:  42, steps per second: 1796, episode reward: 42.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.524 [0.000, 1.000],  mean_best_reward: --
 42564/100000: episode: 1374, duration: 0.020s, episode steps:  42, steps per second: 2116, episode reward: 42.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 42597/100000: episode: 1375, duration: 0.016s, episode steps:  33, steps per second: 2051, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.515 [0.000, 1.000],  mean_best_reward: --
 42664/100000: episode: 1376, duration: 0.031s, episode steps:  67, steps per second: 2172, episode reward: 67.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.507 [0.000, 1.000],  mean_best_reward: --
 42686/100000: episode: 1377, duration: 0.011s, episode steps:  22, steps per second: 2088, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], me

 44347/100000: episode: 1417, duration: 0.027s, episode steps:  55, steps per second: 2014, episode reward: 55.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.473 [0.000, 1.000],  mean_best_reward: --
 44381/100000: episode: 1418, duration: 0.017s, episode steps:  34, steps per second: 2009, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.588 [0.000, 1.000],  mean_best_reward: --
 44416/100000: episode: 1419, duration: 0.017s, episode steps:  35, steps per second: 2104, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.486 [0.000, 1.000],  mean_best_reward: --
 44451/100000: episode: 1420, duration: 0.017s, episode steps:  35, steps per second: 2058, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.514 [0.000, 1.000],  mean_best_reward: --
 44481/100000: episode: 1421, duration: 0.016s, episode steps:  30, steps per second: 1904, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], me

 46068/100000: episode: 1461, duration: 0.019s, episode steps:  38, steps per second: 2042, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.474 [0.000, 1.000],  mean_best_reward: --
 46104/100000: episode: 1462, duration: 0.018s, episode steps:  36, steps per second: 1980, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.472 [0.000, 1.000],  mean_best_reward: --
 46140/100000: episode: 1463, duration: 0.017s, episode steps:  36, steps per second: 2155, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 46165/100000: episode: 1464, duration: 0.013s, episode steps:  25, steps per second: 1989, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.400 [0.000, 1.000],  mean_best_reward: --
 46198/100000: episode: 1465, duration: 0.017s, episode steps:  33, steps per second: 1994, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], me

 47768/100000: episode: 1511, duration: 0.012s, episode steps:  24, steps per second: 1976, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 47830/100000: episode: 1512, duration: 0.030s, episode steps:  62, steps per second: 2046, episode reward: 62.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 47862/100000: episode: 1513, duration: 0.016s, episode steps:  32, steps per second: 1981, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.438 [0.000, 1.000],  mean_best_reward: --
 47874/100000: episode: 1514, duration: 0.006s, episode steps:  12, steps per second: 1875, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.417 [0.000, 1.000],  mean_best_reward: --
 47914/100000: episode: 1515, duration: 0.019s, episode steps:  40, steps per second: 2143, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], me

 49503/100000: episode: 1558, duration: 0.021s, episode steps:  42, steps per second: 1976, episode reward: 42.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.571 [0.000, 1.000],  mean_best_reward: --
 49532/100000: episode: 1559, duration: 0.014s, episode steps:  29, steps per second: 2007, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.517 [0.000, 1.000],  mean_best_reward: --
 49568/100000: episode: 1560, duration: 0.018s, episode steps:  36, steps per second: 2017, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 49582/100000: episode: 1561, duration: 0.007s, episode steps:  14, steps per second: 1894, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.571 [0.000, 1.000],  mean_best_reward: --
 49626/100000: episode: 1562, duration: 0.021s, episode steps:  44, steps per second: 2093, episode reward: 44.000, mean reward:  1.000 [ 1.000,  1.000], me

 51216/100000: episode: 1610, duration: 0.011s, episode steps:  20, steps per second: 1903, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 51241/100000: episode: 1611, duration: 0.013s, episode steps:  25, steps per second: 1919, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.440 [0.000, 1.000],  mean_best_reward: --
 51262/100000: episode: 1612, duration: 0.010s, episode steps:  21, steps per second: 2068, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.429 [0.000, 1.000],  mean_best_reward: --
 51302/100000: episode: 1613, duration: 0.019s, episode steps:  40, steps per second: 2121, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 51329/100000: episode: 1614, duration: 0.014s, episode steps:  27, steps per second: 1926, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], me

 52500/100000: episode: 1649, duration: 0.007s, episode steps:  12, steps per second: 1730, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.583 [0.000, 1.000],  mean_best_reward: --
 52519/100000: episode: 1650, duration: 0.011s, episode steps:  19, steps per second: 1809, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.579 [0.000, 1.000],  mean_best_reward: --
 52548/100000: episode: 1651, duration: 0.015s, episode steps:  29, steps per second: 1929, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.448 [0.000, 1.000],  mean_best_reward: 63.000000
 52568/100000: episode: 1652, duration: 0.010s, episode steps:  20, steps per second: 2007, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 52631/100000: episode: 1653, duration: 0.029s, episode steps:  63, steps per second: 2164, episode reward: 63.000, mean reward:  1.000 [ 1.000,  1.0

 53732/100000: episode: 1692, duration: 0.013s, episode steps:  20, steps per second: 1541, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 53752/100000: episode: 1693, duration: 0.011s, episode steps:  20, steps per second: 1847, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.600 [0.000, 1.000],  mean_best_reward: --
 53773/100000: episode: 1694, duration: 0.011s, episode steps:  21, steps per second: 1983, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.476 [0.000, 1.000],  mean_best_reward: --
 53795/100000: episode: 1695, duration: 0.012s, episode steps:  22, steps per second: 1907, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.545 [0.000, 1.000],  mean_best_reward: --
 53814/100000: episode: 1696, duration: 0.010s, episode steps:  19, steps per second: 1999, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], me

 55451/100000: episode: 1742, duration: 0.015s, episode steps:  30, steps per second: 1966, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.467 [0.000, 1.000],  mean_best_reward: --
 55496/100000: episode: 1743, duration: 0.023s, episode steps:  45, steps per second: 1973, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.489 [0.000, 1.000],  mean_best_reward: --
 55516/100000: episode: 1744, duration: 0.010s, episode steps:  20, steps per second: 2076, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.450 [0.000, 1.000],  mean_best_reward: --
 55557/100000: episode: 1745, duration: 0.019s, episode steps:  41, steps per second: 2135, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.463 [0.000, 1.000],  mean_best_reward: --
 55596/100000: episode: 1746, duration: 0.019s, episode steps:  39, steps per second: 2097, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], me

 57122/100000: episode: 1790, duration: 0.011s, episode steps:  20, steps per second: 1842, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 57153/100000: episode: 1791, duration: 0.016s, episode steps:  31, steps per second: 1948, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.484 [0.000, 1.000],  mean_best_reward: --
 57194/100000: episode: 1792, duration: 0.019s, episode steps:  41, steps per second: 2112, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.537 [0.000, 1.000],  mean_best_reward: --
 57210/100000: episode: 1793, duration: 0.008s, episode steps:  16, steps per second: 1888, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 57266/100000: episode: 1794, duration: 0.026s, episode steps:  56, steps per second: 2174, episode reward: 56.000, mean reward:  1.000 [ 1.000,  1.000], me

 58959/100000: episode: 1832, duration: 0.052s, episode steps: 106, steps per second: 2043, episode reward: 106.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.509 [0.000, 1.000],  mean_best_reward: --
 59002/100000: episode: 1833, duration: 0.020s, episode steps:  43, steps per second: 2157, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.488 [0.000, 1.000],  mean_best_reward: --
 59045/100000: episode: 1834, duration: 0.021s, episode steps:  43, steps per second: 2005, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.512 [0.000, 1.000],  mean_best_reward: --
 59060/100000: episode: 1835, duration: 0.008s, episode steps:  15, steps per second: 1849, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.600 [0.000, 1.000],  mean_best_reward: --
 59117/100000: episode: 1836, duration: 0.027s, episode steps:  57, steps per second: 2075, episode reward: 57.000, mean reward:  1.000 [ 1.000,  1.000], m

 61174/100000: episode: 1881, duration: 0.016s, episode steps:  30, steps per second: 1918, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.467 [0.000, 1.000],  mean_best_reward: --
 61187/100000: episode: 1882, duration: 0.007s, episode steps:  13, steps per second: 1745, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.615 [0.000, 1.000],  mean_best_reward: --
 61218/100000: episode: 1883, duration: 0.016s, episode steps:  31, steps per second: 1970, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.452 [0.000, 1.000],  mean_best_reward: --
 61232/100000: episode: 1884, duration: 0.008s, episode steps:  14, steps per second: 1840, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.286 [0.000, 1.000],  mean_best_reward: --
 61288/100000: episode: 1885, duration: 0.028s, episode steps:  56, steps per second: 1994, episode reward: 56.000, mean reward:  1.000 [ 1.000,  1.000], me

 62896/100000: episode: 1928, duration: 0.010s, episode steps:  19, steps per second: 1822, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.474 [0.000, 1.000],  mean_best_reward: --
 62973/100000: episode: 1929, duration: 0.037s, episode steps:  77, steps per second: 2069, episode reward: 77.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.481 [0.000, 1.000],  mean_best_reward: --
 63005/100000: episode: 1930, duration: 0.015s, episode steps:  32, steps per second: 2066, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 63073/100000: episode: 1931, duration: 0.032s, episode steps:  68, steps per second: 2114, episode reward: 68.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 63090/100000: episode: 1932, duration: 0.008s, episode steps:  17, steps per second: 2016, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], me

 64607/100000: episode: 1972, duration: 0.024s, episode steps:  50, steps per second: 2092, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 64677/100000: episode: 1973, duration: 0.035s, episode steps:  70, steps per second: 2011, episode reward: 70.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.486 [0.000, 1.000],  mean_best_reward: --
 64727/100000: episode: 1974, duration: 0.024s, episode steps:  50, steps per second: 2121, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 64777/100000: episode: 1975, duration: 0.024s, episode steps:  50, steps per second: 2068, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 64803/100000: episode: 1976, duration: 0.013s, episode steps:  26, steps per second: 2056, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], me

 66776/100000: episode: 2019, duration: 0.016s, episode steps:  30, steps per second: 1922, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
 66808/100000: episode: 2020, duration: 0.016s, episode steps:  32, steps per second: 2006, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 66829/100000: episode: 2021, duration: 0.011s, episode steps:  21, steps per second: 1912, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.476 [0.000, 1.000],  mean_best_reward: --
 66863/100000: episode: 2022, duration: 0.017s, episode steps:  34, steps per second: 1972, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 66930/100000: episode: 2023, duration: 0.032s, episode steps:  67, steps per second: 2121, episode reward: 67.000, mean reward:  1.000 [ 1.000,  1.000], me

 68550/100000: episode: 2067, duration: 0.030s, episode steps:  60, steps per second: 2024, episode reward: 60.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 68591/100000: episode: 2068, duration: 0.020s, episode steps:  41, steps per second: 2059, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.488 [0.000, 1.000],  mean_best_reward: --
 68632/100000: episode: 2069, duration: 0.020s, episode steps:  41, steps per second: 2078, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.463 [0.000, 1.000],  mean_best_reward: --
 68736/100000: episode: 2070, duration: 0.049s, episode steps: 104, steps per second: 2115, episode reward: 104.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.481 [0.000, 1.000],  mean_best_reward: --
 68788/100000: episode: 2071, duration: 0.026s, episode steps:  52, steps per second: 1998, episode reward: 52.000, mean reward:  1.000 [ 1.000,  1.000], m

 70287/100000: episode: 2111, duration: 0.021s, episode steps:  39, steps per second: 1885, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.462 [0.000, 1.000],  mean_best_reward: --
 70322/100000: episode: 2112, duration: 0.017s, episode steps:  35, steps per second: 2036, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.486 [0.000, 1.000],  mean_best_reward: --
 70350/100000: episode: 2113, duration: 0.014s, episode steps:  28, steps per second: 1989, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.571 [0.000, 1.000],  mean_best_reward: --
 70417/100000: episode: 2114, duration: 0.031s, episode steps:  67, steps per second: 2187, episode reward: 67.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.507 [0.000, 1.000],  mean_best_reward: --
 70441/100000: episode: 2115, duration: 0.012s, episode steps:  24, steps per second: 2086, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], me

 72118/100000: episode: 2156, duration: 0.023s, episode steps:  46, steps per second: 1975, episode reward: 46.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 72187/100000: episode: 2157, duration: 0.032s, episode steps:  69, steps per second: 2142, episode reward: 69.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 72234/100000: episode: 2158, duration: 0.022s, episode steps:  47, steps per second: 2123, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.489 [0.000, 1.000],  mean_best_reward: --
 72269/100000: episode: 2159, duration: 0.017s, episode steps:  35, steps per second: 2099, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.571 [0.000, 1.000],  mean_best_reward: --
 72311/100000: episode: 2160, duration: 0.019s, episode steps:  42, steps per second: 2175, episode reward: 42.000, mean reward:  1.000 [ 1.000,  1.000], me

 73887/100000: episode: 2199, duration: 0.022s, episode steps:  46, steps per second: 2113, episode reward: 46.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 73947/100000: episode: 2200, duration: 0.030s, episode steps:  60, steps per second: 2010, episode reward: 60.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.517 [0.000, 1.000],  mean_best_reward: --
 74002/100000: episode: 2201, duration: 0.026s, episode steps:  55, steps per second: 2078, episode reward: 55.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.509 [0.000, 1.000],  mean_best_reward: 93.000000
 74018/100000: episode: 2202, duration: 0.008s, episode steps:  16, steps per second: 1970, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.688 [0.000, 1.000],  mean_best_reward: --
 74034/100000: episode: 2203, duration: 0.010s, episode steps:  16, steps per second: 1604, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.0

 75589/100000: episode: 2247, duration: 0.012s, episode steps:  22, steps per second: 1842, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 75637/100000: episode: 2248, duration: 0.023s, episode steps:  48, steps per second: 2103, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.479 [0.000, 1.000],  mean_best_reward: --
 75658/100000: episode: 2249, duration: 0.011s, episode steps:  21, steps per second: 1892, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.571 [0.000, 1.000],  mean_best_reward: --
 75675/100000: episode: 2250, duration: 0.009s, episode steps:  17, steps per second: 1999, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.471 [0.000, 1.000],  mean_best_reward: --
 75690/100000: episode: 2251, duration: 0.009s, episode steps:  15, steps per second: 1690, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], me

 77358/100000: episode: 2292, duration: 0.025s, episode steps:  50, steps per second: 2007, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 77369/100000: episode: 2293, duration: 0.006s, episode steps:  11, steps per second: 1699, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.545 [0.000, 1.000],  mean_best_reward: --
 77425/100000: episode: 2294, duration: 0.027s, episode steps:  56, steps per second: 2106, episode reward: 56.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.482 [0.000, 1.000],  mean_best_reward: --
 77467/100000: episode: 2295, duration: 0.020s, episode steps:  42, steps per second: 2104, episode reward: 42.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.571 [0.000, 1.000],  mean_best_reward: --
 77503/100000: episode: 2296, duration: 0.017s, episode steps:  36, steps per second: 2064, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], me

 79075/100000: episode: 2336, duration: 0.020s, episode steps:  40, steps per second: 2013, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.475 [0.000, 1.000],  mean_best_reward: --
 79143/100000: episode: 2337, duration: 0.033s, episode steps:  68, steps per second: 2091, episode reward: 68.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.485 [0.000, 1.000],  mean_best_reward: --
 79177/100000: episode: 2338, duration: 0.018s, episode steps:  34, steps per second: 1916, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.471 [0.000, 1.000],  mean_best_reward: --
 79230/100000: episode: 2339, duration: 0.025s, episode steps:  53, steps per second: 2121, episode reward: 53.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.472 [0.000, 1.000],  mean_best_reward: --
 79268/100000: episode: 2340, duration: 0.018s, episode steps:  38, steps per second: 2140, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], me

 80905/100000: episode: 2383, duration: 0.023s, episode steps:  47, steps per second: 2001, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.468 [0.000, 1.000],  mean_best_reward: --
 80931/100000: episode: 2384, duration: 0.014s, episode steps:  26, steps per second: 1844, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.423 [0.000, 1.000],  mean_best_reward: --
 80964/100000: episode: 2385, duration: 0.017s, episode steps:  33, steps per second: 1974, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.515 [0.000, 1.000],  mean_best_reward: --
 81004/100000: episode: 2386, duration: 0.019s, episode steps:  40, steps per second: 2090, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.525 [0.000, 1.000],  mean_best_reward: --
 81024/100000: episode: 2387, duration: 0.011s, episode steps:  20, steps per second: 1900, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], me

 82694/100000: episode: 2433, duration: 0.013s, episode steps:  25, steps per second: 1856, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.560 [0.000, 1.000],  mean_best_reward: --
 82709/100000: episode: 2434, duration: 0.009s, episode steps:  15, steps per second: 1647, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.333 [0.000, 1.000],  mean_best_reward: --
 82735/100000: episode: 2435, duration: 0.013s, episode steps:  26, steps per second: 1974, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 82758/100000: episode: 2436, duration: 0.011s, episode steps:  23, steps per second: 2085, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 82808/100000: episode: 2437, duration: 0.024s, episode steps:  50, steps per second: 2106, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], me

 84086/100000: episode: 2474, duration: 0.042s, episode steps:  84, steps per second: 1992, episode reward: 84.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 84111/100000: episode: 2475, duration: 0.013s, episode steps:  25, steps per second: 1915, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.440 [0.000, 1.000],  mean_best_reward: --
 84127/100000: episode: 2476, duration: 0.008s, episode steps:  16, steps per second: 1961, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 84158/100000: episode: 2477, duration: 0.017s, episode steps:  31, steps per second: 1837, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.452 [0.000, 1.000],  mean_best_reward: --
 84194/100000: episode: 2478, duration: 0.017s, episode steps:  36, steps per second: 2154, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], me

 85836/100000: episode: 2522, duration: 0.012s, episode steps:  25, steps per second: 2010, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 85854/100000: episode: 2523, duration: 0.010s, episode steps:  18, steps per second: 1784, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.444 [0.000, 1.000],  mean_best_reward: --
 85894/100000: episode: 2524, duration: 0.020s, episode steps:  40, steps per second: 2031, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.450 [0.000, 1.000],  mean_best_reward: --
 85914/100000: episode: 2525, duration: 0.010s, episode steps:  20, steps per second: 1953, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.450 [0.000, 1.000],  mean_best_reward: --
 85967/100000: episode: 2526, duration: 0.025s, episode steps:  53, steps per second: 2152, episode reward: 53.000, mean reward:  1.000 [ 1.000,  1.000], me

 87179/100000: episode: 2564, duration: 0.022s, episode steps:  44, steps per second: 2017, episode reward: 44.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.455 [0.000, 1.000],  mean_best_reward: --
 87198/100000: episode: 2565, duration: 0.010s, episode steps:  19, steps per second: 1868, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.474 [0.000, 1.000],  mean_best_reward: --
 87251/100000: episode: 2566, duration: 0.025s, episode steps:  53, steps per second: 2132, episode reward: 53.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.509 [0.000, 1.000],  mean_best_reward: --
 87277/100000: episode: 2567, duration: 0.013s, episode steps:  26, steps per second: 2072, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 87304/100000: episode: 2568, duration: 0.013s, episode steps:  27, steps per second: 2056, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], me

 88419/100000: episode: 2604, duration: 0.017s, episode steps:  33, steps per second: 1969, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.485 [0.000, 1.000],  mean_best_reward: --
 88472/100000: episode: 2605, duration: 0.026s, episode steps:  53, steps per second: 2047, episode reward: 53.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.472 [0.000, 1.000],  mean_best_reward: --
 88503/100000: episode: 2606, duration: 0.015s, episode steps:  31, steps per second: 2083, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.516 [0.000, 1.000],  mean_best_reward: --
 88532/100000: episode: 2607, duration: 0.015s, episode steps:  29, steps per second: 1939, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.483 [0.000, 1.000],  mean_best_reward: --
 88605/100000: episode: 2608, duration: 0.035s, episode steps:  73, steps per second: 2088, episode reward: 73.000, mean reward:  1.000 [ 1.000,  1.000], me

 90126/100000: episode: 2653, duration: 0.035s, episode steps:  65, steps per second: 1875, episode reward: 65.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.492 [0.000, 1.000],  mean_best_reward: --
 90143/100000: episode: 2654, duration: 0.009s, episode steps:  17, steps per second: 1872, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.471 [0.000, 1.000],  mean_best_reward: --
 90192/100000: episode: 2655, duration: 0.023s, episode steps:  49, steps per second: 2143, episode reward: 49.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.510 [0.000, 1.000],  mean_best_reward: --
 90254/100000: episode: 2656, duration: 0.030s, episode steps:  62, steps per second: 2101, episode reward: 62.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.516 [0.000, 1.000],  mean_best_reward: --
 90285/100000: episode: 2657, duration: 0.015s, episode steps:  31, steps per second: 2104, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], me

 91872/100000: episode: 2695, duration: 0.031s, episode steps:  68, steps per second: 2164, episode reward: 68.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 91968/100000: episode: 2696, duration: 0.045s, episode steps:  96, steps per second: 2119, episode reward: 96.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.510 [0.000, 1.000],  mean_best_reward: --
 92016/100000: episode: 2697, duration: 0.024s, episode steps:  48, steps per second: 2032, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.521 [0.000, 1.000],  mean_best_reward: --
 92039/100000: episode: 2698, duration: 0.012s, episode steps:  23, steps per second: 2000, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.435 [0.000, 1.000],  mean_best_reward: --
 92134/100000: episode: 2699, duration: 0.043s, episode steps:  95, steps per second: 2192, episode reward: 95.000, mean reward:  1.000 [ 1.000,  1.000], me

 93750/100000: episode: 2740, duration: 0.033s, episode steps:  66, steps per second: 2006, episode reward: 66.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 93792/100000: episode: 2741, duration: 0.020s, episode steps:  42, steps per second: 2104, episode reward: 42.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 93870/100000: episode: 2742, duration: 0.036s, episode steps:  78, steps per second: 2174, episode reward: 78.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.487 [0.000, 1.000],  mean_best_reward: --
 93913/100000: episode: 2743, duration: 0.020s, episode steps:  43, steps per second: 2178, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.488 [0.000, 1.000],  mean_best_reward: --
 93981/100000: episode: 2744, duration: 0.032s, episode steps:  68, steps per second: 2134, episode reward: 68.000, mean reward:  1.000 [ 1.000,  1.000], me

 95570/100000: episode: 2786, duration: 0.063s, episode steps: 136, steps per second: 2152, episode reward: 136.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 95588/100000: episode: 2787, duration: 0.010s, episode steps:  18, steps per second: 1894, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.444 [0.000, 1.000],  mean_best_reward: --
 95608/100000: episode: 2788, duration: 0.010s, episode steps:  20, steps per second: 1969, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.550 [0.000, 1.000],  mean_best_reward: --
 95627/100000: episode: 2789, duration: 0.009s, episode steps:  19, steps per second: 2014, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.474 [0.000, 1.000],  mean_best_reward: --
 95651/100000: episode: 2790, duration: 0.012s, episode steps:  24, steps per second: 1993, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], m

 97339/100000: episode: 2833, duration: 0.037s, episode steps:  78, steps per second: 2092, episode reward: 78.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 97370/100000: episode: 2834, duration: 0.015s, episode steps:  31, steps per second: 2001, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.516 [0.000, 1.000],  mean_best_reward: --
 97411/100000: episode: 2835, duration: 0.019s, episode steps:  41, steps per second: 2118, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.463 [0.000, 1.000],  mean_best_reward: --
 97436/100000: episode: 2836, duration: 0.014s, episode steps:  25, steps per second: 1820, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.560 [0.000, 1.000],  mean_best_reward: --
 97464/100000: episode: 2837, duration: 0.013s, episode steps:  28, steps per second: 2075, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], me

 99490/100000: episode: 2881, duration: 0.016s, episode steps:  34, steps per second: 2141, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
 99507/100000: episode: 2882, duration: 0.010s, episode steps:  17, steps per second: 1665, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
 99604/100000: episode: 2883, duration: 0.045s, episode steps:  97, steps per second: 2159, episode reward: 97.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.495 [0.000, 1.000],  mean_best_reward: --
 99646/100000: episode: 2884, duration: 0.020s, episode steps:  42, steps per second: 2065, episode reward: 42.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.524 [0.000, 1.000],  mean_best_reward: --
 99685/100000: episode: 2885, duration: 0.018s, episode steps:  39, steps per second: 2143, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], me

<tensorflow.python.keras.callbacks.History at 0x631b3f2e8>