# Applied Deep Learning Tutorial 
# Deep Reinforcement Learning with Deep-Q-Network (DQN)

## Introduction
In this tutorial, you will attempt to implement a Deep-Q-Network that is able to do a classic control. The approaches are build upon the paper by DeepMind: Playing Atari with Deep Reinforcement Learning [paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf), which first introduces the notion of a Deep Q-Network.
<img src="graphics/atari_play.png" width="700"><br>
<center> Fig. 1: Breakout environment of the Atari game </center>

## Core idea
As you probably remember from the lecture, during trial and error we can learn a policy for our Atari game, and model it within our Q-matrix. This is done with a deep neural network. After training, this Q-matrix gives us an estimate of the expected reward when taking action a in state s: Q(s, a).
Playing the action with the maximum Q-value in any given state is the same as playing optimal, or following a full exploitation strategy.

## OpenAI Gym
[OpenAI Gym](https://gym.openai.com/docs/) is a library that can simulate a large number of reinforcement learning environments, including Atari games (these need to be installed additionaly). You will need Python 3.5+

>pip install gym


## Taking our cart pole on a first ride
Now that you have gym installed you can load the 'Pendulum-v0' environment of Atari.


In [50]:
# Import the gym module
import gym


In [51]:
# Load the environment
env = gym.make('Pendulum-v0')

# Reset, it returns the starting frame
frame = env.reset()

for _ in range(100000):
    # Perform a random action, returns the new frame, reward and whether the game is over
    action = env.action_space.sample()
    observation, reward, is_done, info = env.step(action)
    print('observation: ', observation, 'reward: ', reward)
    if is_done: break
    env.render()

env.close()


observation:  [ 0.86493428 -0.50188513 -0.55901944] reward:  -0.2527977999118114
observation:  [ 0.83464549 -0.55078753 -1.15061163] reward:  -0.3097494877120603
observation:  [ 0.79602129 -0.60526862 -1.3359158 ] reward:  -0.47494441788282654
observation:  [ 0.73525871 -0.67778656 -1.89289474] reward:  -0.601573143554305
observation:  [ 0.65985849 -0.7513899  -2.10835792] reward:  -0.9167669661274144
observation:  [ 0.55508021 -0.83179683 -2.64342032] reward:  -1.1673354431915164
observation:  [ 0.42471827 -0.90532557 -2.99617455] reward:  -1.6670192404112978
observation:  [ 0.23821938 -0.97121137 -3.96237256] reward:  -2.1831260346424
observation:  [-1.94282752e-03 -9.99998113e-01 -4.84949737e+00] reward:  -3.3407622952310305
observation:  [-0.27963141 -0.96010743 -5.62934702] reward:  -4.82531052925543
observation:  [-0.56235717 -0.82689444 -6.27646994] reward:  -6.607273120874804
observation:  [-0.80020766 -0.59972301 -6.60815458] reward:  -8.643460447705376
observation:  [-0.95592

observation:  [0.99970808 0.02416083 3.36359403] reward:  -1.2898567693292018
observation:  [0.98014441 0.19828498 3.50889325] reward:  -1.1326792024098062
observation:  [0.9327887  0.3604237  3.38228462] reward:  -1.2744454743618234
observation:  [0.84988949 0.52696097 3.72597421] reward:  -1.280180157183985
observation:  [0.72313105 0.69071085 4.14901644] reward:  -1.696370860934642
observation:  [0.53195302 0.84677387 4.94838772] reward:  -2.306314529037839
observation:  [0.27214774 0.96225548 5.70562034] reward:  -3.4691970886978196
observation:  [-0.04112822  0.99915388  6.33528277] reward:  -4.933257309238082
observation:  [-0.38488876  0.92296297  7.07894933] reward:  -6.6119203883280155
observation:  [-0.70704555  0.70716801  7.80448563] reward:  -8.875900099297148
observation:  [-0.9330223   0.35981856  8.        ] reward:  -11.642251971799386
observation:  [-0.99945102 -0.03313097  8.        ] reward:  -14.09509100682671
observation:  [-0.90440625 -0.42667239  8.        ] rew

This already looks nice, yet the actions are random and thus it is time to better understand our environment. And to implement our Deep-Q-Network


In [135]:
# import the necessary libraries
import gym
import gym.spaces
import gym.wrappers
import numpy as np
import random
import pickle
from collections import deque
from keras.layers import Flatten, Dense
from keras import backend as K
from keras.models import Sequential, Model, load_model
from keras import optimizers

## Observation
The observation is made up of cos(theta), sin(theta) and theta dot. 
Theta is normalized between -pi and pi.

## Action
Joint effort -2.0 to +2.0
Write a function to discretize the continuous action space of the joint effort.


In [136]:
# define the action space
def create_action_bins(num_action_bins):
    actionbins = np.linspace(-2.0, 2.0, num_action_bins)
    
    return actionbins

# depending on the action, find the according actionbin 
# discretization of the continuous action space
def find_actionbin(action, actionbins):
    idx = (np.abs(actionbins - action)).argmin()

    return idx

## Reward
The reward is defined as
> -(theta^2 + 0.1 x theta_dt^2 + 0.001 x action^2)

What is the lowest expected cost? And what is the highest cost?

-(pi^2 + 0.1 x 8^2 + 0.001 x 2^2) = -16.2736044

-(0^2 + 0.1 x 0^2 + 0.001 x 0^2) = 0

From this reward function, what is the goal of the agent?
In essence, the goal is to remain at zero angle (vertical), with the least rotational velocity, and the least effort.

For a hint have a look at the [wiki](https://github.com/openai/gym/wiki).

In [137]:
def train_model(memory, gamma=0.9):
    for state, action, reward, state_new in memory:
        
        # flatten state to make it compatible to our neural network
        flat_state_new = np.reshape(state_new, [1, 3])
        flat_state = np.reshape(state, [1, 3])

        # determine estimated reward given state s' after action a, 
        # combination of observed and predicted exploited reward.
        target = reward + gamma * np.amax(model.predict(flat_state_new))
        # determine current expected agent rewards
        targetfull = model.predict(flat_state)
        
        # update current expected rewards with the emulated prediced reward
        targetfull[0][action] = target
        
        # Fit model based on emulation and prediction
        model.fit(flat_state, targetfull, epochs=1, verbose=0)

## Deep Q Model

As a reminder, this is our Q function.
> Q(s, a) = r + gamma max_a'(Q(s, a'))

The input of our neural network, our generalizable Q-matrix, will be the observation or the state of the pendulum. 
and the output will be the estimate of the reward taking the action a'. Gamma is the discount factor of the predicted reward in our next state. r is the reward 

For our first network we will implement a DQN with keras:

- Layer with 128 ReLU units
- Layer with 64 ReLU units
- 3 inputs and one output per action bin with linear activation function
- Adam optimizer with learning rate 0.0002, beta_1 0.9 and beta_2 0.999
- Loss mean squared error

In [138]:
# Define the Deep-Q-Network in keras

def build_model(num_output_nodes):
    model = Sequential()

    model.add(Dense(128, input_shape=(3,), activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(num_output_nodes, activation='linear'))

    adam = optimizers.Adam(lr=0.0002, beta_1=0.9, beta_2=0.999)
    model.compile(loss='mse', optimizer=adam)

    return model

In [145]:
def run_episodes(epsilon, gamma, training_iterations, sequence_iterations):
    
    epsilon_decay = 0.9999
    epsilon_min = 0.02
    steps_per_sequence = 250

    for epoch in range(0, training_iterations // sequence_iterations):
        for sequence_id in range(0, sequence_iterations):
            state = env.reset()
            memory = deque()
            
            total_reward = 0
            
            # Easy implementation of decaying exploration
            if epsilon > epsilon_min:
                epsilon = epsilon * epsilon_decay
            
            for i in range(0, steps_per_sequence):
                    
                # exploration
                if np.random.uniform() < epsilon:
                    action = env.action_space.sample()
                # exploitation
                else:
                    flat_state = np.reshape(state, [1, 3])
                    action = np.amax(model.predict(flat_state))

                # determine action
                actionbin = find_actionbin(action, actionbinslist)
                action = actionbinslist[actionbin]
                action = np.array([action])

                # emulate the action in the simulation and observe the transition 
                # as well as the reward
                observation, reward, done, _ = env.step(action)
                total_reward += reward

                state_new = observation

                # save transitions into
                memory.append((state, actionbin, reward, state_new))

                state = state_new
                
            # train model on the samples memories
            train_model(memory, gamma)
            
            print(epoch , ' epoch', sequence_id, ' sequence. Average reward = ', total_reward / steps_per_sequence, '. epsilon = ', epsilon)

           

## Function for running the policy of our DQN after loading or training


In [146]:
def play_game(rounds):
    state = env.reset()
    totalreward = 0

    for _ in range(0, rounds):
        # Rendering for visualization
        env.render()

        flat_state = np.reshape(state, [1, 3])
        actionbin = np.argmax(model.predict(flat_state))

        action = actionbinslist[actionbin]
        action = np.array([action])

        observation, reward, done, _ = env.step(action)

        totalreward += reward

        state_new = observation
        state = state_new
        
    return totalreward

## Train the DQN


In [None]:
env = gym.make('Pendulum-v0')


# iterations
training_iterations = 1000
sequence_iterations = 25

# epsilon (setting exploitation vs exploration)
epsilon = 1

# gamma (importance of predicted estimated reward)
gamma = 0.9

# Discretization settings for the action space
num_action_bins = 20
actionbinslist = create_action_bins(num_action_bins)



# Build model
model = build_model(num_action_bins)

run_episodes(epsilon, gamma, training_iterations, sequence_iterations)

   

Exception ignored in: <bound method ScopedTFStatus.__del__ of <tensorflow.python.framework.c_api_util.ScopedTFStatus object at 0x00000205B5A02438>>
Traceback (most recent call last):
  File "C:\Users\Z624284\AppData\Local\Continuum\anaconda3\envs\juno.tf\lib\site-packages\tensorflow\python\framework\c_api_util.py", line 39, in __del__
    c_api.TF_DeleteStatus(self.status)
AttributeError: 'ScopedTFStatus' object has no attribute 'status'


0  epoch 0  sequence. Average reward =  -6.1693816682624405 . epsilon =  0.9999
0  epoch 1  sequence. Average reward =  -6.103361849396182 . epsilon =  0.9998000100000001
0  epoch 2  sequence. Average reward =  -7.4376313848445115 . epsilon =  0.9997000299990001
0  epoch 3  sequence. Average reward =  -4.804119388389292 . epsilon =  0.9996000599960002
0  epoch 4  sequence. Average reward =  -5.238030700912224 . epsilon =  0.9995000999900007
0  epoch 5  sequence. Average reward =  -6.103151804757043 . epsilon =  0.9994001499800017
0  epoch 6  sequence. Average reward =  -7.2355087385584245 . epsilon =  0.9993002099650037
0  epoch 7  sequence. Average reward =  -6.091206494715181 . epsilon =  0.9992002799440072
0  epoch 8  sequence. Average reward =  -4.50672234640893 . epsilon =  0.9991003599160128
0  epoch 9  sequence. Average reward =  -6.0409351612612 . epsilon =  0.9990004498800211
0  epoch 10  sequence. Average reward =  -6.144442319953128 . epsilon =  0.9989005498350332
0  epoch 1

3  epoch 15  sequence. Average reward =  -5.035032888284185 . epsilon =  0.9909408287818039
3  epoch 16  sequence. Average reward =  -8.774774093637603 . epsilon =  0.9908417346989258
3  epoch 17  sequence. Average reward =  -6.596894512713364 . epsilon =  0.9907426505254558
3  epoch 18  sequence. Average reward =  -8.702508151872454 . epsilon =  0.9906435762604033
3  epoch 19  sequence. Average reward =  -7.5386949864379975 . epsilon =  0.9905445119027773
3  epoch 20  sequence. Average reward =  -5.64687164224414 . epsilon =  0.990445457451587
3  epoch 21  sequence. Average reward =  -6.213091834810061 . epsilon =  0.9903464129058418
3  epoch 22  sequence. Average reward =  -6.269378891987121 . epsilon =  0.9902473782645512
3  epoch 23  sequence. Average reward =  -6.148349646770915 . epsilon =  0.9901483535267248
3  epoch 24  sequence. Average reward =  -6.039287694925221 . epsilon =  0.9900493386913721
4  epoch 0  sequence. Average reward =  -7.8429714472451835 . epsilon =  0.989950

7  epoch 5  sequence. Average reward =  -5.178568457756237 . epsilon =  0.982061932340002
7  epoch 6  sequence. Average reward =  -5.0946314420730605 . epsilon =  0.9819637261467681
7  epoch 7  sequence. Average reward =  -4.635768419783698 . epsilon =  0.9818655297741534
7  epoch 8  sequence. Average reward =  -7.726272129654802 . epsilon =  0.981767343221176
7  epoch 9  sequence. Average reward =  -8.224994474098535 . epsilon =  0.9816691664868539
7  epoch 10  sequence. Average reward =  -4.381272065192495 . epsilon =  0.9815709995702052
7  epoch 11  sequence. Average reward =  -5.703149011562138 . epsilon =  0.9814728424702482
7  epoch 12  sequence. Average reward =  -6.4138739981201365 . epsilon =  0.9813746951860012
7  epoch 13  sequence. Average reward =  -8.234966890354427 . epsilon =  0.9812765577164826
7  epoch 14  sequence. Average reward =  -7.9809076750995285 . epsilon =  0.9811784300607109
7  epoch 15  sequence. Average reward =  -7.362387144720857 . epsilon =  0.981080312

10  epoch 20  sequence. Average reward =  -4.379460374934107 . epsilon =  0.9732625914072018
10  epoch 21  sequence. Average reward =  -6.598131330500858 . epsilon =  0.9731652651480611
10  epoch 22  sequence. Average reward =  -6.064414131594992 . epsilon =  0.9730679486215463
10  epoch 23  sequence. Average reward =  -7.146010381489505 . epsilon =  0.9729706418266841
10  epoch 24  sequence. Average reward =  -7.0212686903085695 . epsilon =  0.9728733447625014
11  epoch 0  sequence. Average reward =  -8.669303633755003 . epsilon =  0.9727760574280252
11  epoch 1  sequence. Average reward =  -6.063678440307088 . epsilon =  0.9726787798222825
11  epoch 2  sequence. Average reward =  -4.803067127447964 . epsilon =  0.9725815119443002
11  epoch 3  sequence. Average reward =  -4.626108819585894 . epsilon =  0.9724842537931058
11  epoch 4  sequence. Average reward =  -7.660444336154984 . epsilon =  0.9723870053677265
11  epoch 5  sequence. Average reward =  -4.977998160688636 . epsilon =  0

14  epoch 9  sequence. Average reward =  -8.088396161966543 . epsilon =  0.9646385570163959
14  epoch 10  sequence. Average reward =  -8.513938303419046 . epsilon =  0.9645420931606943
14  epoch 11  sequence. Average reward =  -7.007603416937717 . epsilon =  0.9644456389513782
14  epoch 12  sequence. Average reward =  -8.647486499676175 . epsilon =  0.9643491943874831
14  epoch 13  sequence. Average reward =  -5.861969947346815 . epsilon =  0.9642527594680443
14  epoch 14  sequence. Average reward =  -4.508723747536428 . epsilon =  0.9641563341920976
14  epoch 15  sequence. Average reward =  -7.798023124624614 . epsilon =  0.9640599185586783
14  epoch 16  sequence. Average reward =  -7.977095682991967 . epsilon =  0.9639635125668224
14  epoch 17  sequence. Average reward =  -5.536056071300988 . epsilon =  0.9638671162155658
14  epoch 18  sequence. Average reward =  -4.861401720958721 . epsilon =  0.9637707295039442
14  epoch 19  sequence. Average reward =  -5.282801599218541 . epsilon 

17  epoch 23  sequence. Average reward =  -6.684018879156434 . epsilon =  0.9560909397917597
17  epoch 24  sequence. Average reward =  -7.638143126507069 . epsilon =  0.9559953306977805
18  epoch 0  sequence. Average reward =  -6.616558115326755 . epsilon =  0.9558997311647107
18  epoch 1  sequence. Average reward =  -8.131539908182404 . epsilon =  0.9558041411915943
18  epoch 2  sequence. Average reward =  -6.613853977825596 . epsilon =  0.9557085607774751
18  epoch 3  sequence. Average reward =  -5.3540916797582705 . epsilon =  0.9556129899213974
18  epoch 4  sequence. Average reward =  -7.456419048854893 . epsilon =  0.9555174286224053
18  epoch 5  sequence. Average reward =  -8.350788285159307 . epsilon =  0.955421876879543
18  epoch 6  sequence. Average reward =  -4.838258159231288 . epsilon =  0.9553263346918551
18  epoch 7  sequence. Average reward =  -5.754887616695981 . epsilon =  0.955230802058386
18  epoch 8  sequence. Average reward =  -4.435192053680382 . epsilon =  0.9551

21  epoch 12  sequence. Average reward =  -5.98644026741869 . epsilon =  0.9476190626043499
21  epoch 13  sequence. Average reward =  -5.405820373591107 . epsilon =  0.9475243006980895
21  epoch 14  sequence. Average reward =  -3.9958821620462293 . epsilon =  0.9474295482680196
21  epoch 15  sequence. Average reward =  -7.016443682206228 . epsilon =  0.9473348053131928
21  epoch 16  sequence. Average reward =  -6.745407678726373 . epsilon =  0.9472400718326615
21  epoch 17  sequence. Average reward =  -7.1449551861913125 . epsilon =  0.9471453478254782
21  epoch 18  sequence. Average reward =  -8.828087934016438 . epsilon =  0.9470506332906957
21  epoch 19  sequence. Average reward =  -4.5850641307350815 . epsilon =  0.9469559282273666
21  epoch 20  sequence. Average reward =  -4.4530212941597185 . epsilon =  0.9468612326345439
21  epoch 21  sequence. Average reward =  -8.516185901379107 . epsilon =  0.9467665465112804
21  epoch 22  sequence. Average reward =  -6.695352464418698 . epsi

25  epoch 1  sequence. Average reward =  -4.843388080880756 . epsilon =  0.9392222543252324
25  epoch 2  sequence. Average reward =  -7.532066723037054 . epsilon =  0.9391283320998
25  epoch 3  sequence. Average reward =  -5.255153932832116 . epsilon =  0.93903441926659
25  epoch 4  sequence. Average reward =  -7.51948332199497 . epsilon =  0.9389405158246634
25  epoch 5  sequence. Average reward =  -4.910314919623226 . epsilon =  0.938846621773081
25  epoch 6  sequence. Average reward =  -5.066135636657844 . epsilon =  0.9387527371109037
25  epoch 7  sequence. Average reward =  -6.279556895258972 . epsilon =  0.9386588618371926
25  epoch 8  sequence. Average reward =  -6.840608776587891 . epsilon =  0.9385649959510088
25  epoch 9  sequence. Average reward =  -4.497889356011475 . epsilon =  0.9384711394514138
25  epoch 10  sequence. Average reward =  -3.866882927002844 . epsilon =  0.9383772923374687
25  epoch 11  sequence. Average reward =  -8.047180828065667 . epsilon =  0.9382834546

28  epoch 15  sequence. Average reward =  -6.691923152522143 . epsilon =  0.9308998497723158
28  epoch 16  sequence. Average reward =  -6.524174800823152 . epsilon =  0.9308067597873385
28  epoch 17  sequence. Average reward =  -4.387963079549854 . epsilon =  0.9307136791113598
28  epoch 18  sequence. Average reward =  -5.676836514773319 . epsilon =  0.9306206077434487
28  epoch 19  sequence. Average reward =  -7.634706729248972 . epsilon =  0.9305275456826744
28  epoch 20  sequence. Average reward =  -4.898939263021155 . epsilon =  0.9304344929281061
28  epoch 21  sequence. Average reward =  -7.09994516703015 . epsilon =  0.9303414494788133
28  epoch 22  sequence. Average reward =  -6.561268063982231 . epsilon =  0.9302484153338654
28  epoch 23  sequence. Average reward =  -5.01169887386651 . epsilon =  0.930155390492332
28  epoch 24  sequence. Average reward =  -5.614079317809342 . epsilon =  0.9300623749532828
29  epoch 0  sequence. Average reward =  -8.802739176208558 . epsilon =  

32  epoch 4  sequence. Average reward =  -6.65543449417379 . epsilon =  0.9226511896576549
32  epoch 5  sequence. Average reward =  -7.6654581602518155 . epsilon =  0.9225589245386892
32  epoch 6  sequence. Average reward =  -4.710552418151881 . epsilon =  0.9224666686462353
32  epoch 7  sequence. Average reward =  -8.674456680767761 . epsilon =  0.9223744219793707
32  epoch 8  sequence. Average reward =  -5.6237866872205915 . epsilon =  0.9222821845371728
32  epoch 9  sequence. Average reward =  -6.53048029667966 . epsilon =  0.9221899563187191
32  epoch 10  sequence. Average reward =  -8.086989781923416 . epsilon =  0.9220977373230873
32  epoch 11  sequence. Average reward =  -7.406978269292366 . epsilon =  0.922005527549355
32  epoch 12  sequence. Average reward =  -6.379142821034786 . epsilon =  0.9219133269966001
32  epoch 13  sequence. Average reward =  -5.050629831141887 . epsilon =  0.9218211356639004
32  epoch 14  sequence. Average reward =  -7.550861398403511 . epsilon =  0.9

35  epoch 18  sequence. Average reward =  -7.709334122780835 . epsilon =  0.9144756205352245
35  epoch 19  sequence. Average reward =  -4.703807375820872 . epsilon =  0.914384172973171
35  epoch 20  sequence. Average reward =  -7.170867493154012 . epsilon =  0.9142927345558737
35  epoch 21  sequence. Average reward =  -4.554532841599707 . epsilon =  0.914201305282418
35  epoch 22  sequence. Average reward =  -6.3482879617086425 . epsilon =  0.9141098851518898
35  epoch 23  sequence. Average reward =  -5.359563236025524 . epsilon =  0.9140184741633747
35  epoch 24  sequence. Average reward =  -4.617492275178604 . epsilon =  0.9139270723159584
36  epoch 0  sequence. Average reward =  -4.915180601083846 . epsilon =  0.9138356796087268
36  epoch 1  sequence. Average reward =  -5.163398513301604 . epsilon =  0.9137442960407659
36  epoch 2  sequence. Average reward =  -6.627472574019072 . epsilon =  0.9136529216111619
36  epoch 3  sequence. Average reward =  -4.667294561056605 . epsilon =  0

39  epoch 7  sequence. Average reward =  -4.862692275394958 . epsilon =  0.9063724947491544
39  epoch 8  sequence. Average reward =  -4.977026267198624 . epsilon =  0.9062818574996795
39  epoch 9  sequence. Average reward =  -4.972486741070415 . epsilon =  0.9061912293139295
39  epoch 10  sequence. Average reward =  -7.014677481878989 . epsilon =  0.9061006101909981
39  epoch 11  sequence. Average reward =  -4.527398438122505 . epsilon =  0.906010000129979
39  epoch 12  sequence. Average reward =  -6.712051510436032 . epsilon =  0.905919399129966
39  epoch 13  sequence. Average reward =  -6.189550286943619 . epsilon =  0.905828807190053
39  epoch 14  sequence. Average reward =  -8.560432868325416 . epsilon =  0.905738224309334
39  epoch 15  sequence. Average reward =  -7.8463298244341315 . epsilon =  0.9056476504869031
39  epoch 16  sequence. Average reward =  -5.227000408159071 . epsilon =  0.9055570857218544
39  epoch 17  sequence. Average reward =  -7.745483024513975 . epsilon =  0.

42  epoch 21  sequence. Average reward =  -7.88762637457801 . epsilon =  0.8983411703824229
42  epoch 22  sequence. Average reward =  -7.765236229852451 . epsilon =  0.8982513362653847
42  epoch 23  sequence. Average reward =  -8.16730026890201 . epsilon =  0.8981615111317581
42  epoch 24  sequence. Average reward =  -8.8390711062678 . epsilon =  0.898071694980645
43  epoch 0  sequence. Average reward =  -7.602451127687275 . epsilon =  0.897981887811147
43  epoch 1  sequence. Average reward =  -8.090808861101634 . epsilon =  0.8978920896223659
43  epoch 2  sequence. Average reward =  -4.538041814095906 . epsilon =  0.8978023004134037
43  epoch 3  sequence. Average reward =  -8.607678060941922 . epsilon =  0.8977125201833623
43  epoch 4  sequence. Average reward =  -5.387298409896091 . epsilon =  0.897622748931344
43  epoch 5  sequence. Average reward =  -7.184430441603505 . epsilon =  0.8975329866564509
43  epoch 6  sequence. Average reward =  -8.186284466222297 . epsilon =  0.89744323

46  epoch 10  sequence. Average reward =  -4.589190832935932 . epsilon =  0.8903810112060048
46  epoch 11  sequence. Average reward =  -4.744295664561129 . epsilon =  0.8902919731048842
46  epoch 12  sequence. Average reward =  -4.760225360752386 . epsilon =  0.8902029439075737
46  epoch 13  sequence. Average reward =  -4.577713252932968 . epsilon =  0.890113923613183
46  epoch 14  sequence. Average reward =  -9.059464573384295 . epsilon =  0.8900249122208217
46  epoch 15  sequence. Average reward =  -8.36776150457438 . epsilon =  0.8899359097295996
46  epoch 16  sequence. Average reward =  -6.976016968841335 . epsilon =  0.8898469161386267
46  epoch 17  sequence. Average reward =  -6.31655217942244 . epsilon =  0.8897579314470128
46  epoch 18  sequence. Average reward =  -6.481554373660303 . epsilon =  0.889668955653868
46  epoch 19  sequence. Average reward =  -8.756984582312764 . epsilon =  0.8895799887583027
46  epoch 20  sequence. Average reward =  -4.85353742427888 . epsilon =  0

49  epoch 24  sequence. Average reward =  -7.9490698254888255 . epsilon =  0.8824913866284702
50  epoch 0  sequence. Average reward =  -6.136517154634197 . epsilon =  0.8824031374898074
50  epoch 1  sequence. Average reward =  -8.880417731423153 . epsilon =  0.8823148971760584
50  epoch 2  sequence. Average reward =  -6.820408277976667 . epsilon =  0.8822266656863408
50  epoch 3  sequence. Average reward =  -5.6338425525017675 . epsilon =  0.8821384430197722
50  epoch 4  sequence. Average reward =  -8.357647080347487 . epsilon =  0.8820502291754702
50  epoch 5  sequence. Average reward =  -5.394017951780136 . epsilon =  0.8819620241525526
50  epoch 6  sequence. Average reward =  -4.157956983389514 . epsilon =  0.8818738279501374
50  epoch 7  sequence. Average reward =  -4.869735204108804 . epsilon =  0.8817856405673423
50  epoch 8  sequence. Average reward =  -7.638719040560585 . epsilon =  0.8816974620032856
50  epoch 9  sequence. Average reward =  -6.03554703860673 . epsilon =  0.881

53  epoch 13  sequence. Average reward =  -7.268398623461833 . epsilon =  0.874671671646032
53  epoch 14  sequence. Average reward =  -7.33109035244215 . epsilon =  0.8745842044788674
53  epoch 15  sequence. Average reward =  -7.398546236249326 . epsilon =  0.8744967460584195
53  epoch 16  sequence. Average reward =  -5.770699060144606 . epsilon =  0.8744092963838137
53  epoch 17  sequence. Average reward =  -4.598486065159274 . epsilon =  0.8743218554541753
53  epoch 18  sequence. Average reward =  -5.134657904897694 . epsilon =  0.8742344232686299
53  epoch 19  sequence. Average reward =  -4.87774896514892 . epsilon =  0.8741469998263031
53  epoch 20  sequence. Average reward =  -6.839634118506368 . epsilon =  0.8740595851263204
53  epoch 21  sequence. Average reward =  -4.556673927037261 . epsilon =  0.8739721791678078
53  epoch 22  sequence. Average reward =  -5.351965578339064 . epsilon =  0.873884781949891
53  epoch 23  sequence. Average reward =  -6.491856630675203 . epsilon =  

57  epoch 2  sequence. Average reward =  -7.612242035590114 . epsilon =  0.8669212467930305
57  epoch 3  sequence. Average reward =  -5.232596975782571 . epsilon =  0.8668345546683512
57  epoch 4  sequence. Average reward =  -5.611186995525891 . epsilon =  0.8667478712128844
57  epoch 5  sequence. Average reward =  -6.4012834821249704 . epsilon =  0.8666611964257631
57  epoch 6  sequence. Average reward =  -5.721393779589898 . epsilon =  0.8665745303061205
57  epoch 7  sequence. Average reward =  -5.114839600004468 . epsilon =  0.8664878728530899
57  epoch 8  sequence. Average reward =  -7.627723584985534 . epsilon =  0.8664012240658047
57  epoch 9  sequence. Average reward =  -6.523844722946406 . epsilon =  0.8663145839433981
57  epoch 10  sequence. Average reward =  -4.614519115211531 . epsilon =  0.8662279524850037
57  epoch 11  sequence. Average reward =  -6.526190581439485 . epsilon =  0.8661413296897552
57  epoch 12  sequence. Average reward =  -5.835678441420757 . epsilon =  0.8

60  epoch 16  sequence. Average reward =  -7.7405546840454305 . epsilon =  0.8592394980928637
60  epoch 17  sequence. Average reward =  -4.480228424251339 . epsilon =  0.8591535741430544
60  epoch 18  sequence. Average reward =  -6.420376695807766 . epsilon =  0.8590676587856401
60  epoch 19  sequence. Average reward =  -8.781558791518123 . epsilon =  0.8589817520197615
60  epoch 20  sequence. Average reward =  -6.2480635723523985 . epsilon =  0.8588958538445595
60  epoch 21  sequence. Average reward =  -5.6177114044953695 . epsilon =  0.858809964259175
60  epoch 22  sequence. Average reward =  -5.69109379961595 . epsilon =  0.8587240832627491
60  epoch 23  sequence. Average reward =  -5.615497094076509 . epsilon =  0.8586382108544228
60  epoch 24  sequence. Average reward =  -7.656848145414283 . epsilon =  0.8585523470333374
61  epoch 0  sequence. Average reward =  -5.554524009568689 . epsilon =  0.8584664917986341
61  epoch 1  sequence. Average reward =  -5.195127777517649 . epsilon 

64  epoch 5  sequence. Average reward =  -4.836698927935461 . epsilon =  0.8516258170093467
64  epoch 6  sequence. Average reward =  -4.9713815180799905 . epsilon =  0.8515406544276458
64  epoch 7  sequence. Average reward =  -5.1651065643955825 . epsilon =  0.8514555003622031
64  epoch 8  sequence. Average reward =  -6.150493900712077 . epsilon =  0.8513703548121668
64  epoch 9  sequence. Average reward =  -6.695434101722618 . epsilon =  0.8512852177766856
64  epoch 10  sequence. Average reward =  -5.245719164040699 . epsilon =  0.851200089254908
64  epoch 11  sequence. Average reward =  -7.049128447406486 . epsilon =  0.8511149692459825
64  epoch 12  sequence. Average reward =  -4.823288915578333 . epsilon =  0.8510298577490579
64  epoch 13  sequence. Average reward =  -5.166669131291118 . epsilon =  0.850944754763283
64  epoch 14  sequence. Average reward =  -5.135093171497452 . epsilon =  0.8508596602878067
64  epoch 15  sequence. Average reward =  -6.543436788290155 . epsilon =  0

67  epoch 19  sequence. Average reward =  -8.139923312440937 . epsilon =  0.8440796003985066
67  epoch 20  sequence. Average reward =  -7.312112035954564 . epsilon =  0.8439951924384668
67  epoch 21  sequence. Average reward =  -6.958177042614661 . epsilon =  0.8439107929192229
67  epoch 22  sequence. Average reward =  -6.903692665170002 . epsilon =  0.843826401839931
67  epoch 23  sequence. Average reward =  -7.213854556405664 . epsilon =  0.843742019199747
67  epoch 24  sequence. Average reward =  -4.8842143214583125 . epsilon =  0.843657644997827
68  epoch 0  sequence. Average reward =  -7.396902408316447 . epsilon =  0.8435732792333273
68  epoch 1  sequence. Average reward =  -4.9492751860938125 . epsilon =  0.8434889219054039
68  epoch 2  sequence. Average reward =  -6.0757102044710845 . epsilon =  0.8434045730132134
68  epoch 3  sequence. Average reward =  -4.875614864527671 . epsilon =  0.843320232555912
68  epoch 4  sequence. Average reward =  -5.419447862168693 . epsilon =  0.

71  epoch 8  sequence. Average reward =  -5.143110792756145 . epsilon =  0.8366002504607994
71  epoch 9  sequence. Average reward =  -6.293261369017285 . epsilon =  0.8365165904357533
71  epoch 10  sequence. Average reward =  -5.4012630498783825 . epsilon =  0.8364329387767098
71  epoch 11  sequence. Average reward =  -7.640437339790032 . epsilon =  0.8363492954828321
71  epoch 12  sequence. Average reward =  -8.396757523916738 . epsilon =  0.8362656605532838
71  epoch 13  sequence. Average reward =  -5.226066258033214 . epsilon =  0.8361820339872285
71  epoch 14  sequence. Average reward =  -6.75640752387058 . epsilon =  0.8360984157838297
71  epoch 15  sequence. Average reward =  -7.848960802334649 . epsilon =  0.8360148059422513
71  epoch 16  sequence. Average reward =  -7.2770346649216435 . epsilon =  0.8359312044616571
71  epoch 17  sequence. Average reward =  -4.849922537261766 . epsilon =  0.835847611341211
71  epoch 18  sequence. Average reward =  -5.226186056870852 . epsilon =

74  epoch 22  sequence. Average reward =  -4.877694045647493 . epsilon =  0.8291871746937558
74  epoch 23  sequence. Average reward =  -4.868124662990597 . epsilon =  0.8291042559762865
74  epoch 24  sequence. Average reward =  -5.293558539240609 . epsilon =  0.8290213455506888
75  epoch 0  sequence. Average reward =  -6.569501842003011 . epsilon =  0.8289384434161338
75  epoch 1  sequence. Average reward =  -5.199699978012329 . epsilon =  0.8288555495717922
75  epoch 2  sequence. Average reward =  -5.640725528295607 . epsilon =  0.828772664016835
75  epoch 3  sequence. Average reward =  -7.868317183089562 . epsilon =  0.8286897867504334
75  epoch 4  sequence. Average reward =  -4.471380803932494 . epsilon =  0.8286069177717583
75  epoch 5  sequence. Average reward =  -4.080627239036215 . epsilon =  0.8285240570799811
75  epoch 6  sequence. Average reward =  -6.499263014836763 . epsilon =  0.8284412046742732
75  epoch 7  sequence. Average reward =  -4.880890980644872 . epsilon =  0.828

78  epoch 11  sequence. Average reward =  -5.19536519312064 . epsilon =  0.8218397858450438
78  epoch 12  sequence. Average reward =  -8.462772404771323 . epsilon =  0.8217576018664593
78  epoch 13  sequence. Average reward =  -5.294188115753479 . epsilon =  0.8216754261062726
78  epoch 14  sequence. Average reward =  -6.091963126260808 . epsilon =  0.821593258563662
78  epoch 15  sequence. Average reward =  -8.54422991627961 . epsilon =  0.8215110992378057
78  epoch 16  sequence. Average reward =  -8.737901503855474 . epsilon =  0.8214289481278819
78  epoch 17  sequence. Average reward =  -8.481244231584398 . epsilon =  0.8213468052330691
78  epoch 18  sequence. Average reward =  -6.7963411317 . epsilon =  0.8212646705525458
78  epoch 19  sequence. Average reward =  -5.921729847219549 . epsilon =  0.8211825440854905
78  epoch 20  sequence. Average reward =  -4.087615185768202 . epsilon =  0.821100425831082
78  epoch 21  sequence. Average reward =  -6.25597962285553 . epsilon =  0.8210

82  epoch 0  sequence. Average reward =  -7.075281575196707 . epsilon =  0.8145575018659459
82  epoch 1  sequence. Average reward =  -5.113246557104087 . epsilon =  0.8144760461157593
82  epoch 2  sequence. Average reward =  -5.258803711856762 . epsilon =  0.8143945985111477
82  epoch 3  sequence. Average reward =  -4.716976622003524 . epsilon =  0.8143131590512966
82  epoch 4  sequence. Average reward =  -8.241229620553609 . epsilon =  0.8142317277353915
82  epoch 5  sequence. Average reward =  -4.89669704124162 . epsilon =  0.814150304562618
82  epoch 6  sequence. Average reward =  -5.2566022895197415 . epsilon =  0.8140688895321618
82  epoch 7  sequence. Average reward =  -5.709663628836539 . epsilon =  0.8139874826432086
82  epoch 8  sequence. Average reward =  -6.757073609906348 . epsilon =  0.8139060838949443
82  epoch 9  sequence. Average reward =  -6.635957241881381 . epsilon =  0.8138246932865548
82  epoch 10  sequence. Average reward =  -8.243172751545648 . epsilon =  0.81374

85  epoch 14  sequence. Average reward =  -6.693506338295184 . epsilon =  0.8073397458652514
85  epoch 15  sequence. Average reward =  -5.905926488416364 . epsilon =  0.8072590118906648
85  epoch 16  sequence. Average reward =  -6.974863279531843 . epsilon =  0.8071782859894758
85  epoch 17  sequence. Average reward =  -5.252847153574324 . epsilon =  0.8070975681608769
85  epoch 18  sequence. Average reward =  -4.828868672523803 . epsilon =  0.8070168584040608
85  epoch 19  sequence. Average reward =  -5.491001694116104 . epsilon =  0.8069361567182204
85  epoch 20  sequence. Average reward =  -5.109071393145182 . epsilon =  0.8068554631025485
85  epoch 21  sequence. Average reward =  -7.188481920936948 . epsilon =  0.8067747775562383
85  epoch 22  sequence. Average reward =  -8.337803344006973 . epsilon =  0.8066941000784826
85  epoch 23  sequence. Average reward =  -5.058349216989339 . epsilon =  0.8066134306684748
85  epoch 24  sequence. Average reward =  -3.9270071451764927 . epsilo

89  epoch 3  sequence. Average reward =  -7.20521113992671 . epsilon =  0.8001859460635561
89  epoch 4  sequence. Average reward =  -5.6968067324654 . epsilon =  0.8001059274689497
89  epoch 5  sequence. Average reward =  -5.872315123520138 . epsilon =  0.8000259168762028
89  epoch 6  sequence. Average reward =  -6.8121703445444695 . epsilon =  0.7999459142845152
89  epoch 7  sequence. Average reward =  -4.786108592160315 . epsilon =  0.7998659196930867
89  epoch 8  sequence. Average reward =  -4.543969766859854 . epsilon =  0.7997859331011175
89  epoch 9  sequence. Average reward =  -8.044492189546467 . epsilon =  0.7997059545078073
89  epoch 10  sequence. Average reward =  -4.593898583404613 . epsilon =  0.7996259839123566
89  epoch 11  sequence. Average reward =  -5.64078892870396 . epsilon =  0.7995460213139653
89  epoch 12  sequence. Average reward =  -4.867793313490641 . epsilon =  0.799466066711834
89  epoch 13  sequence. Average reward =  -6.622810624577621 . epsilon =  0.79938

92  epoch 17  sequence. Average reward =  -6.324787309213805 . epsilon =  0.7930955357479659
92  epoch 18  sequence. Average reward =  -7.6002165665242964 . epsilon =  0.793016226194391
92  epoch 19  sequence. Average reward =  -8.623552477167305 . epsilon =  0.7929369245717717
92  epoch 20  sequence. Average reward =  -7.3917602716684465 . epsilon =  0.7928576308793145
92  epoch 21  sequence. Average reward =  -8.031959091333732 . epsilon =  0.7927783451162266
92  epoch 22  sequence. Average reward =  -5.517649670972187 . epsilon =  0.792699067281715
92  epoch 23  sequence. Average reward =  -6.078577418097075 . epsilon =  0.7926197973749869
92  epoch 24  sequence. Average reward =  -4.438754833254623 . epsilon =  0.7925405353952494
93  epoch 0  sequence. Average reward =  -7.768376119852359 . epsilon =  0.7924612813417099
93  epoch 1  sequence. Average reward =  -7.349445454957367 . epsilon =  0.7923820352135758
93  epoch 2  sequence. Average reward =  -5.988888501851309 . epsilon = 

96  epoch 6  sequence. Average reward =  -4.445340494260427 . epsilon =  0.7860679532272039
96  epoch 7  sequence. Average reward =  -4.39914493349434 . epsilon =  0.7859893464318812
96  epoch 8  sequence. Average reward =  -8.264388284821607 . epsilon =  0.7859107474972381
96  epoch 9  sequence. Average reward =  -5.824178809083243 . epsilon =  0.7858321564224884
96  epoch 10  sequence. Average reward =  -5.899298501696634 . epsilon =  0.7857535732068461
96  epoch 11  sequence. Average reward =  -7.041852976286681 . epsilon =  0.7856749978495254
96  epoch 12  sequence. Average reward =  -8.048354697542948 . epsilon =  0.7855964303497405
96  epoch 13  sequence. Average reward =  -5.249803354004356 . epsilon =  0.7855178707067055
96  epoch 14  sequence. Average reward =  -7.427656107902005 . epsilon =  0.7854393189196349
96  epoch 15  sequence. Average reward =  -8.46919195136595 . epsilon =  0.7853607749877429
96  epoch 16  sequence. Average reward =  -8.134849995995955 . epsilon =  0.

99  epoch 20  sequence. Average reward =  -5.006048953425316 . epsilon =  0.7791026417871125
99  epoch 21  sequence. Average reward =  -4.046478946665899 . epsilon =  0.7790247315229338
99  epoch 22  sequence. Average reward =  -7.5129165892087535 . epsilon =  0.7789468290497815
99  epoch 23  sequence. Average reward =  -5.072544899054636 . epsilon =  0.7788689343668765
99  epoch 24  sequence. Average reward =  -7.383432705232275 . epsilon =  0.7787910474734399
100  epoch 0  sequence. Average reward =  -5.801642418438553 . epsilon =  0.7787131683686925
100  epoch 1  sequence. Average reward =  -4.963123135612355 . epsilon =  0.7786352970518556
100  epoch 2  sequence. Average reward =  -5.628067402139868 . epsilon =  0.7785574335221505
100  epoch 3  sequence. Average reward =  -8.230499120986375 . epsilon =  0.7784795777787983
100  epoch 4  sequence. Average reward =  -7.664585743474343 . epsilon =  0.7784017298210204
100  epoch 5  sequence. Average reward =  -4.969304730607373 . epsilo

103  epoch 8  sequence. Average reward =  -6.3554480576623025 . epsilon =  0.77227627727428
103  epoch 9  sequence. Average reward =  -5.542334079294506 . epsilon =  0.7721990496465526
103  epoch 10  sequence. Average reward =  -6.572179735457533 . epsilon =  0.7721218297415879
103  epoch 11  sequence. Average reward =  -6.474167865029693 . epsilon =  0.7720446175586138
103  epoch 12  sequence. Average reward =  -6.088140490254862 . epsilon =  0.7719674130968579
103  epoch 13  sequence. Average reward =  -8.180844462675289 . epsilon =  0.7718902163555482
103  epoch 14  sequence. Average reward =  -6.0390877736523665 . epsilon =  0.7718130273339127
103  epoch 15  sequence. Average reward =  -8.122965430123125 . epsilon =  0.7717358460311793
103  epoch 16  sequence. Average reward =  -5.138278237095319 . epsilon =  0.7716586724465762
103  epoch 17  sequence. Average reward =  -8.387172484664584 . epsilon =  0.7715815065793316
103  epoch 18  sequence. Average reward =  -5.192184595808795 

106  epoch 21  sequence. Average reward =  -5.670410586100735 . epsilon =  0.7655097242034357
106  epoch 22  sequence. Average reward =  -8.996288534148592 . epsilon =  0.7654331732310153
106  epoch 23  sequence. Average reward =  -4.498548725170127 . epsilon =  0.7653566299136922
106  epoch 24  sequence. Average reward =  -5.608537014699966 . epsilon =  0.7652800942507009
107  epoch 0  sequence. Average reward =  -5.6849826314147895 . epsilon =  0.7652035662412758
107  epoch 1  sequence. Average reward =  -6.124233456756229 . epsilon =  0.7651270458846517
107  epoch 2  sequence. Average reward =  -5.487309567409769 . epsilon =  0.7650505331800633
107  epoch 3  sequence. Average reward =  -5.668316055876228 . epsilon =  0.7649740281267453
107  epoch 4  sequence. Average reward =  -8.404461409078692 . epsilon =  0.7648975307239326
107  epoch 5  sequence. Average reward =  -8.557724829810999 . epsilon =  0.7648210409708602
107  epoch 6  sequence. Average reward =  -6.077221055615139 . ep

110  epoch 9  sequence. Average reward =  -4.808842319580411 . epsilon =  0.7588024585169224
110  epoch 10  sequence. Average reward =  -7.409739255000028 . epsilon =  0.7587265782710707
110  epoch 11  sequence. Average reward =  -7.857129498766894 . epsilon =  0.7586507056132437
110  epoch 12  sequence. Average reward =  -7.343513933489721 . epsilon =  0.7585748405426823
110  epoch 13  sequence. Average reward =  -4.972190967515903 . epsilon =  0.7584989830586281
110  epoch 14  sequence. Average reward =  -8.909011269044365 . epsilon =  0.7584231331603222
110  epoch 15  sequence. Average reward =  -5.28159153264277 . epsilon =  0.7583472908470061
110  epoch 16  sequence. Average reward =  -7.1801865059946195 . epsilon =  0.7582714561179215
110  epoch 17  sequence. Average reward =  -4.980562236202714 . epsilon =  0.7581956289723096
110  epoch 18  sequence. Average reward =  -4.419552973663078 . epsilon =  0.7581198094094124
110  epoch 19  sequence. Average reward =  -7.407399035954651

113  epoch 22  sequence. Average reward =  -7.096030204608766 . epsilon =  0.7521539607487872
113  epoch 23  sequence. Average reward =  -8.170823727229998 . epsilon =  0.7520787453527122
113  epoch 24  sequence. Average reward =  -6.837521417596222 . epsilon =  0.7520035374781769
114  epoch 0  sequence. Average reward =  -4.7502489811285775 . epsilon =  0.7519283371244291
114  epoch 1  sequence. Average reward =  -5.487955928720709 . epsilon =  0.7518531442907166
114  epoch 2  sequence. Average reward =  -4.863594096511806 . epsilon =  0.7517779589762875
114  epoch 3  sequence. Average reward =  -5.715632873092398 . epsilon =  0.7517027811803899
114  epoch 4  sequence. Average reward =  -8.509863908489015 . epsilon =  0.7516276109022719
114  epoch 5  sequence. Average reward =  -7.755319496489861 . epsilon =  0.7515524481411817
114  epoch 6  sequence. Average reward =  -5.246395195939555 . epsilon =  0.7514772928963676
114  epoch 7  sequence. Average reward =  -6.588607476049912 . eps

117  epoch 10  sequence. Average reward =  -4.888448520333321 . epsilon =  0.7455637159845491
117  epoch 11  sequence. Average reward =  -5.613386924484541 . epsilon =  0.7454891596129506
117  epoch 12  sequence. Average reward =  -5.004636730629266 . epsilon =  0.7454146106969893
117  epoch 13  sequence. Average reward =  -5.7732841497629135 . epsilon =  0.7453400692359197
117  epoch 14  sequence. Average reward =  -5.733797067168216 . epsilon =  0.7452655352289961
117  epoch 15  sequence. Average reward =  -7.392777021152343 . epsilon =  0.7451910086754732
117  epoch 16  sequence. Average reward =  -6.435190827490019 . epsilon =  0.7451164895746056
117  epoch 17  sequence. Average reward =  -8.010582609697359 . epsilon =  0.7450419779256482
117  epoch 18  sequence. Average reward =  -6.205865103577615 . epsilon =  0.7449674737278557
117  epoch 19  sequence. Average reward =  -5.34139895840892 . epsilon =  0.7448929769804828
117  epoch 20  sequence. Average reward =  -4.90514068445625

120  epoch 23  sequence. Average reward =  -7.5064861163705565 . epsilon =  0.7390312138213185
120  epoch 24  sequence. Average reward =  -5.0089175697260835 . epsilon =  0.7389573106999364
121  epoch 0  sequence. Average reward =  -5.425351626446773 . epsilon =  0.7388834149688664
121  epoch 1  sequence. Average reward =  -5.289919603977698 . epsilon =  0.7388095266273695
121  epoch 2  sequence. Average reward =  -6.700219607677062 . epsilon =  0.7387356456747068
121  epoch 3  sequence. Average reward =  -5.7978350583191025 . epsilon =  0.7386617721101394
121  epoch 4  sequence. Average reward =  -5.0340063595479805 . epsilon =  0.7385879059329283
121  epoch 5  sequence. Average reward =  -4.968777011111777 . epsilon =  0.738514047142335
121  epoch 6  sequence. Average reward =  -6.28268785416664 . epsilon =  0.7384401957376208
121  epoch 7  sequence. Average reward =  -5.610821104650887 . epsilon =  0.738366351718047
121  epoch 8  sequence. Average reward =  -8.797968420170612 . epsi

124  epoch 11  sequence. Average reward =  -4.8220708529040905 . epsilon =  0.7325559483282726
124  epoch 12  sequence. Average reward =  -7.118053239826366 . epsilon =  0.7324826927334398
124  epoch 13  sequence. Average reward =  -5.579612015243691 . epsilon =  0.7324094444641664
124  epoch 14  sequence. Average reward =  -5.23681122492867 . epsilon =  0.7323362035197201
124  epoch 15  sequence. Average reward =  -5.772565468514486 . epsilon =  0.7322629698993681
124  epoch 16  sequence. Average reward =  -4.860930270211841 . epsilon =  0.7321897436023782
124  epoch 17  sequence. Average reward =  -6.27914894833052 . epsilon =  0.732116524628018
124  epoch 18  sequence. Average reward =  -5.2008365935558984 . epsilon =  0.7320433129755552
124  epoch 19  sequence. Average reward =  -5.661609927935604 . epsilon =  0.7319701086442577
124  epoch 20  sequence. Average reward =  -8.027078651254158 . epsilon =  0.7318969116333933
124  epoch 21  sequence. Average reward =  -8.387265503442135

127  epoch 24  sequence. Average reward =  -5.13360786894744 . epsilon =  0.7261374180074647
128  epoch 0  sequence. Average reward =  -7.034328627903595 . epsilon =  0.7260648042656639
128  epoch 1  sequence. Average reward =  -5.950494569909502 . epsilon =  0.7259921977852374
128  epoch 2  sequence. Average reward =  -8.400977492644236 . epsilon =  0.7259195985654588
128  epoch 3  sequence. Average reward =  -4.9087396869547835 . epsilon =  0.7258470066056023
128  epoch 4  sequence. Average reward =  -5.530945321568984 . epsilon =  0.7257744219049418
128  epoch 5  sequence. Average reward =  -7.8936175902599 . epsilon =  0.7257018444627513
128  epoch 6  sequence. Average reward =  -7.5762484330183835 . epsilon =  0.725629274278305
128  epoch 7  sequence. Average reward =  -8.638213419052674 . epsilon =  0.7255567113508772
128  epoch 8  sequence. Average reward =  -5.090775452388447 . epsilon =  0.7254841556797421
128  epoch 9  sequence. Average reward =  -5.90651064586201 . epsilon =

131  epoch 12  sequence. Average reward =  -4.842400638011937 . epsilon =  0.7197751257549888
131  epoch 13  sequence. Average reward =  -6.827522802253564 . epsilon =  0.7197031482424133
131  epoch 14  sequence. Average reward =  -5.76631713561752 . epsilon =  0.7196311779275891
131  epoch 15  sequence. Average reward =  -4.8241776655268955 . epsilon =  0.7195592148097963
131  epoch 16  sequence. Average reward =  -6.086277984531909 . epsilon =  0.7194872588883153
131  epoch 17  sequence. Average reward =  -7.603656429652375 . epsilon =  0.7194153101624265
131  epoch 18  sequence. Average reward =  -5.31693572313428 . epsilon =  0.7193433686314104
131  epoch 19  sequence. Average reward =  -4.784874153391468 . epsilon =  0.7192714342945472
131  epoch 20  sequence. Average reward =  -7.109994009260486 . epsilon =  0.7191995071511178
131  epoch 21  sequence. Average reward =  -7.144247868477962 . epsilon =  0.7191275872004027
131  epoch 22  sequence. Average reward =  -5.182223286629595

135  epoch 0  sequence. Average reward =  -5.34355443623656 . epsilon =  0.713468578822478
135  epoch 1  sequence. Average reward =  -8.600237709531644 . epsilon =  0.7133972319645958
135  epoch 2  sequence. Average reward =  -4.912085634594072 . epsilon =  0.7133258922413993
135  epoch 3  sequence. Average reward =  -5.653414615426173 . epsilon =  0.7132545596521752
135  epoch 4  sequence. Average reward =  -8.860528362945074 . epsilon =  0.71318323419621
135  epoch 5  sequence. Average reward =  -4.59099905479529 . epsilon =  0.7131119158727903
135  epoch 6  sequence. Average reward =  -5.574822734604827 . epsilon =  0.713040604681203
135  epoch 7  sequence. Average reward =  -7.747763657822355 . epsilon =  0.7129693006207349
135  epoch 8  sequence. Average reward =  -4.789367226384808 . epsilon =  0.7128980036906729
135  epoch 9  sequence. Average reward =  -5.078336265706866 . epsilon =  0.7128267138903038
135  epoch 10  sequence. Average reward =  -5.823477187674027 . epsilon =  0

138  epoch 13  sequence. Average reward =  -4.799075411679825 . epsilon =  0.707217288778944
138  epoch 14  sequence. Average reward =  -5.444444065569213 . epsilon =  0.7071465670500661
138  epoch 15  sequence. Average reward =  -6.670813589843386 . epsilon =  0.707075852393361
138  epoch 16  sequence. Average reward =  -7.249487724024673 . epsilon =  0.7070051448081217
138  epoch 17  sequence. Average reward =  -5.188868158457408 . epsilon =  0.7069344442936409
138  epoch 18  sequence. Average reward =  -4.721756425268265 . epsilon =  0.7068637508492116
138  epoch 19  sequence. Average reward =  -6.9988655466089496 . epsilon =  0.7067930644741267
138  epoch 20  sequence. Average reward =  -7.824020611828256 . epsilon =  0.7067223851676793
138  epoch 21  sequence. Average reward =  -4.863729554812768 . epsilon =  0.7066517129291625
138  epoch 22  sequence. Average reward =  -5.295208087708715 . epsilon =  0.7065810477578696
138  epoch 23  sequence. Average reward =  -6.356803057577971

142  epoch 1  sequence. Average reward =  -7.880476759459022 . epsilon =  0.7010207714729464
142  epoch 2  sequence. Average reward =  -6.780626031751949 . epsilon =  0.7009506693957991
142  epoch 3  sequence. Average reward =  -4.4492993914601495 . epsilon =  0.7008805743288595
142  epoch 4  sequence. Average reward =  -4.797881994094753 . epsilon =  0.7008104862714266
142  epoch 5  sequence. Average reward =  -5.92741469924614 . epsilon =  0.7007404052227995
142  epoch 6  sequence. Average reward =  -8.09395575117348 . epsilon =  0.7006703311822772
142  epoch 7  sequence. Average reward =  -5.638028230139122 . epsilon =  0.700600264149159
142  epoch 8  sequence. Average reward =  -6.117622897453152 . epsilon =  0.7005302041227441
142  epoch 9  sequence. Average reward =  -5.378099725527405 . epsilon =  0.7004601511023318
142  epoch 10  sequence. Average reward =  -6.363025621929928 . epsilon =  0.7003901050872217
142  epoch 11  sequence. Average reward =  -5.612530739137871 . epsilon

145  epoch 14  sequence. Average reward =  -8.361969317039685 . epsilon =  0.694878546995098
145  epoch 15  sequence. Average reward =  -4.918323386076592 . epsilon =  0.6948090591403985
145  epoch 16  sequence. Average reward =  -5.036255948091062 . epsilon =  0.6947395782344844
145  epoch 17  sequence. Average reward =  -5.240693118899889 . epsilon =  0.694670104276661
145  epoch 18  sequence. Average reward =  -7.619750110850808 . epsilon =  0.6946006372662333
145  epoch 19  sequence. Average reward =  -8.79526322532048 . epsilon =  0.6945311772025067
145  epoch 20  sequence. Average reward =  -6.982260588189245 . epsilon =  0.6944617240847865
145  epoch 21  sequence. Average reward =  -5.6334079563587505 . epsilon =  0.694392277912378
145  epoch 22  sequence. Average reward =  -7.989567204566014 . epsilon =  0.6943228386845868
145  epoch 23  sequence. Average reward =  -6.0641685105916485 . epsilon =  0.6942534064007183
145  epoch 24  sequence. Average reward =  -5.9685484723786635

149  epoch 2  sequence. Average reward =  -8.826323667645667 . epsilon =  0.6887901396408949
149  epoch 3  sequence. Average reward =  -6.267374158951032 . epsilon =  0.6887212606269307
149  epoch 4  sequence. Average reward =  -5.881257288239137 . epsilon =  0.6886523885008681
149  epoch 5  sequence. Average reward =  -6.692150149473145 . epsilon =  0.688583523262018
149  epoch 6  sequence. Average reward =  -7.342455792008822 . epsilon =  0.6885146649096918
149  epoch 7  sequence. Average reward =  -6.886855420837117 . epsilon =  0.6884458134432009
149  epoch 8  sequence. Average reward =  -6.068629254716298 . epsilon =  0.6883769688618566
149  epoch 9  sequence. Average reward =  -5.923544655623993 . epsilon =  0.6883081311649705
149  epoch 10  sequence. Average reward =  -4.834460738496938 . epsilon =  0.688239300351854
149  epoch 11  sequence. Average reward =  -6.5850002020391845 . epsilon =  0.6881704764218188
149  epoch 12  sequence. Average reward =  -5.667931292213594 . epsil

152  epoch 15  sequence. Average reward =  -6.11974856035501 . epsilon =  0.6827550778738755
152  epoch 16  sequence. Average reward =  -4.965853238931661 . epsilon =  0.682686802366088
152  epoch 17  sequence. Average reward =  -7.579015590251589 . epsilon =  0.6826185336858515
152  epoch 18  sequence. Average reward =  -4.913046015368261 . epsilon =  0.6825502718324828
152  epoch 19  sequence. Average reward =  -5.464241358747129 . epsilon =  0.6824820168052996
152  epoch 20  sequence. Average reward =  -6.4788955276258875 . epsilon =  0.6824137686036191
152  epoch 21  sequence. Average reward =  -5.6207540956874285 . epsilon =  0.6823455272267588
152  epoch 22  sequence. Average reward =  -7.424217525764969 . epsilon =  0.6822772926740361
152  epoch 23  sequence. Average reward =  -5.162767303914645 . epsilon =  0.6822090649447687
152  epoch 24  sequence. Average reward =  -6.380089408138544 . epsilon =  0.6821408440382742
153  epoch 0  sequence. Average reward =  -8.024936405752491

156  epoch 3  sequence. Average reward =  -5.455133268297484 . epsilon =  0.6767728942890996
156  epoch 4  sequence. Average reward =  -5.48524882841272 . epsilon =  0.6767052169996707
156  epoch 5  sequence. Average reward =  -6.484045872156041 . epsilon =  0.6766375464779707
156  epoch 6  sequence. Average reward =  -8.06948835753569 . epsilon =  0.6765698827233229
156  epoch 7  sequence. Average reward =  -4.892879542985815 . epsilon =  0.6765022257350505
156  epoch 8  sequence. Average reward =  -7.001723362651527 . epsilon =  0.676434575512477
156  epoch 9  sequence. Average reward =  -5.191279838301346 . epsilon =  0.6763669320549257
156  epoch 10  sequence. Average reward =  -7.821463101681686 . epsilon =  0.6762992953617202
156  epoch 11  sequence. Average reward =  -5.053985547623162 . epsilon =  0.676231665432184
156  epoch 12  sequence. Average reward =  -8.188763186430307 . epsilon =  0.6761640422656409
156  epoch 13  sequence. Average reward =  -6.426334994115298 . epsilon

159  epoch 16  sequence. Average reward =  -4.831435773683519 . epsilon =  0.6708431255769504
159  epoch 17  sequence. Average reward =  -5.592100665512168 . epsilon =  0.6707760412643927
159  epoch 18  sequence. Average reward =  -4.973600466869596 . epsilon =  0.6707089636602663
159  epoch 19  sequence. Average reward =  -6.170340273736755 . epsilon =  0.6706418927639003
159  epoch 20  sequence. Average reward =  -4.956876508452576 . epsilon =  0.6705748285746239
159  epoch 21  sequence. Average reward =  -4.9487326131480005 . epsilon =  0.6705077710917665
159  epoch 22  sequence. Average reward =  -5.892985204786298 . epsilon =  0.6704407203146573
159  epoch 23  sequence. Average reward =  -6.100092224676599 . epsilon =  0.6703736762426258
159  epoch 24  sequence. Average reward =  -5.693589604487905 . epsilon =  0.6703066388750015
160  epoch 0  sequence. Average reward =  -8.052175226756425 . epsilon =  0.6702396082111141
160  epoch 1  sequence. Average reward =  -6.470827022602681

163  epoch 4  sequence. Average reward =  -5.5775571937132975 . epsilon =  0.6649653124872504
163  epoch 5  sequence. Average reward =  -5.838476045399217 . epsilon =  0.6648988159560018
163  epoch 6  sequence. Average reward =  -7.013274065451167 . epsilon =  0.6648323260744061
163  epoch 7  sequence. Average reward =  -5.337762305064393 . epsilon =  0.6647658428417987
163  epoch 8  sequence. Average reward =  -5.059819294985797 . epsilon =  0.6646993662575146
163  epoch 9  sequence. Average reward =  -5.593499081330319 . epsilon =  0.6646328963208888
163  epoch 10  sequence. Average reward =  -7.063902297600116 . epsilon =  0.6645664330312567
163  epoch 11  sequence. Average reward =  -6.95544264974231 . epsilon =  0.6644999763879537
163  epoch 12  sequence. Average reward =  -6.7939927332371575 . epsilon =  0.6644335263903148
163  epoch 13  sequence. Average reward =  -7.799473070432698 . epsilon =  0.6643670830376758
163  epoch 14  sequence. Average reward =  -5.625832245688907 . e

166  epoch 17  sequence. Average reward =  -6.337092720043079 . epsilon =  0.6591389997936938
166  epoch 18  sequence. Average reward =  -5.694467005336217 . epsilon =  0.6590730858937145
166  epoch 19  sequence. Average reward =  -5.733147054135686 . epsilon =  0.6590071785851251
166  epoch 20  sequence. Average reward =  -8.510843731033175 . epsilon =  0.6589412778672666
166  epoch 21  sequence. Average reward =  -6.188542435835357 . epsilon =  0.65887538373948
166  epoch 22  sequence. Average reward =  -4.68598666381397 . epsilon =  0.6588094962011061
166  epoch 23  sequence. Average reward =  -5.361525183363707 . epsilon =  0.658743615251486
166  epoch 24  sequence. Average reward =  -7.999510491367854 . epsilon =  0.6586777408899608
167  epoch 0  sequence. Average reward =  -6.781788855595571 . epsilon =  0.6586118731158718
167  epoch 1  sequence. Average reward =  -8.520347578640756 . epsilon =  0.6585460119285602
167  epoch 2  sequence. Average reward =  -6.7129299074509055 . ep

170  epoch 5  sequence. Average reward =  -5.117266424907669 . epsilon =  0.6533637362585909
170  epoch 6  sequence. Average reward =  -5.725013712899987 . epsilon =  0.6532983998849651
170  epoch 7  sequence. Average reward =  -5.540207486523213 . epsilon =  0.6532330700449765
170  epoch 8  sequence. Average reward =  -7.601125930538234 . epsilon =  0.6531677467379721
170  epoch 9  sequence. Average reward =  -6.706768885956828 . epsilon =  0.6531024299632983
170  epoch 10  sequence. Average reward =  -6.577191428754378 . epsilon =  0.653037119720302
170  epoch 11  sequence. Average reward =  -6.763976048758618 . epsilon =  0.65297181600833
170  epoch 12  sequence. Average reward =  -5.577040725411874 . epsilon =  0.6529065188267291
170  epoch 13  sequence. Average reward =  -7.341674133900196 . epsilon =  0.6528412281748465
170  epoch 14  sequence. Average reward =  -6.405881909745456 . epsilon =  0.652775944052029
170  epoch 15  sequence. Average reward =  -8.174071332247333 . epsil

173  epoch 18  sequence. Average reward =  -5.025005947277912 . epsilon =  0.6476390745979188
173  epoch 19  sequence. Average reward =  -6.629300367927099 . epsilon =  0.647574310690459
173  epoch 20  sequence. Average reward =  -5.399002305825474 . epsilon =  0.64750955325939
173  epoch 21  sequence. Average reward =  -5.800431160713317 . epsilon =  0.647444802304064
173  epoch 22  sequence. Average reward =  -8.027531456359153 . epsilon =  0.6473800578238336
173  epoch 23  sequence. Average reward =  -5.592337298471218 . epsilon =  0.6473153198180512
173  epoch 24  sequence. Average reward =  -8.499837062927496 . epsilon =  0.6472505882860694
174  epoch 0  sequence. Average reward =  -7.151527184302006 . epsilon =  0.6471858632272408
174  epoch 1  sequence. Average reward =  -5.429105664319858 . epsilon =  0.6471211446409181
174  epoch 2  sequence. Average reward =  -4.797482598913667 . epsilon =  0.647056432526454
174  epoch 3  sequence. Average reward =  -6.6291978978271136 . epsi

177  epoch 6  sequence. Average reward =  -7.580268642695536 . epsilon =  0.6419645714466807
177  epoch 7  sequence. Average reward =  -5.942166779650124 . epsilon =  0.6419003749895361
177  epoch 8  sequence. Average reward =  -6.90707221330258 . epsilon =  0.6418361849520372
177  epoch 9  sequence. Average reward =  -5.818296019465955 . epsilon =  0.641772001333542
177  epoch 10  sequence. Average reward =  -5.7074564683246924 . epsilon =  0.6417078241334087
177  epoch 11  sequence. Average reward =  -5.6125653509267055 . epsilon =  0.6416436533509953
177  epoch 12  sequence. Average reward =  -5.724384857244477 . epsilon =  0.6415794889856602
177  epoch 13  sequence. Average reward =  -8.538112512132118 . epsilon =  0.6415153310367616
177  epoch 14  sequence. Average reward =  -5.87052196258399 . epsilon =  0.6414511795036579
177  epoch 15  sequence. Average reward =  -4.842266770362227 . epsilon =  0.6413870343857075
177  epoch 16  sequence. Average reward =  -5.1997753842637415 . 

180  epoch 19  sequence. Average reward =  -5.684602972853031 . epsilon =  0.6363397873245691
180  epoch 20  sequence. Average reward =  -5.332605287725426 . epsilon =  0.6362761533458366
180  epoch 21  sequence. Average reward =  -5.301074163385677 . epsilon =  0.636212525730502
180  epoch 22  sequence. Average reward =  -5.784008144871892 . epsilon =  0.6361489044779289
180  epoch 23  sequence. Average reward =  -7.033185251261092 . epsilon =  0.6360852895874811
180  epoch 24  sequence. Average reward =  -4.874864160818905 . epsilon =  0.6360216810585224
181  epoch 0  sequence. Average reward =  -4.7206083075299 . epsilon =  0.6359580788904166
181  epoch 1  sequence. Average reward =  -5.194299626888136 . epsilon =  0.6358944830825275
181  epoch 2  sequence. Average reward =  -5.080902731338963 . epsilon =  0.6358308936342193
181  epoch 3  sequence. Average reward =  -6.204663149553087 . epsilon =  0.6357673105448559
181  epoch 4  sequence. Average reward =  -8.100248352131523 . epsi

184  epoch 7  sequence. Average reward =  -8.54883793138551 . epsilon =  0.6307642866019273
184  epoch 8  sequence. Average reward =  -5.683754821808133 . epsilon =  0.6307012101732671
184  epoch 9  sequence. Average reward =  -5.632047460505272 . epsilon =  0.6306381400522497
184  epoch 10  sequence. Average reward =  -5.613237430643711 . epsilon =  0.6305750762382445
184  epoch 11  sequence. Average reward =  -5.18396318612355 . epsilon =  0.6305120187306207
184  epoch 12  sequence. Average reward =  -7.012914476374358 . epsilon =  0.6304489675287476
184  epoch 13  sequence. Average reward =  -6.915318647822295 . epsilon =  0.6303859226319947
184  epoch 14  sequence. Average reward =  -6.593985617790862 . epsilon =  0.6303228840397315
184  epoch 15  sequence. Average reward =  -5.213003091919355 . epsilon =  0.6302598517513275
184  epoch 16  sequence. Average reward =  -4.6547032476261325 . epsilon =  0.6301968257661523
184  epoch 17  sequence. Average reward =  -8.074878302289678 . 

187  epoch 20  sequence. Average reward =  -4.976742454122402 . epsilon =  0.6252376374660126
187  epoch 21  sequence. Average reward =  -5.427275783806253 . epsilon =  0.625175113702266
187  epoch 22  sequence. Average reward =  -6.8428701188935035 . epsilon =  0.6251125961908958
187  epoch 23  sequence. Average reward =  -4.905420895361912 . epsilon =  0.6250500849312767
187  epoch 24  sequence. Average reward =  -6.446400547403449 . epsilon =  0.6249875799227835
188  epoch 0  sequence. Average reward =  -5.7240542236657355 . epsilon =  0.6249250811647913
188  epoch 1  sequence. Average reward =  -6.705640927029241 . epsilon =  0.6248625886566748
188  epoch 2  sequence. Average reward =  -8.3964778711239 . epsilon =  0.6248001023978091
188  epoch 3  sequence. Average reward =  -7.049794844259552 . epsilon =  0.6247376223875694
188  epoch 4  sequence. Average reward =  -4.725421993760772 . epsilon =  0.6246751486253306
188  epoch 5  sequence. Average reward =  -7.380105971378851 . eps

191  epoch 8  sequence. Average reward =  -5.122721265983215 . epsilon =  0.6197594118875509
191  epoch 9  sequence. Average reward =  -5.548890488321516 . epsilon =  0.6196974359463622
191  epoch 10  sequence. Average reward =  -5.140224503690388 . epsilon =  0.6196354662027675
191  epoch 11  sequence. Average reward =  -6.074149755339962 . epsilon =  0.6195735026561473
191  epoch 12  sequence. Average reward =  -7.855104598361924 . epsilon =  0.6195115453058817
191  epoch 13  sequence. Average reward =  -5.6671537806399375 . epsilon =  0.6194495941513511
191  epoch 14  sequence. Average reward =  -5.744112256851253 . epsilon =  0.619387649191936
191  epoch 15  sequence. Average reward =  -5.20457819730191 . epsilon =  0.6193257104270168
191  epoch 16  sequence. Average reward =  -7.6533904911864585 . epsilon =  0.6192637778559741
191  epoch 17  sequence. Average reward =  -6.115223379848752 . epsilon =  0.6192018514781885
191  epoch 18  sequence. Average reward =  -5.7050327793502635

194  epoch 21  sequence. Average reward =  -4.743162624342451 . epsilon =  0.614329185587588
194  epoch 22  sequence. Average reward =  -8.182588677565935 . epsilon =  0.6142677526690292
194  epoch 23  sequence. Average reward =  -6.240504382682751 . epsilon =  0.6142063258937623
194  epoch 24  sequence. Average reward =  -4.003567357695198 . epsilon =  0.6141449052611729
195  epoch 0  sequence. Average reward =  -7.948768947893388 . epsilon =  0.6140834907706468
195  epoch 1  sequence. Average reward =  -8.617906958543776 . epsilon =  0.6140220824215697
195  epoch 2  sequence. Average reward =  -5.661984642715342 . epsilon =  0.6139606802133276
195  epoch 3  sequence. Average reward =  -6.101062981658603 . epsilon =  0.6138992841453063
195  epoch 4  sequence. Average reward =  -7.738224318700729 . epsilon =  0.6138378942168917
195  epoch 5  sequence. Average reward =  -5.7556704796606954 . epsilon =  0.6137765104274701
195  epoch 6  sequence. Average reward =  -8.200271497284259 . eps

In [None]:

# Save model weights

print('saving model')
model.save('pendulum_model_juno_' + str(training_iterations) + '.h5')
print('model saved')


In [148]:
# Evaluate performance on 10 test runs with 100 steps each
trarray = []
rounds = 100
for i in range(10):
    trarray.append(play_game(rounds))
    print(i, ' sequence. Average test reward = ', np.average(trarray)/rounds, 'Average test reward = ', trarray[-1]/rounds)
    

0  sequence. Average test reward =  -3.4512197585919715 Average test reward =  -3.4512197585919715
1  sequence. Average test reward =  -3.009438806659382 Average test reward =  -2.567657854726793
2  sequence. Average test reward =  -2.4452600521959984 Average test reward =  -1.3169025432692325
3  sequence. Average test reward =  -2.1629505577161874 Average test reward =  -1.3160220742767534
4  sequence. Average test reward =  -1.9895910320026298 Average test reward =  -1.2961529291483982
5  sequence. Average test reward =  -1.884748807913826 Average test reward =  -1.3605376874698074
6  sequence. Average test reward =  -1.9707895391840953 Average test reward =  -2.4870339268057107
7  sequence. Average test reward =  -2.048546774529388 Average test reward =  -2.5928474219464364
8  sequence. Average test reward =  -2.2271028859710635 Average test reward =  -3.655551777504468
9  sequence. Average test reward =  -2.136622170053155 Average test reward =  -1.3222957267919802


## Run pretrained model

In case you already trained a model or want to load the pretrained model for sanity checking use the following script (make sure you executed the necessary cells starting with the imports).

- How does the performance change with the amount of trained iterations?
- How can we measure performance?
- Is it sufficient to start the play_game function a single time? 
- How can we make sure, that the evaluation is meaningful?



In [74]:
env = gym.make('Pendulum-v0')

actionbinslist = create_action_bins(20)

# 'pendulum_model_[iterationstrained].h5' 
# iterationstrained: 100, 1000, 10000
model = load_model('pendulum_model_1000.h5')

for i in range(10):
    play_game(rounds=250)
    
env.close()

## Next steps to take it from here

- Implement a skip frame approach
- Experiment with the discretization of the action bins (e.g. advantages and disadvantages of triadisation)
- Experiment with exploration vs exploitation