# Setting our program's arguments

There are a few arguments we can change that can influence the model, environment or debugging of this notebook.

> **Environments**: For all possible environments please have a look at: [OpenAI Environments](https://github.com/openai/gym/wiki/Table-of-environments)

| Variable | Description | Default | Possible values | Type |
| :- | :- | :- | :- | :- |
|| **ENVIRONMENT OPTIONS** ||||
| **ENVIRONMENT** | The selected environment (from the environments list or just typed in by hand) | 'CartPole-v1' | A string representing a gym environment | str |
|| **DEBUG OPTIONS** ||||
| **VERBOSE**  | What types of debug data are printed. 0 for no debug printing, 1 for normal debug printing and 2 for normal + GPU device placement printing (this option is very spammy and resource intensive) | 1 | 0/1/2 | int
| **VISUALIZE['train']** | Whether to visualize training | False | - | bool |
| **VISUALIZE['evaluate']** | Whether to visualize evaluation | True | - | bool |
|| **AGENT OPTIONS** ||||
| **LOAD_PRETRAINED** | If you already have a saved model load it from `./models/{ENVIRONMENT}.h5` | False | - | bool |
| **STEPS['train']** | The amount of training games we wish to play | 10k | > 0 | int |
| **STEPS['evaluate']** | The amount of evaluation games we wish to evaluate on | 10 | > 0 | int |

In [1]:
ENVIRONMENT = 'CartPole-v1'

# Parameters
LOAD_PRETRAINED = True
STEPS = {
    'train': 10000,
    'evaluate': 100
}

# Logging and visualisation
VERBOSE = 1
VISUALIZE = {
    'train': False,
    'evaluate': True
}

# Adding our packages

If your notebook stops here please make sure you have tensorflow > 2.0 installed

In [2]:
from packaging import version
import numpy as np
import random
import gym
import os

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense
from collections import deque

# Purely for logging
from gym.spaces.discrete import Discrete

# Check to make sure you're running TensorFlow 2.0+
assert version.parse(tf.__version__).release[0] >= 2, \
"This notebook requires TensorFlow 2.0 or above."

# Setting up our environment

Our model has inputs and outputs and they are as follows.
- Our inputs are the things we perceive also known as the observation space (`env.observation_space`)
- Our outputs are the things we can do in this observation_space also known as the actions (`env.action_space`)

Now there's two different types of spaces in OpenAI gym.
- Discrete: discrete spaces/actions are actions that are either on or off but never inbetween (0/1)
- Box/Real: these spaces can be partly on or partly off (0,.1,.11,.5,1)

Usually box action spaces tend to have less actions since you have more control over every individual action

In [3]:
env = gym.make(ENVIRONMENT)

# Check if we're running a Discrete (on/off) or Box action_space (0,.1,.11,1)
if (type(env.action_space) == Discrete):
    possible_actions = env.action_space.n
    ACTION_TYPE = 'discrete'
else:
    possible_actions = env.action_space.shape[0]
    ACTION_TYPE = 'real'

observation_shape = env.observation_space.shape

# Log some information about the environment you're playing
print(f'You\'re playing {ENVIRONMENT}')
print(f'You have control over {possible_actions} {ACTION_TYPE} actions')
print(f'Your observable space has a shape of: {observation_shape}')

You're playing CartPole-v1
You have control over 2 discrete actions
Your observable space has a shape of: (4,)




# Checking our TensorFlow setup

Here we can check if we run on GPU or CPU and what tensorflow version is installed.

In [4]:
if VERBOSE == 1:
    print(f'Tensorflow Version: {tf.__version__}')
    print(f'Tensorflow Build: {"GPU" if tf.test.is_built_with_cuda() else "CPU"}')
          
if VERBOSE == 2:
    # Use this to check if your GPU is actually utilized
    tf.debugging.set_log_device_placement(True)


# To check if it can see and possibly utilize your gpu(s)
gpu_lst = tf.config.experimental.list_physical_devices('GPU')

for gpu in gpu_lst:
    # Allocate memory on the fly instead of preallocating all your VRAM
    tf.config.experimental.set_memory_growth(gpu_lst[0], True)
    print(f'GPU: {gpu}')

Tensorflow Version: 2.1.0
Tensorflow Build: GPU
GPU: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')


# Creating your model

A model always has a certain amount of input, output and a way to transform this input to output.
The little knobs we can turn to change that transformation are called hyperparameters, our environment has the following:

### HyperParameters
| Math Symbol | Variable | Description | Default | Possible values | Type |
| :- | :- | :- | :- | :- | :- |
| - | **inputs** | The input size | 2 | > 0 | int |
| - | **outputs** | The output size | 4 | > 0 | int |
| - | **learning_rate** | The initial step size of our model | 1e-3 | > 0 | float |
| - | **memory** | How much our network can remember | 2k | > 0 | int |
| γ | **gamma** | How important an action is to us in the future | 0.95 | 0 - 1 | float |
| 𝜖 | **epsilon** | The chance of our network taking a random actions instead of a prediction | 1.0 | 0 - 1 | float |
| 𝜖 | **epsilon_low** | (EPSILON lower bound) How liberal our network is once it's learned patterns | 1e-2 | 0 - <1 | float |
| 𝜖-greedy | **epsilon_decay** | How much EPSILON decreases by (`EPSILON *= EPSILON_DECAY`) | 0.95 | 0 - <1 | float |
| - | **batch_size** | How much memory it trains on | 32 | > 0 | int |

In [5]:
class Agent:
    def __init__(self, observation_shape, number_of_actions, environment_name):
        self.inputs = observation_shape
        self.outputs = number_of_actions
        self.env_name = environment_name # the model doesn't use this, it's just used as a name for the weights file
        self.memory = deque(maxlen=2000)
        self.gamma = .95
        self.epsilon = 1.0
        self.epsilon_low = 1e-2
        self.epsilon_decay = .95
        self.learning_rate = 1e-3
        self.batch_size = 32
        self.model = self._create_model()
    
    def _create_model(self):
        """ Creates a keras model
        
            Returns:
                (Sequential): A sequential keras model
        """
        # Neural Net for Deep-Q learning Model
        model = Sequential([
            Dense(24, input_shape=self.inputs, activation='relu'),
            Dense(24, activation='relu'),
            Dense(self.outputs, activation='softmax' if ACTION_TYPE == 'real' else 'linear')
        ])
        
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
        
        if VERBOSE == 1:
            model.summary()
        
        return model

    def memorize(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def save_weights(self):
        """ saves the weights into a .h5 file for later use """
        self.model.save_weights(f"./models/{self.env_name}.h5")
        
    def load_weights(self):
        """ loads the weights from a .h5 file """
        self.model.save_weights(f"./models/{self.env_name}.h5")
        
    def act(self, state):
        """
            Returns:
                (list):
                    A list of actions with len() equal to self.outputs
                    The value is either predicted or random based on the epsilon trigger.
        """
        if np.random.rand() <= self.epsilon: # See if we explore (do a random action) or predict
            return random.randrange(self.outputs)
        
        return np.argmax(self.model.predict(state)[0]) # return the state with the highest rate of success

    def replay(self):
        if len(self.memory) < self.batch_size:
            return
        
        batch = random.sample(self.memory, self.batch_size)
        for state, action, reward, next_state, done in batch:
            target = reward
            
            if not done:
                target = reward + self.gamma * np.amax(self.model.predict(next_state)[0])
            
            target_f = self.model.predict(state)
            target_f[0][action] = target
            
            self.model.fit(
                state,
                target_f,
                verbose=0,
                epochs=1
            )
            
        # Slowly base actions more on predictions than on random actions
        if self.epsilon > self.epsilon_low:
            self.epsilon *= self.epsilon_decay

# Create our agent

In [6]:
agent = Agent(
    observation_shape = observation_shape,
    number_of_actions = possible_actions,
    environment_name = ENVIRONMENT
)

# If we have a saved model, load it.
if (LOAD_PRETRAINED and os.path.isfile(f'./models/{ENVIRONMENT}.h5')):
    print('loading model')
    agent.load_weights()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 24)                120       
_________________________________________________________________
dense_1 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 50        
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________
loading model


# Boiling your PC, also known as "the training phase"


<img src="https://4.bp.blogspot.com/-Cu1mJOh11AU/XAIcUyPK0WI/AAAAAAAANNA/BRlNj0Cbt6EJHNH25D4RhB0e6_sbL1Y8QCLcBGAs/s1600/28056576_10213577221682063_7572084637958860851_n.jpg" width="500px" align="right"/>

In [None]:
SCORES = []

for step in range(STEPS["train"] + 1):
    state = env.reset()
    state = np.reshape(state, [1, observation_shape[0]])
    
    done = False
    score = 0
    while not done:
        if VISUALIZE['train']:
            env.render()
        
        action = agent.act(state)
        
        next_state, reward, done, _ = env.step(action)
        next_state = np.reshape(next_state, [1, observation_shape[0]])
        
        agent.memorize(state, action, reward, next_state, done)
        
        state = next_state
        score += 1
        
    print(f"Episode: {step}/{STEPS['train']} score: {score}")
    SCORES.append(score)
    
    # Save our weights every 100 steps
    if step % 100 == 0 and step != 0:
        print(f"Mean score of last 100 actions is {np.mean(SCORES[-100:])}")
        agent.save_weights()
    
    # Retrain on memory
    agent.replay()

Episode: 0/10000 score: 228
Episode: 1/10000 score: 28
Episode: 2/10000 score: 84
Episode: 3/10000 score: 67
Episode: 4/10000 score: 23
Episode: 5/10000 score: 22
Episode: 6/10000 score: 25
Episode: 7/10000 score: 16
Episode: 8/10000 score: 19
Episode: 9/10000 score: 12
Episode: 10/10000 score: 15
Episode: 11/10000 score: 25
Episode: 12/10000 score: 37
Episode: 13/10000 score: 69
Episode: 14/10000 score: 55
Episode: 15/10000 score: 34
Episode: 16/10000 score: 50
Episode: 17/10000 score: 18
Episode: 18/10000 score: 23
Episode: 19/10000 score: 16
Episode: 20/10000 score: 22
Episode: 21/10000 score: 92
Episode: 22/10000 score: 28
Episode: 23/10000 score: 72
Episode: 24/10000 score: 22
Episode: 25/10000 score: 57
Episode: 26/10000 score: 32
Episode: 27/10000 score: 27
Episode: 28/10000 score: 110
Episode: 29/10000 score: 94
Episode: 30/10000 score: 59
Episode: 31/10000 score: 31
Episode: 32/10000 score: 21
Episode: 33/10000 score: 29
Episode: 34/10000 score: 38
Episode: 35/10000 score: 50


Episode: 280/10000 score: 171
Episode: 281/10000 score: 197
Episode: 282/10000 score: 197
Episode: 283/10000 score: 224
Episode: 284/10000 score: 244
Episode: 285/10000 score: 445
Episode: 286/10000 score: 388
Episode: 287/10000 score: 500
Episode: 288/10000 score: 185
Episode: 289/10000 score: 198
Episode: 290/10000 score: 226
Episode: 291/10000 score: 500
Episode: 292/10000 score: 500
Episode: 293/10000 score: 500
Episode: 294/10000 score: 301
Episode: 295/10000 score: 210
Episode: 296/10000 score: 332
Episode: 297/10000 score: 348
Episode: 298/10000 score: 170
Episode: 299/10000 score: 163
Episode: 300/10000 score: 162
Mean score of last 100 actions is 244.04
Episode: 301/10000 score: 184
Episode: 302/10000 score: 215
Episode: 303/10000 score: 217
Episode: 304/10000 score: 205
Episode: 305/10000 score: 176
Episode: 306/10000 score: 201
Episode: 307/10000 score: 208
Episode: 308/10000 score: 277
Episode: 309/10000 score: 148
Episode: 310/10000 score: 178
Episode: 311/10000 score: 175

Episode: 550/10000 score: 300
Episode: 551/10000 score: 241
Episode: 552/10000 score: 407
Episode: 553/10000 score: 223
Episode: 554/10000 score: 500
Episode: 555/10000 score: 193
Episode: 556/10000 score: 215
Episode: 557/10000 score: 343
Episode: 558/10000 score: 209
Episode: 559/10000 score: 500
Episode: 560/10000 score: 291
Episode: 561/10000 score: 373
Episode: 562/10000 score: 500
Episode: 563/10000 score: 460
Episode: 564/10000 score: 349
Episode: 565/10000 score: 461
Episode: 566/10000 score: 238
Episode: 567/10000 score: 283
Episode: 568/10000 score: 329
Episode: 569/10000 score: 198
Episode: 570/10000 score: 308
Episode: 571/10000 score: 242
Episode: 572/10000 score: 181
Episode: 573/10000 score: 182
Episode: 574/10000 score: 446
Episode: 575/10000 score: 252
Episode: 576/10000 score: 500
Episode: 577/10000 score: 304
Episode: 578/10000 score: 231
Episode: 579/10000 score: 305
Episode: 580/10000 score: 500
Episode: 581/10000 score: 193
Episode: 582/10000 score: 305
Episode: 5

Episode: 820/10000 score: 214
Episode: 821/10000 score: 234
Episode: 822/10000 score: 234
Episode: 823/10000 score: 221
Episode: 824/10000 score: 215
Episode: 825/10000 score: 232
Episode: 826/10000 score: 329
Episode: 827/10000 score: 206
Episode: 828/10000 score: 229
Episode: 829/10000 score: 227
Episode: 830/10000 score: 252
Episode: 831/10000 score: 285
Episode: 832/10000 score: 211
Episode: 833/10000 score: 238
Episode: 834/10000 score: 240
Episode: 835/10000 score: 266
Episode: 836/10000 score: 262
Episode: 837/10000 score: 395
Episode: 838/10000 score: 500
Episode: 839/10000 score: 500
Episode: 840/10000 score: 373
Episode: 841/10000 score: 303
Episode: 842/10000 score: 230
Episode: 843/10000 score: 226
Episode: 844/10000 score: 253
Episode: 845/10000 score: 500
Episode: 846/10000 score: 151
Episode: 847/10000 score: 500
Episode: 848/10000 score: 397
Episode: 849/10000 score: 336
Episode: 850/10000 score: 492
Episode: 851/10000 score: 338
Episode: 852/10000 score: 178
Episode: 8

Episode: 1089/10000 score: 141
Episode: 1090/10000 score: 153
Episode: 1091/10000 score: 130
Episode: 1092/10000 score: 129
Episode: 1093/10000 score: 153
Episode: 1094/10000 score: 126
Episode: 1095/10000 score: 131
Episode: 1096/10000 score: 117
Episode: 1097/10000 score: 143
Episode: 1098/10000 score: 133
Episode: 1099/10000 score: 153
Episode: 1100/10000 score: 141
Mean score of last 100 actions is 170.51
Episode: 1101/10000 score: 15
Episode: 1102/10000 score: 369
Episode: 1103/10000 score: 160
Episode: 1104/10000 score: 247
Episode: 1105/10000 score: 134
Episode: 1106/10000 score: 123
Episode: 1107/10000 score: 161
Episode: 1108/10000 score: 177
Episode: 1109/10000 score: 175
Episode: 1110/10000 score: 190
Episode: 1111/10000 score: 134
Episode: 1112/10000 score: 138
Episode: 1113/10000 score: 201
Episode: 1114/10000 score: 213
Episode: 1115/10000 score: 259
Episode: 1116/10000 score: 192
Episode: 1117/10000 score: 122
Episode: 1118/10000 score: 147
Episode: 1119/10000 score: 195

Episode: 1352/10000 score: 118
Episode: 1353/10000 score: 127
Episode: 1354/10000 score: 152
Episode: 1355/10000 score: 138
Episode: 1356/10000 score: 331
Episode: 1357/10000 score: 105
Episode: 1358/10000 score: 202
Episode: 1359/10000 score: 321
Episode: 1360/10000 score: 234
Episode: 1361/10000 score: 90
Episode: 1362/10000 score: 189
Episode: 1363/10000 score: 110
Episode: 1364/10000 score: 102
Episode: 1365/10000 score: 114
Episode: 1366/10000 score: 145
Episode: 1367/10000 score: 76
Episode: 1368/10000 score: 124
Episode: 1369/10000 score: 97
Episode: 1370/10000 score: 348
Episode: 1371/10000 score: 111
Episode: 1372/10000 score: 151
Episode: 1373/10000 score: 115
Episode: 1374/10000 score: 117
Episode: 1375/10000 score: 129
Episode: 1376/10000 score: 130
Episode: 1377/10000 score: 109
Episode: 1378/10000 score: 287
Episode: 1379/10000 score: 500
Episode: 1380/10000 score: 198
Episode: 1381/10000 score: 192
Episode: 1382/10000 score: 143
Episode: 1383/10000 score: 98
Episode: 138

Episode: 1614/10000 score: 127
Episode: 1615/10000 score: 132
Episode: 1616/10000 score: 205
Episode: 1617/10000 score: 225
Episode: 1618/10000 score: 213
Episode: 1619/10000 score: 117
Episode: 1620/10000 score: 227
Episode: 1621/10000 score: 186
Episode: 1622/10000 score: 187
Episode: 1623/10000 score: 221
Episode: 1624/10000 score: 246
Episode: 1625/10000 score: 500
Episode: 1626/10000 score: 280
Episode: 1627/10000 score: 199
Episode: 1628/10000 score: 349
Episode: 1629/10000 score: 212
Episode: 1630/10000 score: 191
Episode: 1631/10000 score: 255
Episode: 1632/10000 score: 101
Episode: 1633/10000 score: 370
Episode: 1634/10000 score: 250
Episode: 1635/10000 score: 221
Episode: 1636/10000 score: 202
Episode: 1637/10000 score: 186
Episode: 1638/10000 score: 130
Episode: 1639/10000 score: 163
Episode: 1640/10000 score: 217
Episode: 1641/10000 score: 118
Episode: 1642/10000 score: 105
Episode: 1643/10000 score: 101
Episode: 1644/10000 score: 147
Episode: 1645/10000 score: 65
Episode: 

Episode: 1877/10000 score: 111
Episode: 1878/10000 score: 120
Episode: 1879/10000 score: 115
Episode: 1880/10000 score: 130
Episode: 1881/10000 score: 127
Episode: 1882/10000 score: 116
Episode: 1883/10000 score: 20
Episode: 1884/10000 score: 111
Episode: 1885/10000 score: 104
Episode: 1886/10000 score: 104
Episode: 1887/10000 score: 116
Episode: 1888/10000 score: 112
Episode: 1889/10000 score: 117
Episode: 1890/10000 score: 104
Episode: 1891/10000 score: 116
Episode: 1892/10000 score: 121
Episode: 1893/10000 score: 134
Episode: 1894/10000 score: 105
Episode: 1895/10000 score: 118
Episode: 1896/10000 score: 110
Episode: 1897/10000 score: 158
Episode: 1898/10000 score: 125
Episode: 1899/10000 score: 130
Episode: 1900/10000 score: 108
Mean score of last 100 actions is 129.03
Episode: 1901/10000 score: 137
Episode: 1902/10000 score: 132
Episode: 1903/10000 score: 124
Episode: 1904/10000 score: 165
Episode: 1905/10000 score: 111
Episode: 1906/10000 score: 189
Episode: 1907/10000 score: 32


# Evaluating your model

In [None]:
for step in range(STEPS['evaluate'] + 1):
    state = env.reset()
    done = False

    while not done:
        if VISUALIZE['evaluate']:
            env.render()
        
        action = agent.act(state)
        next_state, reward, done, info = env.step(action)
        next_state = np.reshape(next_state, [1, observation_shape[0]])
        state = next_state

# Lastly close the environment
env.close()