# Deep Q Learning - Example 1
This is a tutorial on Deep Q Learning using tensorflow and the OpenAI-gym package taken
from:

https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/

In [1]:
# import packages
import numpy as np
import gym
import tensorflow as tf

from tensorflow.keras.layers import Dense, Activation, Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

# See TensorFlow version
print(f"TensorFlow version: {tf.__version__}")
# Check for TensorFlow GPU access
print(f"TensorFlow has access to the following devices:\n{tf.config.list_physical_devices()}")

TensorFlow version: 2.8.0
TensorFlow has access to the following devices:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


# Create the Environment
The OpenAI-gym package is used to create an environment for the learning agent. The environment contains an action space and an observation space. In the case of 'CartPole-v0', there are two actions that can take place. The cart can go to the left or the right (0, 1). The Discrete action space allows a fixed range of non-negative numbers. The Box space represents an n-dimensional box, so valid observations will be an array of 4 numbers here.

In [2]:
# set the environment name
ENV_NAME = 'CartPole-v0'

# make the environment
env = gym.make(ENV_NAME)

# set random seed
np.random.seed(123)
env.seed(123)

# get the number of actions available
# for the agent in the environment
num_actions = env.action_space.n

print(f'There are {num_actions} actions in the {ENV_NAME} environment.')
print(f'The action space is:\n {env.action_space}')
print(f'The observation space is:\n {env.observation_space}')
print(f'Obs. High = {env.observation_space.high}')
print(f'Obs. Low = {env.observation_space.low}')
print(f'Obs. Shape = {env.observation_space.shape}')


There are 2 actions in the CartPole-v0 environment.
The action space is:
 Discrete(2)
The observation space is:
 Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Obs. High = [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
Obs. Low = [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
Obs. Shape = (4,)


# Building the Model
The next step is to build the model to be trained. For this example, we will use a 3-layer model with a single input layer and two fully connected layers with 'ReLU' and 'Linear' activation functions, respectively. The output layer shape equates to the number of actions that the agent can take, in this case, 2.

In [3]:
# construct the model
model = Sequential([
    Flatten(input_shape=(1, ) + env.observation_space.shape),
    Dense(16, activation='relu'),
    Dense(num_actions, activation='linear')
])

# output the model summary
print(model.summary())

Metal device set to: Apple M1

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 4)                 0         
                                                                 
 dense (Dense)               (None, 16)                80        
                                                                 
 dense_1 (Dense)             (None, 2)                 34        
                                                                 
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________
None


2022-02-16 19:46:19.052033: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-02-16 19:46:19.052148: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


# Configure & Compile the Agent
For this example, we will set the policy as Epsilon Greedy and the memory as sequential memory because we want to store the result of actions we performed and the rewards obtained for each action. We then set the agent parameters, the optimizer, and finally compile the model. Some possible metrics for compilation are:
1. Mean Squared Error ('mse')
2. Mean Absolute Error ('mae')
3. Mean Absolute Percentage Error ('mape')
4. Cosine Proximity ('cosine')

In [4]:
# set the policy
policy = EpsGreedyQPolicy()

# set the memory
memory = SequentialMemory(limit=50000, window_length=1)

# create the Deep Q agent
dqn = DQNAgent(
    model=model,
    nb_actions=num_actions,
    memory=memory,
    nb_steps_warmup=250,
    target_model_update=1e-2,
    policy=policy
)

# set the optimizer with the learning rate
opt = Adam(learning_rate=0.001)

# compile the agent with mean absolute error (mae) 
dqn.compile(opt, metrics=['mae'])

TypeError: Keras symbolic inputs/outputs do not implement `__len__`. You may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model. This error will also get raised if you try asserting a symbolic input/output directly.

# Training
Now the agent can be trained using the .fit() method. Here we will visualize the training, but this can slow down performance for larger training datasets.

In [36]:
# set the number of training steps
n_steps = 10000

# train the model
history = dqn.fit(
    env,
    nb_steps=n_steps,
    visualize=False,
    verbose=1,
    log_interval=n_steps
)

# close the environment after training
env.close()


Training for 10000 steps ...
Interval 1 (0 steps performed)
done, took 87.786 seconds


# Testing
Lastly, we can test the model to see how well it was trained.

In [39]:
# test the model
dqn.test(env, nb_episodes=10, visualize=True)

# close the environment after testing
env.close()

Testing for 10 episodes ...
Episode 1: reward: 99.000, steps: 99
Episode 2: reward: 96.000, steps: 96
Episode 3: reward: 103.000, steps: 103
Episode 4: reward: 102.000, steps: 102
Episode 5: reward: 107.000, steps: 107
Episode 6: reward: 115.000, steps: 115
Episode 7: reward: 164.000, steps: 164
Episode 8: reward: 111.000, steps: 111
Episode 9: reward: 103.000, steps: 103
Episode 10: reward: 93.000, steps: 93
