# A Minimal Deep Q-Learning Implementation (minDQN)
Running this code will render the agent solving the CartPole
environment using OpenAI gym. Our Minimal Deep Q-Network is
approximately 150 lines of code. In addition, this implementation
uses Tensorflow and Keras and should generally run in less than 15 minutes.

In [3]:
# import packages
import gym
import numpy as np
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.initializers import HeUniform
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import Huber
from tensorflow.keras.optimizers import Adam

# See TensorFlow version
print(f"TensorFlow version: {tf.__version__}")
# Check for TensorFlow GPU access
print(
    f'TensorFlow has access to the following devices:'
    + f'\n{tf.config.list_physical_devices()}'
)


TensorFlow version: 2.8.0
TensorFlow has access to the following devices:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


# Setup Learning Agent
Setup the learning agent and seed the random number function.

In [7]:
RANDOM_SEED = 5  # seed the random number generation
ENV_NAME = 'CartPole-v1'  # define the gym environment name
TRAIN_EPISODES = 300  # number of training episodes
TEST_EPISODES = 100  # number of test episodes

# create the environment
env = gym.make(ENV_NAME)

# seed random numbers
tf.random.set_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
env.seed(RANDOM_SEED)

# get the number of actions available
# for the agent in the environment
num_actions = env.action_space.n

print(f'There are {num_actions} actions in the {ENV_NAME} environment.')
print(f'The action space is:\n {env.action_space}')
print(f'\nThe observation space is:\n {env.observation_space}')
print(f'\nObs. High = {env.observation_space.high}')
print(f'Obs. Low = {env.observation_space.low}')
print(f'Obs. Shape = {env.observation_space.shape}')

There are 2 actions in the CartPole-v1 environment.
The action space is:
 Discrete(2)

The observation space is:
 Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

Obs. High = [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
Obs. Low = [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
Obs. Shape = (4,)


# Construct the Model
The agent maps X-states to Y-actions. For this problem, we will use a 3-layer architecture consisting of a single, fully connected, input layer with a 'ReLU' activation function and two more fully connected layers with 'ReLU' and 'Linear' activation functions, respectively. For example, if the neural network output is [.1, .7, .1, .3], the highest value, 0.7, is the Q-value. The index of the highest value is (1).

In [9]:
# define the initialization kernel
init_kernel = HeUniform()

# define the model
model = Sequential([
    Dense(
        24,
        input_shape=env.observation_space.shape,
        activation='relu',
        kernel_initializer=init_kernel
    ),
    Dense(12, activation='relu', kernel_initializer=init_kernel),
    Dense(env.action_space.n, activation='linear', kernel_initializer=init_kernel)
])

# display the model summary
print(model.summary())

2022-02-16 20:24:25.354485: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-02-16 20:24:25.354679: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Metal device set to: Apple M1

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 24)                120       
                                                                 
 dense_4 (Dense)             (None, 12)                300       
                                                                 
 dense_5 (Dense)             (None, 2)                 26        
                                                                 
Total params: 446
Trainable params: 446
Non-trainable params: 0
_________________________________________________________________
None
