___

<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>
___
<center><em>Copyright by Pierian Data Inc.</em></center>
<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>


# Manually Creating a DQN Model


## Deep-Q-Learning
In this notebook we will create our first Deep Reeinforcement Learning model, called Deep-Q-Network (DQN)
We are again using a simple environment from openai gym. <br />
However, you will soon see the enormous gain we will get by switching from standard Q-Learning to Deep Q Learning.

In this notebook we again take a look at the CartPole problem (https://gym.openai.com/envs/CartPole-v1/)



Let us start by importing the necessary packages

# Part 0: Imports

Notice how we're importing the TF libraries here at the top together, in some rare instances, if you import them later on, you get strange bugs, so best just to import everything from Tensorflow here at the top.

In [16]:
from collections import deque
import random

import numpy as np
import gymnasium as gym
from tensorflow.keras.models import Sequential  # To compose multiple Layers
from tensorflow.keras.layers import Dense  # Fully-Connected layer
from tensorflow.keras.layers import Activation  # Activation functions
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import clone_model

# Part 1: The Environment

In [17]:
env_name = 'CartPole-v1'
env = gym.make("CartPole-v1", render_mode="human")

Remember, the goal of the CartPole challenge was to balance the stick upright

env.reset()  # reset the environment to the initial state
for _ in range(200):  # play for max 200 iterations
    env.render()  # render the current game state on your screen
    random_action = env.action_space.sample()  # chose a random action
    env.step(random_action)  # execute that action
env.close()  # close the environment

# Part 2: The Artificial Neural Network

### Let us build our first Neural Network
To build our network, we first need to find out how many actions and observation our environment has.
We can either get those information from the source code (https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py) or via the following commands:

In [18]:
num_actions = env.action_space.n
num_observations = env.observation_space.shape[0]  # You can use this command to get the number of observations
print(f"There are {num_actions} possible actions and {num_observations} observations")


There are 2 possible actions and 4 observations


So our network needs to have an input dimension of 4 and an output dimension of 2.
In between we are free to chose.

Let's just say we want to use a four layer architecture:


1. The first layer has 16 neurons
2. The second layer has 32 neurons
4. The fourth layer (output layer) has 2 neurons

This yields 690 parameters
$$ \text{4 observations} * 16 (\text{neurons}) + 16 (\text{bias}) + (16*32) + 32 + (32*2)+2 = 690$$

In [19]:
model = Sequential()

model.add(Dense(16, input_shape=(1, num_observations)))
model.add(Activation('relu'))

model.add(Dense(32))
model.add(Activation('relu'))


model.add(Dense(num_actions))
model.add(Activation('linear'))

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 1, 16)             80        
                                                                 
 activation_3 (Activation)   (None, 1, 16)             0         
                                                                 
 dense_4 (Dense)             (None, 1, 32)             544       
                                                                 
 activation_4 (Activation)   (None, 1, 32)             0         
                                                                 
 dense_5 (Dense)             (None, 1, 2)              66        
                                                                 
 activation_5 (Activation)   (None, 1, 2)              0         
                                                                 
Total params: 690
Trainable params: 690
Non-trainable 

Now we have our model which takes an observation as input and outputs a value for each action.
The higher the value, the more likely that this value is a suitable action for the current observation

As stated in the lecture, Deep-Q-Learning works better when using a target network.
So let's just copy the above network

In [20]:
#model.load_weights("34.ckt")
target_model = clone_model(model)


Now it is time to define our hyperparameters.

# Part 3: Hyperparameters and Update Function

In [21]:
EPOCHS = 1000

epsilon = 1.0
EPSILON_REDUCE = 0.995  # is multiplied with epsilon each epoch to reduce it
LEARNING_RATE = 0.001 #NOT THE SAME AS ALPHA FROM Q-LEARNING FROM BEFORE!!
GAMMA = 0.95


Let us use the epsilon greedy action selection function once again:

In [22]:
def epsilon_greedy_action_selection(model, epsilon, observation):
    if np.random.random() > epsilon:
        print(observation)
        prediction = model.predict(observation)  # perform the prediction on the observation
        action = np.argmax(prediction)  # Chose the action with the higher value
    else:
        action = np.random.randint(0, env.action_space.n)  # Else use random action
    return action

As shown in the lecture, we need a replay buffer.
We can use the **deque** data structure for this, which already implements the circular behavior.

The *maxlen* argument specifies the number of elements the buffer can store between he overwrites them at the beginning

The following cell shows an example usage of the deque function. You can see, that in the first example all values fit into the deque, so nothing is overwritten. 

In the second example, the deque is printed in each iteration. It can hold all values in the first five iterations but then needs to delete the oldest value in the deque to make room for the new value 

In [23]:
### deque examples
deque_1 = deque(maxlen=5)
for i in range(5):  # all values fit into the deque, no overwriting
    deque_1.append(i)
print(deque_1)
print("---------------------")
deque_2 = deque(maxlen=5)

# after the first 5 values are stored, it needs to overwrite the oldest value to store the new one
for i in range(10):  
    deque_2.append(i)
    print(deque_2)

deque([0, 1, 2, 3, 4], maxlen=5)
---------------------
deque([0], maxlen=5)
deque([0, 1], maxlen=5)
deque([0, 1, 2], maxlen=5)
deque([0, 1, 2, 3], maxlen=5)
deque([0, 1, 2, 3, 4], maxlen=5)
deque([1, 2, 3, 4, 5], maxlen=5)
deque([2, 3, 4, 5, 6], maxlen=5)
deque([3, 4, 5, 6, 7], maxlen=5)
deque([4, 5, 6, 7, 8], maxlen=5)
deque([5, 6, 7, 8, 9], maxlen=5)


Let's say we allow our replay buffer a maximum size of 20000

In [24]:
replay_buffer = deque(maxlen=20000)
update_target_model = 10

As mentioned in the lecture, action replaying is crucial for Deep Q-Learning. <br />
The following cell implements one version of the action replay algorithm. <br />
It uses the zip statement paired with the * (Unpacking Argument Lists) operator to create batches from the samples for efficient prediction and training.<br />
The zip statement returns all corresponding pairs from each entry. <br />
It might look confusing but the following example should clarify it

In [25]:
test_tuple = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
zipped_list = list(zip(*test_tuple))
a, b, c = zipped_list
print(a, b, c)

(1, 4, 7) (2, 5, 8) (3, 6, 9)


Now it's time to write the replay function

In [26]:
def replay(replay_buffer, batch_size, model, target_model):
    
    # As long as the buffer has not enough elements we do nothing
    if len(replay_buffer) < batch_size: 
        return
    
    # Take a random sample from the buffer with size batch_size
    samples = random.sample(replay_buffer, batch_size)  
    
    # to store the targets predicted by the target network for training
    target_batch = []  
    
    # Efficient way to handle the sample by using the zip functionality
    zipped_samples = list(zip(*samples))  
    states, actions, rewards, new_states, dones = zipped_samples  
    
    # Predict targets for all states from the sample
    targets = target_model.predict(np.array(states))
    
    # Predict Q-Values for all new states from the sample
    q_values = model.predict(np.array(new_states))  
    
    # Now we loop over all predicted values to compute the actual targets
    for i in range(batch_size):  
        
        # Take the maximum Q-Value for each sample
        q_value = max(q_values[i][0])  
        
        # Store the ith target in order to update it according to the formula
        target = targets[i].copy()  
        if dones[i]:
            target[0][actions[i]] = rewards[i]
        else:
            target[0][actions[i]] = rewards[i] + q_value * GAMMA
        target_batch.append(target)

    # Fit the model based on the states and the updated targets for 1 epoch
    model.fit(np.array(states), np.array(target_batch), epochs=1, verbose=0)  


We need to update our target network every once in a while. <br />
Keras provides the *set_weights()* and *get_weights()* methods which do the work for us, so we only need to check whether we hit an update epoch

In [27]:
def update_model_handler(epoch, update_target_model, model, target_model):
    if epoch > 0 and epoch % update_target_model == 0:
        target_model.set_weights(model.get_weights())


# Part 4: Training the Model

Now it is time to write the training loop! <br />
First we compile the model

In [28]:
model.compile(loss='mse', optimizer=Adam(lr=LEARNING_RATE))




Then we perform the training routine. <br />
This might take some time, so make sure to grab your favorite beverage and watch your model learn. <br />
Feel free to use our provided chekpoints as a starting point

In [29]:
best_so_far = 0
for epoch in range(EPOCHS):
    observation = env.reset()[0]  # Get inital state
    
    # Keras expects the input to be of shape [1, X] thus we have to reshape
    observation = observation.reshape([1, 4])  
    done = False  
    
    points = 0
    while not done:  # as long current run is active
        
        # Select action acc. to strategy
        action = epsilon_greedy_action_selection(model, epsilon, observation)
        
        # Perform action and get next state
        next_observation,reward,done,truncated,info = env.step(action)

        next_observation = next_observation.reshape([1, 4])  # Reshape!!
        replay_buffer.append((observation, action, reward, next_observation, done))  # Update the replay buffer
        observation = next_observation  # update the observation
        points+=1

        # Most important step! Training the model by replaying
        replay(replay_buffer, 32, model, target_model)

    
    epsilon *= EPSILON_REDUCE  # Reduce epsilon
    
    # Check if we need to update the target model
    update_model_handler(epoch, update_target_model, model, target_model)
    
    if points > best_so_far:
        best_so_far = points
    if epoch %25 == 0:
        print(f"{epoch}: Points reached: {points} - epsilon: {epsilon} - Best: {best_so_far}")


0: Points reached: 21 - epsilon: 0.995 - Best: 21
[[-0.09287532 -0.7877592   0.08816639  1.1571473 ]]


InvalidArgumentError: Graph execution error:

Detected at node 'sequential_1/dense_3/Tensordot/GatherV2_1' defined at (most recent call last):
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/traitlets/config/application.py", line 846, in launch_instance
      app.start()
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 712, in start
      self.io_loop.start()
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 199, in start
      self.asyncio_loop.run_forever()
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
      self._run_once()
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
      handle._run()
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/asyncio/events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 510, in dispatch_queue
      await self.process_one()
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 499, in process_one
      await dispatch(*args)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 406, in dispatch_shell
      await result
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 730, in execute_request
      reply_content = await reply_content
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 390, in do_execute
      res = shell.run_cell(code, store_history=store_history, silent=silent)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 528, in run_cell
      return super().run_cell(*args, **kwargs)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2914, in run_cell
      result = self._run_cell(
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2960, in _run_cell
      return runner(coro)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner
      coro.send(None)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3185, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3377, in run_ast_nodes
      if (await self.run_code(code, result,  async_=asy)):
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3457, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "/var/folders/4p/q2w97g_n3xv8t1g03z0l35sw0000gn/T/ipykernel_8867/4034569020.py", line 13, in <module>
      action = epsilon_greedy_action_selection(model, epsilon, observation)
    File "/var/folders/4p/q2w97g_n3xv8t1g03z0l35sw0000gn/T/ipykernel_8867/3577742953.py", line 4, in epsilon_greedy_action_selection
      prediction = model.predict(observation)  # perform the prediction on the observation
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 2350, in predict
      tmp_batch_outputs = self.predict_function(iterator)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 2137, in predict_function
      return step_function(self, iterator)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 2123, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 2111, in run_step
      outputs = model.predict_step(data)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 2079, in predict_step
      return self(x, training=False)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 561, in __call__
      return super().__call__(*args, **kwargs)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1132, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/engine/sequential.py", line 413, in call
      return super().call(inputs, training=training, mask=mask)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/engine/functional.py", line 511, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/engine/functional.py", line 668, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1132, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/Users/tylergasperlin/opt/anaconda3/lib/python3.9/site-packages/keras/layers/core/dense.py", line 244, in call
      outputs = tf.tensordot(inputs, self.kernel, [[rank - 1], [0]])
Node: 'sequential_1/dense_3/Tensordot/GatherV2_1'
indices[0] = 2 is not in [0, 2)
	 [[{{node sequential_1/dense_3/Tensordot/GatherV2_1}}]] [Op:__inference_predict_function_25822]

In [None]:
epoch

# Part 5: Using Trained Model

In [None]:
observation = env.reset()[0]
for counter in range(300):
    env.render()
    
    # TODO: Get discretized observation
    action = np.argmax(model.predict(observation.reshape([1,4])))
    
    # TODO: Perform the action 
    observation,reward,done,truncated,info = env.step(action)

    if done:
        print(f"done")
        break
env.close()