# Reinforcement Learning (DQN) Tutorial

*Prepared by Damian Dailisan*

---

## Problem: `LunarLander-v2`

This example shows an implementation of a Deep Q Learning (DQN) agent
trained to solve the `LunarLander-v2` task from the [OpenAI Gym](https://gym.openai.com/envs/LunarLander-v2/).

<video controls autoplay=true src="https://gym.openai.com/videos/2019-10-21--mqt8Qj1mwo/LunarLander-v2/original.mp4"/>



## Task
This environment is a classic rocket trajectory optimization problem.
The goal is to train an agent to control the landing of a rocket into a landing pad.
In this environment, landing outside the landing pad is possible.
Fuel is infinite, so an agent can learn to fly and then land on its first attempt.

### Actions
The agent has to decide between four actions --- do nothing, fire left orientation engine, fire main engine, fire right orientation engine --- with the objective of landing on the landing pad.

### States
The state of the lander is encoded in 8 variables:
- x position
- y position
- x velocity
- y velocity
- angle
- angular velocity
- left leg touching ground
- right leg touching ground

### Rewards
As the agent observes the current state of the environment and chooses
an action, the environment *transitions* to a new state, and also
returns a reward that indicates the consequences of the action.
This environment rewards the agent for the following:
- -100 lander crashed or lands outside landing pad (ends an episode)
- +100 lander comes to rest within landing pad (ends an episode)
- +10 for each leg currently on the ground (lifting a leg incurs a -10 reward)
- -0.3 for each frame the main engine is used
- -0.03 for using the side engines
- There are miscellaneous positive (negative) rewards for decreasing (increasing) the distance to the landing pads.

The rewards incentivise the agent for landing inside the landing pad on both legs, while using the least amount of fuel as possible.



In [1]:
# !pip install pyglet
# !pip install "gym[Box_2D]"
# !pip install tensorflow
# !pip install Box2D

In [2]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import gym
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # this is a CPU-bound process
seed = 42

# load the environment from openai gym
env = gym.make("LunarLander-v2").env
env.seed(seed)
state = env.reset()

  for external in metadata.entry_points().get(self.group, []):


## Model: Q-network

Our model will be a fully connected neural network with two [64,64] hidden layers that takes in state observations $s$as input.
It has four outputs, representing $Q(s, \mathrm{do nothing})$, 
$Q(s, \mathrm{fire left})$, $Q(s, \mathrm{fire main})$, and $Q(s, \mathrm{fire right})$. 
In effect, the network is trying to predict the *expected return* of taking each action given the current input.


In [3]:
num_actions = env.action_space.n # this should be 4
num_observations =  len(state) # this is 8

def create_q_model():
    inputs = layers.Input(shape=(num_observations))

    layer1 = layers.Dense(64, activation="relu")(inputs)
    layer2 = layers.Dense(64, activation="relu")(layer1)
    layer3 = layers.Dense(64, activation="relu")(layer2)

    action = layers.Dense(num_actions, activation=None)(layer3)
    return keras.Model(inputs=inputs, outputs=action)

## Replay Buffer

The replay is a useful trick used in DQNs, particularly when subsequent states are highly correlated to each other.
Instead of batching consecutive experiences together and using this to train the DQN, we can instead temporarily store the recent experiences of the agent in a buffer.
This allows us to reuse this data later.
Random samples from the replay buffer results in a batch of transitions that are decorrelated.
It has been shown that this greatly stabilizes and improves the DQN training procedure.

The replay buffer is a first-in-first-out (FIFO) storage with finite capacity, which we will implement as a `deque`.

In [4]:
from collections import deque
import random

class ReplayBuffer(object):
    def __init__(self, capacity):
        self.memory = deque([], maxlen=capacity)

    def push(self, *args):
        """Save a transition"""
        self.memory.append((*args))

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

## DQN algorithm


Our aim is to train a policy that maximizes the discounted,
cumulative reward
$R = \sum_{t=t_0}^{\tau} \gamma^{t} r_t$, where
$R$ is also known as the *return*. The discount,
$\gamma$, is a constant between $0$ and $1$
that ensures the sum converges.
The discount is a weight that makes rewards from the uncertain far
future less important than the ones in the near future.

$Q$-learning tries to find the function
$Q(s,a)$ that rstimates our return, if we were to take an action in a given
state.
This allows us to construct a policy $\pi$ that maximizes our
rewards:

$$ \pi(s) = \arg\!\max_a \ Q(s, a) $$

The challenge here is to find $Q$ that suitably defines our environment.
Because neural networks are universal function
approximators, one approach is to train a neural network to resemble $Q$.
This offers a vast improvement over the tabular approach, which can get numerically intractable once there are a lot more states and actions to consider, as is in a more complex environment.

We can use the Bellman Equation:
$$ Q(s,a)= \mathbb{E}(r + \gamma \max_{a} Q(s',a)) $$
to define a loss function for our problem.
Here, we use the temporal difference error, $\delta$:
\begin{align}\delta = Q(s, a) - (r + \gamma \max_a Q(s', a))\end{align}
as the loss function.
In addition to this error, we use the [Huber
loss](https://en.wikipedia.org/wiki/Huber_loss) to train the neural network.
For small errors, the Huber loss behaves similar to the mean squared error, while for large errors it is similar to the mean absolute error.
The Huber loss is more robust to outliers due to noisy estimates of $Q$.
The network is trained over a batch of transitions $B$ sampled from the replay memory:

\begin{align}\mathcal{L} = \frac{1}{|B|}\sum_{(s, a, s', r) \ \in \ B} \mathcal{L}(\delta)\end{align}

\begin{align}\text{where} \quad \mathcal{L}(\delta) = \begin{cases}
     \frac{1}{2}{\delta^2}  & \text{for } |\delta| \le 1, \\
     |\delta| - \frac{1}{2} & \text{otherwise.}
   \end{cases}\end{align}
   
For convenience and numerical stability reasons, we also make use of two neural networks: the policy and target networks.
The policy network represents the first $Q$ term in the temporal difference error, while the target network is the second $Q$ term.
The target network copies its weights from the policy network over a longer interval.
Avoiding frequent updates to the target network ensures the stability of training the DQN.

In [5]:
# The first model makes the predictions for Q-values which are used to
# make a action.
model_policy = create_q_model()

# Build a target model for the prediction of future rewards.
# The weights of a target model get updated every `update_target_network` steps thus when the
# loss between the Q-values is calculated the target Q-value is stable.
model_target = create_q_model()

2022-02-05 23:25:56.640178: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-02-05 23:25:56.640243: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2022-02-05 23:25:56.640768: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Training

Some hyperparameters:

-  `epsilon_max`, `epsilon_min`, and `exploration_fraction` control the annealed value of epsilon over training steps.
   This allows us to decay the emount of exploration of the agent over time.
-  `update_target_network` sets the interval on how often the target network is updated.
-  `train_freq` is the number of actions before the policy network weights are updated.





In [6]:
# Configuration paramaters for the whole setup
gamma = 0.99  # Discount factor for past rewards
epsilon_min = 0.05  # Minimum epsilon greedy parameter
epsilon_max = 1.0  # Maximum epsilon greedy parameter
epsilon = epsilon_max  # Epsilon greedy parameter
batch_size = 32  # Size of batch taken from replay buffer
max_steps_per_episode = 1000 # just a safety constraint
exploration_fraction = 0.1 # Number of frames for exploration
buffer_size = 50000 # Maximum replay length
train_freq = 4 # Train the model after 4 actions
update_target_network = 500 # How often to update the target network

# Deepmind paper used RMSProp however then Adam optimizer is faster
optimizer = keras.optimizers.Adam(learning_rate=1e-3)

episode_reward_history = [0.]
running_reward = 0
episode_count = 0


loss_function = keras.losses.Huber() # Using huber loss for stability

# Experience replay buffers
replay_buffer = ReplayBuffer(buffer_size)

num_timesteps = 100000 # longer to train
# num_timesteps = 10000 # debug
epsilon_greedy_frames = num_timesteps*exploration_fraction

state = env.reset()
step_count = 0
for frame_count in range(1,num_timesteps+1):
    # env.render(); Adding this line would show the attempts
    # of the agent in a pop up window.

    # Use epsilon-greedy for exploration
    if epsilon > np.random.rand(1)[0]:
        # Take random action
        action = np.random.choice(num_actions)
    else:
        # Predict action Q-values from state
        action_probs = model_policy(state[np.newaxis], training=False)
        # Take best action
        action = tf.argmax(action_probs[0]).numpy()

    # Linear Decay probability of taking random action
    epsilon -= (epsilon_max - epsilon_min)/epsilon_greedy_frames
    epsilon = max(epsilon, epsilon_min)

    # Apply the sampled action in our environment
    state_next, reward, done, _ = env.step(action)

#     episode_reward += reward
    episode_reward_history[-1] += reward
    
    # Save actions and states in replay buffer
    # replay_buffer.append((action, state, state_next, reward, done))
    replay_buffer.push((action, state, state_next, reward, done))

    state = state_next

    # Update every fourth frame and once batch size is over 32
    if frame_count % train_freq == 0 and len(replay_buffer) > batch_size:
        # sample the replay buffer
        samples = replay_buffer.sample(batch_size)
        action_sample = [sample[0] for sample in samples]
        state_sample = np.array([sample[1] for sample in samples])
        state_next_sample = np.array([sample[2] for sample in samples])
        rewards_sample = [sample[3] for sample in samples]
        done_sample = tf.convert_to_tensor(
            [float(sample[4]) for sample in samples]
        )

        # Build the updated Q-values for the sampled future states
        # Use the target model for stability
        future_rewards = model_target.predict(state_next_sample)
        # Q value = reward + discount factor * expected future reward
        updated_q_values = rewards_sample + gamma * tf.reduce_max(future_rewards, axis=1)*(1 - done_sample)
        # final frame has no future reward
        
        # # If final frame set the last value to -1
        # updated_q_values = updated_q_values * (1 - done_sample) - done_sample

        # Create a mask so we only calculate loss on the updated Q-values
        masks = tf.one_hot(action_sample, num_actions)

        with tf.GradientTape() as tape:
            # Train the model on the states and updated Q-values
            q_values = model_policy(state_sample)

            # Apply the masks to the Q-values to get the Q-value for action taken
            q_action = tf.reduce_sum(tf.multiply(q_values, masks), axis=1)
            # Calculate loss between new Q-value and old Q-value
            loss = loss_function(updated_q_values, q_action)

        # Backpropagation
        grads = tape.gradient(loss, model_policy.trainable_variables)
        optimizer.apply_gradients(zip(grads, model_policy.trainable_variables))

    if frame_count % update_target_network == 0:
        # update the the target network with new weights
        model_target.set_weights(model_policy.get_weights())
        # Log details
        if frame_count%(update_target_network*4) == 0:
            template = f"running reward: {running_reward:.2f} at episode {episode_count}, frames: {frame_count}"
            print(template)

    step_count +=1
    if step_count==max_steps_per_episode:
        # its taking too long, reset
        done = True
        step_count = 0
        
    if done:
        state = env.reset()
            
        # Update running reward to check condition for solving
        if len(episode_reward_history) > 20:
            del episode_reward_history[:1]
        running_reward = np.mean(episode_reward_history)

        episode_reward_history.append(0)

        episode_count += 1
    

    if frame_count in [1000, 10000, 100000]:
        model_policy.save(f"dqn_{frame_count}.h5")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
running reward: -151.32 at episode 21, frames: 2000
running reward: -99.20 at episode 41, frames: 4000
running reward: -78.82 at episode 54, frames: 6000
running reward: -69.35 at episode 58, frames: 8000
running reward: -58.75 at episode 60, frames: 10000
running reward: -67.31 at episode 62, frames: 12000
running reward: -27.84 at episode 67, frames: 14000
running reward: -8.61 at episode 72, frames: 16000
running reward: 18.65 at episode 77, frames: 18000
running reward: 33.82 at episode 81, frames: 20000
running reward: 37.62 at episode 83, frames: 22000
r

In [None]:
action_sample

We will save this trained model for reuse later (as it takes some time to train the model until it performs well.

## Visualization



In [9]:
model_policy=keras.models.load_model(f"dqn_100000.h5", compile=False)
state = env.reset()
done = False
episode_rewards=[0]
steps=0
for i in range(5000):  
#     env.render() # for visualization, must be done on a local machine

    action_probs = model_policy(state[np.newaxis], training=False)
    action = tf.argmax(action_probs[0]).numpy()

    # Apply the sampled action in our environment
    state, reward, done, _ = env.step(action)

    episode_rewards[-1] += reward
    steps += 1
    
    if steps==max_steps_per_episode:
        done = True
        steps = 0
        
    if done:
        state = env.reset()
        episode_rewards.append(0.0)

# Compute mean reward for the last 100 episodes
print(f"Mean reward: {np.mean(episode_rewards[-100:]):.2f}\t Num episodes: {len(episode_rewards)}")

Mean reward: -13.67	 Num episodes: 6


## References
1. https://keras.io/examples/rl/deep_q_network_breakout/
2. https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html
3. https://stable-baselines.readthedocs.io/en/master/guide/examples.html#basic-usage-training-saving-loading
4. https://goodboychan.github.io/python/reinforcement_learning/pytorch/udacity/2021/05/07/DQN-LunarLander.html