# Deep Reinforcement Learning

## Introduction

Deep Reinforcement Learning (Deep RL) combines Reinforcement Learning (RL) with Deep Learning to create agents that can learn to make decisions by interacting with an environment. Deep RL has achieved remarkable success in various domains, including game playing, robotics, and autonomous vehicles.

In this tutorial, we will explore the fundamentals of Deep RL, implement algorithms like Deep Q-Networks (DQN) and Policy Gradients, and understand the underlying mathematics. We will also dive into some of the latest developments in this field.

## Table of Contents

1. [Fundamentals of Reinforcement Learning](#1)
   - [Markov Decision Processes](#1.1)
   - [Key Concepts](#1.2)
2. [Deep Reinforcement Learning](#2)
   - [Combining Deep Learning and Reinforcement Learning](#2.1)
3. [Deep Q-Networks (DQN)](#3)
   - [Mathematical Foundations](#3.1)
   - [Algorithm Explanation](#3.2)
   - [Implementation](#3.3)
4. [Policy Gradient Methods](#4)
   - [Mathematical Foundations](#4.1)
   - [Algorithm Explanation](#4.2)
   - [Implementation](#4.3)
5. [Latest Developments in Deep RL](#5)
   - [Double DQN](#5.1)
   - [Dueling DQN](#5.2)
   - [Asynchronous Advantage Actor-Critic (A3C)](#5.3)
   - [Proximal Policy Optimization (PPO)](#5.4)
6. [Conclusion](#6)
7. [References](#7)


<a id="1"></a>
## 1. Fundamentals of Reinforcement Learning

Reinforcement Learning (RL) is a computational approach to learning from interaction. An agent learns to make decisions by performing actions in an environment to maximize cumulative rewards.

<a id="1.1"></a>
### Markov Decision Processes

The RL problem is often formalized as a Markov Decision Process (MDP), defined by:

- **States (S)**: The set of possible states the agent can be in.
- **Actions (A)**: The set of actions the agent can take.
- **Transition Probability (P)**: Probability of moving from one state to another given an action.
- **Reward Function (R)**: Immediate reward received after transitioning from one state to another due to an action.
- **Discount Factor (γ)**: A factor between 0 and 1 that reduces future rewards' importance.

At each time step $( t )$, the agent observes a state $( s_t )$, takes an action $( a_t )$, receives a reward $( r_t )$, and transitions to a new state $( s_{t+1} )$.

<a id="1.2"></a>
### Key Concepts

- **Policy (π)**: A mapping from states to actions. Determines the agent's behavior.
- **Value Function (V)**: Estimates how good it is to be in a state, considering future rewards.
- **Q-Function (Q)**: Estimates how good it is to take a specific action in a state.

- **Objective**: Find a policy that maximizes the expected cumulative reward:

$[
J(\pi) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]
]$

<a id="2"></a>
## 2. Deep Reinforcement Learning

Traditional RL methods struggle with high-dimensional state and action spaces. Deep Reinforcement Learning addresses this by using deep neural networks to approximate value functions and policies.

<a id="2.1"></a>
### Combining Deep Learning and Reinforcement Learning

- **Function Approximation**: Use neural networks to approximate value functions or policies.
- **Experience Replay**: Store experiences and sample mini-batches to break correlations.
- **Stability Techniques**: Use target networks, regularization, and other methods to stabilize training.

<a id="3"></a>
## 3. Deep Q-Networks (DQN)

DQN is a seminal Deep RL algorithm that combines Q-Learning with deep neural networks to handle high-dimensional input spaces.

**Reference:**

- Mnih, V., et al. (2015). *Human-level control through deep reinforcement learning*. Nature, 518(7540), 529–533.

<a id="3.1"></a>
### Mathematical Foundations

**Q-Learning** aims to learn the optimal action-value function $( Q^*(s, a) )$, which satisfies the Bellman Equation:

$[
Q^*(s, a) = \mathbb{E}_{s'} \left[ r + \gamma \max_{a'} Q^*(s', a') \mid s, a \right]
]$

**Deep Q-Network** approximates $( Q^*(s, a) )$ using a neural network with parameters $( \theta )$:

$[
Q(s, a; \theta) \approx Q^*(s, a)
]$

**Loss Function:**

$[
L(\theta) = \mathbb{E}_{(s, a, r, s')} \left[ \left( y_i - Q(s, a; \theta) \right)^2 \right]
]$

Where:

$[
y_i = r + \gamma \max_{a'} Q(s', a'; \theta^{-})
]$

- $( \theta^{-} )$: Parameters of the target network (periodically updated).

<a id="3.2"></a>
### Algorithm Explanation

1. **Initialize** the Q-network $( Q(s, a; \theta) )$ with random weights $( \theta )$.
2. **Initialize** the target network $( Q'(s, a; \theta^{-}) )$ with weights $( \theta^{-} = \theta )$.
3. **Initialize** the replay memory $( D )$.
4. **For** each episode:
   - **For** each step in the episode:
     - Observe state $( s )$.
     - Select action $( a )$ using an ε-greedy policy.
     - Execute action $( a )$, observe reward $( r )$ and next state $( s' )$.
     - Store transition $( (s, a, r, s') )$ in $( D )$.
     - Sample mini-batch from $( D )$.
     - Compute target $( y_i = r + \gamma \max_{a'} Q'(s', a'; \theta^{-}) )$.
     - Update $( \theta )$ by minimizing the loss $( L(\theta) )$.
     - **Periodically** update $( \theta^{-} = \theta )$.


<a id="3.3"></a>
### Implementation

We'll implement DQN using OpenAI Gym's CartPole environment.

In [None]:
# Import necessary libraries
import gym
import numpy as np
import random
import tensorflow as tf
from tensorflow.keras import models, layers, optimizers
from collections import deque

# Set up the environment
env = gym.make('CartPole-v1')

# Set seeds for reproducibility
env.seed(1)
np.random.seed(1)
tf.random.set_seed(1)

# Define hyperparameters
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
batch_size = 64
n_episodes = 500
gamma = 0.99          # Discount factor
epsilon = 1.0         # Exploration rate
epsilon_min = 0.01
epsilon_decay = 0.995
learning_rate = 0.001
memory = deque(maxlen=2000)

# Build the Q-network
def build_model():
    model = models.Sequential()
    model.add(layers.Dense(24, input_dim=state_size, activation='relu'))
    model.add(layers.Dense(24, activation='relu'))
    model.add(layers.Dense(action_size, activation='linear'))
    model.compile(loss='mse', optimizer=optimizers.Adam(lr=learning_rate))
    return model

# Initialize networks
model = build_model()
target_model = build_model()
target_model.set_weights(model.get_weights())

# Function to choose an action
def choose_action(state, epsilon):
    if np.random.rand() <= epsilon:
        return random.randrange(action_size)
    act_values = model.predict(state)
    return np.argmax(act_values[0])  # Returns action

# Function to replay and train the network
def replay(batch_size):
    minibatch = random.sample(memory, batch_size)
    for state, action, reward, next_state, done in minibatch:
        target = model.predict(state)
        if done:
            target[0][action] = reward
        else:
            t = target_model.predict(next_state)[0]
            target[0][action] = reward + gamma * np.amax(t)
        model.fit(state, target, epochs=1, verbose=0)

# Function to update the target network
def update_target_model():
    target_model.set_weights(model.get_weights())

# Main training loop
for e in range(n_episodes):
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    for time in range(500):
        # Uncomment to render the environment
        # env.render()
        action = choose_action(state, epsilon)
        next_state, reward, done, _ = env.step(action)
        reward = reward if not done else -10
        next_state = np.reshape(next_state, [1, state_size])
        memory.append((state, action, reward, next_state, done))
        state = next_state
        if done:
            print(f"Episode: {e}/{n_episodes}, Score: {time}, Epsilon: {epsilon:.2}")
            break
        if len(memory) > batch_size:
            replay(batch_size)
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay
    update_target_model()

This implementation uses experience replay and a target network for stable learning.

<a id="4"></a>
## 4. Policy Gradient Methods

Policy gradient methods optimize the policy directly by adjusting parameters in the direction of greater expected reward.

<a id="4.1"></a>
### Mathematical Foundations

The objective is to maximize the expected return:

$[
J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]
]$

Using the policy gradient theorem:

$[
\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a_t | s_t) G_t \right]
]$

Where $( G_t )$ is the return (cumulative future reward) from time step $( t )$.

<a id="4.2"></a>
### Algorithm Explanation

1. **Initialize** policy network $( \pi_\theta(a|s) )$ with parameters $( \theta )$.
2. **Collect** episodes using the current policy.
3. **Compute** returns $( G_t )$ for each time step.
4. **Compute** policy gradient estimates:

   $[
   \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T_i} \nabla_\theta \log \pi_\theta(a_t^i | s_t^i) G_t^i
   ]$

5. **Update** policy parameters:

   $[
   \theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)
   ]$

<a id="4.3"></a>
### Implementation

We'll implement the REINFORCE algorithm on the CartPole environment.

In [None]:
# Import necessary libraries
import gym
import numpy as np
import tensorflow as tf
from tensorflow.keras import models, layers, optimizers

# Set up the environment
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
tf.random.set_seed(1)

# Define the policy network
def build_policy_network():
    model = models.Sequential()
    model.add(layers.Dense(24, input_dim=state_size, activation='relu'))
    model.add(layers.Dense(24, activation='relu'))
    model.add(layers.Dense(action_size, activation='softmax'))
    return model

policy = build_policy_network()
optimizer = optimizers.Adam(lr=0.01)

# Function to select action based on policy probabilities
def choose_action(state):
    state = state.reshape([1, state_size])
    probs = policy.predict(state).flatten()
    action = np.random.choice(action_size, p=probs)
    return action

# Function to compute discounted rewards
def discount_rewards(rewards, gamma=0.99):
    discounted = np.zeros_like(rewards)
    cumulative = 0
    for i in reversed(range(len(rewards))):
        cumulative = cumulative * gamma + rewards[i]
        discounted[i] = cumulative
    return discounted

# Main training loop
episodes = 1000
gamma = 0.99
for episode in range(episodes):
    state = env.reset()
    states, actions, rewards = [], [], []
    total_reward = 0
    done = False
    while not done:
        # env.render()
        action = choose_action(state)
        next_state, reward, done, _ = env.step(action)
        states.append(state)
        action_onehot = np.zeros(action_size)
        action_onehot[action] = 1
        actions.append(action_onehot)
        rewards.append(reward)
        state = next_state
        total_reward += reward
    # Compute discounted rewards
    discounted_rewards = discount_rewards(rewards, gamma)
    # Convert lists to arrays
    states = np.vstack(states)
    actions = np.vstack(actions)
    discounted_rewards = np.vstack(discounted_rewards)
    # Normalize rewards
    discounted_rewards = (discounted_rewards - np.mean(discounted_rewards)) / (np.std(discounted_rewards) + 1e-7)
    # Train the policy network
    with tf.GradientTape() as tape:
        logits = policy(states)
        neg_log_prob = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
        loss = tf.reduce_mean(neg_log_prob * discounted_rewards)
    grads = tape.gradient(loss, policy.trainable_variables)
    optimizer.apply_gradients(zip(grads, policy.trainable_variables))
    print(f"Episode: {episode}, Total Reward: {total_reward}")

In this implementation, we collect episodes, compute discounted rewards, and update the policy network using the REINFORCE algorithm.

<a id="5"></a>
## 5. Latest Developments in Deep RL

Deep RL has rapidly evolved, with numerous advancements improving stability, sample efficiency, and performance.

<a id="5.1"></a>
### 5.1 Double DQN

Double DQN addresses the overestimation bias in Q-Learning by decoupling action selection from evaluation.

**Reference:**

- Hasselt, H. V., Guez, A., & Silver, D. (2016). *Deep Reinforcement Learning with Double Q-learning*. [arXiv:1509.06461](https://arxiv.org/abs/1509.06461)

<a id="5.2"></a>
### 5.2 Dueling DQN

Dueling DQN separates the estimation of state value and advantage, improving learning efficiency.

**Reference:**

- Wang, Z., et al. (2016). *Dueling Network Architectures for Deep Reinforcement Learning*. [arXiv:1511.06581](https://arxiv.org/abs/1511.06581)

<a id="5.3"></a>
### 5.3 Asynchronous Advantage Actor-Critic (A3C)

A3C uses multiple workers in parallel to stabilize and speed up training.

**Reference:**

- Mnih, V., et al. (2016). *Asynchronous Methods for Deep Reinforcement Learning*. [arXiv:1602.01783](https://arxiv.org/abs/1602.01783)

<a id="5.4"></a>
### 5.4 Proximal Policy Optimization (PPO)

PPO simplifies trust region policy optimization, balancing ease of implementation and performance.

**Reference:**

- Schulman, J., et al. (2017). *Proximal Policy Optimization Algorithms*. [arXiv:1707.06347](https://arxiv.org/abs/1707.06347)

<a id="6"></a>
## 6. Conclusion

Deep Reinforcement Learning combines the strengths of deep learning and reinforcement learning, enabling agents to learn complex tasks from raw sensory inputs. We explored foundational algorithms like DQN and policy gradients and touched upon advanced techniques that continue to push the boundaries of what's possible in AI.

<a id="7"></a>
## 7. References

1. Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
2. Mnih, V., et al. (2015). *Human-level control through deep reinforcement learning*. Nature, 518(7540), 529–533.
3. Schulman, J., et al. (2017). *Proximal Policy Optimization Algorithms*. [arXiv:1707.06347](https://arxiv.org/abs/1707.06347)
4. Mnih, V., et al. (2016). *Asynchronous Methods for Deep Reinforcement Learning*. [arXiv:1602.01783](https://arxiv.org/abs/1602.01783)
5. Hasselt, H. V., Guez, A., & Silver, D. (2016). *Deep Reinforcement Learning with Double Q-learning*. [arXiv:1509.06461](https://arxiv.org/abs/1509.06461)
6. Wang, Z., et al. (2016). *Dueling Network Architectures for Deep Reinforcement Learning*. [arXiv:1511.06581](https://arxiv.org/abs/1511.06581)

---

This notebook provides an in-depth exploration of Deep Reinforcement Learning. You can run the code cells to see how DQN and policy gradient methods are implemented and experiment with the models.