<center>
    <h1>Deep Reinforcement Learning</h1>
</center>

# Brief Recap of Deep Reinforcement Learning

- Deep Reinforcement Learning (DRL) is a machine learning approach that combines deep neural networks with reinforcement learning to create systems that can learn optimal behaviors through interaction with an environment.

- It revolutionized autonomous decision-making by enabling agents to learn complex strategies directly from raw input data without explicit programming of rules or behaviors.

- Deep Reinforcement Learning has several key advantages over traditional machine learning approaches:
    1. End-to-end learning: Can learn directly from raw sensory inputs to actions
    2. Adaptability: Capable of learning and adjusting strategies in dynamic environments
    3. Generalization: Can transfer learned skills to similar but previously unseen situations

- DRL uses various algorithms like Deep Q-Networks (DQN), Policy Gradients, and Actor-Critic methods to achieve efficient learning in complex environments with large state and action spaces.

- It has become foundational for many cutting-edge applications, including game playing, robotics control, autonomous vehicles, and resource management systems.

- Popular DRL frameworks and implementations include OpenAI Gym, Stable Baselines, RLlib, and TensorFlow-Agents, each offering different features and optimization capabilities.

- These technologies continue to evolve, finding new applications across industries such as healthcare, finance, manufacturing, and logistics optimization.

## Architecture of Deep Reinforcement Learning

- Deep Reinforcement Learning architectures are specialized neural network systems designed to learn optimal decision-making policies through environment interaction and reward optimization.

- They revolutionized autonomous learning by combining deep neural networks with traditional reinforcement learning principles, enabling end-to-end learning from raw inputs to actions.

- Deep RL architectures have several fundamental components:
    1. Input Processing: Handles raw state information from the environment
    2. Feature Extraction: Transforms raw inputs into meaningful representations
    3. Policy/Value Estimation: Determines actions or state values
    4. Action Selection: Chooses optimal actions based on learned policies

- The architecture typically consists of multiple interconnected layers:
    1. Input Layer: Receives state observations from the environment
    2. Hidden Layers: Process and transform state information
    3. Output Layer: Generates action probabilities or value estimates
    4. Memory Buffer: Stores experience for replay and learning

- These architectures employ various optimization techniques:
    1. Experience Replay: Stores and reuses past experiences
    2. Target Networks: Stabilizes training through delayed updates
    3. Advantage Estimation: Improves policy gradient calculations

- Popular architectural variants include:
    - Deep Q-Networks (DQN) for discrete action spaces
    - Policy Gradient Networks for continuous action spaces
    - Actor-Critic Networks for combined policy and value learning

- These architectures continue to evolve, incorporating new advances in deep learning such as attention mechanisms, transformers, and multi-agent learning capabilities.

## Applications of Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) is a powerful approach that combines deep learning and reinforcement learning to solve complex problems in various domains such as:

- Robotics
    - **Autonomous Navigation**: DRL is used to train robots for navigation tasks in dynamic environments. Robots learn to make decisions based on sensory input to navigate through obstacles and reach goals.
    - **Manipulation Tasks**: In industrial settings, DRL helps robots learn to perform tasks like picking, placing, and assembling objects with precision.

- Game Playing
    - **Atari Games**: DRL algorithms, such as Deep Q-Networks (DQN), have been successfully applied to play Atari games, achieving superhuman performance in many cases.
    - **Board Games**: DRL has been used in games like Go (AlphaGo) and chess, where agents learn strategies through self-play, leading to groundbreaking results.

- Finance
    - **Algorithmic Trading**: DRL can optimize trading strategies by learning from historical market data and making buy/sell decisions to maximize returns.
    - **Portfolio Management**: It helps in dynamically adjusting asset allocations in a portfolio to balance risk and reward based on changing market conditions.

- Healthcare
    - **Personalized Treatment Plans**: DRL can assist in developing personalized treatment strategies by learning optimal interventions for individual patients based on their health data.
    - **Drug Discovery**: In pharmaceuticals, DRL models are utilized to explore molecular space for potential drug candidates by predicting interactions and outcomes.

- Transportation
    - **Traffic Signal Control**: DRL algorithms optimize traffic light timings to reduce congestion and improve traffic flow in urban areas.
    - **Autonomous Vehicles**: DRL is integral to developing autonomous driving systems, enabling vehicles to make real-time decisions based on environmental conditions.

- Natural Language Processing
    - **Dialogue Systems**: DRL enhances conversational agents by optimizing response strategies based on user interactions, leading to more engaging and context-aware dialogues.
    - **Text Summarization**: It is applied to learn how to summarize text effectively by evaluating the quality of generated summaries through reinforcement signals.

- Energy Management
    - **Smart Grid Optimization**: DRL can optimize energy distribution and consumption in smart grids, balancing supply and demand while minimizing costs.
    - **Demand Response**: It helps in managing energy consumption patterns in response to changing prices and grid conditions, promoting efficient energy usage.

- Game Development
    - **Procedural Content Generation**: DRL is used to create dynamic and adaptive game content, enhancing player experience by adjusting difficulty levels based on player performance.
    - **Player Behavior Modeling**: It can model and predict player behavior to improve game design and user engagement.

# Implementing some core concepts of building a Deep Reinforcement Learning Model with Tensorflow

In [1]:
import gym
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers

`gym`: OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. We use it to create the environment (CartPole in this case).

## Environment Setup

- `env = gym.make('CartPole-v1')`: We create an instance of the CartPole environment where the agent will interact and learn. The environment provides the state, action space, and rewards.
- `state_shape`: This variable holds the shape of the state space (input) for the neural network. For CartPole, the state consists of position, velocity, angle, and angular velocity.
- `num_actions`: This is the number of possible actions the agent can take, which is 2 (move left or right).

In [2]:
env = gym.make('CartPole-v1')
state_shape = env.observation_space.shape
num_actions = env.action_space.n

## Defining the Q-Network

- **Q-Network**: This neural network approximates the Q-value function. It takes the state as input and outputs the Q-values for all possible actions.
    - `layers.Dense(128, activation='relu')`: We use fully connected layers (Dense) with 128 neurons, using the ReLU activation function to introduce non-linearity.
    - `layers.Dense(num_actions)`: The output layer has `num_actions` neurons (2 for CartPole), each representing the predicted Q-value for that action.
    - `Sequential`: A simple feed-forward neural network where each layer's output is the next layer's input.

In [3]:
def create_q_network(state_shape, num_actions):
    model = tf.keras.Sequential([
        layers.Dense(128, activation='relu', input_shape=state_shape),
        layers.Dense(128, activation='relu'),
        layers.Dense(num_actions)
    ])
    return model

## Defining the Replay Buffer

- **Replay Buffer**: This class stores past experiences in the form of `(state, action, reward, next_state, done)` tuples.
- **add()**: This method adds new experiences to the buffer. If the buffer is full, it replaces older experiences in a circular manner.
- **sample()**: It randomly selects a batch of experiences from the buffer to break correlations between consecutive experiences. This helps stabilize training.
- **Purpose**: Replay buffer allows the agent to learn from a wider variety of experiences, enhancing sample efficiency and reducing instability in training.

In [4]:
class ReplayBuffer:
    def __init__(self, size):
        self.buffer = []
        self.max_size = size
        self.size = 0

    def add(self, experience):
        if self.size < self.max_size:
            self.buffer.append(experience)
            self.size += 1
        else:
            self.buffer[self.size % self.max_size] = experience

    def sample(self, batch_size):
        idx = np.random.choice(len(self.buffer), batch_size)
        return [self.buffer[i] for i in idx]

## Epsilon-Greedy Policy for Exploration

- **Epsilon-Greedy Policy**: This policy balances exploration and exploitation.
  - **Exploration**: With a probability `epsilon`, the agent takes a random action, encouraging exploration of new states.
  - **Exploitation**: With a probability `1 - epsilon`, the agent chooses the action with the highest Q-value, exploiting current knowledge.
- `np.random.rand() < epsilon`: Generates a random number between 0 and 1. If it’s less than `epsilon`, the agent explores; otherwise, it exploits.

In [5]:
def epsilon_greedy_policy(q_values, epsilon):
    if np.random.rand() < epsilon:
        return np.random.randint(num_actions)
    else:
        return np.argmax(q_values)

## Training the DQN

```python
def train_dqn(env, episodes, batch_size=64, gamma=0.99, epsilon=1.0, epsilon_decay=0.995, min_epsilon=0.1):
    # Create Q-network and target Q-network
    q_network = create_q_network(state_shape, num_actions)
    target_q_network = create_q_network(state_shape, num_actions)
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
    loss_fn = tf.keras.losses.MeanSquaredError()

    replay_buffer = ReplayBuffer(100000)  # Large replay buffer
```

### Explanation:
- **Q-Network**: The main neural network that learns to approximate the Q-values.
- **Target Q-Network**: A separate network used to stabilize training. The weights of this network are updated less frequently (every few episodes) to avoid oscillating Q-values during training.
- **Optimizer**: Adam optimizer is used to minimize the loss by adjusting the network's weights.
- **Loss Function**: Mean Squared Error is used to minimize the difference between the predicted Q-values and the target Q-values.
- **Replay Buffer**: A buffer with a size of 100,000 to store experiences.

```python
    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        done = False

        while not done:
            state_input = np.expand_dims(state, axis=0).astype(np.float32)
            q_values = q_network(state_input)
            action = epsilon_greedy_policy(q_values, epsilon)
            
            # Take the chosen action in the environment
            next_state, reward, done, _ = env.step(action)
            replay_buffer.add((state, action, reward, next_state, done))

            state = next_state
            total_reward += reward
```

### Explanation:
- **Episode Loop**: The outer loop runs for the specified number of episodes. Each episode is a complete run of the environment (from start to terminal state).
- `env.reset()`: Resets the environment to the initial state.
- **Action Selection**: The agent selects an action using the epsilon-greedy policy based on the Q-values predicted by the Q-network.
- **Environment Interaction**: `env.step(action)` executes the selected action and returns the next state, reward, and whether the episode is done.
- **Experience Storage**: The `(state, action, reward, next_state, done)` tuple is stored in the replay buffer for future training.
- **Total Reward**: Tracks the cumulative reward obtained by the agent in the episode.

```python
            if len(replay_buffer.buffer) > batch_size:
                # Sample a batch from the replay buffer
                experiences = replay_buffer.sample(batch_size)
                states, actions, rewards, next_states, dones = map(np.array, zip(*experiences))

                # Predict Q-values for the next states using the target network
                next_q_values = target_q_network(next_states)
                max_next_q_values = np.max(next_q_values, axis=1)

                # Bellman equation for the target Q-value
                targets = rewards + gamma * max_next_q_values * (1 - dones)
```

### Explanation:
- **Experience Sampling**: A batch of experiences is sampled from the replay buffer once there are enough experiences to fill the batch.
- **Target Calculation**: 
  - `next_q_values`: The Q-values for the next states are predicted by the target network.
  - `max_next_q_values`: The highest Q-value for the next state is chosen (maximizing future reward).
  - **Bellman Equation**: The target Q-value is calculated using the Bellman equation: `reward + (discount factor * max future reward)`. The factor `(1 - dones)` ensures that no future reward is added if the episode is done.

```python
                # Gradient descent to update the Q-network
                with tf.GradientTape() as tape:
                    q_values = q_network(states)
                    action_masks = tf.one_hot(actions, num_actions)
                    q_values_taken = tf.reduce_sum(q_values * action_masks, axis=1)
                    loss = loss_fn(targets, q_values_taken)

                grads = tape.gradient(loss, q_network.trainable_variables)
                optimizer.apply_gradients(zip(grads, q_network.trainable_variables))
```

### Explanation:
- **Q-Value Prediction**: The Q-network predicts the Q-values for the batch of states.
- **Action Mask**: A one-hot mask is created for the actions taken in those states to extract the Q-values corresponding to the actions the agent chose.
- **Loss Calculation**: The difference between the predicted Q-values and the target Q-values is calculated using Mean Squared Error.
- **Gradient Descent**: The gradients of the loss are computed with respect to the network's weights, and the optimizer applies these gradients to update the Q-network's parameters.

```python
            if done:
                print(f"Episode {episode + 1}, Total Reward: {total_reward}")
                break

        # Update epsilon for the next episode
        epsilon = max(min_epsilon, epsilon * epsilon_decay)

        # Periodically update the target network
        if episode % 10 == 0:
            target_q_network.set_weights(q_network.get_weights())
```

# Let's Build a Real world project to understand the concept of Deep Reinfocement Learning better