# ABE Tutorial 4
## Using Functional Approximation in Deep Reinforcement Learning

In this tutorial, we explore how neural networks can be used for function approximation in deep reinforcement learning (DRL). We build on previous tutorials that covered value-based and actor–critic methods, and now we focus on:

- **Constructing neural networks** using PyTorch.
- **Data normalization** techniques to stabilize learning.
- **Training networks** using gradient descent and backpropagation.
- **Extending to continuous action spaces** for actor–critic methods.

***
## 1. Building a Neural Network

We'll use PyTorch as our Python package to build neural networks. We've seen these before in our first four tutorials, but here let's dive into the details a little more.

The first thing to note is that we are using a sequential approach to building our neural networks. In this approach we just need to specify a network by providing an ordered list of layers. Let's take a look at how to do this below, by building a simple three layer network:

* **Input Layer**: This is the layer where the data comes into the model. Let's assume there are 4 input variables.

* **First Hidden Layer**: This a layer of nodes that is connected to the input layer and will transform the input data, and pass these transformed values to the output layer. Let's assume this hidden layer has 32 nodes.

* **Output Layer**: This output layer will take the transformed values and output values that can be used to inform what actions can be taken. Let's assume there are two actions that can be taken.

You should see below these 3 layers, and you should see how each layers shape corresponds to the data: e.g., 4 input values gets passed to the 32 nodes in the hidden layer, and how those 32 nodes pass those transformed values to the 2 actions in the output layer.

In [None]:
import torch
import torch.nn as nn

def build_simple_network():
    """
    Construct a simple neural network.
    
    Returns:
        nn.Sequential: A neural network model.
    """
    model = nn.Sequential(
        nn.Linear(4, 32),  # Maps 4 input features to 32 hidden nodes.
        nn.Linear(32, 32), # Hidden layer maintains 32 features.
        nn.Linear(32, 2)   # Output layer produces 2 values (e.g., action values).
    )
    return model

# Build the network and print its architecture.
simple_net = build_simple_network()
print(simple_net)

***
## 2. Inspecting the Network Architecture

We use the `torchinfo` package (a modern alternative to `torchsummary`) to inspect the network. This tool provides detailed insights into each layer, including the shape of tensors and the number of trainable parameters. This step is essential for understanding how data flows through the network.

In [None]:
from torchinfo import summary

# Define the input shape as a tuple: batch size of 1 and 4 features.
input_shape = (1, 4)

# Display the network summary.
summary(simple_net, input_size=input_shape)

We can see in the summary above that the model has 1282 parameters! These are all the weights and bias values that are associated with each edge.

However, linear layers alone can only model linear relationships. In DRL and deep learning, activation functions such as ReLU introduce non-linearity, enabling the network to approximate more complex functions. 

In our network:
- **ReLU (Rectified Linear Unit):** Zeroes out negative values, helping to model non-linearities.

**Note**: We do not apply an activation function to the output layer so that the network can output any continuous value, which is often necessary for value estimation.

In [None]:
def build_network_with_activation():
    """
    Construct a neural network with activation functions.
    
    Returns:
        nn.Sequential: A neural network model with ReLU activations.
    """
    model = nn.Sequential(
        nn.Linear(4, 32),
        nn.ReLU(),             # Introduces non-linearity.
        nn.Linear(32, 32),
        nn.ReLU(),             # Second non-linear transformation.
        nn.Linear(32, 2)
    )
    return model

# Create the network with activation functions.
activated_net = build_network_with_activation()

# Display the model summary using torchinfo.
summary(activated_net, input_size=input_shape)

By placing the ReLU activation layers after each layer we are filtering out any nodes that are outputting negative values. This cutoff is what let's a neural network model non-linear relationships.

You'll notice that the model has the same number of weights and biases parameters. This is because the activation function is really just a filter and requires no new parameters. 

You'll notice too that there is no activation function applied after the output layer. This is because we want the output layer to output a continuous value and we want to keep negative values as an option. We'll see that for the output layer we have to think more about what kinds of outputs we want (continuous numeric, restricted to be between 0-1, ...etc) and that will determine how we build this last layer. Internally, however, with the hidden layers we will generally use ReLU activation functions.

***
## 3. Incorporating Normalization

Normalization techniques, such as Layer Normalization, are crucial for stabilizing the training of deep networks. In DRL, where the agent’s experience is non-stationary, normalization helps maintain a consistent scale for inputs and intermediate activations.

In [None]:
def build_normalized_network():
    """
    Construct a neural network with normalization layers.
    
    Returns:
        nn.Sequential: A normalized neural network model.
    """
    model = nn.Sequential(
        nn.Linear(4, 32),
        nn.ReLU(),
        nn.LayerNorm(32),   # Normalize activations from the first hidden layer.
        nn.Linear(32, 32),
        nn.ReLU(),
        nn.LayerNorm(32),   # Normalize activations from the second hidden layer.
        nn.Linear(32, 2)
    )
    return model

# Build and inspect the normalized network.
normalized_net = build_normalized_network()
summary(normalized_net, input_size=input_shape)

- **LayerNorm:** Normalizes the activations of a layer for each given example, which is particularly useful in deep networks and recurrent models.

Normalization layers add additional parameters for scaling and shifting but are omitted in the output layer to preserve the meaningful scale of outputs.

***
## 4. Simulating Data for Regression

We've seen how we can build neural networks using a lego like approach and using different kinds of layers. Let's see now how we can update the weights/biases of these layers so that the network can learn. To do this, let's:

* Simulate some data to use as input
* Measure how far the network predictions are from the "right" answer
* Adjust the weights and biases to make better predictions
* Do this many times, until the network is making good predictions!

To train our network, we simulate a simple regression problem where the state variable and expected return have a linear relationship with added noise:

$$ y = 0.5x + \epsilon $$

where $\epsilon$ represents Gaussian noise. This provides a controlled environment to study the learning process, loss minimization, and gradient-based optimization, which are foundational concepts in both deep learning and DRL.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def simulate_data(num_samples=1000):
    """
    Simulate a linear relationship between a state variable and an expected return.
    
    The relationship is given by:
        y = 0.5 * x + ε
    where ε is Gaussian noise.
    
    Args:
        num_samples (int): Number of data samples.
    
    Returns:
        tuple: Arrays of state variables and expected returns.
    """
    rng = np.random.default_rng()  # Modern random generator for reproducibility
    state_vars = rng.normal(loc=0, scale=1, size=num_samples)
    expected_return = state_vars * 0.5 + rng.normal(loc=0, scale=0.25, size=num_samples)
    return state_vars, expected_return

# Generate the synthetic data.
state_vars, expected_return = simulate_data()

# Plot the relationship between the state variable and expected return.
plt.figure(figsize=(8, 6))
plt.scatter(state_vars, expected_return, alpha=0.6)
plt.xlabel("State Variable")
plt.ylabel("Expected Return")
plt.title("Scatter Plot: State Variable vs Expected Return")
plt.show()

Let's see if our nerual network can learn the relationship between state values and expected rewards. 

***
## 5. Constructing a Regression Network

For our regression problem, we design a network that takes a single input (state variable) and outputs a single prediction (expected return). The architecture includes:

- **Input Layer:** 1 neuron.
- **Hidden Layers:** Two layers with 32 neurons each, utilizing ReLU activations and Layer Normalization.
- **Output Layer:** 1 neuron for continuous output.

This design illustrates how even simple networks in DRL must be carefully architected to capture the underlying data distribution.


In [None]:
def build_regression_network():
    """
    Build a neural network for regression with one input and one output.
    
    Returns:
        nn.Sequential: The constructed regression network.
    """
    model = nn.Sequential(
        nn.Linear(1, 32),  # Input layer: 1 feature to 32 neurons.
        nn.ReLU(),
        nn.LayerNorm(32),
        nn.Linear(32, 32),
        nn.ReLU(),
        nn.LayerNorm(32),
        nn.Linear(32, 1)   # Output layer: 1 continuous value.
    )
    return model

# Build the regression network and print its summary.
regression_net = build_regression_network()
summary(regression_net, input_size=(1, 1))

LayerNorm is a little over kill for the simple model we are building (i.e., it is more useful in deeper networks that are learning continuously), but let's leave it in for the example.

Let's see how well this network does without learning (i.e., all the weights/biases are randomly selected).

***
## 6. Preparing Data for the Network

Before training, we convert our simulated state variables into a PyTorch tensor that matches the input shape expected by our regression network. This is an important preprocessing step in DRL and deep learning in general.


In [None]:
# Convert the state variables into a PyTorch tensor with shape (N, 1)
state_vars_tensor = torch.tensor(state_vars, dtype=torch.float32).view(-1, 1)

# Generate predictions using the untrained regression network.
with torch.no_grad():
    initial_predictions = regression_net(state_vars_tensor)

# Plot the true data against the initial network predictions.
plt.figure(figsize=(8, 6))
plt.scatter(state_vars, expected_return, label="True Data", alpha=0.6)
plt.scatter(state_vars, initial_predictions.numpy(), color="red", label="Initial Predictions", alpha=0.6)
plt.xlabel("State Variable")
plt.ylabel("Expected Return")
plt.title("Untrained Model Predictions vs True Data")
plt.legend()
plt.show()

You may run the code multiple times to observe that the initial predictions vary due to random weight initialization. As expected, these predictions do not fit the data well before training.

To quantify the discrepancy between the network's predictions and the true expected rewards, we use a **loss function**. The loss function provides a single scalar value that measures the error in the network's predictions and guides the learning process by indicating how much the network's parameters need to change.

***
## 7. Loss Function: Mean Squared Error

In supervised learning and DRL, the loss function is critical for evaluating model performance. Here, we use the **Mean Squared Error (MSE)**, defined as:

$$
\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2,
$$

where:
- $N$ is the number of samples.
- $y_i$ is the true expected reward for the $i$-th sample.
- $\hat{y}_i$ is the predicted reward for the $i$-th sample.

This metric computes the average of the squared differences between the actual and predicted values. The gradients of the MSE with respect to the network parameters are computed using the chain rule, enabling the optimizer to update the weights in a direction that minimizes the error.


In [None]:
# Convert the expected return data to a PyTorch tensor.
expected_return_tensor = torch.tensor(expected_return, dtype=torch.float32).view(-1, 1)

# Define the MSE loss function.
loss_fn = nn.MSELoss()

# Compute the loss between initial predictions and the true expected returns.
loss = loss_fn(initial_predictions, expected_return_tensor)
print(f"Initial Loss: {loss.item():.4f}")

Loss functions serve as the objective that our training process aims to minimize. They quantify the discrepancy between the model's predictions and the true targets, guiding the network to learn meaningful representations of the data.

Let's see if we can reduce the MSE!

***
## 8. Optimizers

**Optimizers** are algorithms designed to update the network’s parameters in order to minimize the loss function. They use gradient information computed with respect to the loss to determine the direction and magnitude of updates to the weights and biases. For example, the [Adam optimizer](https://arxiv.org/abs/1412.6980) uses adaptive learning rates and momentum to converge on a set of weights that reduce the loss. Its update rule is:

$$
w_{t+1} = w_t - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon},
$$

where:
- $w_t$ is the current weight,
- $\alpha$ is the learning rate,
- $\hat{m}_t$ is the bias-corrected first moment (mean of gradients),
- $\hat{v}_t$ is the bias-corrected second moment (variance of gradients),
- $\epsilon$ is a small constant to prevent division by zero.

The optimizer uses the gradient information to update the weights in a direction that minimizes the loss (*gradient descent*), which in our case is the MSE.

***
## 9. Backpropagation

**Backpropagation** is the method used to compute the gradients that the optimizer requires. It works in two main phases:

1. **Forward Pass:**

    a. *Neuron Computation:*  
     Each neuron computes a weighted sum of its inputs plus a bias:
     $$
     z = \sum_{j} w_j x_j + b,
     $$
     where:
     - $x_j$ are the inputs,
     - $w_j$ are the weights,
     - $b$ is the bias.
     
     The neuron then applies an activation function (such as ReLU) to produce its output:
     $$
     a = \sigma(z).
     $$
   
    b. *Layer-by-Layer Propagation:*  
     The outputs from one layer become the inputs to the next, culminating in the final prediction, $\hat{y}$.
   
    c. *Loss Computation:*  
     The MSE loss is computed over all samples:
     $$
     \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2,
     $$
     where:
     - $N$ is the number of samples,
     - $y_i$ is the true value for the $i$-th sample,
     - $\hat{y}_i$ is the predicted value.

2. **Backward Pass:**

    a. *Local Gradient Computation:*  
     At the neuron level, we calculate the derivative of the activation function, $\sigma'(z)$, to understand how a small change in $z$ affects the output $a$.
     
    b. *Applying the Chain Rule:*  
     When computing the gradient of the loss $L$ with respect to a weight $w$, we use the chain rule, which is
     $$
     \frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}
     $$
     Here:
     - $\frac{\partial L}{\partial \hat{y}}$ tells us how the loss changes as the network's output $\hat{y}$ changes. It is derived by differentiating the loss function with respect to the network's prediction.
     - $\frac{\partial \hat{y}}{\partial z}$ is the derivative of the activation function. It measures how the neuron's output $\hat{y}$ responds to a change in its input $z$. For example, if you use the ReLU activation, this derivative is 1 when $z$ is positive and 0 when $z$ is not positive.
     - $\frac{\partial z}{\partial w}$ is simply the input $x$, since the neuron's input is computed as $z = w \cdot x + b$. This indicates that a small change in $w$ results in a change in $z$ proportional to $x$.

   
    c. *Gradient Propagation:*  
     These gradients are computed starting from the output layer and propagated backward through the network. This tells us how each weight contributes to the overall MSE loss.

The optimizer then uses these computed gradients to update the weights, aiming to reduce the loss. By repeating this cycle, the network incrementally improves its predictions, driving the MSE lower with each *epoch*.

In [None]:
# Define the optimizer using the Adam algorithm with a learning rate of 0.01.
optimizer = torch.optim.Adam(regression_net.parameters(), lr=0.01)

# Forward pass: Compute predictions using the regression network.
predictions = regression_net(state_vars_tensor)
loss = loss_fn(predictions, expected_return_tensor)

# Backpropagation and optimization steps.
optimizer.zero_grad()   # Clears any existing gradients.
loss.backward()         # Computes the new gradients using the chain rule.
optimizer.step()        # Applies the computed gradients to update the model's weights.

print(f"New loss after another training step: {loss.item():.4f}")

You can run the above code cell above a few times. Hopefully, you see the loss get smaller than the first loss!

In practice, the network is trained over many iterations (epochs). During each epoch, the following occurs:
1. **Forward Pass:** Compute predictions for all input data.
2. **Loss Calculation:** Evaluate the MSE loss.
3. **Backpropagation:** Use the chain rule to compute gradients.
4. **Weight Update:** Adjust weights using gradient descent (Adam optimizer).

This iterative process minimizes the loss, allowing the network to approximate the target function. The theory behind this process stems from optimization theory and the principles of stochastic gradient descent, which underlie many DRL algorithms.

Let's formalize these steps a little better, and write the steps into a loop!

In [None]:
def train_network(model, optimizer, loss_fn, inputs, targets, num_epochs=1000, log_interval=100):
    """
    Train the neural network model using a training loop.
    
    Args:
        model (nn.Module): The neural network to be trained.
        optimizer (torch.optim.Optimizer): Optimizer for updating model parameters.
        loss_fn (nn.Module): Loss function to compute prediction error.
        inputs (torch.Tensor): Input data.
        targets (torch.Tensor): Target data.
        num_epochs (int): Number of training epochs.
        log_interval (int): Interval (in epochs) to log the loss.
    
    Returns:
        list: History of loss values logged during training.
    """
    loss_history = []
    
    for epoch in range(num_epochs):
        # Forward pass: compute predictions.
        predictions = model(inputs)
        loss = loss_fn(predictions, targets)
        
        # Backpropagation: reset gradients, compute new gradients, and update weights.
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Log the loss at specified intervals.
        if (epoch + 1) % log_interval == 0:
            print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")
            loss_history.append(loss.item())
    
    return loss_history

# Train the regression network using the defined training loop.
loss_history = train_network(regression_net, optimizer, loss_fn, state_vars_tensor, expected_return_tensor, num_epochs=1000, log_interval=100)

***
## 10. Evaluating Model Performance

After training the model over multiple epochs, it is important to evaluate the network’s performance by comparing the network's final predictions to the true data. The goal is for the predicted values to align closely with the target values.

Since we use neural networks as function approximators in DRL, we need it to accurately represent the value or policy function. The closer the predicted values are to the true expected rewards, the better the agent can learn and make decisions.

In our DRL context, we can think of $y_i$ as the expected reward for a given state $s_i$, and $\hat{y}_i$ as the predicted reward from the neural network.

A close alignment between $y_i$ and $\hat{y}_i$ indicates that the network has learned the underlying relationship in the data.

In [None]:
# Generate final predictions without computing gradients.
with torch.no_grad():
    final_predictions = regression_net(state_vars_tensor)

# Plot the true data vs. the model's final predictions.
plt.figure(figsize=(8, 6))
plt.scatter(state_vars, expected_return, label="True Data", alpha=0.6)
plt.scatter(state_vars, final_predictions.numpy(), color="red", label="Predicted Data", alpha=0.6)
plt.xlabel("State Variable")
plt.ylabel("Expected Return")
plt.title("Model Predictions vs True Data After Training")
plt.legend()
plt.show()

We should see that the model predictions are much better aligned with the true expected returns! 

Note: we know the true relationship is a simple linear line with 0.5 slope. But we can see that the red line is trying to find more complex relationships, and is overfitting the data! There are ways to minimize overfitting of a neural network model and we'll see some of these as we go through some more RL examples. But it is usually a good idea to overfit our models first then take steps to reduce overfiting (e.g., regularize).

***
From the examples above we should now have an introductory sense of how we can build and train neural network models! We'll build on these skills in these tutorials going forward. The first thing we'll do is see how we can change the outputs of our neural networks to allow for continuous actions! We'll do this in the next tutorial.