In [None]:
'''
 * Copyright (c) 2008 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Bayesian Neural Networks

So far, our discussion of neural networks has focused on the use of maximum likelihood to determine the network parameters (weights and biases). Regularized maximum likelihood can be interpreted as a MAP (maximum posterior) approach in which the regularizer can be viewed as the logarithm of a prior parameter distribution. However, in a Bayesian treatment, we need to marginalize over the distribution of parameters in order to make predictions. 

In Section 3.3, we developed a Bayesian solution for a simple linear regression model under the assumption of Gaussian noise. We saw that the posterior distribution, which is Gaussian, could be evaluated exactly, and the predictive distribution could also be found in closed form. 

In the case of a multilayered network, the highly nonlinear dependence of the network function on the parameter values means that an exact Bayesian treatment can no longer be found. In fact, the log of the posterior distribution will be non-convex, corresponding to the multiple local minima in the error function. 

The technique of variational inference, to be discussed in Chapter 10, has been applied to Bayesian neural networks using a factorized Gaussian approximation to the posterior distribution (Hinton and van Camp, 1993) and also using a full-covariance Gaussian (Barber and Bishop, 1998a; Barber and Bishop, 1998b). The most complete treatment, however, has been based on the Laplace approximation (MacKay, 1992c; MacKay, 1992b) and forms the basis for the discussion given here. We will approximate the posterior distribution by a Gaussian, centered at a mode of the true posterior. Furthermore, we shall assume that the covariance of this Gaussian is small so that the network function is approximately linear with respect to the parameters over the region of parameter space for which the posterior probability is significantly nonzero. With these two approximations, we will obtain models that are analogous to the linear regression and classification models discussed in earlier chapters and so we can exploit the results obtained there. We can then make use of the evidence framework to provide point estimates for the hyperparameters and to compare alternative models (for example, networks having different numbers of hidden units). To start with, we shall discuss the regression case and then later consider the modifications needed for solving classification tasks.

## Posterior Parameter Distribution

Consider the problem of predicting a single continuous target variable $ t $ from a vector $ x $ of inputs (the extension to multiple targets is straightforward). We shall suppose that the conditional distribution $ p(t|x) $ is Gaussian, with an $ x $-dependent mean given by the output of a neural network model $ y(x, w) $, and with precision (inverse variance) $ \beta $:

$$
p(t|x, w, \beta) = \mathcal{N}(t | y(x, w), \beta^{-1})
$$

Similarly, we shall choose a prior distribution over the weights $ w $ that is Gaussian of the form:

$$
p(w|\alpha) = \mathcal{N}(w | 0, \alpha^{-1} I)
$$

For an i.i.d. data set of $ N $ observations $ x_1, \dots, x_N $, with a corresponding set of target values $ D = \{t_1, \dots, t_N\} $, the likelihood function is given by:

$$
p(D | w, \beta) = \prod_{n=1}^{N} \mathcal{N}(t_n | y(x_n, w), \beta^{-1})
$$

The resulting posterior distribution is then:

$$
p(w | D, \alpha, \beta) \propto p(w|\alpha) p(D | w, \beta)
$$

Which, as a consequence of the nonlinear dependence of $ y(x, w) $ on $ w $, will be non-Gaussian. We can find a Gaussian approximation to the posterior distribution by using the Laplace approximation. To do this, we must first find a (local) maximum of the posterior, and this must be done using iterative numerical optimization. 

As usual, it is convenient to maximize the logarithm of the posterior, which can be written as:

$$
\ln p(w|D) = - \frac{\alpha}{2} w^T w - \sum_{n=1}^{N} \frac{1}{2} \beta \left( y(x_n, w) - t_n \right)^2 + \text{constant}
$$

This corresponds to a regularized sum-of-squares error function. Assuming for the moment that $ \alpha $ and $ \beta $ are fixed, we can find a maximum of the posterior, which we denote $ w_{\text{MAP}} $, by standard nonlinear optimization algorithms such as conjugate gradients, using error backpropagation to evaluate the required derivatives.

Having found a mode $ w_{\text{MAP}} $, we can then build a local Gaussian approximation by evaluating the matrix of second derivatives of the negative log posterior distribution. From the equation above, this is given by:

$$
A = -\nabla\nabla \ln p(w|D, \alpha, \beta) = \alpha I + \beta H
$$

Where $ H $ is the Hessian matrix comprising the second derivatives of the sum-of-squares error function with respect to the components of $ w $. The corresponding Gaussian approximation to the posterior is then:

$$
q(w|D) = \mathcal{N}(w | w_{\text{MAP}}, A^{-1})
$$

Similarly, the predictive distribution is obtained by marginalizing with respect to this posterior distribution:

$$
p(t | x, D) = \int p(t | x, w) q(w | D) dw
$$

However, even with the Gaussian approximation to the posterior, this integration is still analytically intractable due to the nonlinearity of the network function $ y(x, w) $ as a function of $ w $. 

To make progress, we now assume that the posterior distribution has small variance compared with the characteristic scales of $ w $ over which $ y(x, w) $ is varying. This allows us to make a Taylor series expansion of the network function around $ w_{\text{MAP}} $ and retain only the linear terms:

$$
y(x, w) \approx y(x, w_{\text{MAP}}) + g^T (w - w_{\text{MAP}})
$$

Where we have defined $ g = \nabla_w y(x, w) \big|_{w=w_{\text{MAP}}} $.

With this approximation, we now have a linear-Gaussian model with a Gaussian distribution for $ p(w) $ and a Gaussian for $ p(t|w) $ whose mean is a linear function of $ w $:

$$
p(t|x, w, \beta) \approx \mathcal{N}(t | y(x, w_{\text{MAP}}) + g^T (w - w_{\text{MAP}}), \beta^{-1})
$$

We can therefore make use of the general result for the marginal:

$$
p(t | x, D, \alpha, \beta) = \mathcal{N}(t | y(x, w_{\text{MAP}}), \sigma^2(x))
$$


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Define the neural network model
class BayesianNeuralNetwork:
    def __init__(self, input_dim, hidden_dim):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        # Initialize weights and biases for a single hidden layer
        self.W1 = np.random.randn(input_dim, hidden_dim)
        self.b1 = np.zeros((1, hidden_dim))
        self.W2 = np.random.randn(hidden_dim, 1)
        self.b2 = np.zeros((1, 1))
    
    def forward(self, X):
        """Forward pass of the neural network."""
        self.Z1 = np.dot(X, self.W1) + self.b1
        self.A1 = np.tanh(self.Z1)  # Activation function
        self.Z2 = np.dot(self.A1, self.W2) + self.b2
        return self.Z2

    def compute_loss(self, X, y, beta=1.0, alpha=1.0):
        """Compute the negative log posterior (including prior and likelihood)."""
        # Predict
        y_pred = self.forward(X)
        
        # Likelihood term (sum of squared errors, with precision beta)
        likelihood = np.sum((y - y_pred) ** 2)
        
        # Prior term (L2 regularization on weights)
        prior = alpha * (np.sum(self.W1 ** 2) + np.sum(self.W2 ** 2))
        
        # Combine the terms: negative log posterior
        return (0.5 * beta * likelihood) + (0.5 * prior)
    
    def gradient(self, X, y, beta=1.0, alpha=1.0):
        """Compute gradients of the negative log posterior."""
        y_pred = self.forward(X)
        
        # Gradients for output layer (W2, b2)
        dZ2 = y_pred - y
        dW2 = np.dot(self.A1.T, dZ2) + alpha * self.W2  # Add L2 regularization term
        db2 = np.sum(dZ2, axis=0, keepdims=True)
        
        # Gradients for hidden layer (W1, b1)
        dA1 = np.dot(dZ2, self.W2.T)
        dZ1 = dA1 * (1 - self.A1 ** 2)  # Derivative of tanh activation
        dW1 = np.dot(X.T, dZ1) + alpha * self.W1  # Add L2 regularization term
        db1 = np.sum(dZ1, axis=0, keepdims=True)
        
        # Return gradients
        return dW1, db1, dW2, db2
    
    def train(self, X_train, y_train, learning_rate=0.01, epochs=1000, beta=1.0, alpha=1.0):
        """Train the Bayesian Neural Network."""
        for epoch in range(epochs):
            # Compute the gradients
            dW1, db1, dW2, db2 = self.gradient(X_train, y_train, beta, alpha)
            
            # Update weights using gradient descent
            self.W1 -= learning_rate * dW1
            self.b1 -= learning_rate * db1
            self.W2 -= learning_rate * dW2
            self.b2 -= learning_rate * db2
            
            # Print the loss every 100 epochs
            if epoch % 100 == 0:
                loss = self.compute_loss(X_train, y_train, beta, alpha)
                print(f"Epoch {epoch}, Loss: {loss}")

    def predict(self, X):
        """Make predictions using the trained model."""
        return self.forward(X)

# Generate synthetic regression data
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=42)
X = X.reshape(-1, 1)  # Reshape for the model

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Instantiate and train the model
input_dim = X_train.shape[1]
hidden_dim = 10  # Hidden layer size
bnn = BayesianNeuralNetwork(input_dim, hidden_dim)

# Train the model
bnn.train(X_train, y_train, learning_rate=0.01, epochs=2000)

# Make predictions
y_pred = bnn.predict(X_test)

# Plot the results
plt.scatter(X_test, y_test, color='blue', label='True Data')
plt.plot(X_test, y_pred, color='red', label='Predictions')
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.title("Bayesian Neural Network Predictions")
plt.show()


### Predictive Distribution with Input-Dependent Variance

The input-dependent variance is given by:

$$
\sigma^2(x) = \beta^{-1} + \mathbf{g}^\top \mathbf{A}^{-1} \mathbf{g}. \tag{5.173}
$$

We observe that the predictive distribution $ p(t | \mathbf{x}, \mathcal{D}) $ is a Gaussian, where the mean is given by the network function $ y(\mathbf{x}, \mathbf{w}_{\text{MAP}}) $, with the parameters set to their Maximum A Posteriori (MAP) values. 

The variance of the predictive distribution has two components:

1. **Intrinsic Noise**: 
   $$ \beta^{-1} $$
   This term arises from the intrinsic noise on the target variable.

2. **Uncertainty in the Model Parameters**:
   $$ \mathbf{g}^\top \mathbf{A}^{-1} \mathbf{g} $$
   This is an $\mathbf{x}$-dependent term that expresses the uncertainty in the interpolant due to the uncertainty in the model parameters $ \mathbf{w} $.

This predictive distribution can be compared to the corresponding predictive distribution for the linear regression model, given by:

- Mean:
  $$
  \mu(x) = \mathbf{x}^\top \mathbf{w}_{\text{MAP}}
  $$

- Variance:
  $$
  \sigma^2(x) = \beta^{-1} + \mathbf{x}^\top \mathbf{S} \mathbf{x}.
  $$
