In [None]:
'''
 * Copyright (c) 2008 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

###  Convolutional Networks

Another approach to creating models that are invariant to certain transformations of the inputs is to build the invariance properties into the structure of a neural network. This is the basis for the convolutional neural network (Le Cun et al., 1989; LeCun et al., 1998), which has been widely applied to image data.

Consider the specific task of recognizing handwritten digits. Each input image comprises a set of pixel intensity values, and the desired output is a posterior probability distribution over the ten digit classes. We know that the identity of the digit is invariant under translations and scaling as well as (small) rotations. 
![image-3.png](attachment:image-3.png)

Furthermore, the network must also exhibit invariance to more subtle transformations such as elastic deformations of the kind illustrated in Fig.14.

![image.png](attachment:image.png)

Fig.17 Diagram illustrating part of a convolutional neural network, showing a layer of convolutional units followed by a layer of subsampling units. Several successive pairs of such layers may be used.

One simple approach would be to treat the image as the input to a fully connected network, such as the kind shown in Fig.1. 

![image-2.png](attachment:image-2.png)

Given a sufficiently large training set, such a network could in principle yield a good solution to this problem and would learn the appropriate invariances by example. However, this approach ignores a key property of images, which is that nearby pixels are more strongly correlated than more distant pixels.

Many of the modern approaches to computer vision exploit this property by extracting local features that depend only on small subregions of the image. Information from such features can then be merged in later stages of processing in order to detect higher-order features and ultimately to yield information about the image as a whole. Also, local features that are useful in one region of the image are likely to be useful in other regions of the image, for instance if the object of interest is translated.

These notions are incorporated into convolutional neural networks through three mechanisms: 
1. Local receptive fields,
2. Weight sharing,
3. Subsampling.

The structure of a convolutional network is illustrated in **Fig.17**. In the convolutional layer, the units are organized into planes, each of which is called a **feature map**. Units in a feature map each take inputs only from a small subregion of the image, and all of the units in a feature map are constrained to share the same weight values. 

For instance, a feature map might consist of 100 units arranged in a $ 10 \times 10 $ grid, with each unit taking inputs from a $ 5 \times 5 $ pixel patch of the image. The whole feature map therefore has 25 adjustable weight parameters plus one adjustable bias parameter. Input values from a patch are linearly combined using the weights and the bias, and the result is transformed by a sigmoidal nonlinearity using the equation:

$$
y(x) = \sigma(w^T x + b)
$$

Where $ \sigma $ is the sigmoid activation function, $ w $ represents the weights, $ x $ is the input, and $ b $ is the bias.

If we think of the units as feature detectors, then all of the units in a feature map detect the same pattern but at different locations in the input image. Due to the weight sharing, the evaluation of the activations of these units is equivalent to a convolution of the image pixel intensities with a **kernel** comprising the weight parameters. If the input image is shifted, the activations of the feature map will be shifted by the same amount but will otherwise be unchanged. This provides the basis for the (approximate) invariance of the network outputs to translations and distortions of the input image.

Because we will typically need to detect multiple features in order to build an effective model, there will generally be multiple feature maps in the convolutional layer, each having its own set of weight and bias parameters.

The outputs of the convolutional units form the inputs to the **subsampling layer** of the network. For each feature map in the convolutional layer, there is a plane of units in the subsampling layer, and each unit takes inputs from a small receptive field in the corresponding feature map of the convolutional layer. These units perform subsampling. For instance, each subsampling unit might take inputs from a $ 2 \times 2 $ unit region in the corresponding feature map and would compute the average of those inputs, multiplied by an adaptive weight with the addition of an adaptive bias parameter, and then transformed using a sigmoidal nonlinear activation function.

The receptive fields are chosen to be contiguous and non-overlapping so that there are half the number of rows and columns in the subsampling layer compared with the convolutional layer. In this way, the response of a unit in the subsampling layer will be relatively insensitive to small shifts of the image in the corresponding regions of the input space.

In a practical architecture, there may be several pairs of convolutional and subsampling layers. At each stage, there is a larger degree of invariance to input transformations compared to the previous layer. There may be several feature maps in a given convolutional layer for each plane of units in the previous subsampling layer, so that the gradual reduction in spatial resolution is then compensated by an increasing number of features.

The final layer of the network would typically be a fully connected, fully adaptive layer, with a **softmax** output nonlinearity in the case of multiclass classification.

The whole network can be trained by error minimization using **backpropagation** to evaluate the gradient of the error function. This involves a slight modification of the usual backpropagation algorithm to ensure that the shared-weight constraints are satisfied. Due to the use of local receptive fields, the number of weights in the network is smaller than if the network were fully connected. Furthermore, the number of independent parameters to be learned from the data is much smaller still, due to the substantial numbers of constraints on the weights.



![image.png](attachment:image.png)

Fig.18 The left ﬁgure shows a two-link robot arm, in which the Cartesian coordinates (x1 , x2 ) of the end ef- fector are determined uniquely by the two joint angles θ1 and θ2 and the (ﬁxed) lengths L1 and L2 of the arms. This is know as the forward kinematics of the arm. In prac- tice, we have to ﬁnd the joint angles that will give rise to a desired end effector position and, as shown in the right ﬁg- ure, this inverse kinematics has two solutions correspond- ing to ‘elbow up’ and ‘elbow down’.

###  Soft Weight Sharing

One way to reduce the effective complexity of a network with a large number of weights is to constrain weights within certain groups to be equal. This is the technique of weight sharing that was discussed in Section 5.5.6 as a way of building translation invariance into networks used for image interpretation. It is only applicable, however, to particular problems in which the form of the constraints can be specified in advance. Here we consider a form of **soft weight sharing** (Nowlan and Hinton, 1992) in which the hard constraint of equal weights is replaced by a form of regularization in which groups of weights are encouraged to have similar values. Furthermore, the division of weights into groups, the mean weight value for each group, and the spread of values within the groups are all determined as part of the learning process.

Recall that the simple weight decay regularizer, given in (5.112), can be viewed as the negative log of a Gaussian prior distribution over the weights. We can encourage the weight values to form several groups, rather than just one group, by considering instead a probability distribution that is a mixture of Gaussians. The centres and variances of the Gaussian components, as well as the mixing coefficients, will be considered as adjustable parameters to be determined as part of the learning process.

Thus, we have a probability density of the form:

$$
p(w) = \prod_{i} p(w_i)
$$

where 

$$
p(w_i) = \sum_{j=1}^M \pi_j \mathcal{N}(w_i | \mu_j, \sigma_j^2)
$$

and $ \pi_j $ are the mixing coefficients. Taking the negative logarithm then leads to a regularization function of the form:

$$
\Omega(w) = -\sum_{i} \ln \left( \sum_{j=1}^M \pi_j \mathcal{N}(w_i | \mu_j, \sigma_j^2) \right)
$$

The total error function is then given by:

$$
E(w) = E(w) + \lambda \Omega(w)
$$

where $ \lambda $ is the regularization coefficient. This error is minimized both with respect to the weights $ w_i $ and with respect to the parameters $ \{\pi_j, \mu_j, \sigma_j\} $ of the mixture model.

If the weights were constant, then the parameters of the mixture model could be determined by using the **EM algorithm** discussed in Chapter 9. However, the distribution of weights is itself evolving during the learning process, and so to avoid numerical instability, a joint optimization is performed simultaneously over the weights and the mixture-model parameters. This can be done using a standard optimization algorithm such as conjugate gradients or quasi-Newton methods.

In order to minimize the total error function, it is necessary to be able to evaluate its derivatives with respect to the various adjustable parameters. To do this, it is convenient to regard the $ \{\pi_j\} $ as prior probabilities and to introduce the corresponding posterior probabilities which, following Bayes' theorem, are given by:

$$
\gamma_j(w) = \frac{\pi_j \mathcal{N}(w | \mu_j, \sigma_j^2)}{\sum_{k=1}^M \pi_k \mathcal{N}(w | \mu_k, \sigma_k^2)}
$$

The derivatives of the total error function with respect to the weights are then given by:

$$
\frac{\partial E}{\partial w_i} = \frac{\partial E}{\partial w_i} + \lambda \sum_{j} \gamma_j(w_i) \left( \frac{w_i - \mu_j}{\sigma_j^2} \right)
$$

The effect of the regularization term is therefore to pull each weight towards the center of the $ j $-th Gaussian, with a force proportional to the posterior probability of that Gaussian for the given weight. This is precisely the kind of effect that we are seeking.

Derivatives of the error with respect to the centres of the Gaussians are also easily computed to give:

$$
\frac{\partial E}{\partial \mu_j} = \lambda \sum_{i} \gamma_j(w_i) \left( \frac{\mu_j - w_i}{\sigma_j^2} \right)
$$

This has a simple intuitive interpretation, because it pushes $ \mu_j $ towards an average of the weight values, weighted by the posterior probabilities that the respective weight parameters were generated by component $ j $.

Similarly, the derivatives with respect to the variances are given by:

$$
\frac{\partial E}{\partial \sigma_j} = \lambda \sum_{i} \gamma_j(w_i) \left( \frac{(w_i - \mu_j)^2}{\sigma_j^3} - \frac{1}{\sigma_j} \right)
$$

This drives $ \sigma_j $ towards the weighted average of the squared deviations of the weights around the corresponding center $ \mu_j $, where the weighting coefficients are again given by the posterior probability that each weight is generated by component $ j $.

Note that in a practical implementation, new variables $ \eta_j $ defined by:

$$
\sigma_j^2 = \exp(\eta_j)
$$

are introduced, and the minimization is performed with respect to the $ \eta_j $. This ensures that the parameters $ \sigma_j $ remain positive. It also has the effect of discouraging pathological solutions in which one or more of the $ \sigma_j $ goes to zero, corresponding to a Gaussian component collapsing onto one of the weight parameter values.

For the derivatives with respect to the mixing coefficients $ \pi_j $, we need to take account of the constraints:

$$
\sum_{j} \pi_j = 1, \quad 0 \leq \pi_j \leq 1
$$

This can be done by expressing the mixing coefficients in terms of a set of auxiliary variables \( \eta_j \) using the **softmax** function:

$$
\pi_j = \frac{\exp(\eta_j)}{\sum_{k=1}^M \exp(\eta_k)}
$$

The derivatives of the regularized error function with respect to the $ \eta_j $ then take the form:

$$
\frac{\partial E}{\partial \eta_j} = \lambda \sum_{i} \left( \pi_j - \gamma_j(w_i) \right)
$$

We see that $ \pi_j $ is therefore driven towards the average posterior probability for component $ j $.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader  # Correct import

# Define the Soft Weight Sharing Regularization Class
class SoftWeightSharing(nn.Module):
    def __init__(self, num_gaussians, weight_dim):
        super(SoftWeightSharing, self).__init__()
        self.num_gaussians = num_gaussians
        self.weight_dim = weight_dim

        # Parameters for the Gaussian Mixture Model
        self.mu = nn.Parameter(torch.randn(num_gaussians, weight_dim))  # Mean of Gaussians
        self.sigma = nn.Parameter(torch.randn(num_gaussians, weight_dim))  # Standard deviation of Gaussians
        self.pi = nn.Parameter(torch.ones(num_gaussians) / num_gaussians)  # Mixing coefficients

    def forward(self, weights):
        # Compute the Gaussian mixture likelihoods for the given weights
        gamma = self.compute_gamma(weights)
        
        # Regularization term
        regularization = self.compute_regularization(gamma, weights)
        return regularization

    def compute_gamma(self, weights):
        # Compute the posterior probability for each Gaussian component
        num_weights = weights.size(0)
        gamma = torch.zeros(num_weights, self.num_gaussians)
        
        for j in range(self.num_gaussians):
            diff = weights - self.mu[j]
            exponent = torch.exp(-0.5 * (diff ** 2) / (self.sigma[j] ** 2))
            denominator = torch.sqrt(2 * torch.pi * (self.sigma[j] ** 2))
            gamma[:, j] = self.pi[j] * (exponent / denominator)

        # Normalize gamma
        gamma_sum = gamma.sum(dim=1, keepdim=True)
        gamma = gamma / gamma_sum
        return gamma

    def compute_regularization(self, gamma, weights):
        # Regularization term based on the posterior probabilities
        regularization = 0
        for j in range(self.num_gaussians):
            diff = weights - self.mu[j]
            regularization += torch.sum(gamma[:, j] * (diff ** 2) / (self.sigma[j] ** 2))
        return regularization

# Define a simple neural network model with Soft Weight Sharing
class SimpleModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_gaussians=3):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.soft_weight_sharing = SoftWeightSharing(num_gaussians, hidden_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

    def get_regularization(self):
        # Calculate the regularization term for the hidden layer weights
        regularization = 0
        for name, param in self.named_parameters():
            if 'fc1' in name:  # Only apply regularization to the hidden layer weights (fc1)
                regularization += self.soft_weight_sharing(param)
        return regularization

# Training function with regularization
def train(model, train_loader, num_epochs=10, learning_rate=0.001, lambda_reg=0.01):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0
        for inputs, targets in train_loader:
            optimizer.zero_grad()

            # Forward pass
            outputs = model(inputs)
            loss = criterion(outputs, targets)

            # Add regularization
            regularization_loss = model.get_regularization()
            total_loss = loss + lambda_reg * regularization_loss

            # Backward pass and optimization
            total_loss.backward()
            optimizer.step()

            running_loss += total_loss.item()

        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}')

# Example Usage
if __name__ == "__main__":
    # Example data, assume input dimension of 10, hidden layer of size 5, output of 3 classes
    input_dim = 10
    hidden_dim = 5
    output_dim = 3
    num_gaussians = 3

    # Dummy data loader (replace with real dataset)
    train_data = torch.randn(100, input_dim)
    train_targets = torch.randint(0, output_dim, (100,))
    train_loader = DataLoader(list(zip(train_data, train_targets)), batch_size=32)

    # Initialize and train the model
    model = SimpleModel(input_dim, hidden_dim, output_dim, num_gaussians)
    train(model, train_loader)


In [None]:
import numpy as np

# Soft Weight Sharing Regularization Class
class SoftWeightSharing:
    def __init__(self, num_gaussians, weight_dim):
        self.num_gaussians = num_gaussians
        self.weight_dim = weight_dim

        # Parameters for the Gaussian Mixture Model
        self.mu = np.random.randn(num_gaussians, weight_dim)  # Mean of Gaussians
        self.sigma = np.random.randn(num_gaussians, weight_dim)  # Standard deviation of Gaussians
        self.pi = np.ones(num_gaussians) / num_gaussians  # Mixing coefficients

    def compute_gamma(self, weights):
        num_weights = weights.shape[0]
        gamma = np.zeros((num_weights, self.num_gaussians))

        for j in range(self.num_gaussians):
            diff = weights - self.mu[j]
            exponent = np.exp(-0.5 * (diff ** 2) / (self.sigma[j] ** 2))
            denominator = np.sqrt(2 * np.pi * (self.sigma[j] ** 2))
            gamma[:, j] = self.pi[j] * (exponent / denominator)

        # Normalize gamma
        gamma_sum = np.sum(gamma, axis=1, keepdims=True)
        gamma /= gamma_sum
        return gamma

    def compute_regularization(self, gamma, weights):
        regularization = 0
        for j in range(self.num_gaussians):
            diff = weights - self.mu[j]
            regularization += np.sum(gamma[:, j] * (diff ** 2) / (self.sigma[j] ** 2))
        return regularization

# Define a simple feedforward network without using torch
class SimpleModel:
    def __init__(self, input_dim, hidden_dim, output_dim, num_gaussians=3):
        # Manually define the weights and biases
        self.W1 = np.random.randn(input_dim, hidden_dim)  # Weights for input to hidden
        self.b1 = np.zeros(hidden_dim)  # Bias for hidden layer
        self.W2 = np.random.randn(hidden_dim, output_dim)  # Weights for hidden to output
        self.b2 = np.zeros(output_dim)  # Bias for output layer

        # Soft weight sharing
        self.soft_weight_sharing = SoftWeightSharing(num_gaussians, hidden_dim)

    def forward(self, x):
        # Input to hidden layer (manual computation)
        z1 = np.dot(x, self.W1) + self.b1  # Linear transformation
        a1 = np.maximum(0, z1)  # ReLU activation (max(0, z1))

        # Hidden to output layer (manual computation)
        z2 = np.dot(a1, self.W2) + self.b2  # Linear transformation
        return z2

    def get_regularization(self):
        # Calculate the regularization term for the hidden layer weights
        return self.soft_weight_sharing.compute_regularization(self.W1)

# Train the model
def train(model, train_data, train_targets, num_epochs=10, learning_rate=0.001, lambda_reg=0.01):
    for epoch in range(num_epochs):
        epoch_loss = 0
        for i in range(len(train_data)):
            x = train_data[i].reshape(1, -1)  # reshape for a single sample
            target = train_targets[i].reshape(1, -1)

            # Forward pass
            outputs = model.forward(x)
            loss = np.mean((outputs - target) ** 2)  # MSE loss

            # Add regularization
            regularization_loss = model.get_regularization()
            total_loss = loss + lambda_reg * regularization_loss

            # Backpropagation
            model.W2 -= learning_rate * np.dot(model.a1.T, (outputs - target))  # Update W2
            model.b2 -= learning_rate * np.sum(outputs - target, axis=0)  # Update b2

            d_a1 = np.dot(outputs - target, model.W2.T)
            d_z1 = d_a1 * (model.a1 > 0)  # derivative of ReLU activation

            model.W1 -= learning_rate * np.dot(x.T, d_z1)  # Update W1
            model.b1 -= learning_rate * np.sum(d_z1, axis=0)  # Update b1

            # Apply the soft weight sharing regularization update (gradient descent)
            # (for simplification, we'll assume it's a part of the total gradient)
            model.W1 -= learning_rate * lambda_reg * model.get_regularization()

            epoch_loss += total_loss

        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss / len(train_data):.4f}')

# Example usage
if __name__ == "__main__":
    # Example data, assume input dimension of 10, hidden layer of size 5, output of 3 classes
    input_dim = 10
    hidden_dim = 5
    output_dim = 3
    num_gaussians = 3

    # Generate random data (100 samples, input dimension = 10)
    train_data = np.random.randn(100, input_dim)
    train_targets = np.random.randint(0, output_dim, size=(100, 1))

    # Initialize and train the model
    model = SimpleModel(input_dim, hidden_dim, output_dim, num_gaussians)
    train(model, train_data, train_targets)
