In [None]:
'''
 * Copyright (c) 2004 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

![image.png](attachment:image.png)

Fig.30: The saliency of a parameter, such as a weight, is the increase in the training error when that weight is set to zero. One can approximate the saliency by expanding the true error around a local minimum, w∗ , and setting the weight to zero. In this example the approximated saliency is smaller than the true saliency; this is typically, but not always the case.

##  Complexity Measurement

### Wald Statistics

The fundamental theory of generalization favors simplicity. For a given level of performance on observed data, models with fewer parameters can be expected to perform better on test data. For instance, weight decay leads to simpler decision boundaries (closer to linear). Likewise, training via cascade-correlation adds weights only as needed.

The fundamental idea in Wald statistics is that we can estimate the importance of a parameter in a model, such as a weight, by how much the training error increases if that parameter is eliminated. To this end, the Optimal Brain Damage method (OBD) seeks to delete weights by keeping the training error as small as possible. 

### Optimal Brain Damage and Optimal Brain Surgeon

The **Optimal Brain Surgeon (OBS)** method extends OBD to include the off-diagonal terms in the network’s Hessian, which were shown to be significant and important for pruning in classical and benchmark problems. OBD and OBS share the same basic approach of training a network to a (local) minimum in error at weight $ w^* $, and then pruning a weight that leads to the smallest increase in the training error.

The predicted functional increase in the error for a change in full weight vector $ \delta w $ is:

$$
\delta J = \nabla_w J^T \delta w + \frac{1}{2} \delta w^T H \delta w + O(\|\delta w\|^3),
$$

where $ H $ is the Hessian matrix. The first term vanishes because we are at a local minimum in error, and we ignore third- and higher-order terms. The general solution for minimizing this function, given the constraint of deleting one weight, is:

$$
\delta w_q = - H^{-1} u_q,
$$

and the saliency \( L_q \) of the weight \( q \) is:

$$
L_q = \frac{1}{2} \left[ H^{-1} \right]_{qq}.
$$

Here, $ u_q $ is the unit vector along the $ q $-th direction in weight space, and $ L_q $ is the saliency of weight $ q $—an estimate of the increase in training error if weight $ q $ is pruned and the other weights updated accordingly.

### Computing the Inverse Hessian

We define:

$$
X_k = \frac{\partial g(x; w)}{\partial w},
$$

and

$$
a_k = \frac{\partial d(t, z)}{\partial z}.
$$

The recursion for computing the inverse Hessian becomes:

$$
H_{m+1}^{-1} = H_m^{-1} - \frac{1}{a_k + X_{m+1}^T H_m X_{m+1}} X_{m+1}^T H_m X_{m+1} H_m^{-1}.
$$

Where $ \alpha $ is a small parameter—effectively a weight decay constant. 

### Error Measures and Hessian

Note how different error measures $ d(t, z) $ scale the gradient vectors $ X_k $ that form the Hessian. For squared error, $ d(t, z) = (t - z)^2 $, we have $ a_k = 1 $, and all gradient vectors are weighted equally. This can be extended to other error measures, such as cross-entropy (Problem 36).

### Saliency Approximation

In the second-order approximation to the criterion function, Optimal Brain Damage assumes the Hessian matrix is diagonal, while Optimal Brain Surgeon uses the full Hessian matrix.

#### Visualizing Saliency


![image-2.png](attachment:image-2.png)

Fig.31: In the second-order approximation to the criterion function, optimal brain damage assumes the Hessian matrix is diagonal, while Optimal Brain Surgeon uses the full Hessian matrix.


In [None]:
import numpy as np
import random
import matplotlib.pyplot as plt

class Layer:
    def __init__(self, num_units, input_size):
        self.num_units = num_units
        self.weights = np.random.uniform(-1, 1, (num_units, input_size))

    def forward(self, inputs):
        """Forward pass through the layer"""
        self.inputs = inputs
        self.output = np.maximum(0, np.dot(self.weights, inputs))  # ReLU activation
        return self.output

    def compute_gradients(self, output_error):
        """Compute gradients for the weights"""
        grad = output_error * (self.output > 0)  # Gradient of ReLU
        self.dweights = np.outer(grad, self.inputs)
        return np.dot(self.weights.T, grad)

class Neocognitron:
    def __init__(self, input_size, layer_sizes):
        self.layers = []
        prev_size = input_size
        for size in layer_sizes:
            self.layers.append(Layer(size, prev_size))
            prev_size = size

    def forward(self, inputs):
        """Forward pass through all layers"""
        for layer in self.layers:
            inputs = layer.forward(inputs)
        return inputs

    def compute_loss(self, predicted, true):
        """Calculate Mean Squared Error loss"""
        return np.sum((predicted - true) ** 2)

    def compute_hessian(self, inputs, targets):
        """Approximate the Hessian matrix using second-order derivatives"""
        num_params = sum([layer.weights.size for layer in self.layers])
        hessian = np.zeros((num_params, num_params))

        # Compute loss and gradient
        predicted = self.forward(inputs)
        loss = self.compute_loss(predicted, targets)
        
        # Backpropagate through the network
        output_error = 2 * (predicted - targets)
        gradients = []
        
        for layer in reversed(self.layers):
            output_error = layer.compute_gradients(output_error)
            gradients.append(layer.dweights.flatten())

        # Compute the Hessian matrix
        for i in range(num_params):
            for j in range(num_params):
                hessian[i, j] = np.dot(gradients[i], gradients[j])
        
        return hessian

    def saliency(self, hessian):
        """Compute saliency of each weight using the inverse of the Hessian"""
        inv_hessian = np.linalg.inv(hessian)
        saliency = np.diagonal(inv_hessian)  # Diagonal elements represent the saliency
        return saliency

    def prune_weights(self, threshold):
        """Prune weights based on saliency"""
        for layer in self.layers:
            saliency = self.saliency(self.compute_hessian(layer.inputs, layer.weights))
            # Prune weights with saliency lower than the threshold
            for i in range(len(saliency)):
                if saliency[i] < threshold:
                    layer.weights[i] = 0

    def train(self, inputs, targets, num_epochs=10, threshold=0.1):
        """Train the network using gradient descent and prune based on saliency"""
        errors = []
        
        for epoch in range(num_epochs):
            # Forward pass
            predicted = self.forward(inputs)
            
            # Compute loss
            loss = self.compute_loss(predicted, targets)
            errors.append(loss)

            # Backpropagate
            output_error = 2 * (predicted - targets)
            for layer in reversed(self.layers):
                output_error = layer.compute_gradients(output_error)
            
            # Prune weights
            self.prune_weights(threshold)
            
            print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {loss:.4f}')
        
        # Plot the error over epochs
        plt.plot(range(1, num_epochs + 1), errors, marker='o')
        plt.xlabel('Epochs')
        plt.ylabel('Loss')
        plt.title('Training Loss over Epochs')
        plt.grid(True)
        plt.show()

# Example usage
if __name__ == "__main__":
    # Generate random training data (10 samples, 28*28 inputs, 10 classes output)
    input_size = 28 * 28  # 28x28 pixel input for MNIST
    output_size = 10  # 10 digits (0-9)
    layer_sizes = [100, 50, 25]  # Example intermediate layer sizes

    # Initialize network
    neocognitron = Neocognitron(input_size, layer_sizes + [output_size])

    # Dummy training data (for example, random values)
    inputs = np.random.rand(28 * 28)  # Random image input
    targets = np.zeros(10)  # Example: one-hot encoded target
    targets[random.randint(0, 9)] = 1  # Random target class

    # Train the Neocognitron network with pruning
    neocognitron.train(inputs, targets, num_epochs=5, threshold=0.1)

    # Make a prediction after training
    test_input = np.random.rand(28 * 28)  # New random input
    prediction = neocognitron.forward(test_input)
    print(f"Prediction: {prediction}")


In [None]:
# Define a basic Layer class
class Layer:
    def __init__(self, num_units, input_size):
        self.num_units = num_units
        self.weights = [[random.uniform(-1, 1) for _ in range(input_size)] for _ in range(num_units)]

    def forward(self, inputs):
        """Forward pass through the layer"""
        self.inputs = inputs
        self.output = [max(0, sum(w * i for w, i in zip(unit_weights, inputs))) for unit_weights in self.weights]
        return self.output

    def compute_gradients(self, output_error):
        """Compute gradients for the weights"""
        grad = [error * (output > 0) for output, error in zip(self.output, output_error)]  # Gradient of ReLU
        self.dweights = [[grad[i] * self.inputs[j] for j in range(len(self.inputs))] for i in range(len(grad))]
        return [sum(self.weights[i][j] * grad[i] for i in range(self.num_units)) for j in range(len(self.inputs))]

# Define the Neocognitron class
class Neocognitron:
    def __init__(self, input_size, layer_sizes):
        self.layers = []
        prev_size = input_size
        for size in layer_sizes:
            self.layers.append(Layer(size, prev_size))
            prev_size = size

    def forward(self, inputs):
        """Forward pass through all layers"""
        for layer in self.layers:
            inputs = layer.forward(inputs)
        return inputs[0]  # Return the first (and only) element in case of single output unit

    def compute_loss(self, predicted, targets):
        """Calculate Mean Squared Error loss"""
        return (predicted - targets[0]) ** 2  # Ensure it's scalar, matching single output case

    def compute_hessian(self, inputs, targets):
        """Approximate the Hessian matrix using second-order derivatives"""
        num_params = sum([len(layer.weights) * len(layer.weights[0]) for layer in self.layers])
        hessian = [[0] * num_params for _ in range(num_params)]

        # Forward pass
        predicted = self.forward(inputs)
        loss = self.compute_loss(predicted, targets)

        # Backpropagate through the network
        output_error = [2 * (predicted - targets[0])]  # Only one output error
        gradients = []

        for layer in reversed(self.layers):
            output_error = layer.compute_gradients(output_error)
            gradients.append(layer.dweights)

        # Compute the Hessian matrix
        for i in range(num_params):
            for j in range(num_params):
                hessian[i][j] = sum(gradients[i][k] * gradients[j][k] for k in range(len(gradients[i])))

        return hessian

    def saliency(self, hessian):
        """Compute saliency of each weight using the inverse of the Hessian"""
        inv_hessian = self.inverse_matrix(hessian)
        saliency = [inv_hessian[i][i] for i in range(len(inv_hessian))]
        return saliency

    def inverse_matrix(self, matrix):
        """Calculate the inverse of a matrix (for small matrices only)"""
        # Inverse of a 2x2 matrix [ [a, b], [c, d] ] is [ [d, -b], [-c, a] ] / (ad - bc)
        if len(matrix) == 2 and len(matrix[0]) == 2:
            det = matrix[0][0] * matrix[1][1] - matrix[0][1] * matrix[1][0]
            if det != 0:
                return [
                    [matrix[1][1] / det, -matrix[0][1] / det],
                    [-matrix[1][0] / det, matrix[0][0] / det]
                ]
            else:
                return matrix  # Singular matrix, return the original matrix
        return matrix  # Return original matrix for higher-dimensional matrices (for simplicity)

    def prune_weights(self, threshold):
        """Prune weights based on saliency"""
        for layer in self.layers:
            saliency = self.saliency(self.compute_hessian(layer.inputs, layer.weights))
            for i in range(len(saliency)):
                if saliency[i] < threshold:
                    layer.weights[i] = [0] * len(layer.weights[i])

    def train(self, inputs, targets, num_epochs=10, threshold=0.1):
        """Train the network using gradient descent and prune based on saliency"""
        errors = []

        for epoch in range(num_epochs):
            # Forward pass
            predicted = self.forward(inputs)

            # Compute loss
            loss = self.compute_loss(predicted, targets)
            errors.append(loss)

            # Backpropagate
            output_error = [2 * (predicted - targets[0])]  # Only one output error
            for layer in reversed(self.layers):
                output_error = layer.compute_gradients(output_error)

            # Prune weights
            self.prune_weights(threshold)

            print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {loss:.4f}')

        # Plot the error over epochs
        plt.plot(range(1, num_epochs + 1), errors, marker='o')
        plt.xlabel('Epochs')
        plt.ylabel('Loss')
        plt.title('Training Loss over Epochs')
        plt.grid(True)
        plt.show()

# Example usage with a small dataset
if __name__ == "__main__":
    # Input size and layer sizes for a toy example
    input_size = 3  # Small input size (for simplicity)
    layer_sizes = [5, 3]  # Simple layers (5 units in first, 3 in second)

    # Initialize the Neocognitron
    neocognitron = Neocognitron(input_size, layer_sizes + [1])  # Single output unit

    # Toy training data: 5 inputs and corresponding targets
    inputs = [random.random() for _ in range(input_size)]
    targets = [random.random()]  # Single target value

    # Train the network
    neocognitron.train(inputs, targets, num_epochs=5, threshold=0.1)

    # Make a prediction after training
    test_input = [random.random() for _ in range(input_size)]
    prediction = neocognitron.forward(test_input)
    print(f"Prediction: {prediction}")
