In [None]:
'''
 * Copyright (c) 2018 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

# GraphSAGE

Most graph embedding methods require all the nodes in the graph to participate in the training process, which is a property of transductive learning, and cannot directly generalize to nodes that have not been seen before.

Graph with SAmple and aggreGatE (simply called GraphSAGE), introduced by Hamilton et al. [50], uses an inductive node embedding for large-scale networks, which can generate embeddings quickly for new nodes without an additional training process. The framework of GraphSAGE is as follows [50]:

- **Embedding generation**: Unlike embedding approaches that are based on matrix factorization, GraphSAGE leverages node features (e.g., text attributes, node profile information, node degrees) in order to learn an embedding function that generalizes to unseen nodes.
  
- **Parameter learning**: By incorporating node features in the learning of parameters, GraphSAGE simultaneously learns the topological structure of each node’s neighborhood as well as the distribution of node features in the neighborhood.
  
- **Aggregator architecture**: Instead of training a distinct embedding vector for each node, GraphSAGE trains a set of aggregate functions that learn to aggregate feature information from a node’s local neighborhood. Each aggregate function aggregates information from a different number of hops, or search depth, away from a given node.

The aggregate functions applied in GraphSAGE must meet the following two basic requirements:

1. The aggregate functions must operate over an unordered set of vectors because a node’s neighbors have no natural ordering.
  
2. An aggregate function should be symmetric (i.e., invariant to permutations of its inputs), but should still be trainable and maintain high representational capacity.

There are three aggregate functions satisfying the above two basic requirements [50]:

### 1. Mean Aggregate Function

The basic operator simply takes the elementwise mean of the vectors in \( h^{k-1}_v \), \( \forall u \in N(v) \):

$$
\text{AGG}_{\text{mean}} = \text{MEAN}(h^{k-1}_u, \forall u \in N(v)) = \frac{1}{|N(v)|} \sum_{u \in N(v)} h^{k-1}_u 
\quad \text{(7.14.10)}
$$

The embedding in the \( k \)th layer of the node \( v \) is given by:

$$
h^k_v = \sigma \left( W_k \cdot \frac{1}{|N(v)|} \sum_{u \in N(v)} h^{k-1}_u + B_k h^{k-1}_v \right), \forall k > 0 
\quad \text{(7.14.11)}
$$

A better choice is the Graph Convolution Network (GCN) neighborhood aggregation defined as:

$$
\text{AGG}_{\text{GCN}} = \text{MEAN}\left(h^{k-1}_v \cup h^{k-1}_u, \forall u \in N(v)\right) = \frac{1}{\sqrt{|N(v)| \cdot |N(u)|}} \sum_{u \in N(v) \cup v} h^{k-1}_u
\quad \text{(7.14.12)}
$$

$$
\Rightarrow h^k_v = \sigma \left( W_k \cdot \sum_{u \in N(v) \cup v} \frac{h^{k-1}_u}{\sqrt{|N(v)| \cdot |N(u)|}} \right), \forall k > 0 
\quad \text{(7.14.13)}
$$

### 2. LSTM Aggregate Function

This is a more complex aggregator based on an LSTM architecture [63]:

$$
\text{AGG}_{\text{LSTM}} = \text{LSTM}(h^{k-1}_u, \forall u \in N(v)) 
\quad \text{(7.14.14)}
$$

Compared to the mean aggregate function, the advantage of LSTM aggregate functions is their larger expressive capability. However, LSTMs are not inherently symmetric, so it is necessary to adapt LSTMs for operating on an unordered set by simply applying the LSTMs to a random permutation of the node’s neighbors.

### 3. Pooling Aggregate Function

This aggregator is symmetric and trainable. In this pooling approach, each neighbor’s vector is independently fed through a fully connected neural network; following this transformation, an elementwise max-pooling operation is applied to aggregate information across the neighbor set:

$$
\text{AGG}_{\text{pool}} = \gamma \left( h^{k}_u + b \right), \forall u_i \in N(v)
\quad \text{(7.14.15)}
$$

where \( \gamma \) usually takes elementwise mean/max function.

### GraphSAGE Algorithm

Algorithm 7.10 shows the GraphSAGE embedding generation algorithm [50]:

```python
Algorithm 7.10 GraphSAGE embedding generation algorithm [50]
1. input: Graph G(V , E); input features {xv , ∀ v ∈ V }; depth K; weight matrices Wk^0, k = 1, . . . , K; nonlinearity σ ; neighborhood function N : v → 2V.
2. h_v ← x_v , ∀v ∈ V ;
3. for k = 1, . . . , K do
4.     for v ∈ V do
5.         h^k_v ← σ ( W_k · MEAN({h^{k-1}_u , ∀ u ∈ N(v)} ) )
6.     end for
7.     h^k_v ← h^k_v / ||h^k_v ||, ∀ v ∈ V .
8. end for
9. z_v ← h^K_v , ∀ v ∈ V .
10. output: vector representations z_v for all v ∈ V .


In [1]:
import numpy as np

# GraphSAGE Mean Aggregator Implementation

class GraphSAGE:
    def __init__(self, graph, features, output_dim, depth, learning_rate=0.01, epochs=100):
        self.graph = graph  # Dictionary where keys are nodes and values are lists of neighbors
        self.features = features  # Node features, a numpy array of shape (num_nodes, feature_dim)
        self.output_dim = output_dim  # Dimension of the node embeddings
        self.depth = depth  # Number of layers in the GraphSAGE model
        self.learning_rate = learning_rate  # Learning rate for gradient descent
        self.epochs = epochs  # Number of training epochs
        self.weights = self.initialize_weights()  # Initialize weights

    def initialize_weights(self):
        weights = []
        input_dim = self.features.shape[1]
        for _ in range(self.depth):
            W = np.random.randn(input_dim, self.output_dim) * np.sqrt(2.0 / input_dim)
            weights.append(W)
            input_dim = self.output_dim
        return weights

    def mean_aggregator(self, node, neighbor_features):
        if len(neighbor_features) == 0:
            return np.zeros(self.output_dim)
        mean_features = np.mean(neighbor_features, axis=0)
        return mean_features

    def forward(self, node):
        current_features = self.features[node]
        for i in range(self.depth):
            neighbor_features = np.array([self.features[neighbor] for neighbor in self.graph[node]])
            aggregated_features = self.mean_aggregator(node, neighbor_features)
            current_features = np.dot(current_features + aggregated_features, self.weights[i])
            current_features = np.maximum(0, current_features)  # ReLU activation
        return current_features

    def train(self):
        for epoch in range(self.epochs):
            updated_features = np.zeros_like(self.features)
            for node in self.graph.keys():
                updated_features[node] = self.forward(node)
            self.features = updated_features
            # Normally, you'd have a loss function and backpropagation step here.
            print(f'Epoch {epoch+1}/{self.epochs} completed')

    def get_embeddings(self):
        embeddings = np.zeros((len(self.graph), self.output_dim))
        for node in self.graph.keys():
            embeddings[node] = self.forward(node)
        return embeddings


# Example usage

# Define a simple graph as a dictionary where each key is a node, and the value is a list of neighbors
graph = {
    0: [1, 2],
    1: [0, 3],
    2: [0, 3],
    3: [1, 2]
}

# Create some random features for each node
features = np.random.rand(4, 5)  # 4 nodes, 5-dimensional features

# Instantiate and train the GraphSAGE model
sage = GraphSAGE(graph, features, output_dim=3, depth=2, learning_rate=0.01, epochs=10)
sage.train()

# Get the node embeddings
embeddings = sage.get_embeddings()
print(embeddings)


ValueError: operands could not be broadcast together with shapes (3,) (5,) 

In [2]:
import numpy as np

# GraphSAGE Mean Aggregator Implementation

class GraphSAGE:
    def __init__(self, graph, features, output_dim, depth, learning_rate=0.01, epochs=100):
        self.graph = graph  # Dictionary where keys are nodes and values are lists of neighbors
        self.features = features  # Node features, a numpy array of shape (num_nodes, feature_dim)
        self.output_dim = output_dim  # Dimension of the node embeddings
        self.depth = depth  # Number of layers in the GraphSAGE model
        self.learning_rate = learning_rate  # Learning rate for gradient descent
        self.epochs = epochs  # Number of training epochs
        self.weights = self.initialize_weights()  # Initialize weights

    def initialize_weights(self):
        weights = []
        input_dim = self.features.shape[1]
        for _ in range(self.depth):
            W = np.random.randn(input_dim, self.output_dim) * np.sqrt(2.0 / input_dim)
            weights.append(W)
            input_dim = self.output_dim  # Update input_dim to the output_dim for the next layer
        return weights

    def mean_aggregator(self, node, neighbor_features):
        if len(neighbor_features) == 0:
            return np.zeros(self.output_dim)
        mean_features = np.mean(neighbor_features, axis=0)
        return mean_features

    def forward(self, node):
        current_features = self.features[node]
        for i in range(self.depth):
            neighbor_features = np.array([self.features[neighbor] for neighbor in self.graph[node]])
            aggregated_features = self.mean_aggregator(node, neighbor_features)
            
            # Update features with the same dimension
            current_features = np.dot(current_features, self.weights[i]) + np.dot(aggregated_features, self.weights[i])
            current_features = np.maximum(0, current_features)  # ReLU activation
        return current_features

    def train(self):
        for epoch in range(self.epochs):
            updated_features = np.zeros((len(self.graph), self.output_dim))
            for node in self.graph.keys():
                updated_features[node] = self.forward(node)
            self.features = updated_features
            print(f'Epoch {epoch+1}/{self.epochs} completed')

    def get_embeddings(self):
        embeddings = np.zeros((len(self.graph), self.output_dim))
        for node in self.graph.keys():
            embeddings[node] = self.forward(node)
        return embeddings


# Example usage

# Define a simple graph as a dictionary where each key is a node, and the value is a list of neighbors
graph = {
    0: [1, 2],
    1: [0, 3],
    2: [0, 3],
    3: [1, 2]
}

# Create some random features for each node
features = np.random.rand(4, 5)  # 4 nodes, 5-dimensional features

# Instantiate and train the GraphSAGE model
sage = GraphSAGE(graph, features, output_dim=5, depth=2, learning_rate=0.01, epochs=10)
sage.train()

# Get the node embeddings
embeddings = sage.get_embeddings()
print(embeddings)


Epoch 1/10 completed
Epoch 2/10 completed
Epoch 3/10 completed
Epoch 4/10 completed
Epoch 5/10 completed
Epoch 6/10 completed
Epoch 7/10 completed
Epoch 8/10 completed
Epoch 9/10 completed
Epoch 10/10 completed
[[0.         0.         0.         0.         0.00063144]
 [0.         0.         0.         0.         0.00286013]
 [0.         0.         0.         0.         0.00236066]
 [0.         0.         0.         0.         0.00063144]]


### 7.14.3 Graph Convolutional Networks (GCNs)

There are two types of spatial information in graph data:

- **Node information**: Each vertex or node has its own information or characteristics that are represented by the nodes themselves.
- **Structural information**: Each node in the graph data has its own structural information, which is association information between nodes and is represented by edges connecting a node and other nodes.

Generally speaking, graph data should consider not only node information but also structure information. A graph convolutional neural network (GCN) can automatically learn not only node information but also association information between nodes.

Graph Convolutional Networks (GCNs), introduced by Kipf and Welling [82], are machine learning methods that "learn" graph-structured data by extracting spatial features. Because the standard convolution for image or text cannot be directly applied to graphs without a grid structure, a graph must necessarily be mapped onto another spectral domain that does have a grid structure.

Unlike spatial convolution (such as GraphSAGE) which is a vertex domain (spatial domain) method based on convolutions in the vertex domain (spatial domain) defined directly by the connection relationships of each node, spectral convolution is a frequency domain method.

Bruna et al. [13] first introduced a convolution for graph data from spectral domains using the graph Laplacian matrix \( L \). This convolution in the spectral domain of a graph is called the spectral convolution. The Laplacian matrix has many important properties. The following are the two points related to GCNs:

- The Laplacian matrix is a symmetric matrix and can perform eigenvalue decomposition (spectral decomposition), which corresponds to the spectral domain of GCNs.
- The Laplacian matrix has only nonzero elements at the center apex and the first-order connected vertices, and the rest are 0.

In classical signal processing, we have the convolution theorem in the time domain:

$$
\int_{-\infty}^{\infty} \omega t f(t) \ast h(t) = \int_{-\infty}^{\infty} \hat{f}(\omega)\hat{h}(\omega)e^{-j \omega t} \, d\omega,
$$

and the convolution theorem in the frequency domain:

$$
\int_{-\infty}^{\infty} \hat{f}(\omega) \ast \hat{h}(\omega) = \int_{-\infty}^{\infty} f(t)h(t)e^{j \omega t} \, dt.
$$

Because a graph signal \( f \) has no grid structure, the standard convolution cannot be directly applied to \( f \). Compared with the graph signal \( f \) in the vertex domain, its spectral graph signal \( \hat{f} \) in the graph spectral domain has a grid structure, and thus the convolution can be directly applied to \( \hat{f} \). The spectral signals \( \hat{f}(\lambda) \) are referred to as kernels of the vertex signals \( f(i) \).

To introduce the convolution on graph signals, we need to search for an orthonormal basis instead of the basis function \( e^{\pm j \omega t} \) in the standard convolution on Euclidean structures.

Since both normalized and unnormalized Laplacians are symmetric and positive semi-definite matrices, they admit an eigenvalue decomposition \( L = U \Lambda U^T \), where \( U = [u_1, \dots, u_n] \) are the orthonormal eigenvectors and \( \Lambda = \text{diag}(\lambda_1, \dots, \lambda_n) \) is the diagonal matrix of the corresponding nonnegative eigenvalues (spectrum) \( \lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_n = 0 \). The eigenvectors play the role of Fourier atoms in classical harmonic analysis, and the eigenvalues can be interpreted as (the square of) frequencies.

Consider signals defined on an undirected, connected, weighted graph \( G(V, E) \), which consists of a finite set of vertices \( V \) with \( |V| = N \), a set of edges \( E \), and a weighted adjacency matrix \( A \). If there is an edge \( e = (i, j) \) connecting vertices \( i \) and \( j \), then the entry \( w_{ij} \) represents the weight of the edge; otherwise, \( w_{ij} = 0 \). If the graph \( G(V, E) \) is not connected and has \( M \) connected components (\( M > 1 \)), then \( G(V, E) \) is separated into \( M \) subgraphs \( G_1, \dots, G_M \), and signals on \( G(V, E) \) are separated into \( M \) pieces corresponding to the \( M \) connected components, and independently process the separated signals on each of the subgraphs.

A signal or function \( f: V \rightarrow \mathbb{R} \) defined on the vertices of the graph may be represented as a vector \( f \in \mathbb{R}^N \), where the \( i \)th component of the vector \( f \) represents the function value at the \( i \)th vertex in \( V \).


### Graph Fourier Transforms and Convolutions on Graphs

Suppose we are given two signals \( \mathbf{f} = [f_1, \dots, \omega t f_N]^T \) and \( \mathbf{h} = [h_1, \dots, h_N]^T \) on the vertices of graph \( G(V, E) \). By replacing \( e^{\pm j} \) with eigenvectors \( u_i \) and \( u_i^* \), one can define the graph Fourier transforms and the graph convolution as follows [136]:

1. **Graph Fourier Transform** \( \hat{\mathbf{f}} \) of any graph signal or function vector \( \mathbf{f} \in \mathbb{R}^N \) on the vertices of \( G(V, E) \) is defined as the expansion of \( \mathbf{f} \) in terms of the eigenvectors \( u_1, \dots, u_N \) of the graph Laplacian \( L \):

   $$
   \hat{f}(\lambda) = \sum_{i=1}^{N} f(i) u^*_i(i) \quad \text{or} \quad \hat{\mathbf{f}} = U^H \mathbf{f},
   $$
   $$
   \text{where } U = [u_1, \dots, u_N]^T \text{ is the eigenvector-matrix of the EVD } L = U \Lambda U^H.
   $$

2. **Inverse Graph Fourier Transform** is given by:

   $$
   f(i) = \sum_{k=0}^{N-1} \hat{f}(\lambda) u_k(i) \quad \text{or} \quad \mathbf{f} = U \hat{\mathbf{f}}.
   $$

3. **Convolution Theorem in the Time Domain** \( f(t) \ast h(t) = \int_{-\infty}^{\infty} \hat{f}(\omega) \hat{h}(\omega) e^{-j \omega t} \, d\omega \) becomes the **Graph Convolution Theorem**:

   $$
   f(i) \ast_G h(i) = \sum_{k=0}^{N-1} \hat{f}(\lambda) \hat{h}(\lambda) u_k(i),
   $$
   and can be written in the **Spectral Convolution Form**:

   $$
   \mathbf{f} \ast_G \mathbf{h} = U \text{Diag}(\hat{h}_1, \dots, \hat{h}_N) \hat{\mathbf{f}} = U \text{Diag}(\hat{h}_1, \dots, \hat{h}_N) U^T \mathbf{f},
   $$
   which enforces the property that convolution in the vertex domain is equivalent to multiplication in the graph spectral domain.

4. Given a signal \( \mathbf{x} \in \mathbb{R}^N \) and a filter \( g_\theta = \text{Diag}(\theta_1, \dots, \theta_N) \) parameterized by \( \theta = [\theta_1, \dots, \theta_N]^T \in \mathbb{R}^N \), the spectral convolutions on graphs in Eq. (7.14.21) become:

   $$
   \mathbf{x} \ast g_\theta = U g_\theta U^T \mathbf{x},
   $$
   where \( U \) is the matrix of eigenvectors of the normalized graph Laplacian \( L = I_N - D^{-1/2} A D^{-1/2} = U \Lambda U^T \), with a diagonal matrix of its eigenvalues \( \Lambda \) and \( U^T \mathbf{x} \) being the graph Fourier transform of \( \mathbf{x} \). \( g_\theta \) can be understood as a function of the eigenvalues of \( L \), i.e., \( g_\theta(\Lambda) \). This eigenvalue function can be well-approximated by a truncated expansion in terms of Chebyshev polynomials \( T_k(x) \) up to the \( K \)th order [51, 82]:

   $$
   g_\theta(\Lambda) \approx \sum_{k=0}^{K} \theta_k T_k(\tilde{\Lambda}),
   $$
   $$
   \text{where } \tilde{\Lambda} = \frac{2 \Lambda}{\lambda_{\text{max}}} - I_N \text{ with } \lambda_{\text{max}} \text{ denoting the largest eigenvalue of } L, \theta \in \mathbb{R}^K \text{ is a vector of Chebyshev coefficients}.
   $$

   From Eqs. (7.14.22) and (7.14.23) it follows that the spectral convolution of a graph signal \( \mathbf{x} \) with a filter \( g_\theta \) is given by [82]:

   $$
   \mathbf{x} \ast g_\theta \approx \sum_{k=0}^{K} \theta_k T_k(\tilde{L}) \mathbf{x},
   $$
   $$
   \text{where } \tilde{L} = \frac{2 L}{\lambda_{\text{max}}} - I_N \text{ and } T_k(\tilde{L}) = 2 \tilde{L} T_{k-1}(\tilde{L}) - T_{k-2}(\tilde{L}) \text{ with } T_0(\tilde{L}) = 1 \text{ and } T_1(\tilde{L}) = \tilde{L}.
   $$

   Under the approximation \( \lambda \approx 2 \), Eq. (7.14.24) simplifies to:

   $$
   \mathbf{x} \ast g_\theta \approx \theta_0 \mathbf{x} + \theta_1 (L - I_N) \mathbf{x} = \theta_0 \mathbf{x} - D^{-1/2} A D^{-1/2} \mathbf{x}.
   $$

   If taking a single parameter \( \theta = \theta_0 = -\theta_1 \), then (7.14.25) gives the following expression [82]:

   $$
   \mathbf{x} \ast g_\theta \approx \theta(I_N + D^{-1/2} A D^{-1/2}) \mathbf{x},
   $$
   $$
   \text{where } I_N + D^{-1/2} W D^{-1/2} \text{ has eigenvalues in the range } [0, 2].
   $$

   Direct application of the iteration in Eq. (7.14.26) will lead to numerical instability and may lead to exploding or vanishing gradients in deep neural network models. In order to alleviate this problem, the renormalization technique is necessary:

   $$
   I_N + D^{-1/2} A D^{-1/2} \rightarrow \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2},
   $$
   $$
   \text{where } \tilde{A} = A + I_N \text{ and } \tilde{D}_{ii} = \sum_{j=1}^{N} \tilde{A}_{ij}.
   $$

   The above definition can be generalized to a signal \( X \in \mathbb{R}^{N \times C} \) with \( C \) input channels (i.e., a \( C \)-dimensional feature vector for every node) and \( F \) filters or feature maps, giving:

   $$
   Z = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} X \Theta,
   $$
   $$
   \text{where } Z \in \mathbb{R}^{N \times F} \text{ is the convolved signal matrix and } \Theta \in \mathbb{R}^{C \times F} \text{ is a matrix of filter parameters}.
   $$


### Semi-Supervised Multiclass Classification and Graph Convolutional Networks (GCNs)

For semi-supervised multiclass classification, the loss is defined by the cross-entropy error over all labeled examples as follows:

$$
L = -\sum_{l \in Y_L} \sum_{f=1}^{F} Y_{lf} \ln Z_{lf},
$$
where \( Y_L \) is the set of node indices that have labels.

#### Two-Layer GCN for Semi-Supervised Node Classification

For a two-layer Graph Convolutional Network (GCN) used for semi-supervised node classification on a graph with a symmetric adjacency matrix \( A \) (binary or weighted), the forward model takes the simple form:

$$
Z = f(X, A) = \text{softmax} \left( \hat{A} \, \text{ReLU} \left( \hat{A} X W^{(0)} \right) W^{(1)} \right),
$$
where:
- \( \hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} \) is calculated in the pre-processing step.
- \( W^{(0)} \in \mathbb{R}^{C \times H} \) is an input-to-hidden weight matrix for a hidden layer with \( H \) feature maps.
- \( W^{(1)} \in \mathbb{R}^{H \times F} \) is a hidden-to-output weight matrix.
- The softmax activation function is defined as:

$$
\text{softmax}(x_i) = \frac{\exp(x_i)}{\sum_{i} \exp(x_i)},
$$
and is applied row-wise.

#### Characteristics of GCNs

GCNs possess the following four characteristics:

1. **Natural Extension**: GCNs are a natural extension of convolutional neural networks (CNNs) to the graph domain. Graph convolution is widely applicable to nodes and graphs of any topological structure.
2. **Local Characteristics**: GCNs focus on information within the K-order neighborhood centered on a node, which is fundamentally different from Graph Neural Networks (GNNs).
3. **First-Order Characteristics**: After several approximations, a GCN becomes a first-order model. A single-layer GCN processes information from first-order neighbors in graphs, while a multi-layer GCN can handle K-order neighbors.
4. **Parameter Sharing**: The filter parameter \( W \) is shared across all nodes, which is one reason the graph convolution network is named as such.

#### Comparison of GCN Neighborhood Aggregation with Basic Neighborhood Aggregation in GraphSAGE

- **Basic Neighborhood Aggregation**:

$$
h_v^{(k)} = \sigma \left( W^{(k)} \left( \frac{1}{|\mathcal{N}(v)|} \sum_{u \in \mathcal{N}(v)} h_u^{(k-1)} \right) + B^{(k)} h_v^{(k-1)} \right),
$$

where:
  - \( h_v^{(k)} \) is the embedding of node \( v \) in the \( k \)-th layer,
  - \( \sigma(\cdot) \) is a nonlinear activation function,
  - \( \mathcal{N}(v) \) denotes the neighborhood of node \( v \),
  - The summing term denotes the average embedding of the neighboring nodes \( u \in \mathcal{N}(v) \) in the \( (k-1) \)-th layer,
  - The second term denotes the embedding of node \( v \) in the \( (k-1) \)-th layer.

- **GCN Neighborhood Aggregation**:

$$
h_v^{(k)} = \sigma \left( W^{(k)} \sum_{u \in \mathcal{N}(v) \cup \{v\}} \frac{h_u^{(k-1)}}{\sqrt{|\mathcal{N}(u)|} \cdot \sqrt{|\mathcal{N}(v)|}} \right),
$$

where:
  - \( W^{(k)} \) denotes the same weight matrix for both node \( v \) and its neighboring nodes' embeddings,
  - \( \sqrt{|\mathcal{N}(u)|} \cdot \sqrt{|\mathcal{N}(v)|} \) represents the normalization per neighbor.


In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class GCNLayer(nn.Module):
    def __init__(self, in_features, out_features, dropout=0.5):
        super(GCNLayer, self).__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, A_hat, X):
        # A_hat is the preprocessed adjacency matrix
        # X is the input feature matrix
        X = self.dropout(X)
        X = torch.spmm(A_hat, X)  # Multiply with adjacency matrix
        X = self.linear(X)
        X = F.relu(X)
        return X

class GCN(nn.Module):
    def __init__(self, n_features, n_hidden, n_classes, dropout=0.5):
        super(GCN, self).__init__()
        self.gcn1 = GCNLayer(n_features, n_hidden, dropout)
        self.gcn2 = GCNLayer(n_hidden, n_classes, dropout)

    def forward(self, A_hat, X):
        X = self.gcn1(A_hat, X)
        X = self.gcn2(A_hat, X)
        return F.log_softmax(X, dim=1)

def preprocess_adjacency_matrix(A):
    # Add self-loops to the adjacency matrix
    A = A + torch.eye(A.size(0))
    
    # Degree matrix
    D = torch.diag(torch.pow(A.sum(1), -0.5))
    
    # Normalized adjacency matrix
    A_hat = torch.mm(torch.mm(D, A), D)
    
    return A_hat

# Example usage:

# Assume A is your adjacency matrix and X is your feature matrix
A = torch.tensor([[0, 1, 1], [1, 0, 1], [1, 1, 0]], dtype=torch.float32)  # Example adjacency matrix
X = torch.tensor([[1, 0], [0, 1], [1, 1]], dtype=torch.float32)  # Example feature matrix

# Number of features, hidden units, and output classes
n_features = X.size(1)
n_hidden = 4
n_classes = 2  # Example with binary classification

# Preprocess the adjacency matrix
A_hat = preprocess_adjacency_matrix(A)

# Define the model
model = GCN(n_features, n_hidden, n_classes)

# Forward pass
output = model(A_hat, X)
print(output)


tensor([[-0.6150, -0.7779],
        [-0.6150, -0.7779],
        [-0.6150, -0.7779]], grad_fn=<LogSoftmaxBackward>)


In [4]:
import numpy as np

def relu(x):
    return np.maximum(0, x)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))  # for numerical stability
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

def normalize_adjacency_matrix(A):
    I = np.eye(A.shape[0])  # Identity matrix
    A_hat = A + I  # Add self-loops
    D = np.diag(np.sum(A_hat, axis=1))  # Degree matrix
    D_inv_sqrt = np.linalg.inv(np.sqrt(D))  # D^(-1/2)
    return D_inv_sqrt @ A_hat @ D_inv_sqrt  # A_hat = D^(-1/2) * (A + I) * D^(-1/2)

class GCNLayer:
    def __init__(self, in_features, out_features):
        # Initialize weights randomly
        self.weights = np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features)

    def forward(self, A_hat, X):
        Z = A_hat @ X @ self.weights  # Graph Convolution operation
        return relu(Z)  # Apply ReLU activation

class GCN:
    def __init__(self, n_features, n_hidden, n_classes):
        self.layer1 = GCNLayer(n_features, n_hidden)
        self.layer2 = GCNLayer(n_hidden, n_classes)

    def forward(self, A_hat, X):
        H = self.layer1.forward(A_hat, X)  # First GCN layer
        Z = self.layer2.forward(A_hat, H)  # Second GCN layer
        return softmax(Z)  # Apply softmax activation for output

# Example usage:

# Example adjacency matrix (A)
A = np.array([
    [0, 1, 1],
    [1, 0, 1],
    [1, 1, 0]
], dtype=np.float32)

# Example feature matrix (X)
X = np.array([
    [1, 0],
    [0, 1],
    [1, 1]
], dtype=np.float32)

# Number of features, hidden units, and output classes
n_features = X.shape[1]
n_hidden = 4
n_classes = 2  # Example with binary classification

# Preprocess the adjacency matrix
A_hat = normalize_adjacency_matrix(A)

# Define the model
gcn = GCN(n_features, n_hidden, n_classes)

# Forward pass
output = gcn.forward(A_hat, X)
print("Output:\n", output)


Output:
 [[0.6858288 0.3141712]
 [0.6858288 0.3141712]
 [0.6858288 0.3141712]]


## Batch Normalization

Batch normalization (BatchNorm) is a technique used to stabilize and accelerate the training of deep neural networks. It normalizes the output of each layer by adjusting and scaling the activations.

Consider a layer with  D-dimensional input $ \mathbf{x} = [x_1, \dots, x_D]^T $ and a fully connected matrix $ \mathbf{W} $ for extracting $d$-dimensional feature vectors $ \mathbf{y} = \mathbf{W}\mathbf{x} = [y_1, \dots, y_d]^T $, where $ d \leq D $.

Given a mini-batch of feature vectors $$ \mathbf{Y} = \{\mathbf{y}_1, \dots, \mathbf{y}_N\} $$, batch normalization is applied as follows:

### Steps:

1. **Compute the mean** of the mini-batch:
   $$
   \mu_B = \frac{1}{N} \sum_{n=1}^{N} y_n
   $$

2. **Compute the variance** of the mini-batch:
   $$
   \sigma_B^2 = \frac{1}{N} \sum_{n=1}^{N} (y_n - \mu_B)^2
   $$

3. **Normalize the batch**:
   $$
   \hat{y}_n = \frac{y_n - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
   $$

4. **Scale and shift** the normalized values using learnable parameters \( \gamma \) (scale) and \( \beta \) (shift):
   $$
   z_n = \gamma \hat{y}_n + \beta
   $$

   where \( \epsilon \) is a small constant added to the variance to prevent division by zero.

### Summary:

The transformation can be summarized as:
$$
\text{BN}_{\gamma, \beta}(\mathbf{Y}) = \gamma \hat{\mathbf{Y}} + \beta
$$

where:
$$
\hat{\mathbf{Y}} = \frac{\mathbf{Y} - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
$$

Here, $ \gamma $ and $ \beta $ are parameters that are learned during training, allowing the network to adjust the normalized output.


## Figure 7.16: Comparison of Network Architectures

**(a)** Network without BatchNorm layer.

**(b)** The same network with a BatchNorm layer inserted after the fully connected layer $ \mathbf{W} $. Both networks have the same loss function $ \hat{L} = L $.

### Batch Normalization Formulas

1. **Mean Calculation**:
   $$
   \mu_B = \frac{1}{N} \sum_{n=1}^{N} y_n
   $$
   (Equation 7.15.7)

2. **Variance Calculation**:
   $$
   \sigma_B^2 = \frac{1}{N} \sum_{n=1}^{N} (y_n - \mu_B)^2
   $$
   (Equation 7.15.8)

3. **Normalized Value**:
   $$
   \hat{y}_n = \frac{y_n - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
   $$
   (Equation 7.15.9)

4. **BatchNorm Transformation**:
   $$
   z_n = \gamma \hat{y}_n + \beta
   $$
   (Equation 7.15.10)

### Loss Function and Gradients

To design the loss \( L(y, \hat{y}, z, \gamma, \beta) \) for a specific application:

- **Gradient with respect to \( \gamma \)**:
  $$
  \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{N} \frac{\partial L}{\partial z_i} \cdot \hat{y}_i
  $$
  (Equation 7.15.15)

- **Gradient with respect to \( \beta \)**:
  $$
  \frac{\partial L}{\partial \beta} = \sum_{i=1}^{N} \frac{\partial L}{\partial z_i}
  $$

- **Gradient with respect to the normalized output \( \hat{y}_i \)**:
  $$
  \frac{\partial L}{\partial \hat{y}_i} = \frac{\partial L}{\partial z_i} \cdot \gamma
  $$

- **Gradient with respect to the variance \( \sigma_B^2 \)**:
  $$
  \frac{\partial L}{\partial \sigma_B^2} = \frac{1}{2} \sum_{i=1}^{N} \frac{\partial L}{\partial \hat{y}_i} \cdot (y_i - \mu_B) \cdot \left( \sigma_B^2 + \epsilon \right)^{-\frac{3}{2}}
  $$

- **Gradient with respect to the mean \( \mu_B \)**:
  $$
  \frac{\partial L}{\partial \mu_B} = \frac{1}{N} \sum_{i=1}^{N} \frac{\partial L}{\partial \hat{y}_i} \cdot \left( - \frac{1}{\sqrt{\sigma_B^2 + \epsilon}} \right)
  $$

- **Gradient with respect to the input \( y_i \)**:
  $$
  \frac{\partial L}{\partial y_i} = \frac{\partial L}{\partial \hat{y}_i} \cdot \frac{1}{\sqrt{\sigma_B^2 + \epsilon}}
  $$



![Comparison of Network Architectures](en4.png)

## Comparison of Network Architectures and Batch Normalization

### (a) Network Without BatchNor 
$$ x → y → \text{Activation Function} → \text{Output} $$
### (b) Network With BatchNorm
$$ x → \text{Fully Connected Layer} \mathbf{W} → \text{BatchNorm} → \text{Activation Function} → \text{Output} $$
### Comparison

In figure 7.16, we compare two network architectures:

- **(a)** The network without a BatchNorm layer.
- **(b)** The same network as in (a) with a BatchNorm layer inserted after the fully connected layer $\mathbf{W}$. All layer parameters are the same, and the loss function $ \hat{L} = L $ remains unchanged.

### Batch Normalization Formulas

Given a mini-batch $$ \mathbf{Y} = \{ \mathbf{y}_1, \dots, \mathbf{y}_N \} $$, BatchNorm performs the following operations:

1. **Compute the mean**:
   $$
   \mu_B = \frac{1}{N} \sum_{n=1}^{N} y_n
   $$

2. **Compute the variance**:
   $$
   \sigma_B^2 = \frac{1}{N} \sum_{n=1}^{N} (y_n - \mu_B)^2
   $$

3. **Normalize the batch**:
   $$
   \hat{y}_n = \frac{y_n - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
   $$

4. **Scale and shift**:
   $$
   z_n = \gamma \hat{y}_n + \beta
   $$

### Gradient Computation for Learning Parameters $\gamma$ and $\beta$

To learn the parameters $\gamma$ and $\beta$, we need to compute the gradients of the loss \( L \) with respect to these parameters. The gradients are computed as follows:

1. **Gradient with respect to \(\gamma\)**:
   $$
   \frac{\partial L}{\partial \gamma} = \sum_{i=1}^N \frac{\partial L}{\partial z_i} \cdot \hat{y}_i
   $$

2. **Gradient with respect to \(\beta\)**:
   $$
   \frac{\partial L}{\partial \beta} = \sum_{i=1}^N \frac{\partial L}{\partial z_i}
   $$

3. **Gradient with respect to the normalized output \(\hat{y}_i\)**:
   $$
   \frac{\partial L}{\partial \hat{y}_i} = \frac{\partial L}{\partial z_i} \cdot \gamma
   $$

4. **Gradient with respect to the variance \(\sigma_B^2\)**:
   $$
   \frac{\partial L}{\partial \sigma_B^2} = \frac{1}{2} \sum_{i=1}^N \frac{\partial L}{\partial \hat{y}_i} \cdot (y_i - \mu_B) \cdot \left( \sigma_B^2 + \epsilon \right)^{-\frac{3}{2}}
   $$

5. **Gradient with respect to the mean \(\mu_B\)**:
   $$
   \frac{\partial L}{\partial \mu_B} = \frac{1}{N} \sum_{i=1}^N \frac{\partial L}{\partial \hat{y}_i} \cdot \left( - \frac{1}{\sqrt{\sigma_B^2 + \epsilon}} \right)
   $$

6. **Gradient with respect to the input \( y_i \)**:
   $$
   \frac{\partial L}{\partial y_i} = \frac{\partial L}{\partial \hat{y}_i} \cdot \frac{1}{\sqrt{\sigma_B^2 + \epsilon}}
   $$

These gradients are used to update the parameters $\gamma$ and $\beta$ during the training process to minimize the loss function.


In [6]:
import numpy as np

class BatchNormalization:
    def __init__(self, epsilon=1e-8):
        # Initialize Batch Normalization parameters
        self.epsilon = epsilon
        self.gamma = None  # Scale parameter
        self.beta = None   # Shift parameter
        self.mu = None     # Mean
        self.sigma = None  # Variance
        self.mean_running = None
        self.var_running = None
        self.training = True  # Training mode or inference mode

    def initialize_params(self, D):
        # Initialize gamma and beta
        self.gamma = np.ones(D)
        self.beta = np.zeros(D)
        self.mean_running = np.zeros(D)
        self.var_running = np.ones(D)

    def forward(self, X):
        if self.training:
            # Calculate mean and variance for the mini-batch
            self.mu = np.mean(X, axis=0)
            self.sigma = np.var(X, axis=0)
            
            # Normalize
            self.X_hat = (X - self.mu) / np.sqrt(self.sigma + self.epsilon)
            
            # Scale and shift
            self.out = self.gamma * self.X_hat + self.beta
            
            # Update running mean and variance
            self.mean_running = 0.9 * self.mean_running + 0.1 * self.mu
            self.var_running = 0.9 * self.var_running + 0.1 * self.sigma

            return self.out
        else:
            # Use running mean and variance for inference
            X_hat = (X - self.mean_running) / np.sqrt(self.var_running + self.epsilon)
            return self.gamma * X_hat + self.beta

    def backward(self, d_out, X):
        # Gradients for gamma and beta
        d_gamma = np.sum(d_out * self.X_hat, axis=0)
        d_beta = np.sum(d_out, axis=0)
        
        # Gradient of the normalized output
        d_X_hat = d_out * self.gamma
        
        # Gradient of variance
        d_sigma = -0.5 * np.sum(d_X_hat * (self.X - self.mu) * np.power(self.sigma + self.epsilon, -1.5), axis=0)
        
        # Gradient of mean
        d_mu = -np.sum(d_X_hat / np.sqrt(self.sigma + self.epsilon), axis=0) - 2 * d_sigma * np.mean(self.X - self.mu, axis=0)
        
        # Gradient of input
        d_X = d_X_hat / np.sqrt(self.sigma + self.epsilon) + d_sigma * 2 * (self.X - self.mu) / X.shape[0] + d_mu / X.shape[0]
        
        return d_X, d_gamma, d_beta

# Example usage:
if __name__ == "__main__":
    # Define a simple network layer with BatchNorm
    np.random.seed(0)
    X = np.random.randn(10, 5)  # Example data: 10 samples, 5 features

    bn = BatchNormalization()
    bn.initialize_params(X.shape[1])
    
    # Forward pass
    out = bn.forward(X)
    print("Forward pass output:")
    print(out)
    
    # Backward pass (using dummy gradient)
    d_out = np.random.randn(*out.shape)
    d_X, d_gamma, d_beta = bn.backward(d_out,X)
    
    print("\nBackward pass gradients:")
    print("d_X:", d_X)
    print("d_gamma:", d_gamma)
    print("d_beta:", d_beta)


Forward pass output:
[[ 1.91782147  0.00695959  0.93696414  1.55587144  1.40145554]
 [-0.52081245  0.57015465 -0.33249029 -0.17450368 -0.01246507]
 [ 0.47669325  1.08650023  0.69241787 -0.00849167  0.01981701]
 [ 0.64538508  1.12726587 -0.3929259   0.13279046 -1.23979999]
 [-1.92253507  0.26653424  0.80856707 -0.64616055  1.79177121]
 [-0.94522045 -0.35598688 -0.37273495  1.03315631  1.01501921]
 [ 0.48639311 -0.01556559 -1.15973252 -1.56049272 -0.74856917]
 [ 0.48763989  0.85711514  1.1881847  -0.38422632 -0.70430708]
 [-0.58421738 -1.85711655 -2.0791495   1.34171247 -0.90553137]
 [-0.04114746 -1.68586071  0.71089938 -1.28965573 -0.6173903 ]]


AttributeError: 'BatchNormalization' object has no attribute 'X'

### Theorem 7.4

For a BatchNorm network with loss \(\hat{L}\) and an identical non-BatchNorm network with (identical) loss \(L\), the following inequality is true:

$$
\|\nabla_{z_j} \hat{L}\|^2 \leq \frac{\sigma^2}{\gamma_j^2} \left(\|\nabla_{z_j} L\|^2 - \frac{1}{N} \left\langle \nabla_{z_j} L, \hat{z}_j \right\rangle^2 \right)
$$

The reduction of the gradient magnitude \(\|\nabla_{z_j} \hat{L}\| \leq \frac{\sigma}{\gamma_j} \|\nabla_{z_j} L\|\) has an effect even when the scaling of BatchNorm is identical to the original layer scaling (i.e., even when \(\gamma = \sigma_j\)). Because the gradient magnitude \(\|\nabla_{z_j} \hat{L}\|\) captures the Lipschitzness of the loss \(\hat{L}\), BatchNorm exhibits a better Lipschitz constant of the loss \(L\).

### Theorem 7.5

Let \(\hat{g}_j = \nabla_{z_j} L\) and \(H_{jj} = \frac{\partial^2 L}{\partial z_j^2}\) be the gradient vector and Hessian matrix of the loss with respect to the layer outputs, respectively. Then:

$$
\left( \nabla_{z_j} \hat{L} \cdot \frac{\partial \hat{L}}{\partial z_j} \cdot \frac{\partial \gamma}{\partial z_j} \right)^2 \leq \frac{\sigma^2}{N} \left( \hat{g}_j^T H_{jj} \hat{g}_j - \left( \frac{\partial \hat{L}}{\partial z_j} \right)^T \frac{\partial L}{\partial z_j} \right)
$$

If the Hessian matrix \(H_{jj}\) preserves the relative norms of \(\hat{g}_j\) and \(\nabla_{z_j} \hat{L}\), then:

$$
\left( \nabla_{z_j} \hat{L} \right)^T \left( \frac{\partial^2 \hat{L}}{\partial z_j^2} \cdot \frac{\partial L}{\partial z_j} \cdot \frac{\partial L}{\partial z_j} \cdot \hat{g}_j \right) \leq \frac{\sigma^2}{N} \left( \hat{g}_j^T H_{jj} \hat{g}_j - \gamma^2 \left( \hat{g}_j^T \frac{\partial \hat{L}}{\partial z_j} \right) \right)
$$

The quadratic form of the loss Hessian matrix captures the second-order term of the Taylor expansion of the gradient around the current point. Therefore, if the quadratic forms involving the loss Hessian \(H_{jj}\) and the inner product \(\langle \hat{y}_j, \hat{g}_j \rangle\) are nonnegative (both fairly mild assumptions), Theorem 7.5 implies that the quadratic form of the loss Hessian is reduced for a BatchNorm network compared to standard networks, and thus the first-order term (gradient) is more predictive in BatchNorm networks.

### Lemma 7.2

Let \(W^*\) and \(\hat{W}^*\) be the sets of local optima for the weights in the normal and BatchNorm networks, respectively. For any initialization \(W_0\), if \(\langle W^*, W_0 \rangle > 0\), where \(\hat{W}^*\) and \(W^*\) are closest optima for BatchNorm and standard networks, respectively, then:

$$
\|W_0 - \hat{W}^*\|^2 \leq \|W_0 - W^*\|^2 - \frac{\|W^*\|^2}{\|W^*\|} \left\langle W^*, W_0 \right\rangle
$$

This lemma shows that \(\|W_0 - \hat{W}^*\|^2 < \|W_0 - W^*\|^2\). That is, the effect of any initialization \(W_0\) on the closest optima \(\hat{W}^*\) for BatchNorm networks is smaller compared with standard networks. In other words, the initialization in optimization for BatchNorm networks is more favorable.

### Variants and Extensions of Batch Normalization

As pointed out by Ioffe, the dependence of the batch-normalized activations on the entire mini-batch makes BatchNorm powerful, but it also introduces some drawbacks:

- When the training mini-batches are small, the estimates of the mean and variance become less accurate. These inaccuracies are compounded with depth, leading to performance degradation.
- If the training mini-batches do not consist of independent samples, then different activations are produced between training and inference, which may lead to errors during inference.

To address these issues, variants and extensions of batch normalization have been proposed, including:

- **Batch Renormalization**: Uses moving averages of mini-batch statistics to normalize data. The moving averages \((\mu, \sigma^2)\) are given by:

    $$
    \mu = \frac{1}{m} \sum_{i=1}^m \mu_{B_{t-i+1}}
    $$

    $$
    \sigma^2 = \frac{1}{m} \sum_{i=1}^m \sigma_{B_{t-i+1}}^2
    $$

    Normalization using moving averages:

    $$
    \hat{y}_i = \frac{y_i - \mu}{\sigma}
    $$

    $$
    z_i = \gamma \hat{y}_i + \beta
    $$

    However, when the source and target data have different distributions, moving average normalization statistics of the source data may not accurately represent the normalized statistics of the target (or testing) data.


### Algorithm 7.11: Batch Renormalization

**Input:** Feature vectors \( y \) over a training mini-batch \( B = \{y_1, \ldots, y_m\} \); parameters \( \gamma \), \( \beta \); current moving mean \( \mu \) and standard deviation \( \sigma \); moving average update rate \( \alpha \); maximum allowed correction \( r_{\text{max}} \), \( d_{\text{max}} \).

1. Compute the mini-batch mean:
   $$
   \mu_B \leftarrow \frac{1}{m} \sum_{i=1}^m y_i
   $$

2. Compute the mini-batch standard deviation:
   $$
   \sigma_B \leftarrow \sqrt{\frac{1}{m} \sum_{i=1}^m (y_i - \mu_B)^2}
   $$

3. Compute the corrected scale and shift:
   $$
   r \leftarrow \text{clip}\left[\frac{1}{r_{\text{max}}}, r_{\text{max}}\right] \cdot \frac{\sigma_B}{\sigma}
   $$
   $$
   d \leftarrow \text{clip}\left[-d_{\text{max}}, d_{\text{max}}\right] \cdot \frac{\mu_B - \mu}{\sigma}
   $$

4. Normalize the activations:
   $$
   \hat{y}_i \leftarrow \frac{y_i - \mu_B}{\sigma_B} \cdot r + d
   $$

5. Apply the learned affine transformation:
   $$
   z_i \leftarrow \gamma \hat{y}_i + \beta
   $$

6. Update the moving averages:
   $$
   \mu \leftarrow \mu + \alpha (\mu_B - \mu)
   $$
   $$
   \sigma \leftarrow \sigma + \alpha (\sigma_B - \sigma)
   $$

7. **Output:** \( z_i = \text{BatchRenorm}(y_i) \); updated \( \mu \), \( \sigma \).

**Inference:**
   $$
   z \leftarrow \gamma \cdot \frac{y - \mu}{\sigma} + \beta
   $$

### Gradient Computation

The gradients with respect to the loss \(L\) are computed as follows:

1. Gradient with respect to \(\hat{y}_i\):
   $$
   \frac{\partial L}{\partial \hat{y}_i} = \frac{\partial z_i}{\partial \hat{y}_i} \cdot \gamma
   $$

2. Gradient with respect to \(\sigma_B\):
   $$
   \frac{\partial L}{\partial \sigma_B} = -\frac{1}{\sigma_B} \cdot \left(\frac{\partial L}{\partial \hat{y}_i} \cdot \left(\frac{y_i - \mu_B}{\sigma_B}\right)\right)
   $$

3. Gradient with respect to \(\mu_B\):
   $$
   \frac{\partial L}{\partial \mu_B} = -\frac{1}{\sigma_B} \cdot \left(\frac{\partial L}{\partial \hat{y}_i} \cdot \left(\frac{\partial L}{\partial \sigma_B}\right)\right)
   $$

4. Gradient with respect to \( \gamma \):
   $$
   \frac{\partial L}{\partial \gamma} = \sum_{i=1}^m \frac{\partial L}{\partial z_i} \cdot \hat{y}_i
   $$

5. Gradient with respect to \( \beta \):
   $$
   \frac{\partial L}{\partial \beta} = \sum_{i=1}^m \frac{\partial L}{\partial z_i}
   $$

### Note on BatchNorm for CNNs

For convolutional neural networks, the input and output (activation) of a BatchNorm layer are four-dimensional tensors \( Y \), \( Z \in \mathbb{R}^{N \times C \times H \times W} \) with elements \( y_{nij}^k \) and \( z_{nij}^k \), respectively. Here, \( N \) is the number of images in the mini-batch, \( C \) is the number of feature channels, \( H \) and \( W \) are the spatial height and width of the activation map, respectively.


In [7]:
import numpy as np

def batch_renormalization(y, gamma, beta, mu, sigma, alpha, rmax, dmax):
    """
    Applies Batch Renormalization to the feature vectors y over a mini-batch.
    
    Parameters:
    - y: Feature vectors of shape (N, C, H, W)
    - gamma: Scale parameter for BatchNorm
    - beta: Shift parameter for BatchNorm
    - mu: Current moving mean
    - sigma: Current moving standard deviation
    - alpha: Moving average update rate
    - rmax: Maximum allowed correction for the scale
    - dmax: Maximum allowed correction for the shift
    
    Returns:
    - z: Batch normalized and affine-transformed activations
    - mu: Updated moving mean
    - sigma: Updated moving standard deviation
    """
    N, C, H, W = y.shape

    # Compute mini-batch mean
    mu_B = np.mean(y, axis=(0, 2, 3), keepdims=True)
    
    # Compute mini-batch standard deviation
    sigma_B = np.sqrt(np.var(y, axis=(0, 2, 3), keepdims=True) + 1e-8)
    
    # Compute the correction factors
    r = np.clip(sigma_B / sigma, 1 / rmax, rmax)
    d = np.clip((mu_B - mu) / sigma, -dmax, dmax)
    
    # Normalize
    y_hat = (y - mu_B) / sigma_B * r + d
    
    # Apply affine transformation
    z = gamma * y_hat + beta
    
    # Update moving averages
    mu = mu + alpha * (mu_B - mu)
    sigma = sigma + alpha * (sigma_B - sigma)
    
    return z, mu, sigma

# Example usage
N, C, H, W = 32, 64, 32, 32  # Mini-batch size, number of channels, height, width
y = np.random.randn(N, C, H, W)
gamma = np.ones((1, C, 1, 1))
beta = np.zeros((1, C, 1, 1))
mu = np.zeros((1, C, 1, 1))
sigma = np.ones((1, C, 1, 1))
alpha = 0.1
rmax = 3.0
dmax = 3.0

z, mu, sigma = batch_renormalization(y, gamma, beta, mu, sigma, alpha, rmax, dmax)
print("Batch normalized and affine-transformed activations shape:", z.shape)
print("Updated moving mean:", mu)
print("Updated moving standard deviation:", sigma)


Batch normalized and affine-transformed activations shape: (32, 64, 32, 32)
Updated moving mean: [[[[ 3.96129474e-04]]

  [[-1.17578703e-03]]

  [[-1.81113646e-04]]

  [[ 2.14558128e-04]]

  [[ 2.69193033e-04]]

  [[ 3.23901713e-04]]

  [[-2.79679022e-05]]

  [[ 1.84871023e-04]]

  [[-2.93273967e-04]]

  [[ 4.79933556e-04]]

  [[ 7.63765604e-04]]

  [[-6.73948618e-04]]

  [[-1.27552161e-04]]

  [[ 7.73047509e-04]]

  [[ 1.14185056e-03]]

  [[ 6.92246644e-04]]

  [[ 6.46889082e-04]]

  [[ 3.91375650e-04]]

  [[ 4.13265528e-04]]

  [[-1.60871184e-05]]

  [[-7.28630664e-04]]

  [[-2.83280263e-04]]

  [[ 2.59267672e-04]]

  [[-2.45911192e-04]]

  [[-7.44960450e-04]]

  [[-5.87769889e-04]]

  [[ 8.31309267e-04]]

  [[-2.31470418e-04]]

  [[ 2.92184912e-04]]

  [[-2.12327163e-04]]

  [[-3.98613704e-04]]

  [[-3.46699749e-04]]

  [[-3.41094178e-04]]

  [[ 2.05370857e-05]]

  [[-3.65487148e-04]]

  [[ 2.91893980e-04]]

  [[ 7.21974794e-04]]

  [[ 2.99460646e-04]]

  [[ 7.03789828e-05]]

  [[-5

## Layer Normalization (LN)

Layer Normalization (LN) normalizes across all hidden units within a layer. This helps to mitigate the problem of covariate shift by fixing the mean and variance of the summed inputs within each layer.

### Formulas

The normalization statistics for Layer Normalization are computed as follows:

1. Compute the mean across all spatial locations and channels for each hidden unit:

   $$ \mu_{n,i} = \frac{1}{H \times W} \sum_{j=1}^{H} \sum_{k=1}^{W} y_{nijk} $$

   where \( H \) is the height, \( W \) is the width, and \( C \) is the number of channels. Here, \( i \) is the index of the channel and \( n \) is the index of the image in the mini-batch.

2. Compute the variance across all spatial locations and channels for each hidden unit:

   $$ \sigma_{n,i}^2 = \frac{1}{H \times W} \sum_{j=1}^{H} \sum_{k=1}^{W} (y_{nijk} - \mu_{n,i})^2 $$

   where \( \sigma_{n,i}^2 \) is the variance for the \(i\)-th channel and \(n\)-th image.

Layer Normalization performs normalization over \(C\) channels, and is named as such because it normalizes across the entire layer.

## Instance Normalization (IN)

Instance Normalization (IN) normalizes each image independently. This technique is useful for tasks where individual image statistics are more relevant than layer-wide statistics.

### Formulas

The normalization statistics for Instance Normalization are computed as follows:

1. Compute the mean across all spatial locations and channels for each image:

   $$ \mu_{n} = \frac{1}{C \times H \times W} \sum_{i=1}^{C} \sum_{j=1}^{H} \sum_{k=1}^{W} y_{nijk} $$

   where \( C \) is the number of channels, \( H \) is the height, and \( W \) is the width. Here, \( n \) is the index of the image in the mini-batch.

2. Compute the variance across all spatial locations and channels for each image:

   $$ \sigma_{n}^2 = \frac{1}{C \times H \times W} \sum_{i=1}^{C} \sum_{j=1}^{H} \sum_{k=1}^{W} (y_{nijk} - \mu_{n})^2 $$

   where \( \sigma_{n}^2 \) is the variance for the \(n\)-th image.

Instance Normalization is also known as "contrast normalization," and it normalizes each image instance independently.


In [8]:
import torch
import torch.nn as nn

class LayerNormalization(nn.Module):
    def __init__(self, num_features, eps=1e-5, affine=True):
        super(LayerNormalization, self).__init__()
        self.num_features = num_features
        self.eps = eps
        self.affine = affine
        
        if self.affine:
            self.gamma = nn.Parameter(torch.ones(num_features))
            self.beta = nn.Parameter(torch.zeros(num_features))
        else:
            self.gamma = None
            self.beta = None

    def forward(self, x):
        # Compute mean and variance
        mean = x.mean(dim=-1, keepdim=True)
        variance = x.var(dim=-1, keepdim=True, unbiased=False)
        
        # Normalize
        x_normalized = (x - mean) / torch.sqrt(variance + self.eps)
        
        if self.affine:
            x_normalized = self.gamma * x_normalized + self.beta
        
        return x_normalized

# Example usage
batch_size, channels, height, width = 16, 3, 32, 32
x = torch.randn(batch_size, channels, height, width)
layer_norm = LayerNormalization(num_features=channels)
output = layer_norm(x)


RuntimeError: The size of tensor a (3) must match the size of tensor b (32) at non-singleton dimension 3

In [9]:
import numpy as np

class LayerNormalization:
    def __init__(self, num_features, eps=1e-5, affine=True):
        self.num_features = num_features
        self.eps = eps
        self.affine = affine
        if self.affine:
            self.gamma = np.ones(num_features)
            self.beta = np.zeros(num_features)
        else:
            self.gamma = None
            self.beta = None

    def forward(self, x):
        # Compute mean and variance across the last axis (features)
        mean = np.mean(x, axis=-1, keepdims=True)
        variance = np.var(x, axis=-1, keepdims=True)
        
        # Normalize
        x_normalized = (x - mean) / np.sqrt(variance + self.eps)
        
        if self.affine:
            x_normalized = self.gamma * x_normalized + self.beta
        
        return x_normalized

# Example usage
batch_size, channels, height, width = 16, 3, 32, 32
x = np.random.randn(batch_size, channels, height, width)
layer_norm = LayerNormalization(num_features=channels)
output = layer_norm.forward(x)


ValueError: operands could not be broadcast together with shapes (3,) (16,3,32,32) 

In [10]:
import numpy as np

class InstanceNormalization:
    def __init__(self, num_features, eps=1e-5, affine=False):
        self.num_features = num_features
        self.eps = eps
        self.affine = affine
        if self.affine:
            self.gamma = np.ones(num_features)
            self.beta = np.zeros(num_features)
        else:
            self.gamma = None
            self.beta = None

    def forward(self, x):
        # Compute mean and variance for each instance (over channels, height, width)
        mean = np.mean(x, axis=(1, 2, 3), keepdims=True)
        variance = np.var(x, axis=(1, 2, 3), keepdims=True)
        
        # Normalize
        x_normalized = (x - mean) / np.sqrt(variance + self.eps)
        
        if self.affine:
            x_normalized = self.gamma[:, None, None] * x_normalized + self.beta[:, None, None]
        
        return x_normalized

# Example usage
batch_size, channels, height, width = 16, 3, 32, 32
x = np.random.randn(batch_size, channels, height, width)
instance_norm = InstanceNormalization(num_features=channels)
output = instance_norm.forward(x)


## Instance Normalization (IN)

Instance Normalization normalizes each instance (image) independently. The normalization statistics for each image in a mini-batch are calculated as follows:

For the n-th image in the mini-batch:

- **Mean** $(\mu_n)$:
  $$
  \mu_n = \frac{1}{C \cdot H \cdot W} \sum_{i=1}^{C} \sum_{j=1}^{H} \sum_{k=1}^{W} y_{nijk}
  $$
  
- **Variance** $(\sigma_n^2)$:
  $$
  \sigma_n^2 = \frac{1}{C \cdot H \cdot W} \sum_{i=1}^{C} \sum_{j=1}^{H} \sum_{k=1}^{W} (y_{nijk} - \mu_n)^2
  $$

where:
- C is the number of channels,
- H is the height of the image,
- W is the width of the image,
- $y_{nijk}$ represents the value of the pixel at position (i, j, k) in the n-th image.

Instance Normalization is also known as "contrast normalization."

### Explanation

Instance Normalization performs normalization on each image individually, ensuring that each image is normalized separately rather than across the entire batch. This technique can be especially useful in tasks like style transfer where normalization needs to be applied on a per-instance basis.



import numpy as np

def instance_normalization(x, epsilon=1e-5):
    """
    Apply instance normalization to the input tensor x.
    
    Args:
        x (numpy.ndarray): Input tensor with shape (N, C, H, W).
        epsilon (float): Small constant to avoid division by zero.
    
    Returns:
        numpy.ndarray: Normalized tensor.
    """
    # Get dimensions
    N, C, H, W = x.shape
    
    # Initialize output tensor
    normalized_x = np.zeros_like(x)
    
    # Compute normalization for each instance
    for n in range(N):
        for c in range(C):
            # Extract the feature map for this instance and channel
            feature_map = x[n, c, :, :]
            
            # Compute mean and variance
            mean = np.mean(feature_map)
            variance = np.var(feature_map)
            
            # Normalize the feature map
            normalized_feature_map = (feature_map - mean) / np.sqrt(variance + epsilon)
            
            # Store the normalized feature map
            normalized_x[n, c, :, :] = normalized_feature_map
    
    return normalized_x

# Example usage
N, C, H, W = 2, 3, 4, 4  # Example dimensions
x = np.random.rand(N, C, H, W)  # Random input tensor
normalized_x = instance_normalization(x)

print("Input Tensor:\n", x)
print("Normalized Tensor:\n", normalized_x)


In [11]:
import numpy as np

def instance_normalization(x, epsilon=1e-5):
    """
    Apply instance normalization to the input tensor x.
    
    Args:
        x (numpy.ndarray): Input tensor with shape (N, C, H, W).
        epsilon (float): Small constant to avoid division by zero.
    
    Returns:
        numpy.ndarray: Normalized tensor.
    """
    # Get dimensions
    N, C, H, W = x.shape
    
    # Initialize output tensor
    normalized_x = np.zeros_like(x)
    
    # Compute normalization for each instance
    for n in range(N):
        for c in range(C):
            # Extract the feature map for this instance and channel
            feature_map = x[n, c, :, :]
            
            # Compute mean and variance
            mean = np.mean(feature_map)
            variance = np.var(feature_map)
            
            # Normalize the feature map
            normalized_feature_map = (feature_map - mean) / np.sqrt(variance + epsilon)
            
            # Store the normalized feature map
            normalized_x[n, c, :, :] = normalized_feature_map
    
    return normalized_x

# Example usage
N, C, H, W = 2, 3, 4, 4  # Example dimensions
x = np.random.rand(N, C, H, W)  # Random input tensor
normalized_x = instance_normalization(x)

print("Input Tensor:\n", x)
print("Normalized Tensor:\n", normalized_x)


Input Tensor:
 [[[[0.04443066 0.48850864 0.57921359 0.96228825]
   [0.16852245 0.35207429 0.27123914 0.86778994]
   [0.48034966 0.86605724 0.84847832 0.72468133]
   [0.81184092 0.78300024 0.32711358 0.536134  ]]

  [[0.06734881 0.52961901 0.81477889 0.60507987]
   [0.84767498 0.28545461 0.86645336 0.27534859]
   [0.43873507 0.0380017  0.58281981 0.30684952]
   [0.89328281 0.46400591 0.4982895  0.41524684]]

  [[0.82624799 0.63863246 0.1922194  0.78312031]
   [0.03110266 0.68066567 0.21641838 0.96076069]
   [0.9811581  0.84264099 0.26262749 0.16667852]
   [0.42575312 0.20685708 0.01237106 0.163002  ]]]


 [[[0.22595378 0.89277324 0.5528927  0.99532619]
   [0.85462134 0.6041483  0.83135015 0.49747264]
   [0.31501688 0.99417827 0.34602359 0.91422741]
   [0.50030688 0.70146195 0.58175094 0.17018546]]

  [[0.29760236 0.95853937 0.54785507 0.36530722]
   [0.47552322 0.14332099 0.83468931 0.18605998]
   [0.38512755 0.7257209  0.90633744 0.68388011]
   [0.98515906 0.7659335  0.11106902 0.59877

## Group Normalization (GN) [161]

Group Normalization (GN) is different from Batch Normalization (BN), Layer Normalization (LN), and Instance Normalization (IN). GN divides the \(C\) channels into \(G\) groups, where each group contains \(\frac{C}{G}\) channels. GN computes the mean and variance within each group for normalization as follows:

For the \(n\)th image in the mini-batch, the normalization statistics are:

1. **Mean Calculation**:
   $$
   \mu_n = \frac{1}{\frac{C}{G} \cdot H \cdot W} \sum_{i=1}^{\frac{C}{G}} \sum_{j=1}^{H} \sum_{k=1}^{W} y_{nij}^k
   $$
   (7.15.41)

2. **Variance Calculation**:
   $$
   \sigma_n^2 = \frac{1}{\frac{C}{G} \cdot H \cdot W} \sum_{i=1}^{\frac{C}{G}} \sum_{j=1}^{H} \sum_{k=1}^{W} (y_{nij}^k - \mu_n)^2
   $$
   (7.15.42)

In this way, GN normalizes each group of channels, and hence is named Group Normalization (GN). 

**Relation to LN and IN**:
- If \(G = 1\), then the equations for GN reduce to those for Instance Normalization (IN):
  $$
  \mu_n = \frac{1}{C \cdot H \cdot W} \sum_{i=1}^{C} \sum_{j=1}^{H} \sum_{k=1}^{W} y_{nij}^k
  $$
  (7.15.39)
  $$
  \sigma_n^2 = \frac{1}{C \cdot H \cdot W} \sum_{i=1}^{C} \sum_{j=1}^{H} \sum_{k=1}^{W} (y_{nij}^k - \mu_n)^2
  $$
  (7.15.40)

- If \(G = C\), then the equations for GN become those for Layer Normalization (LN):
  $$
  \mu_n = \frac{1}{H \cdot W} \sum_{j=1}^{H} \sum_{k=1}^{W} y_{nij}^k
  $$
  (7.15.37)
  $$
  \sigma_n^2 = \frac{1}{H \cdot W} \sum_{j=1}^{H} \sum_{k=1}^{W} (y_{nij}^k - \mu_n)^2
  $$
  (7.15.38)

**Unified Normalization Statistics**:
The normalization statistics in BN, LN, IN, and GN can be unified as follows:
$$
\mu_l = \frac{1}{m} \sum_{p \in S} y_{p}
$$
$$
\sigma_l^2 = \frac{1}{m} \sum_{p \in S} (y_{p} - \mu_l)^2
$$
(7.15.43) (7.15.44)

Where \(S\) is a designed subset of \(\{n, i, j, k\}\) and \(l\) is the index of \(\{n, i, j, k\} \setminus S\).

**Cases**:
- **Batch Normalization (BN)**:
  $$ S = \{p\} = \{1, \ldots, N\} $$
  $$ m = |S| = N $$
  $$
  \mu_{BN}(i, j, k) = \frac{1}{N} \sum_{n=1}^{N} y_{nij}^k
  $$
  $$
  \sigma_{BN}^2(i, j, k) = \frac{1}{N} \sum_{n=1}^{N} (y_{nij}^k - \mu_{BN}(i, j, k))^2
  $$
  (7.15.45) (7.15.46)

- **Layer Normalization (LN)**:
  $$ S = \{p\} = \{H, W\} $$
  $$ m = H \cdot W $$
  (7.15.37) (7.15.38)

- **Instance Normalization (IN)**:
  $$ S = \{p\} = \{C, H, W\} $$
  $$ m = C \cdot H \cdot W $$
  (7.15.39) (7.15.40)

- **Group Normalization (GN)**:
  $$ S = \{p\} = \{C/G, H, W\} $$
  $$ m = \frac{C}{G} \cdot H \cdot W $$
  (7.15.41) (7.15.42)
