## Objectives

This module aims to provide a comprehensive understanding of Graph Attention Networks (GATs). Specifically, upon completion, you should be able to:
1. **Understand Graphical Neural Network** Understand GNN mechanisms like adjacency matrix, message passing, Invariant Permutation
2.  **Understand how attention mechanisms enhance information aggregation in graphs:** Grasp the core idea behind using attention to selectively weigh the importance of neighboring nodes when updating a node's representation.
3.  **Understand the multi-head attention for node representations:** Comprehend how multiple independent attention mechanisms can be used to capture diverse relationships and improve the robustness of node embeddings.

# Graph based Network
A graph-based network or GNN is any neural network that uses the graph’s structure (nodes and edges) to learn meaningful representations or make predictions.
Graph structure is important because they represent relationship between entities but using Graph as a neural network comes with challenges which is every graph data is different.
<img src ="https://i.postimg.cc/MHkm8601/graph.png">
Fig: Different Graph structure at different scenarios

To handle arbitary input shape there is a feature in graph known as **isomorphism** which states that two graph that looks different can still be strucuturally identical.
Like if we flip an image we get a new image but if we flip a graph the only thing that changes is the order of the node.

The algorithm that handle graph data needs to be **Permutation Invariant**

In the structure of graph **euclidean distance** are not clearly defined because distance only cannot incorporate distance between two nodes. Node embeddings contains **structural** as well as **feature information** of other nodes in the graph.
<img src= "https://i.postimg.cc/4N9Qw9gw/euc.png">
**Left:** Image in euclidean space   **Right:** Graph in non euclidean space

For every graph shown, each node can be described which a **feature vector**. For example, if the node of a graph represents  a person the feature vector would be his attributes like name, address, contact details, etc.
And the structural representation of graphs can be described by **adjacency matrix**. Adjacency matrix tells us which nodes of the graph is connected. The adjacency matrix cannot be used in feed forward network  because adjacency matrix encodes explicit pairwise connections (edges) between nodes in a graph. It represents structure: who is connected to whom. Meanwhile a feedforward layer (like a fully connected dense layer) assumes that every input unit is independent of the others in how it connects to the weights. There is no notion of explicit pairwise relationships between inputs — the connections are learned freely through weight matrices.

**Message Passing Layers**
They combine the node and edge information inyo the node embeddings. The graph information ( node features and structural properties) are fed through message passing layers. It construct node embeddings that contain the knowledge about other nodes and edges in a compressed format. This is done by gathering the current information of neighbor nodes combining it in certain ways to get a new embeddings and updateing the node features or states with these embeddings. This is alos called Graph Convolution and can be seen as extention of convolution to graph data.

**How Message Passing Layers Works**
<img src = "https://i.postimg.cc/brgB3Sbm/Message-Passing.png">

Lets say each node has 50 features. For node 1 we take fetaure of the node, aggregate it with neighbour nodes and update the feature of node 1.

**Operation in Message Passing Layer**


---

A typical message passing layer has **three main steps**:

---

## 1. Message Computation

Each node \$v\$ gathers messages from its neighbors \$u \in N(v)\$:

$$
m_{uv} = M(h_u, h_v, e_{uv})
$$

Where:

* \$h\_u\$ = feature vector of the neighbor node \$u\$
* \$h\_v\$ = feature vector of the target node \$v\$
* \$e\_{uv}\$ = optional edge feature between \$u\$ and \$v\$
* \$M\$ = learnable or fixed message function (often a neural network or linear layer)

---

## 2. Message Aggregation

Each node collects all incoming messages:

$$
m_v = AGG\big( \{ m_{uv} : u \in N(v) \} \big)
$$
<img src = "https://i.postimg.cc/G2WyVXP2/featureupdate.png">
Here, `AGG` is an aggregation function, other function that could be used:

* **Sum:** \$\sum\$
* **Mean:** mean
* **Max:** \$\max\$

---

## 3. Node Update

Each node updates its own feature using the aggregated message:

$$
h'_v = U(h_v, m_v)
$$

Where:

* \$U\$ is an update function (often an MLP or simply \$\text{ReLU}(W \cdot \text{concat}(h\_v, m\_v))\$)

---

## Putting It Together
<img src ="https://i.postimg.cc/prCfyzrX/fdfd.png">

A general message passing layer can be written as:

$$
h'_v = U\Big( h_v,\; AGG\big( \{ M(h_u, h_v, e_{uv}) : u \in N(v) \} \big) \Big)
$$

---

## Example: Simple GCN Layer

A classic Graph Convolutional Network (GCN) layer is a special case:

* No edge features
* Messages are just the neighbor’s feature times a weight matrix

The layer is:

$$
H' = \sigma(\tilde{A} H W)
$$

Where:

* \$\tilde{A} = D^{-\frac{1}{2}} A D^{-\frac{1}{2}}\$ is the normalized adjacency matrix with self-loops
* \$H\$ is the node feature matrix
* \$W\$ is a learnable weight matrix
* \$\sigma\$ is an activation function (e.g., ReLU)

In the GCN layer:

$$
M(h_u) = h_u,\quad
AGG = \text{weighted sum by adjacency},\quad
U = \text{linear + activation}.
$$

---





#GNN Variants
<ul>
<li>Graph Convolution Network</li>
<li>Graph Multi Layer Perceptron</li>
<li>Graph Attention Networks</li>
<li>Gated Graph Neural Networks</li>
</ul>

#Graph Attention Networks

## Key Topics

### 2.1 Attention Coefficients and Neighborhood Weighting

Traditional Graph Convolutional Networks (GCNs) treat all neighbors equally, or apply weights based on fixed graph structure (e.g., degree normalization). GATs introduce an attention mechanism that allows each node to **learnably assign different importance (attention coefficients)** to its neighbors during the message passing process. This dynamic weighting is crucial for handling complex graph structures and capturing heterogeneous relationships.

<img src = "https://i.postimg.cc/k5x4N0p2/AM.png" width =50% >

Let $h_i \in \mathbb{R}^F$ be the input features of node $i$, and $h_j \in \mathbb{R}^F$ be the input features of node $j$ (a neighbor of $i$). The attention mechanism first transforms the input features using a shared linear transformation parameterized by a weight matrix $W \in \mathbb{R}^{F' \times F}$, where $F'$ is the number of output features. So, we have $Wh_i$ and $Wh_j$.

The attention score $e_{ij}$ between node $i$ and its neighbor $j$ is then computed using a shared attentional mechanism $a$:
$$e_{ij} = a(Wh_i, Wh_j)$$
This attentional mechanism $a$ is a single-layer feedforward neural network, parameterized by a weight vector $\vec{a} \in \mathbb{R}^{2F'}$, and followed by a LeakyReLU non-linearity:
$$e_{ij} = \text{LeakyReLU}(\vec{a}^T [Wh_i \, || \, Wh_j])$$
where $||$ denotes concatenation.

To make attention coefficients comparable across different nodes and easier to interpret as weights, we normalize them using the softmax function over all neighbors $j \in N_i$ of node $i$:
$$\alpha_{ij} = \text{softmax}_j(e_{ij}) = \frac{\exp(e_{ij})}{\sum_{k \in N_i} \exp(e_{ik})}$$
These normalized attention coefficients $\alpha_{ij}$ represent the importance of node $j$'s features to node $i$'s new representation. Finally, the output feature for node $i$, $h'_i$, is computed as a weighted sum of its neighbors' transformed features, weighted by the attention coefficients:
$$h'_i = \sigma\left(\sum_{j \in N_i} \alpha_{ij} Wh_j\right)$$
where $\sigma$ is an activation function (e.g., ELU).

Here’s your text rewritten cleanly in **Markdown** with **LaTeX-style math** using `$` for all equations:

---

# Attention Mechanism

The attention mechanism \$a\$ is a **single-layer feedforward neural network**, parameterized by a weight vector \$\vec{a} \in \mathbb{R}^{2F'}\$, and applies the **LeakyReLU** nonlinearity with a negative input slope \$\alpha = 0.2\$.

Fully expanded, the coefficients computed by the attention mechanism  can be expressed as:

$$
\alpha_{ij} = \frac{
\exp\Big( \text{LeakyReLU}\big( \vec{a}^T [ W \vec{h}_i \parallel W \vec{h}_j ] \big) \Big)
}{
\sum_{k \in N_i} \exp\Big( \text{LeakyReLU}\big( \vec{a}^T [ W \vec{h}_i \parallel W \vec{h}_k ] \big) \Big)
}
$$

Where:

* \$\vec{h}\_i\$ and \$\vec{h}\_j\$ are the feature vectors of nodes \$i\$ and \$j\$
* \$W\$ is a learnable weight matrix
* \$\parallel\$ denotes concatenation
* \$N\_i\$ is the neighborhood of node \$i\$

---
Once obtained, the normalized attention coefficients are used to compute a linear combination of the
features corresponding to them, to serve as the final output features for every node (after potentially applying a nonlinearity, σ):


After computing the attention coefficients, the updated node feature is:

$$
\vec{h}'_i = \sigma \Big( \sum_{j \in N_i} \alpha_{ij} \, W \vec{h}_j \Big)
$$

Where:

* \$\vec{h}'\_i\$ is the updated feature vector for node \$i\$
* \$\sigma\$ is a nonlinearity (e.g., ELU or ReLU)
* \$\alpha\_{ij}\$ are the normalized attention coefficients
* \$W\$ is a learnable weight matrix
* \$N\_i\$ is the neighborhood of node \$i\$

---
To stabilize the learning process of self-attention, multi-head attention is employed

---



### 2.2 Multi-head Attention Layers

To enhance the expressive power and robustness of GATs, the concept of **multi-head attention** is introduced. Instead of performing a single attention calculation, $K$ independent attention mechanisms (or "heads") are run in parallel. Each head learns a different set of attention parameters ($\vec{a}^k$) and performs the transformation with its own weight matrix ($W^k$).
<img src = "https://i.postimg.cc/X7m7RNMZ/MAM.png">

For each head $k$, we compute a set of attention-weighted features $h'_i{}^{(k)}$:
$$h'_i{}^{(k)} = \sigma\left(\sum_{j \in \mathcal{N}_i} \alpha_{ij}{}^{(k)} W^k h_j\right)$$

<img src = "https://i.postimg.cc/G2gxrTxw/X.png">
After computing the output features for each of the $K$ heads, there are two common ways to aggregate these results:


## 1. Concatenation (for hidden layers)

The features from all attention heads are **concatenated** to form the final output feature for node \$i\$. If each head outputs \$F'\$ features, the concatenated output will have \$K \times F'\$ features:

$$
h'_i = \mathop{\|}_{k=1}^{K} h'_i{}^{(k)}
$$

This approach allows the model to learn multiple, potentially complementary, representations for each node.

---

## 2. Averaging (for the output layer or when keeping feature dimension constant)

The features from all heads are **averaged**. This is typically used in the final layer of a GAT to produce a single, consolidated embedding or classification output:

$$
h'_i = \frac{1}{K} \sum_{k=1}^{K} h'_i{}^{(k)}
$$

---


Multi-head attention allows the model to capture diverse relationships and dependencies within the graph, making it more robust to noisy or irrelevant connections. It effectively stabilizes the learning process and often leads to better performance.

### 2.3 Comparison to GCNs

| Feature                 | Graph Convolutional Networks (GCNs)                           | Graph Attention Networks (GATs)                               |
| :---------------------- | :------------------------------------------------------------ | :------------------------------------------------------------ |
| **Information Aggregation** | Fixed, pre-defined aggregation based on graph structure (e.g., degree normalization). | **Learned, dynamic attention weights** for each neighbor, allowing selective aggregation. |
| **Weight Sharing** | Weights are shared across all nodes for feature transformation. | Weights for feature transformation ($W$) are shared across nodes. Attentional mechanism ($a$) weights are also shared. |
| **Inductive Capability** | Inherently inductive as weights are shared, but attention mechanism adds more flexibility. | **Strongly inductive** due to shared attention mechanism, easily generalizes to unseen graph structures. |
| **Computational Cost** | Typically less complex, matrix multiplications with sparse adjacency. | Higher computational cost due to attention coefficient calculation for each edge, especially in dense graphs. $\mathcal{O}(VF' + EA_F)$ vs $\mathcal{O}((V+E)F')$ |
| **Interpretation** | Less direct interpretation of neighbor importance.            | **Attention coefficients offer interpretability** on which neighbors are more important. |
| **Performance** | Good performance on many tasks.                               | Often achieve **state-of-the-art performance**, especially on transductive and inductive tasks.
| **Local Information** | Aggregates information from immediate neighbors.              | Can selectively focus on relevant neighbors, potentially capturing more nuanced local structures.

**Key Advantages of GATs over GCNs:**

* **Inductive Learning:** GATs are inherently better suited for inductive learning tasks (generalizing to unseen nodes or entire graphs) because their attention mechanism does not depend on the global graph structure. The attention weights are computed on-the-fly based on node features, making them highly transferable.
* **Varying Neighbor Importance:** GATs overcome the limitation of GCNs where all neighbors contribute equally (or based on pre-defined structural coefficients like degree). They can learn to assign different importance to different neighbors, which is crucial for heterogeneous graphs or graphs with varying relational strengths.
* **Interpretability:** The learned attention coefficients provide a level of interpretability, indicating which neighbors are more influential in forming a node's representation. This can be valuable for understanding the model's decisions.
* **Robustness to Noise:** By focusing on more relevant neighbors and down-weighting less relevant ones, GATs can be more robust to noisy or spurious connections in the graph.


---
Example
* **Graph:**

  * Node **A** with neighbors **B** and **C**
* **Feature dimension (input):** \$F = 2\$
* **Feature dimension (output per head):** \$F' = 1\$
* **Number of heads:** \$K = 2\$

Suppose:

$$
\vec{h}_A = [1, 2]^T,\quad
\vec{h}_B = [0, 1]^T,\quad
\vec{h}_C = [1, 1]^T.
$$

---

##1. Linear Transformation

Each node’s feature is transformed by a learnable **weight matrix** \$W\$.

Suppose for **head 1**:

$$
W^{(1)} = [0.5, 1.0].
\quad (\text{shape: } 1 \times 2)
$$

So the transformed features for head 1 are:

$$
W^{(1)} \vec{h}_A = 0.5(1) + 1.0(2) = 2.5 \\
W^{(1)} \vec{h}_B = 0.5(0) + 1.0(1) = 1.0 \\
W^{(1)} \vec{h}_C = 0.5(1) + 1.0(1) = 1.5
$$

---

## 2. Attention Coefficient (Self-Attention)

For head 1, suppose the learnable vector \$\vec{a}^{(1)} = \[1.0, 1.0]^T\$ (shape \$2 \times 1\$).

**Step 1:** Concatenate the transformed features for each pair.

$$
[W^{(1)} \vec{h}_A \parallel W^{(1)} \vec{h}_B] = [2.5, 1.0] \\
[W^{(1)} \vec{h}_A \parallel W^{(1)} \vec{h}_C] = [2.5, 1.5]
$$

**Step 2:** Compute raw scores:

$$
e_{AB} = \vec{a}^{(1)T} [2.5, 1.0]^T = 2.5 + 1.0 = 3.5 \\
e_{AC} = \vec{a}^{(1)T} [2.5, 1.5]^T = 2.5 + 1.5 = 4.0.
$$

Apply **LeakyReLU** (with \$\alpha = 0.2\$):

$$
\text{LeakyReLU}(x) = \begin{cases} x, & x > 0 \\ 0.2x, & x < 0 \end{cases}
\quad \Longrightarrow \quad
\text{LeakyReLU}(3.5) = 3.5, \quad \text{LeakyReLU}(4.0) = 4.0.
$$

---

## 3. Normalize (Softmax)

$$
\alpha_{AB} = \frac{\exp(3.5)}{\exp(3.5) + \exp(4.0)} = \frac{33.12}{33.12 + 54.60} = 0.38. \\
Similary, \alpha_{AC} = 0.62.
$$

So, for head 1: Node A will pay **38% attention** to B and **62% attention** to C.

---

## 4. Aggregation

Aggregate neighbor features for A:

$$
h'_A{}^{(1)} = \sigma\big( \alpha_{AB} W^{(1)} \vec{h}_B + \alpha_{AC} W^{(1)} \vec{h}_C \big) \\
= \sigma\big( 0.38(1.0) + 0.62(1.5) \big) \\
= \sigma(0.38 + 0.93) = \sigma(1.31).
$$

Suppose \$\sigma\$ is identity → \$h'\_A{}^{(1)} = 1.31\$.

---

## 5.  Multi-Head Attention

Now do the same for **head 2** with different weights.

Suppose:

$$
W^{(2)} = [1.0, -1.0], \quad \vec{a}^{(2)} = [1.0, 1.0]^T.
$$

* \$W^{(2)} \vec{h}\_A = 1(1) + (-1)(2) = -1\$
* \$W^{(2)} \vec{h}\_B = 1(0) + (-1)(1) = -1\$
* \$W^{(2)} \vec{h}\_C = 1(1) + (-1)(1) = 0\$

Then:

$$
[W^{(2)} \vec{h}_A \parallel W^{(2)} \vec{h}_B] = [-1, -1] \\
[W^{(2)} \vec{h}_A \parallel W^{(2)} \vec{h}_C] = [-1, 0]
$$

Dot products:

$$
e_{AB}^{(2)} = (-1) + (-1) = -2,\quad e_{AC}^{(2)} = (-1) + 0 = -1.
$$

LeakyReLU:

$$
\text{LeakyReLU}(-2) = 0.2(-2) = -0.4,\quad \exp(-0.4) = 0.67 \\
\text{LeakyReLU}(-1) = 0.2(-1) = -0.2,\quad \exp(-0.2) = 0.82.
$$

Normalized:

$$
\alpha_{AB}^{(2)} = \frac{0.67}{0.67 + 0.82} = 0.45, \quad
\alpha_{AC}^{(2)} = 0.55.
$$

Aggregate:

$$
h'_A{}^{(2)} = \alpha_{AB}^{(2)} W^{(2)} \vec{h}_B + \alpha_{AC}^{(2)} W^{(2)} \vec{h}_C \\
= 0.45(-1) + 0.55(0) = -0.45.
$$

---

## Combine Heads

* If it’s a **hidden layer** → **concatenate**:

  $$
  h'_A = [\, 1.31 \; || \; -0.45 \,].
  $$

* If it’s an **output layer** → **average**:

  $$
  h'_A = \frac{1}{2} (1.31 + (-0.45)) = 0.43.
  $$

---



* **How self-attention scores are computed** (dot product + nonlinearity)
* **How multiple heads learn different views** (different \$W\$ and \$\vec{a}\$)
* **How they’re combined** (concat or average)

---


## Reference Materials

For a practical deep dive and implementation details, refer to the following tutorial, which often serves as a foundational example for GATs:

* **GAT Node Classification Tutorial (PyTorch Geometric):** This official tutorial provides a hands-on example of implementing and training a GAT for node classification. It's an excellent resource for understanding the practical aspects of GATs.
    * [PyTorch Geometric GAT Node Classification Example](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.GAT.html) (Check PyTorch Geometric documentation for current links and tutorials, a good starting point is usually their examples folder or the specific module documentation for `GATConv`).
    * You might also find relevant examples in their GitHub repository: [PyTorch Geometric GitHub Examples](https://github.com/pyg-team/pytorch_geometric/tree/master/examples)

**Original Paper:**

* **Graph Attention Networks** by Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengio. (2018). Available on arXiv: [https://arxiv.org/abs/1710.10903](https://arxiv.org/abs/1710.10903)