# Recap: Deep Graph Encoder
- ![image.png](attachment:0ea7c541-3793-4dc1-a02a-d71e35fa6395.png)
- Idea: Node's neighborhood defines a computation graph
- ![image.png](attachment:cb37edea-279b-4e7d-83b3-bca81bd60dad.png)
- Learn how to propagate information across the graph to compute node features

## Aggregate From Neighbors:
- Intuition: Nodes aggregate information from their neighbors using neural networks
- ![image.png](attachment:c973a698-0eac-4a3e-8026-01eff4fb439b.png)
- Intuition: Network neighborhood defines a computation graph
- Every node defines a computation graph based on its neighborhood
- ![image.png](attachment:012fb84d-68b9-43b6-a7c3-5a345a9e45cd.png)

# A General GNN Framework
- ![image.png](attachment:980a40a0-52ae-4361-9a00-2866ffea5915.png)
1. GNN layer = Message + Aggregation
    - Different instantiations under this perspective
    - GCN, GraphSAGE, GAT,...
2. Connect GNN layers into a GNN
    - Stack layers sequentially
    - Ways of adding skip connections
3. Idea: Raw input graph != computational graph
    - Graph feature augmentation
    - Graph structure augmentation

- How do we tran a GNN
    - Supervised / unsupervised objectives
    - Node / edge / graph level objectives

# Single Layer of a GNN
- GNN layer = Message + Aggregation
    - Different instantiatios under this perspective
    - GCN, GraphSAGE, GAT, ...
    - ![image.png](attachment:0f146c86-95af-4560-8286-0c9a8ae1f77d.png)
- Idea of a GNN layer
    - Compress a set of vectors into a single vector
    - Two-step process: 1. Message 2. Aggregation
    - ![image.png](attachment:bedc1554-6366-4e44-ad3c-55587f5d7db6.png)

## Message Computation
- Message function
    - ![image.png](attachment:ef73e771-26f0-4d27-8f13-4f17a32b3729.png)
- Intuition: Each node will create a message, which will be sent to other nodes later.
- Example: A linear layer
    - ![image.png](attachment:d6d84314-5bbb-48e0-a81b-7efb9c9052d6.png)
    - Multiply node features with weight matrix W^(l)
    - ![image.png](attachment:5bf00d64-6e1d-406a-a4ff-d7c6f21189ed.png)

## Message Aggregation
- Intuition: Node v will aggregate the messages from its neighbors u:
    - ![image.png](attachment:f77f9c99-4265-4bc4-a193-013ab0fb4cc8.png)
- Example: Sum(.), Mean(.) or Max(.) aggregator
    - ![image.png](attachment:35e0bdb3-eea6-4ddf-9a58-e9cb2998af2b.png)

## Message Aggregation: Issue
- Issue: Information from node v itself could get lost
    - Computation of h(l)_v does not directly depends on h(l-1)_v
- Solution: Include h(l-1)_v when computing h(l)_v
    1. Message: compute message from node v itself
        - Usually, a different message computation will performed
        - ![image.png](attachment:e50a9f06-7037-4b2b-9e8b-760866f0468a.png)
    2. Aggregation: After aggregating from neighbors, we can aggregate the message from  node v itself.
        - Via concatenation or summation
        - ![image.png](attachment:00068f30-b33d-4781-a70b-2bb3458be4f7.png)

## A Single GNN layer
- Putting things together
    1. Message: Each node computes a message
        - ![image.png](attachment:99e4cb47-1d4f-44b0-995f-bacc9666087e.png)
    3. Aggregation: aggregate messages from neighbors
        - ![image.png](attachment:eb4ed69f-3de1-408f-bc0c-f259a127a5e0.png)
- Nonlinearity (activation): Adds expressiveness
    - Often written as sigmoid, examples: ReLU, sigmoid
    - Can be added to message or aggregation

## Activation (Non-Linearity)
- Apply activation to ith dimension of embedding x
    1. Rectified Linear unit (ReLU)
        - ReLU(x_i) = max(x_i, 0)
        - ![image.png](attachment:088fe97f-5b4b-4d43-b569-3722fde4488c.png)
        - most commonly used
    2. sigmoid
        - ![image.png](attachment:f63f46c8-1c5c-4cb3-ac49-53b95958e836.png)
        - Used only when we want to restrcit the range of our embeddings
        - ![image.png](attachment:e63b185a-3736-4608-97ad-5861f7328dd8.png)
    3. Parametric ReLU
        - PReLU(x_i) = max(x_i, 0) + a_imin(x+i, 0)
        - a_i is a trainable parameter
        - ![image.png](attachment:5287cb95-40cd-4674-9eef-578507ac4681.png)
        - Empirically performs better than ReLU

## Classific GNN layers
1. Graph Convolutional Networks (GCN)
    - Message + Aggregation
    - ![image.png](attachment:ff30d094-d226-4ad4-a5c0-93241dea7510.png)
    - Message:
        - Each neighbor: ![image.png](attachment:f7c22204-ab4b-4298-ba55-c72bbe43cbc0.png)
    - Aggregation
        - Sum over messages from neighbors, then apply activation
        - ![image.png](attachment:84f2ff89-1613-4ae5-bfe8-59400d240e4d.png)
    - In GCN the input graph is assumed to have self-edges that are included in the summation

2. GraphSAGE
- ![image.png](attachment:6267cdb7-c73e-4736-ad0c-dd9169967c49.png)
- Two-stage aggregation
    - Stage 1: Aggregate from node neighbors
        -  ![image.png](attachment:fc9ad606-e261-460f-b191-c932c655dc12.png)
    - Stage 2: Further aggregate over the node itself.
        - ![image.png](attachment:69aa95a1-4646-4849-9c88-53908b4ad48b.png)
    - Message is computed within the AGG(.)

#### GraphSAGE neighbor aggregation
- Mean: Take a weighted average of neighbors
    - ![image.png](attachment:418e0c36-cd2a-49d3-bb49-f9bf8a6a1201.png)
- Pool: Transform neighbor vectors and apply symmetric vector function Mean(.) or Max(.)
    - ![image.png](attachment:d3d5fbad-4406-4a53-b462-8ce755775832.png)
- LSTM: Apply LSTM to reshuffled to neighbors
    - ![image.png](attachment:4309794d-c78f-4eb6-bb86-9c2f3e74e25f.png)

#### GraphSAGE: L2 Normalization
- Optional: apply l2 normalization to h(l)_v at every layer
- ![image.png](attachment:2f2e6b4f-692e-4d40-9468-e81f4a916c5d.png)
- Without l2 normalization, the embedding vectors have diffeent scales for vectors
- In some cases, normalization of embedding results in performance improvement
- After l2 normalization, all vectors will have the same l2 norm.

3. Graph Attention Networks
- ![image.png](attachment:d342c011-aa1d-4ec3-b08c-cce11a0e0447.png)
- In GCN / GraphSAGE
    - ![image.png](attachment:44e7f0a5-0bdb-4f74-9c65-29d99d46acd7.png) is the weighting factor (importance) of node u's message to node v
    - alpha_vu is defined explicitly based on the structural properties of the graph (node degree)
    - All neighbors u E N(v) are equally important to node v
- Not all node's neighbors are equally important
    - Attention is inspired by cognitive attention
    - The attention alpha_uv focuses on the important parts of the input data and faded out the rest.
            - Idea: The NN should devote more computing power on the small but important part of the data
            - Which part of the data is more important depends on the context and is learned through training.

### Graph Attention Netwirks
- Can we do better than simple neighborhood aggregation?
- Can weighting factors alpha_vu be learned?
- Goal: specify arbitrary importance to different neighbors of each node in the graph.
- Idea: compute embedding h(l)_v of each node in the graph following an attention strategy:
    - Nodes attend over their neighborhood's message
    - Implicity specifying different weights to different nodes in a neighborhood.
#### Attention Mechanism
- Let alpha_vu be computed as a byproduct of an attention mechanism `a`:
    1. Let `a` compute attention coefficient e_vu across pairs of nodes u, v based on their messages:
        - ![image.png](attachment:766bc05d-149b-464b-9dd2-a2b99079cfa5.png)
        - e_vu indicates the importance of u's message to node v
        - ![image.png](attachment:52bfd482-897d-4465-a083-358f1cda54b3.png)
    - Normalize e_vu into the final attention weight alpa_vu
        - Use the softmax function, so that sum of (alpah_vu) = 1
            - ![image.png](attachment:87ebcdb3-712b-4d0f-b668-66bb8000d15e.png)
        - Weighted sum based on the final atention weight alpha_vu
            - ![image.png](attachment:6fb9ca9e-8357-4912-8d59-99f8e259d8ac.png)
    - ![image.png](attachment:9ced6716-16ad-45e9-b1ce-7eb866d7cb56.png)

- what is the form of attention mechanism `a`?
    - The approach is agnostic to the choice of a
        - E.g., use a simple single layer neural network
            - `a` have trainable parameters (weights in the linear layer)
    - ![image.png](attachment:e787508e-beca-4316-a434-c0b67878b5e9.png)
    - Parameters of `a` are trained jointly
        - Learn the parameters together with weight matrices (i.e., other parameters of the neural net w^(l)) in an end-to-end fashion.

- Multi-head attention: Stabilizes the learning process of attention mechanism.
    - Create multiple attention scores (each replica with a different set of parameters):
        - ![image.png](attachment:791759b2-6313-4297-bfed-30ec33358441.png)
    - Outputs are aggregated
        - By concatenation or summation
        - ![image.png](attachment:7ae5efc4-765a-4690-a430-ea9eff26dca4.png)

#### Benefits of Attention Mechanism
- Key benefit: Allows for (implicitly) specifying different importance values (alpha_vu) to different neighbors
- Computationally Efficient
    - Computation of attentional coefficients can be parallelized across all edges of the graph.
    - Aggregation may be parallelizes across all nodoes
- Storage efficient:
    - Sparse matrix operations do not require more than O(V + E) entries to stored.
    - Fixed number of parameters, irrespective of graph size
- Localized:
    - Only attends over local network neighborhoods
- Inductive capability

# GNN Layers in Practice
- In practice, these classic GNN layers are a great starting point
    - We can often get better performance by considering a general GNN layer design
    - Concretely, we can include modern deep learning modules that proved to be useful in many domains.
    - ![image.png](attachment:aec8fac0-bf44-4d4d-b469-71d4ba95725d.png)

- Many modern deep learning modules can be incorporated into a GNN layer
    - Batch Normalization
        - Stablize neural network training
    - Dropout:
        - Prevent overfitting
    - Attention/Gating
        - Control the importance of a message
    - More:
        - Any other useful deep learning modules

## Batch Normalization
- Goal: Stabilize neural networks training
- Idea: Given a batch of inputs (node embeddings)
    - Re-center the node embeddings into zero mean
    - Re-scale the varaince into unit variance
- ![image.png](attachment:abff3fe6-1704-45b0-b0c8-8d247f999777.png)

## Dropout
- Goal: Regularize a neural net to prevent overfitting
- Idea:
    - During training: With some probaility `p`, rnadomly set neurons to zero (turn off)
    - During testing: Use all the neurons for computation
- ![image.png](attachment:9f0f7af9-7627-49c2-8bde-d87fc19d7eeb.png)

- In GNN, Dropout is applied to the linear layer in the message function
    - A simple message function with linear layer:![image.png](attachment:68a661ec-97a3-4818-86d5-0f1d06bfc033.png)
    - ![image.png](attachment:f1e58a50-3577-4547-b1b4-e49f91e7d7c4.png)

## Summary:
- Modern deep learning modules can be included into a GNN layer for better performance
- Desiginig novel GNN layers is still an active research frontier!
- Suggested resources: We can explore diverse GNN designs or try out our own idea in GraphGym

# Stacking Layers of a GNN
- How to connect GNN layers into a GNN?
    - Stack layers sequentially
    - Ways of adding skip connections
    - ![image.png](attachment:feeb6dcb-15fd-47f3-bd9e-55f571b1d9e3.png)

- How to construct a Graph Neural Network?
    - The standard way: Stack GNN layers sequentially
    - Input: Initial raw node feature x_v
    - Output: Node embeddings h(L)_v after L GNN layers
    - ![image.png](attachment:db78eeff-af90-45bd-9079-8d022ea3025d.png)

## The over-smoothing problem
- The issue of stacking many GNN layers
    - The GNN suffers from the over-smoothing problem
- The over-smoothing problem: all the node embeddings converge to the same value
    - This is bad because we want to use node embeddings to differentiate nodes
- Why does the over-smoothing problem happen?

## Receptive Field of a GNN
- Receptive field: the set of nodes that determine the embedding of a node of interest
    - In a K-layer GNN, each node has a receptive field of K-hop neighborhood
    - ![image.png](attachment:fa279a6b-1764-451a-9dc0-bf6e150832c1.png)
- Receptive field overlap for 2 nodes:
    - The shared neighbors quickly grow when we increase the number of hops (num of GNN layers)
    - ![image.png](attachment:911fbe08-8ef9-425b-9e7b-054867687444.png)

## Receptive Field and over smoothing
- We can explain over-smoothing via the notion of the receptive field
    - We know the embedding of a node is determined by its receptive field
        - If 2 nodes have hightly overlapped receptive fields, then their embeddings are highly similar.
    - Stack many GNN layers -> nodes will have highly overlapped receptive fields-> node embeddings will be highly similar -> suffer from over-smoothing problem
- Next: How do we overcome over-smoothing problem?

## Design GNN layer connectivity
- What do we learn from the over-smoothing problem?
- Lesson 1: Be cautious when adding GNN layers
    - Unlike neural networks in other domains (CNN for image classification), adding more GNN layers do not always help
    - Step 1: Analyze the necessary receptive field to solve our problem. Eg., by computing the diameter of the graph
    - Step 2: Set number of GNN layers L to be a bit more than the receptive field we like. Do not set L to be unncessarily large.
- Question: how to enhance the expressive power of a GNN, if the number of GNN layers is small?

### Expressive Power for Shallow GNNs
- How to make a shallow GNN more expressive?
- Solution 1: Increase the expressive power within each GNN layer
    - In our previous examples, each transformation or aggregation fucntion only includes one linear layer
    - We can make aggregation/transformation become a deep neural network
    - ![image.png](attachment:561d4dd3-0913-4473-ad36-f8adbdd94555.png)
- Solution 2: Add layers that do not pass messages
    - A GNN does not necessarily only contain GNN layers
    - E.g., we can add MLP layers (applied to each node) before and after GNN layers, as pre-process layers and post-process layers.
    - Preprocessing layers: Important when encoding node features is necessary. E.g., when nodes represent images/text.
    - Post-processing layers: Important when reasoning / transformation over node embeddings are needed. E.g., graph classificiation, knowledge graphs
    - In practice, adding these layers works great!
    - ![image.png](attachment:27972498-0030-4f1c-b0c0-a44c8800efb1.png)
- 

- What is my problem still requires many GNN layers?
    - Lesson 2: Add skip connections in GNNs
        - Observation from over-smoothing: Node embeddings in earlier GNN layers can sometimes better differentiate nodes
        - Solution: we can increase the impact of earlier layers on the final node embeddings, by adding shortcut in GNN
        - ![image.png](attachment:e54c53f7-5366-414b-aae2-ef8f2f7412ff.png)

### Idea of Skip connections
- Why do skip connections work?
    - Intuition: Skip connections create a mixture of models
    - N skip connections -> 2**N possible paths
    - Each path could have up to N modules
    - We automatically get a mixture of shallow GNNs and deep GNNs
- ![image.png](attachment:4f81ddba-f2cc-4efd-af97-a3b0a5c42e55.png)

![image.png](attachment:b6fee7a6-d565-4380-9a3f-34feff518247.png)

### Other skip connections
- Directly skip to the last layer
    - The final layer directly aggregates from the all the node embeddings in the previous layers
- ![image.png](attachment:fcec9163-1c13-40c6-9048-d073814aaa96.png)