# Recap: Deep Graph Encoder
- ![image.png](attachment:0ea7c541-3793-4dc1-a02a-d71e35fa6395.png)
- Idea: Node's neighborhood defines a computation graph
- ![image.png](attachment:cb37edea-279b-4e7d-83b3-bca81bd60dad.png)
- Learn how to propagate information across the graph to compute node features

## Aggregate From Neighbors:
- Intuition: Nodes aggregate information from their neighbors using neural networks
- ![image.png](attachment:c973a698-0eac-4a3e-8026-01eff4fb439b.png)
- Intuition: Network neighborhood defines a computation graph
- Every node defines a computation graph based on its neighborhood
- ![image.png](attachment:012fb84d-68b9-43b6-a7c3-5a345a9e45cd.png)

# A General GNN Framework
- ![image.png](attachment:980a40a0-52ae-4361-9a00-2866ffea5915.png)
1. GNN layer = Message + Aggregation
    - Different instantiations under this perspective
    - GCN, GraphSAGE, GAT,...
2. Connect GNN layers into a GNN
    - Stack layers sequentially
    - Ways of adding skip connections
3. Idea: Raw input graph != computational graph
    - Graph feature augmentation
    - Graph structure augmentation

- How do we tran a GNN
    - Supervised / unsupervised objectives
    - Node / edge / graph level objectives

# Single Layer of a GNN
- GNN layer = Message + Aggregation
    - Different instantiatios under this perspective
    - GCN, GraphSAGE, GAT, ...
    - ![image.png](attachment:0f146c86-95af-4560-8286-0c9a8ae1f77d.png)
- Idea of a GNN layer
    - Compress a set of vectors into a single vector
    - Two-step process: 1. Message 2. Aggregation
    - ![image.png](attachment:bedc1554-6366-4e44-ad3c-55587f5d7db6.png)

## Message Computation
- Message function
    - ![image.png](attachment:ef73e771-26f0-4d27-8f13-4f17a32b3729.png)
- Intuition: Each node will create a message, which will be sent to other nodes later.
- Example: A linear layer
    - ![image.png](attachment:d6d84314-5bbb-48e0-a81b-7efb9c9052d6.png)
    - Multiply node features with weight matrix W^(l)
    - ![image.png](attachment:5bf00d64-6e1d-406a-a4ff-d7c6f21189ed.png)

## Message Aggregation
- Intuition: Node v will aggregate the messages from its neighbors u:
    - ![image.png](attachment:f77f9c99-4265-4bc4-a193-013ab0fb4cc8.png)
- Example: Sum(.), Mean(.) or Max(.) aggregator
    - ![image.png](attachment:35e0bdb3-eea6-4ddf-9a58-e9cb2998af2b.png)

## Message Aggregation: Issue
- Issue: Information from node v itself could get lost
    - Computation of h(l)_v does not directly depends on h(l-1)_v
- Solution: Include h(l-1)_v when computing h(l)_v
    1. Message: compute message from node v itself
        - Usually, a different message computation will performed
        - ![image.png](attachment:e50a9f06-7037-4b2b-9e8b-760866f0468a.png)
    2. Aggregation: After aggregating from neighbors, we can aggregate the message from  node v itself.
        - Via concatenation or summation
        - ![image.png](attachment:00068f30-b33d-4781-a70b-2bb3458be4f7.png)

## A Single GNN layer
- Putting things together
    1. Message: Each node computes a message
        - ![image.png](attachment:99e4cb47-1d4f-44b0-995f-bacc9666087e.png)
    3. Aggregation: aggregate messages from neighbors
        - ![image.png](attachment:eb4ed69f-3de1-408f-bc0c-f259a127a5e0.png)
- Nonlinearity (activation): Adds expressiveness
    - Often written as sigmoid, examples: ReLU, sigmoid
    - Can be added to message or aggregation

## Activation (Non-Linearity)
- Apply activation to ith dimension of embedding x
    1. Rectified Linear unit (ReLU)
        - ReLU(x_i) = max(x_i, 0)
        - ![image.png](attachment:088fe97f-5b4b-4d43-b569-3722fde4488c.png)
        - most commonly used
    2. sigmoid
        - ![image.png](attachment:f63f46c8-1c5c-4cb3-ac49-53b95958e836.png)
        - Used only when we want to restrcit the range of our embeddings
        - ![image.png](attachment:e63b185a-3736-4608-97ad-5861f7328dd8.png)
    3. Parametric ReLU
        - PReLU(x_i) = max(x_i, 0) + a_imin(x+i, 0)
        - a_i is a trainable parameter
        - ![image.png](attachment:5287cb95-40cd-4674-9eef-578507ac4681.png)
        - Empirically performs better than ReLU

## Classific GNN layers
1. Graph Convolutional Networks (GCN)
    - Message + Aggregation
    - ![image.png](attachment:ff30d094-d226-4ad4-a5c0-93241dea7510.png)
    - Message:
        - Each neighbor: ![image.png](attachment:f7c22204-ab4b-4298-ba55-c72bbe43cbc0.png)
    - Aggregation
        - Sum over messages from neighbors, then apply activation
        - ![image.png](attachment:84f2ff89-1613-4ae5-bfe8-59400d240e4d.png)
    - In GCN the input graph is assumed to have self-edges that are included in the summation

2. GraphSAGE
- ![image.png](attachment:6267cdb7-c73e-4736-ad0c-dd9169967c49.png)
- Two-stage aggregation
    - Stage 1: Aggregate from node neighbors
        -  ![image.png](attachment:fc9ad606-e261-460f-b191-c932c655dc12.png)
    - Stage 2: Further aggregate over the node itself.
        - ![image.png](attachment:69aa95a1-4646-4849-9c88-53908b4ad48b.png)
    - Message is computed within the AGG(.)

#### GraphSAGE neighbor aggregation
- Mean: Take a weighted average of neighbors
    - ![image.png](attachment:418e0c36-cd2a-49d3-bb49-f9bf8a6a1201.png)
- Pool: Transform neighbor vectors and apply symmetric vector function Mean(.) or Max(.)
    - ![image.png](attachment:d3d5fbad-4406-4a53-b462-8ce755775832.png)
- LSTM: Apply LSTM to reshuffled to neighbors
    - ![image.png](attachment:4309794d-c78f-4eb6-bb86-9c2f3e74e25f.png)

#### GraphSAGE: L2 Normalization
- Optional: apply l2 normalization to h(l)_v at every layer
- ![image.png](attachment:2f2e6b4f-692e-4d40-9468-e81f4a916c5d.png)
- Without l2 normalization, the embedding vectors have diffeent scales for vectors
- In some cases, normalization of embedding results in performance improvement
- After l2 normalization, all vectors will have the same l2 norm.

3. Graph Attention Networks
- ![image.png](attachment:d342c011-aa1d-4ec3-b08c-cce11a0e0447.png)
- In GCN / GraphSAGE
    - ![image.png](attachment:44e7f0a5-0bdb-4f74-9c65-29d99d46acd7.png) is the weighting factor (importance) of node u's message to node v
    - alpha_vu is defined explicitly based on the structural properties of the graph (node degree)
    - All neighbors u E N(v) are equally important to node v
- Not all node's neighbors are equally important
    - Attention is inspired by cognitive attention
    - The attention alpha_uv focuses on the important parts of the input data and faded out the rest.
            - Idea: The NN should devote more computing power on the small but important part of the data
            - Which part of the data is more important depends on the context and is learned through training.