# Graph Manipulation in GNNs

## General GNN Framework
- Idea: Raw input graph != computational graph
    - Graph feature augmentation
    - Graph structure manipulation
    - ![image.png](attachment:c8446120-322e-4df9-848c-c291a343cf38.png)

## Why Manipulate Graphs
- Our assumption so far has been
    - Raw input graph = computational graph
- Reasons for breaking this assumption
    - Feature level:
        - The input graph lacks features -> Feature augmentation
    - Structure level
        - The graph is too sparse -> inefficient message passing
        - The graph is too dense -> message passing is too costly
        - The graph is too large -> cannot fit the computational graph into a GPU
    - It's just unlikely that the input graph happens to be the optimal computation graph for embeddings.

## Graph Manipulation Approaches
- Graph Feature manipulation
    - The input graph lacks features -> feature augmentation
- Graph structure manipulation
    - The graph is too sparse -> Add virtual nodes / edges
    - The graph is too dense -> Sample neighbors when doing message passing
    - The graph is too large -> Sample subgraphs to compute embeddings

## Feature Augmentation on Graphs
- Why do we need feature augmentation?
    1. Input graph does not have node features
        - This is common when we only have the adj. matrix
    - Standard approahes:
        1. Assign constant values to nodes
            - ![image.png](attachment:a28c63b2-8f24-4a57-8b74-b5c4701a809b.png)
        2. Assign unique IDs to nodes
            - These IDs are converted into one-hot vectors
            - ![image.png](attachment:6aad892c-a78e-4a5a-991d-d7e9f3177547.png)

- ![image.png](attachment:44734503-31be-4693-a110-fae75b796a2a.png)

- Why do we need feature augmentation?
    2. Certain structures are hard to learn by GNN
    - Example: cycle count feature
    - Can GNN learn the lenght of a cycle that v1 resides in? No
    - ![image.png](attachment:9e452c21-ecc0-491e-9be3-82c4b3f92681.png)
    - v1 cannot differentiate which graph it resides in
        - Because all the nodes in the graph have degree of 2
        - The computational graphs will be the samme binary tree
    - ![image.png](attachment:93251469-4608-4ad5-99ef-b0cf246f84bb.png)
- Solutiom;
    - We can use cylce count as augmented node features
    - ![image.png](attachment:914b786a-e5c8-45e7-90dc-e8a282ae9791.png)

### Add Virtual Nodes / Edges
- Motivation: Augment sparse graph
- Add virtual nodes
- The virutal node will connect to all the nodes in the graph
    - Suppose in a sparse graph, 2 nodes have shortest path distance of 10
    - After adding the virtual node, all the nodes will have a distance of 2.
        - Node A - Virtual node - Node B
    - ![image.png](attachment:5fb5db97-cd07-4b73-ac54-a7cf6c28ac4a.png)
- Benefits: Greatly improves message passing in sparse graphs

## Node Neighborhood sampling
- Our approach so far:
    - All the neighbors are used for message passing
- Problem: Dense / Large graphs, high-dgree nodes
    - ![image.png](attachment:d17d5025-c2c1-4a42-bdf2-7a979b9657c8.png)
- New Idea: (Randomly) determine a node's neighborhood for message passing
- For example, we can randomly choose 2 neighbors to pass messages
    - Only nodes B and D will pass message to A
    - ![image.png](attachment:63d1aa21-dab5-49ae-a865-72d0eabefedc.png)
    - Next time when we compute the embeddings, we can sample different neighbors
    - ![image.png](attachment:58ac3cde-1ebc-4142-ad2f-2ad0bcc185d6.png)
- In expectation, we get embeddings similar to the case where all the neighbors are used
    - Benefirts: Greatly reduce computational cost
    - It works gret in practice
    - ![image.png](attachment:2aebc991-276e-49e8-a647-09da56f50da1.png)

# Prediction with GNNs
- How do we train a GNN?

## GNN Pipeline
- ![image.png](attachment:0137eac2-9ae6-4c26-8209-6394556d966f.png)
- Now prediction head
- Different prediction heads:
    1. Node level tasks
    2. Edge level tasks
    3. Graph level tasks

## GNN Prediction Heads
- Idea: Different task level require different prediction heads
- ![image.png](attachment:2a8cf7b7-aa91-4711-bd8e-355ebe0298b2.png)

## Prediction Heads: Node-level
- Node-level prediction: We can directly make prediction using node embeddings!
- After GNN computation, we have d-dim node embedings:
    - ![image.png](attachment:5b97f35d-c240-4a28-a85f-b4f6d1c9a6dd.png)
- Suppose we want to make k-way prediction
    - Classification: classify among k categories
    - Regression: regress on k targets
- ![image.png](attachment:1ad8f21f-a0dc-47a1-8f96-3c0cafab9003.png)

## Prediction Heads: Edge-level
- Edge-level prediction: Make prediction using pairs of node embeddings
- Suppose we want to make k-way prediction
- ![image.png](attachment:04ea8423-3cd8-4d92-a90c-f3932cf5b2eb.png)
- Options for Head_edge(h(L)_u, h(L)_v)
    1. Concatenation + Linear
        - We have seen this in graph attention
        - ![image.png](attachment:7f9d7d26-f320-4c9b-a4be-37d0c561282d.png)
        - Here Linear(.) will map 2d-dimensional embeddings (since we concatenated embeddings) to k-dim embeddings(k-way prediction)
    2. Dot Product
        - ![image.png](attachment:df106ae6-8cc1-4bfe-82b2-b4ef38f54b04.png)
        - This approach only applies to 1-way prediction (e.g., link prediction: predict the existence of an edge)
        - Apply to k-way prediction
            - ![image.png](attachment:0f2678c3-01c0-4802-a966-2c6022317988.png)

## Prediction Heads: Graph Level
- Graph level prediction: Make prediction using all the ndoe embeddings in our graph.
- suppose we want to make k-way prediction
- ![image.png](attachment:b5cea884-da31-4dd1-94da-fe5bc58d8f0a.png)
- ![image.png](attachment:f8d0a078-5770-4a7d-b1e5-9182ca43a777.png)

## Issue of Global Pooling
- Issue: Global pooling over a (large) graph will lose information
- Toy example: We use 1-dim node embeddings
    - Node embeeding for G1: {-1, -2, 0, 1, 2}
    - Node embedding for G2: {-10, -20, 0, 10, 20}
    - clearly G1 and G2 have very different node embeddings
        - their sturctures should be different
    - If we do global sum pooling
        - Prediction for G1: y_G = Sum({-1, -2, 0, 1, 2}) = 0
        - prediction for G2: y_G = sum({-10, -20, 0, 10, 20}) = 0
        - We cannot differentiate G1 and G2.

## Add Virtual super Node
- To embed a graph add virtual super node
    - The virtual node will connect to all the nodes in the graph
    - Benefits: GNN learns how to encode the entire input graph.
- ![image.png](attachment:390c0ba4-a064-4e3a-b415-37b434b8b164.png)

## Training Graph Neural Networks
- Where does ground-truth come from?
    - Supervised labels
    - Unsupervised signals
- ![image.png](attachment:341bb80d-a7a5-41bb-8821-04e567688117.png)

## supervised vs unsupervised
- supervised learning on graphs
    - Labels come from external sources
        - E.g., predict drug likeness of a molecular graph
- Unsupervised learning on graphs
    - Siganls come from graphs themselves
        - E.g., link prediction: Predict if 2 nodes are connected
- Sometimes the differencees are blurry
    - We still have 'supervision' in unsupervised learning
        - E.g. train a GNN to predict node clustering coeffiicent
    - An alternative name for unsupervised is self-supervised

## Supervised Labels on Graphs
- supervised lables come from the specific use cases. For example:
    - Node labels y_v: in a citation network, which subject area deos a node belong to
    - Edge labels y_uv: in a transaction network, whether an edge is fraudlent
    - Graph labels y_G: Among molecular graphs, the drug linkeness of graphs
- Advice: Reduce your taks to node / edge / graph labels, since they are easy to work with
    E.g., we knew some nodes from a cluster. We can treat the cluster that a ndoe belongs to as a node label.

## Unsupervised signals on Graphs
- Problem: Sometimes we only have a graph, without any external labels
- The solution: 'self supervised learning', we can find supervision signals within the graph.
    - For example, we can let GNN predict the following:
        - Node-level y_v. node statistics: such as clustering coefficient, pageRank,...
        - Edge-level y_uv. Link Prediction: Hide the edge between 2 nodes, predict it there should be a link
        - Graph-level y_G. Graph statistics: For example, predict if 2 graphs are isomorphic
    - these tasks do not require external labels

## 3. Graph pipeline
- How do we compute the final loss?
    - Classifiation loss
    - Regression loss
- ![image.png](attachment:bbc96f5c-1d9a-4ad9-985b-22204940a721.png)

## Settings for GNN training
- The settings, we have N data points
    - Each data point can be a node/edge/graph
    - ![image.png](attachment:629d17c6-db05-4a3b-a92d-98cfe0fc7e40.png)
    - We will use prediction y_hat(i), label y(i) to refer predictions at all levels.

## classification or regression
- Classification: lebal y(i) with discrete value
    - E.g., node classification: which category does a node belong to
- Regression: labels y(i) with continous value
    - E.g., predict the drug likeleness of a molecular graph
- GNNs can be applied to both settings
- Differences: loss function & evaluation metrics

## Classification Loss:
- Cross entropy is a very common loss function in classification
- K-way prediction for ith data point:
    - ![image.png](attachment:8bb1cae4-3d85-480e-80fc-93f43c423567.png)
- Total loss over all N training examples
    - ![image.png](attachment:2d9061c7-4504-4a1e-b208-ff58a6887dc5.png)

## Regression Loss
- For regression tasks we often use mean squared error (MSE) a.k.a L2 loss
- K way regression for data point(i)
    - ![image.png](attachment:89e5e8bf-55dd-4d57-ab4e-6518ee63ebf7.png)
- Total loss over all N training examples:
    - ![image.png](attachment:ff9efeff-1608-4091-93d1-c0057117cacc.png)

# Graph Pipeline 
- How do we measure the sucess of a GNN?
    - Accuracy
    - ROC AUC
- ![image.png](attachment:0e569b30-c760-4df2-aa17-9062c752c52f.png)

## Evaluation Metrics: Regression
- We us standard evaluation metrics for GNN
- Evaluate regression tasks on graphs
- Root mean square error (RMSE)
    - ![image.png](attachment:fbab0aef-3149-46a4-b276-55470e62fc7e.png)
- MEan absolute error (MAE)
    - ![image.png](attachment:e6bb2a3a-6f70-45d1-a6d1-617e57c50c31.png)

## Evaluation Metrics: classification
- Evalute classification tasks on graphs:
    1. Mutliple class classification
        - We simply report the accuracy
        - ![image.png](attachment:4450bd22-1983-4921-ba7c-356acd89f46f.png)
    2. Binary classification
        - Metrics sensitive to classification threshold
            - Accuracy
            - Precision / Recall
            - If the range of prediction is [0, 1], we will use 0.5 as threshold
        - Metrics Agnostic to classification
            -  ROC AUC
        - ![image.png](attachment:ee9b14be-2be4-4981-a77f-3a8232e21502.png)
        - ROC curve: Captues the tradeoff in TPR ana FPR as the classification threshold is varied for a binary classifier
        - ![image.png](attachment:4781011c-ad03-4e17-8194-b6a396dab879.png)
        - ROC AUC: Area under the ROC curve: Intution: The probability that a classifier will rank a randomly chosen positive instance higher than a random chosen negative one.

## Graph Training pipeline
- ![image.png](attachment:11c49f75-d09f-4f27-9553-1e4837ae0064.png)

### Dataset split: Fixed / Random split
- Fixed split: We will split our dataset once
    - Training set: Used for optimizing GNN parameters
    - Validation set: Develop model/hyperparamters
    - Test set: Held out until we report final performance
- A concern: Sometime we cannot guarantee that the test set will really be held our.
- Random split: We will randomly split our dataset into training/validation/test
    - We report average performance over different random seeds.

## Why splitting graphs is sepcial
- Suppose we want to split an image dataset
    - Image classification: Each data point is an image
    - Here data points are independent
        - Image 5 will not affect our prediction on image 1
        - ![image.png](attachment:8fb21ef8-b867-4d79-b5b4-5b393494fda0.png)
    - Splitting a graph dataset is different!
        - Node classification: Each data point is a node
        - Here data points are NOT independent
            - Node 5 will affect our prediction on node 1, because it will participate in message passing -> affect node 1's embedding
        - ![image.png](attachment:ee169da3-ae56-4a98-9a51-708273629bfb.png)
- Solution 1 (Transducive setting): The input graph can be observed in all the dataset splits (training, validation and test set)
- We will only split the (node) labels
    - At training time, we compute embeddings using the entire graph, and train using node 1 and 2's labels
    - At validation time, we compute embeddings using the entire graph, and evaluate on node 3 & 4's labels
- Solution 2 (Inductive setting): We break the edges between splits to get multiple graphs
    - Now we have 3 graphs that are independent. Node 5 will not affect our prediction on node 1 any more.
    - At training time, we compute embeddings using the graph over node 1 and 2, and train using node 1&2's labels.
    - At validation time, we compute embeddings using the graph over node 3 & 4, and evaluate on node 3 & 4's labels.
    - ![image.png](attachment:bf9d7ba8-8f97-4e69-a5d7-07023bc5c371.png)

## Transductive / Inductive settings
- Transductive setting: training / validation / test sets are on the same graph
    - The dataset consist of one graph
    - The entire graph can be observed in all dataset splits, we only split the labels.
    - Only applicable to node / edge prediction tasks.
- Inductive setting: training / validation / test sets are on different graphs
    - The dataset consists of multiple graphs
    - Each split can only observe the graph(s) within the split. A succcessful model should generalize to unseen graphs
    - Applicable to node / edge / graph tasks.

## Exampple: Node classification
- Transductive node classification
    - All the splits can observe the entire graph structure, but can only observe the labels of their respective nodes
    - ![image.png](attachment:06bd09b3-8602-4382-81bf-c80be871abfc.png)
- Inductive node classification
    - suppose we have a dataset of 3 graphs
    - Each split contains an independent graph
    - ![image.png](attachment:3cdba2df-77b0-466f-ab32-7b3645a6b338.png)

## Example: Graph Classification
- Only the inductive setting is well defined for graph classification
    - Because we have to test on unseen graphs
    - Suppose we have a dataset of 5 graphs. Each split will contain independent graph(s).
- ![image.png](attachment:929e2f65-010d-466b-a1ed-cf22dbc57166.png)

## Example: Link Prediction
- Goal of link prediction: Predict missing edges
- Setting up link prediction is tricky
    - Link prediction is an unsupervised / self-supervsied task. We need to create the labels and dataset splits on our own,
    - Concretely, we need to hide some edges from the GNN and let the GNN predict if the edges exist.
- ![image.png](attachment:686e5394-05b9-42cc-b7e9-0137c588fbb5.png)

## Setting up Link Prediction
- ![image.png](attachment:379a91b0-7668-4740-935b-9093efff3d8b.png)
- For link predicition, we will split edges twice
- Step1: Assign 2 types of edges in the original graph
    - Message edges: Used for GNN message passing
    - Supervision edges: Use for computing objectives
    - After step1:
        - Only message edges will reamin in the graph
        - Supervision edges are used for supervision for edge predicitons made by model, will not fed into GNN.
- Step2: Split edges into train / validation / test
    - Option1: Inductive link prediction split
        - Supppose we have a dataset of 3 graphs. Each inductive split will contain an independent graph
        - ![image.png](attachment:e49d5411-32a4-46aa-8d9c-270296cc26fd.png)
        - In train or val or test set, each graph will have 2 types of edges: message edges + supervision edges
        - ![image.png](attachment:b911e914-1b67-4550-bbfa-4c0940d306b4.png)
    - Option2: Transductive link prediction split;
        - This is the default setting when people talk about link prediction
        - Suppose we have a dataset of 1 graph
        - By definition of 'transductive', the entire graph can be observed in all dataset splits
            - But since edges are both part of graph structure and the supervision, we need to hold out validation / test edges
            - To train the taining set, we further need to hold out supervision edges for the training set
        - ![image.png](attachment:4f25a895-ea37-42ae-b611-95fdeb8e216b.png)
        - Why do we use growing number of edges? After trainig, supervision edges are known to GNN. Therefore, an ideal model should use supervision edges in message passing at validation time. The same applies to the test time.
        - 