# Recap: Node Embeddings
- Intution: Map nodes to d-dimesional emebeddings such that similar nodes in the graph are embedded close together.
- ![image.png](attachment:93343202-7a12-4c22-970c-0f7aea38ec44.png)
- How to learn mapping function f?
- ![image.png](attachment:38bd0c5f-9e24-48f4-a749-1d7ec9848c28.png)

## shallow encoding
- Simplest encoding approach: Encoder is just an embedding-lockup
- ![image.png](attachment:49e293ab-0590-43ba-b248-07d7549ea46d.png)
- Limitations of shallow mbedding methods:
    - O(|V|d) parameters are needed:
        - No sharing of parameters between nodes
        - Every node has its own unique embedding
    - Inherently 'transductive'
        - Cannot generate embeddings for nodes that are not seen during training
    - Do not incorporate node features
        - Nodes in many graphs have features that we can and should leverage

# Deep Graph Encoders
- We will discuss deep learning methods based on graph neural networks (GNNs)
    - ENC(v) = multiple layer of non-linear transformations based on graph structure
- All these deep encoders can be combined with node similarity functions.
- ![image.png](attachment:2491a312-f4d5-4cbb-8807-2c30271120c3.png)

### Tasks on Networks
- Tasks we will be able to solve:
    - Node classification
        - Predict the type of a given node
    - Link Prediction
        - Predict whether 2 nodes are linked
    - Community detection
        - Identify densely linked clusters of nodes
    - Network similarity
        - How similar 2 (sub) networks are

## Modern ML toolbox
- Modern deep learning toolbox is designed for simple sequences & grids
- ![image.png](attachment:030e1769-e0ae-43ee-b87d-914d71a71f5e.png)

- But networks are far more complex
    - Arbitrary size and complex topological structure (i.e., no spatial locality like grids)
        - ![image.png](attachment:8ac2b26b-9cc4-4697-bd7d-c128af089f6c.png)
    - No fixed node ordering or reference point
    - Often dynamic and have multimodal features

## Basics of Deep Learning
- Loss function L:
- f can be simple linear layer, an MLP  or other neural networks (e.g., a GNN layer)
- Sample a minibatch of input data x
- Forward propagation: Compute L given x
- Back propagation: Obtain gradient using a chain rule
- Use stochastic gradient descent (SGD) to optimize L for weights over may iterations

## Deep Learning for Graphs
- Local network neighborhoods:
    - Describe aggregation strategies
    - Define computation graphs
- Stacking multiple layers
    - Describe the model, parameters, training
    - How to fit the model?
    - Simple example for unsupervised and supervised training.

## Setup
- Assume we have a graph G
    - V is the vertex set
    - A is the adjacency matrix (assume binary)
    - X belongs to R is a matrix of node features
    - v: a node in V; N(v): the set of neighbors of v/
    - Node features:
        - Social networks: user profile, user image
        - Biological networks: Gene expression profiles, gene functioal information
        - When there is no node feature in the graph dataset"
            - Indicator vectors (one-hot encoding of a node)
            - vector of constant 1: [1, 1, 1, 1.....,1]

## A Naive Approach
- Join adjacency matric and features
- Feed them into a deep neural netrowk
- ![image.png](attachment:76fe11d4-12e8-44fb-8d77-0d61de2f0b2f.png)
- Issues with this idea:
    - O(|V|) parameters
    - Not applicable to graphs of different sizes
    - Sensitive to node ordering

## IdeaL Convolutional Networks
- CNN on an image
- ![image.png](attachment:dd6862dd-e609-400d-bad5-bd1ee5c1657a.png)
- Goal is to generalize convolutions beyond simple lattices leverage node features/attributes (e.g., text, images)

## Real-world graphs
- But our graphs look like this:
- ![image.png](attachment:eab040d7-0061-4e85-96f8-13ce235553ed.png)
- There is not fixed notion of locality or sliding window on the graph
- Graphs are permutation invariant

### Permutation invariance
- Imagine we want to embed an entire graph
- Observation: A graph does not have a canoical ordering of its nodes
    - We can have many different node orderings of the same graph
- What do we want: If we learn an embedding function over a graph, we should get the same result (same embedding) regardless of how the nodes are numbered
- ![image.png](attachment:eba113a8-716b-4fdb-9fc6-7fefc9d49ebd.png)
- Learned graph representation should be the samem for Order 1 and Order 2
- What do we mean by graph representation is the same for 2 orderings?
    - Consider we learn a fucntion f that mapes a graph G = (A, X) to a vector R_d then
        - ![image.png](attachment:ff514be4-72d6-4b87-896b-33b05fb6b369.png)
- ![image.png](attachment:bf2a5b70-cb9c-4186-86c3-ee0a5347fc6d.png)
- 

## Permutation Equivariance
- For node representation: We learn a function f that mas nodes of G to a matrix R
- In other words, each node in V is mapped to a d-dim embedding.
- ![image.png](attachment:e127510d-b86f-428c-b5ed-1d43b82e6924.png)

- For node representation: 
    - Consider we learn a function f that maps a graph G = (A, X) to a matri R
    - If the output vector of a node at the same position in the graph remains unchanged for any ordering, we say f is permutation equivariant.

![image.png](attachment:f77a1cd7-f9f8-4670-b002-74ede1d4b8fc.png)

## Graph Neural Network overview
- Graph neural networks consist of multiple permutation equivariant / invariant functions
- ![image.png](attachment:6b6fcebe-0f2d-417b-8920-5045c0e775c9.png)
- Are other neural network architectures (MLPs) permutation invariant/equivariant? No. Switching the order of the input leads to different outputs!
- ![image.png](attachment:4592a56f-639b-43f2-a10e-bbe9e1987a18.png)
- This exaplins why the naive MLP approach fails for graphs!

# Graph Convolutional Networks
- Idea: Node's neighborhood defines a computation graph
- ![image.png](attachment:2dee049e-1986-440f-8b0c-3b183a6421c7.png)
- Learn how to propgate information across the graph to compute node features.

## Idea: Aggrgeate Neighbors:
- Key idea: Generate node embeddings based on local network neighborhood
    - ![image.png](attachment:c8a8a6d1-0199-4d43-8477-0fbbc1ce8876.png)
- Intuition: Nodes aggregate information from their neighbors using neural networks
- ![image.png](attachment:629a0e61-0bb5-423d-9b92-b81aaf37c9e9.png)
- Intuition: Network neighborhood defines a computation graph
    - Every node defines a computation graph based on its neighborhood'
    - ![image.png](attachment:7690d011-d3f2-4d61-9030-5d239c11df24.png)

## Deep Model: Many Layers
- Model can be of arbitrary depth:
    - Nodes have embeddings at each layer
    - Layer-0 embedding of node v is its input feature, x_v
    - Layer-k embedding gets information from nodes that are k hops away
    - ![image.png](attachment:4da30f9d-f284-42de-b640-c110000b6493.png)

## Neighborhood Aggregation
- Key distinctions are in how different approaches aggregate information across the layers.
- Basic Approach: Average informatiom from neighbors and apply a neural network
- ![image.png](attachment:f400f040-7579-428f-aece-8d24e844dd7d.png)

## Math: Deep Encoder of a GCN
- Basic Approach: Average neighbor messages and apply a neural network
- ![image.png](attachment:dbbdb1cf-cce9-4e77-928d-4c210ba461e1.png)
- 

## GCN: Invariance and Equivariance
- What are the invariance and equivariance properties for a GCN
    - Given a node, the GCN that computes its embedding is permutationn invariant.
    - ![image.png](attachment:5d01b159-2060-460a-9a2c-e7e0a4cf15a8.png)
- Considering all nodes in a graph, GCN computation is permutation equivariant
    - ![image.png](attachment:cc1aaa07-2563-4f7f-b74f-ad2ca6c9ccd5.png)

- Detailed resonaining
    1. The rows of input node features and output embeddings are aligned
    2. We know computing the embedding of a given node with GCN is invariant.
    3. So, adter permutation, the location of a given node in the input node feature matrix is changed, and the output embedding of a given node stays the same (the colors of node feature and embedding are matched). This is called permutation equivariant.

## Training the model
- How do we train the GCN to generate embeddings?
- ![image.png](attachment:037e11ce-6b33-42d5-bf5a-d0b08f39a6d1.png)
- Need to define a loss function on the embeddings.

## Model Parameters
- ![image.png](attachment:6b2ec5ff-cee7-43fe-b32f-383335082389.png)
- We can feed these embeddings into any loss function adn run SGD to train the weight parameters
- (h^k)_v: the hidden represenration of node v at layer k
- W_k: weight matrix for neighborhood aggregation
- B_k: weight matrix for transforming hidden vector of sel

## Matrix Formulation (1)
- Many aggrgeations can be performed efficiently by (sparse) matrix operations
    - ![image.png](attachment:aaf74031-ef94-4000-8b68-3d29e398ad08.png)

## Matric Formation (2)
- Re-writing update function in matrix form:
    - ![image.png](attachment:d88ed953-569c-4302-95d2-fa2bf1b0f3df.png)
- In practice, this implies that efficient sparse matrix multiplicaton can be used
- Note: not all GNNs can be expressed in a simple matrix form, when aggregation function is complex.

## How to train a GNN
- Node embedding z_v is a function of input graph
- Supervised setting: We want to minimize loss L
    - ![image.png](attachment:373d7bb7-57a5-4ca5-82f6-1907438198e6.png)
    - y: node label
    - L coulbe be L2 if y is real number or corss entropy if y is categorical
- Unsupervised settings:
    - No node label available
    - Use the graph structure as the supervision.

## Unsupervised Training
- One possible idea: 'Similar' nodes have similar embeddings
    - ![image.png](attachment:22e2273a-5ae8-4126-8d2f-8ee1430966c4.png)
- CE is the cross entropy loss
    - ![image.png](attachment:13aa68f1-e7dd-4d84-b554-e0593166ee8d.png)
- Node similarity can be anything, e.g., a loss based on
    - Random walks (node2vec, Deepwalk, struc2vec)
    - Matric factorization

## Supervised Training
- Directly train the model for a supervised task (e.g., node classification)
- E.g: A drug-drug interaction network
- ![image.png](attachment:0cef92b6-88b7-4355-b857-252958993ac5.png)
- Use cross entropy loss
    - ![image.png](attachment:226ab9af-2409-46fa-8b06-d48bd6574cee.png)

## Model Design: Overview
1. Define a neighborhood aggrgeation function
2. Define a loss function on the embeddings
    - ![image.png](attachment:6ff9cb17-f790-4593-bd54-6d6601e27100.png)
3. Train on a set of nodes, i.e a batch of compute graphs
    - ![image.png](attachment:49a2cb19-2b48-4e28-84ea-09ae22c4e292.png)
4. Generate embeddings for nodes as needed
    - ![image.png](attachment:62165298-989b-433a-9650-5402369ee432.png)

## Inductive Capability
- The same aggregation parameters are shared for all nodes:
    - The number of model parameters is sublinear in |V| and we can generalize to unseen nodes
    - ![image.png](attachment:0bd71e5c-534d-4251-99a4-843ce174622b.png)
- New Graph
    - Eg., train on protien interaction graph from model organism A and generate embeddings on newly collected data about ogranism B
    - Inductive node embeeding -> Generalize to entirely unseen graphs
    - ![image.png](attachment:abda1c4f-1d37-4224-8418-e3e946c838cf.png)
- New Node
    - Many application settings constantly encounter previously unseen nodes
    - Need to generate new embeddings 'on the fly'
    - ![image.png](attachment:ee615aa5-6f52-4331-a786-d53bfb00d2e1.png)

# Architecture Comparison
- How do GNNs compare to prominent architectures such as CNNs?

## convolutional Neural Network
- CNN layer with 3x3 filter
- ![image.png](attachment:5be266ef-1de3-47c7-a2ee-c6596b9ff8b0.png)

## GNN vs CNN 
- ![image.png](attachment:064c4e81-b852-42c6-981a-b40f6adf3d6a.png)
- CNN can be seen as a speical GNN with fixed neighbor size and ordering
    - The size of the filter is pre-defined for a CNN
    - The advantage of GNN is it process aritrary graphs with different degrees for each node.
- CNN is not permutation invariant/equivariant
    - Switching the order of pixel leads to different outputs
- Key difference: We can learn differnt (W^u)_l for different 'neighbor' u for pixel v on the image. The reason is we can pick an order for the 9 neighbors using relative position to the central pixel: {(-1, -1), (-1, 0), (-1, 1)...(1, 1)}
- 

## Transformer
- Transformers are among the most popular architectures that achive great performance in many sequence modeling tasks
- ![image.png](attachment:8f0ad314-3040-4ddd-bb8b-04887a22867a.png)
- Key component: self-attention
    - Every token/word attends to all the other tokens/words via matrix calculation.
    - ![image.png](attachment:8d423cd3-86c9-4d70-b1a4-103cd9f515a2.png)
- A general definition of attention
    - Given a set of vectors values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.
- Each token/word has a value vector and a query vector. The value vector can be seen as the representation of the token/word. We use the query vector to calculate the attention score (weights in the weighted sum)
    - ![image.png](attachment:54388c6b-23f8-401a-8e30-4b2d9732bee3.png)

## GNN vs Transformer
- Transformer layer can be seen as a special GNN that runs on a full-connected 'word' graph!
- Since each word attends to all the other words, the computation graph of a transformer layer is identified to that of a GNN on the fully-connected 'word' graph.
- ![image.png](attachment:80eebefc-c603-4310-9dc1-9ce015f2ce4c.png)

# Basics of Deep Learning

## Machine learning as optimization
- supervised learning: we are given input x, and the goal is to predict label y
- Input x can be:
    - vectors of real numbers
    - Sequences (natural language)
    - Matrices (images)
    - Graphs (potentially with node and edge features)
- We formulate the task as an optimization problem.
- Fomulate the task as an optimization problem
    - ![image.png](attachment:07e0d022-2c9a-43c1-8935-721077147019.png)
- Theta: a set of parameters we optimize
    - could contain one or more scalars, vectors, matrices
    - E.g. Theta = {Z} in the shallow encoder (the embedding lookup)
- Loss function L : E.g L2 loss
    - ![image.png](attachment:114f2993-b8e9-4e8b-96b5-bc3a770e10b1.png)

## Loss function example:
- one common loss for classification: cross entropy (CE)
- Label y is a categorical vector (one-hot encoding)
    e.g: y = [0, 0, 1, 0, 0], y is of a class 3
- ![image.png](attachment:04096b97-e928-44e9-bd87-0cc539678516.png)

## Gradient Descent
- Gradient vector: Direction and rate of fastest increase
- directional derivative of a multi-variable function (e..g, L) along a given vector represents the instaneous rate of change of the function along the vector.
- Gradient is the directional derivative in the direction of largest increase.
- Iterative algorithm: Repeatedly update weights in the (opposite) direction of gradients until convergence
- Learning  rate (alpha):
    - Hyperparameter that controls the size of gradient step
    - Can vary over the course of training (LR scheduling)
- Ideal termination condition: gradient = 0
    - In practice, we stop training if it no longer improves performance on validation set (part of dataset we hold our from training).
 
## Stochastic Gradient Descent (SGD):
- Problem with gradient descent
    - Exact gradient requires computing the entire dataset
        - This means summing gradient contributions over all the points in the dataset
        - Modern datasets often contain billions of data points
        - Extremely expensive for every gradient descent step
- Solution: SGD
    - At every step, pick a differnt minibatch B contining a subset of the dataset, use it as input x
    - Common optimizer that improves over SGD
        - Adam, Adagrad, adadelta, RMSprop