**GAT**

GAT is an graph representation learning using **attention** mechanism.

As we know, we usually use weights on edges to represent the importance of neighbors to center node. However, this fixed weight mechanism may not actually capture the potential relation between two linked nodes, so the author proposed a graph representation learning algorithm using attention mechanism to calculate the importance between them.(Basically, this paper use the same idea with the original attention paper on graph data).

First, they **project** the features of two nodes to a lower dimension to reduce the running time and space.

Second, they use a **shared attentional mechanism**(a **linear** layer) to calculate **attention coefficients** between each two nodes.

Third, to make these coefficients easily **comparable** accross different nodes, they **normalize** them by using **softmax**. And also, to add nonlinearity to the result, they first apply **LeakyReLU** to the coefficients before applying softmax to them.

Fourth, from step 1 to 3 would be a attention head. In practice, **multi-head attention** would help to capture different information between each two nodes, so they **average** results of k heads to get the representation of nodes.

In [None]:
import torch

def format_pytorch_version(version):
  return version.split('+')[0]

TORCH_version = torch.__version__
TORCH = format_pytorch_version(TORCH_version)

def format_cuda_version(version):
  return 'cu' + version.replace('.', '')

CUDA_version = torch.version.cuda
CUDA = format_cuda_version(CUDA_version)

!pip install torch-scatter     -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-sparse      -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-cluster     -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-geometric 

Looking in links: https://pytorch-geometric.com/whl/torch-1.10.0+cu111.html
Collecting torch-scatter
  Downloading https://data.pyg.org/whl/torch-1.10.0%2Bcu113/torch_scatter-2.0.9-cp37-cp37m-linux_x86_64.whl (7.9 MB)
[K     |████████████████████████████████| 7.9 MB 4.4 MB/s 
[?25hInstalling collected packages: torch-scatter
Successfully installed torch-scatter-2.0.9
Looking in links: https://pytorch-geometric.com/whl/torch-1.10.0+cu111.html
Collecting torch-sparse
  Downloading https://data.pyg.org/whl/torch-1.10.0%2Bcu113/torch_sparse-0.6.13-cp37-cp37m-linux_x86_64.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 5.3 MB/s 
Installing collected packages: torch-sparse
Successfully installed torch-sparse-0.6.13
Looking in links: https://pytorch-geometric.com/whl/torch-1.10.0+cu111.html
Collecting torch-cluster
  Downloading https://data.pyg.org/whl/torch-1.10.0%2Bcu113/torch_cluster-1.6.0-cp37-cp37m-linux_x86_64.whl (2.5 MB)
[K     |████████████████████████████████| 2.5

In [None]:
from torch_geometric.datasets import Planetoid
import os
import re
import random
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
from collections import defaultdict
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

In [None]:
dataset = Planetoid(root='/tmp/Cora', name='Cora')
nodenum = dataset.data.num_nodes
adj = torch.zeros(nodenum, nodenum)
edges = dataset.data.edge_index.T
for edge in edges:
  adj[edge[0]][edge[1]] += 1
  adj[edge[1]][edge[0]] += 1
print(adj)

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...
Done!


tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 2.,  ..., 0., 0., 0.],
        [0., 2., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 2.],
        [0., 0., 0.,  ..., 0., 2., 0.]])


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data = dataset[0].to(device)
features = data.x
print(features.shape)

torch.Size([2708, 1433])


In [None]:
xs = torch.IntTensor(nodenum, 1)
for i in range(nodenum):
  xs[i] = i

In [None]:
batchsz = 64
trainset = torch.utils.data.TensorDataset(xs[data.train_mask].to(device), data.y[data.train_mask])
train_loader = DataLoader(trainset, batch_size=batchsz, shuffle=True)

To implement this algorithm(train a batch of nodes each time), we first get a batch of nodes. The dimension would be **[batchsz, featuresz]**.

And then, since we use a **shared projection** layer to project all nodes to a lower dimension, we pass the feature tensor through **lineartrans** layer and get a tensor with dimension **[batchsz, hiddendim]**

They use a **shared attentional mechanism** layer to get the attention coefficients using this formula: $att(Wh_i||Wh_j)$. The att is a linear layer with **[2 * hiddendim, 1]** weight. The input is a **concatenation** of two projected representation of two linked nodes, we could know the coefficient $e_{ij}$ would be **weighted** sum of each element in $Wh_i$ and $Wh_j$. In this case, I use two linear layer att1 and att2 with **[hiddendim, 1]** separately. First, I pass hidden features of all nodes(**[nodenum, hiddendim]**) through att1 and then get a vector **[nodenum, 1]**, where each row is weighted sum of ith node's hidden feature. Then, to get the matrix of coefficients(elements are $e_{ij}$, where i, j is from 1 to nodenum), I use a all ones matrix with dimension **[nodenum, 1]** to multiply with the **transpose** of the vector we get in the first step. After this operation, we get a **[nodenum, nodenum]** matrix, where the ith element of each row is the weighted sum of ith node's hidden feature. After that, I use att2 to get a **[nodenum, 1]** vector again(the meaning of each row is the same as the first one) and use it to add with the **[nodenum, nodenum]** matrix, so for the element in the ith row and j column of this final matrix, it would represent $att(Wh_i||Wh_j)$.

Since not all two nodes have links between each other, we use each element in **adjecency matrix** to multiply with according element with the final matrix we just get to get the final coefficient matrix E.

Finally, we just perform operation accordingly to get the **normalized** coefficient matrix.

In [None]:
class GAT(nn.Module):
  def __init__(self, features, adj, in_dim, hidden_dim, out_dim, k):
    super(GAT, self).__init__()
    self.features = features
    self.adj = adj.to(device)
    self.k = k
    self.hidden_dim = hidden_dim
    self.lineartrans = nn.ModuleList([nn.Linear(in_dim, hidden_dim, bias=False) for i in range(k)])
    self.att1 = nn.ModuleList([nn.Linear(hidden_dim, 1, bias=False) for i in range(k)])
    self.att2 = nn.ModuleList([nn.Linear(hidden_dim, 1, bias=False) for i in range(k)])
    self.mlp = nn.Linear(hidden_dim, out_dim, bias=True)
    self.leakyrelu = nn.LeakyReLU()
    self.softmax = nn.Softmax(dim=1)
    self.logsoftmax = nn.LogSoftmax()

  def forward(self, batch):
    batchlist = []
    for node in batch:
      batchlist.append(node.item())
    result = torch.zeros(len(batchlist), self.hidden_dim).to(device)
    for i in range(self.k):
      hiddenfeatures = self.lineartrans[i](self.features)
      att1 = self.att1[i](hiddenfeatures)
      att2 = self.att2[i](hiddenfeatures[batchlist])
      tmp = torch.ones(len(batchlist), att1.shape[1]).to(device)
      tmp = torch.mm(tmp, att1.t())
      tmp += att2
      e = torch.mul(tmp, self.adj[batchlist])
      e = self.leakyrelu(e)
      a = self.softmax(e)
      out = torch.mm(a, hiddenfeatures)
      out = self.leakyrelu(out)
      result += out
    result /= self.k
    result = self.mlp(result)
    return self.logsoftmax(result)

In [None]:
lr = 0.1
epochs = 50
hidden_dim = 256

In [None]:
model = GAT(features, adj, dataset.num_node_features, hidden_dim, dataset.num_classes, 3).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.NLLLoss()

In [None]:
model.train()
for epoch in range(epochs):
  acc = 0
  for x, y in train_loader:
    optimizer.zero_grad()
    out = model(x)
    loss = criterion(out, y)
    loss.backward()
    optimizer.step()
    _, pred = out.max(dim=1)
    acc += float(pred.eq(y).sum().item())
  print("epoch: {0}, loss: {1}, train acc: {2}".format(epoch, loss.item(), acc / data.train_mask.sum().item()))



epoch: 0, loss: 12.028645515441895, train acc: 0.08571428571428572
epoch: 1, loss: 2.7495124340057373, train acc: 0.17857142857142858
epoch: 2, loss: 4.763061046600342, train acc: 0.11428571428571428
epoch: 3, loss: 2.6664645671844482, train acc: 0.2571428571428571
epoch: 4, loss: 3.093618631362915, train acc: 0.40714285714285714
epoch: 5, loss: 1.3721275329589844, train acc: 0.65
epoch: 6, loss: 0.35978391766548157, train acc: 0.75
epoch: 7, loss: 0.15279605984687805, train acc: 0.85
epoch: 8, loss: 0.04930971562862396, train acc: 0.9642857142857143
epoch: 9, loss: 0.06350747495889664, train acc: 0.9571428571428572
epoch: 10, loss: 0.001685184775851667, train acc: 0.9642857142857143
epoch: 11, loss: 4.711748260888271e-05, train acc: 0.9857142857142858
epoch: 12, loss: 0.0016544405370950699, train acc: 0.9785714285714285
epoch: 13, loss: 0.000497175904456526, train acc: 0.9928571428571429
epoch: 14, loss: 0.0006171417771838605, train acc: 0.9928571428571429
epoch: 15, loss: 0.001018341

In [None]:
testset = torch.utils.data.TensorDataset(xs[data.test_mask].to(device), data.y[data.test_mask])
test_loader = DataLoader(testset, batch_size=batchsz, shuffle=True)

In [None]:
model.eval()
acc = 0
for x, y in test_loader:
  out = model(x)
  _, pred = out.max(dim=1)
  acc += float(pred.eq(y).sum().item())
print("test acc: {0}".format(acc / data.test_mask.sum().item()))

test acc: 0.512


