使用原生pytorch实现GAT模型，论文为2018年的[《GRAPH ATTENTION NETWORKS》](https://arxiv.org/abs/1710.10903),该篇论文对算法的细节讲解非常详细，基本照着就可以把代码写出了来了，真是难得啊！看论文都要看哭了的我！  
然后代码实现基本全程照抄以前的代码，太开心啦，框架照抄GCN的框架，attention详见transformer decoder部分的masked self-attention的部分，只是这次mask不是只mask时间序偏后的，而是mask不是邻接节点的就行了。

In [39]:
import os
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F

#device = 'cpu'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


使用的数据集为[cora数据集](https://linqs.soe.ucsc.edu/data)(该网站还有其它关于图神经网络的数据集),该数据集由许多机器学习领域的paper构成，这些paper被分为7个类别，在该数据集中，一篇论文至少与该数据集中任一其它论文有引用或被引用关系，共2708篇论文

总共包含两个文件：  
1.`cora.content`文件包含对paper的内容描述，格式为$$ \text{<paper_id> <word_attributes> <class_label>} $$其中：  
&emsp;`<paper_id>`是paper的标识符，每一篇paper对应一个标识符。  
&emsp;`<word_attributes>`是词汇特征，为0或1，表示对应词汇是否存在。  
&emsp;`<class_label>`是该文档所述的类别。  
  
2.`cora.cites`包含了数据集的引用图，格式为$$ \text{<ID of cited paper> <ID of citing paper>} $$其中：  
&emsp;`<ID of cited paper>`是被引用的paper标识符。  
&emsp;`<ID of citing paper>`是引用的paper标识符。 

In [51]:
'''
node_num, feat_dim, stat_dim, num_class, T
feat_Matrix, X_Node, X_Neis, dg_list
'''
#数据处理
content_path = "./cora/cora.content"
cite_path = "./cora/cora.cites"

#读取文本内容
with open(content_path, "r") as fp:
    contents = fp.readlines()
with open(cite_path, "r") as fp:
    cites = fp.readlines()
    
contents = np.array([np.array(s.strip().split("\t")) for s in contents])
paper_list, feat_list, label_list = np.split(contents, [1,-1], axis=1)
paper_list, label_list = np.squeeze(paper_list), np.squeeze(label_list)

# paper -> index dict
#print("paper_list",sorted(paper_list))
paper_dict = dict([(key, val) for val, key in enumerate(paper_list)])

# lable -> index dict
labels = list(set(label_list))
label_dict = dict([(key, val) for val, key in enumerate(labels)])

# edge_index
cites = [i.strip().split("\t") for i in cites]
#print(cites)
#下面这几句代码不一样
#print(paper_dict)
cites = np.array([[paper_dict[i[0]], paper_dict[i[1]]] for i in cites], dtype = np.int64)
#print(cites[1:7])
cites = np.concatenate((cites, cites[:, ::-1]), axis=0) 
#print(cites[1:7])
#这句也不一样
#print(cites[:,0])
degree_list=np.zeros(len(paper_list), dtype = np.int32)
for i in cites:
    degree_list[i[0]] += 1
#_, degree_list = np.unique(cites[:,0],return_counts=True)
#print(degree_list)

#input
node_num = len(paper_list)
feat_dim = feat_list.shape[1]
new_feat = 512
num_class = len(labels)
T = 2
feat_Matrix = torch.Tensor(feat_list.astype(np.float32))
X_Node, X_Neis = np.split(cites, 2, axis=1)
X_Node, X_Neis = torch.tensor(np.squeeze(X_Node)), \
                 torch.tensor(np.squeeze(X_Neis))
#print(X_Node)
#print(degree_list[163])
dg_list = torch.tensor(degree_list[X_Node])
label_list = np.array([label_dict[i] for i in label_list])
label_list = torch.tensor(label_list, dtype = torch.long, device = device)

In [52]:
print("Number of node : ", node_num)
print("Number of edges : ", cites.shape[0])
print("Number of classes : ", num_class)
print("Dimension of node features : ", feat_dim)
print("Dimension of node state : ", new_feat)
print("Shape of feat_Matrix : ", feat_Matrix.shape)
print("Shape of X_Node : ", X_Node.shape)
print("Shape of X_Neis : ", X_Neis.shape)
print("Length of dg_list : ", len(dg_list))

Number of node :  2708
Number of edges :  10858
Number of classes :  7
Dimension of node features :  1433
Dimension of node state :  512
Shape of feat_Matrix :  torch.Size([2708, 1433])
Shape of X_Node :  torch.Size([10858])
Shape of X_Neis :  torch.Size([10858])
Length of dg_list :  10858


In [42]:
'''
Init:
input_size:特征数
new_size:输出的特征数
head:head数
Input:
inputq,k,v: N * input_size
mask: N * N, mask(i,j)表示j到i没有边
Output: 
N * new_size

注：1.论文中最后一层multihead过后不再是拼在一起，而是加在一起取平均，这里没有这么做，要实现也很简单的
    2.论文中的attention与transformer的attention也不同，但是我觉得差别是不会太大的，仔细想一想这两种attention背后的意义，但是这些本来就比较玄学，
    毕竟神经网络的可解释性很差，故不写上来了,下面代码图方便就直接用transformer的了
'''
class MultiHeadAttention(nn.Module):#目前是一个head的attention
    def __init__(self, input_size, new_size, head = 2):
        super(MultiHeadAttention, self).__init__()
        self.hidden_size = new_size//head
        assert head * self.hidden_size == new_size
        self.new_size = new_size
        self.head = head
        self.input_size = input_size
        
        self.qlinear = nn.Linear(input_size,new_size)
        self.vlinear = nn.Linear(input_size,new_size)
        self.klinear = nn.Linear(input_size,new_size)
        self.ylinear = nn.Linear(new_size,new_size)
        
    def scaledDotProductAttention(self, matq, matk, matv, mask):
        scale = matk.size(-1)**0.5
        matt = matq.matmul(matk.transpose(1,2))/scale  #(head, N,hidden_size) * (head, hidden_size,N) -> (head, N,N)
        if not mask is None:
            matt.masked_fill_(mask, -np.inf)
        matt = F.softmax(matt, dim=-1)  #(head, N, N) * (head, N, hidde_size)
        mattv = matt.matmul(matv)
        return torch.cat(torch.chunk(mattv.view(-1,self.hidden_size),self.head,0), dim = 1) #(head, N, hidden_size) ->(N, new_size)
    
    def toMulti(self, matx):#这里是把特征等分
        return torch.cat(torch.chunk(matx, self.head, -1),dim=0).view(self.head, -1, self.hidden_size)
    
    def forward(self, inputq, inputk, inputv, mask = None):
        matq = self.toMulti(self.qlinear(inputq))
        matv = self.toMulti(self.vlinear(inputv))
        matk = self.toMulti(self.klinear(inputk))
        mattv = self.scaledDotProductAttention(matq, matk, matv, mask)
        y = self.ylinear(mattv)
        return y

In [43]:
#开始检查没有mask的对不对
model1 = MultiHeadAttention(5, 2).to(device)
input1 = torch.randn(6,5).to(device)
y1 = model1(input1,input1,input1)
print(y1)
torch.sum(y1).backward()

tensor([[0.5229, 0.5034],
        [0.4926, 0.4687],
        [0.3847, 0.3493],
        [0.3212, 0.3451],
        [0.3911, 0.3379],
        [0.3800, 0.2676]], device='cuda:0', grad_fn=<AddmmBackward>)


In [45]:
#开始检查有mask的对不对
model1 = MultiHeadAttention(5, 2).to(device)
input1 = torch.randn(6,5).to(device)
y1 = model1(input1,input1,input1,torch.tensor([[False,False,False,False,False,True]]*6, device=device))
print(y1)
torch.sum(y1).backward()

tensor([[0.5505, 0.6872],
        [0.5592, 0.6937],
        [0.5571, 0.5888],
        [0.5451, 0.5281],
        [0.5635, 0.5805],
        [0.5562, 0.6307]], device='cuda:0', grad_fn=<AddmmBackward>)


In [46]:
'''
Init:
    feat_dim : 特征数
    new_feat : 输出的新节点的特征数
Input:
    H : (N, fear_dim), N个节点的特征
    A : (N, N), 邻接矩阵，用以在对应的地方产生mask
    
'''
class GATlayer(nn.Module):
    def __init__(self, feat_dim, new_feat):
        super(GATlayer, self).__init__()
        self.feat_dim = feat_dim
        self.new_feat = new_feat
        
        self.attention = MultiHeadAttention(feat_dim, new_feat)
    
    def forward(self, H, A):
        return F.relu(self.attention(H,H,H,A.eq(0)))

In [62]:
class GAT(nn.Module):
    def __init__(self, feat_dim, new_feat, num_class):
        super(GAT,self).__init__()
        
        self.gat_layer1 = GATlayer(feat_dim, new_feat)
        self.gat_layer2 = GATlayer(new_feat, new_feat)
        self.out_layer = nn.Linear(new_feat, num_class)
        #这里天坑，由于不需要学习的我一向喜欢直接用F.而不是类
        #结果之前一直准确率不对，才发现如果用的是类在预测时会自动关闭的，而函数则不会
        self.dropout = nn.Dropout(p=0.3)
    
    def forward(self, X, A):
        H = self.gat_layer1(X, A)
        H = self.gat_layer2(H, A)
        output = F.log_softmax(self.dropout(self.out_layer(H)),dim = -1)
        return output

In [63]:
def edgeToMat(edges, node_num):
    A = torch.eye(node_num).to(device)
    for i in edges:
        A[i[0]][i[1]] += 1
        A[i[1]][i[0]] += 1
    return A

In [64]:
a1= [[0,2],[0,3],[1,2],[2,3]]
print(edgeToMat(a1, 4))

tensor([[1., 0., 1., 1.],
        [0., 1., 1., 0.],
        [1., 1., 1., 1.],
        [1., 0., 1., 1.]], device='cuda:0')


In [68]:
def train(model,optimizer,feat_Matrix,A,train_mask,test_mask,learning_rate = 0.01, weight_decay = 1e-3):
    for epoch in range(100):
        model.train()
        optimizer.zero_grad()

        out = model(feat_Matrix, A)

        loss = F.nll_loss(out[train_mask], label_list[train_mask])
        _, pred = out.max(dim=1)

        correct = float(pred[train_mask].eq(label_list[train_mask]).sum().item())
        acc = correct / train_mask.sum().item()
        if epoch % 10 == 0:
            print('[Epoch {}/100] Loss {:.4f}, train acc {:.4f}'.format(epoch, loss.cpu().detach().data.item(), acc))

        loss.backward()
        optimizer.step()

        if (epoch + 1) % 20 == 0:
            model.eval()
            _, pred = model(feat_Matrix, A).max(dim = 1)
            correct = float(pred[test_mask].eq(label_list[test_mask]).sum().item())
            acc = correct / test_mask.sum().item()
            print('Accuracy: {:.4f}'.format(acc))

In [66]:
#split dataset
train_mask = torch.zeros(node_num, dtype = torch.bool)
train_mask[:node_num - 1000] = 1               #1700左右training

val_mask = None                                
test_mask = torch.zeros(node_num, dtype = torch.bool)
test_mask[node_num - 500:] = 1                 # 500test

model = GAT(feat_dim, new_feat, num_class).to(device)

#Adam是一种算法，可以百度了解
optimizer = torch.optim.Adam(model.parameters(), lr = 0.01, weight_decay = 1e-3)
feat_Matrix = feat_Matrix.to(device)
A = edgeToMat(cites, node_num).to(device)
train(model,optimizer,feat_Matrix,A,train_mask,test_mask)

[Epoch 0/200] Loss 1.9400, train acc 0.2324
[Epoch 10/200] Loss 1.6778, train acc 0.2670
Accuracy: 0.2980
[Epoch 20/200] Loss 1.5534, train acc 0.3402
[Epoch 30/200] Loss 1.2507, train acc 0.4578
Accuracy: 0.6120
[Epoch 40/200] Loss 1.0585, train acc 0.5345
[Epoch 50/200] Loss 0.6797, train acc 0.7207
Accuracy: 0.7560
[Epoch 60/200] Loss 0.4866, train acc 0.7687
[Epoch 70/200] Loss 0.4091, train acc 0.7963
Accuracy: 0.7540
[Epoch 80/200] Loss 0.3747, train acc 0.8044
[Epoch 90/200] Loss 3.2781, train acc 0.5287
Accuracy: 0.6420
[Epoch 100/200] Loss 1.1929, train acc 0.6142
[Epoch 110/200] Loss 0.5371, train acc 0.7681
Accuracy: 0.7880
[Epoch 120/200] Loss 0.4266, train acc 0.7851
[Epoch 130/200] Loss 0.3717, train acc 0.8056
Accuracy: 0.7680
[Epoch 140/200] Loss 0.3470, train acc 0.8179
[Epoch 150/200] Loss 0.3202, train acc 0.8220
Accuracy: 0.7560
[Epoch 160/200] Loss 0.3383, train acc 0.8132
[Epoch 170/200] Loss 0.3079, train acc 0.8314
Accuracy: 0.7460
[Epoch 180/200] Loss 0.2944, t

In [77]:
train(model,optimizer,feat_Matrix,A,train_mask,test_mask,learning_rate = 0.01, weight_decay = 1e-3)

[Epoch 0/100] Loss 1.5971, train acc 0.3454
[Epoch 10/100] Loss 1.3206, train acc 0.4321
Accuracy: 0.5080
[Epoch 20/100] Loss 1.2189, train acc 0.4801
[Epoch 30/100] Loss 1.1940, train acc 0.4830
Accuracy: 0.4680
[Epoch 40/100] Loss 1.1949, train acc 0.4748
[Epoch 50/100] Loss 1.1979, train acc 0.4813
Accuracy: 0.4840
[Epoch 60/100] Loss 1.1786, train acc 0.4994
[Epoch 70/100] Loss 1.1455, train acc 0.5117
Accuracy: 0.4720
[Epoch 80/100] Loss 1.2571, train acc 0.4719
[Epoch 90/100] Loss 1.2137, train acc 0.4684
Accuracy: 0.5300
