## Graph Convolutional Network（GCN）

### Reference
1. [图卷积神经网络(GCN)理解与tensorflow2.0代码实现](https://github.com/zxxwin/tf2_gcn)

2.6.0


## 1. Cora数据集合
Cora数据集由机器学习论文组成，是近年来图深度学习很喜欢使用的数据集。
整个数据集有2708篇论文，所有样本点被分为7个类别，
类别分别是
+ 1）基于案例；
+ 2）遗传算法；
+ 3）神经网络；
+ 4）概率方法；
+ 5）强化学习；
+ 6）规则学习；
+ 7）理论。

每篇论文都由一个1433维的词向量表示，所以，每个样本点具有1433个特征。词向量的每个元素都对应一个词，且该元素只有0或1两个取值。取0表示该元素对应的词不在论文中，取1表示在论文中。


数据下载链接：https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz

Reference: [cora数据集的读取和处理](https://blog.csdn.net/weixin_41650348/article/details/109406230)


In [16]:
!cd ../../data;wget https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz;tar -zxvf cora.tgz

10078.38s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


--2022-07-08 19:01:39--  https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz
Resolving linqs-data.soe.ucsc.edu (linqs-data.soe.ucsc.edu)... 128.114.47.74
Connecting to linqs-data.soe.ucsc.edu (linqs-data.soe.ucsc.edu)|128.114.47.74|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 168052 (164K) [application/x-gzip]
Saving to: ‘cora.tgz’


2022-07-08 19:01:41 (141 KB/s) - ‘cora.tgz’ saved [168052/168052]

cora/
cora/README
cora/cora.cites
cora/cora.content


10086.30s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


/workspace/user_code/davidwwang/workspace/tensorflow/gnn


###### 数据集查看
下载的压缩包中有三个文件，分别是cora.cites，cora.content，README。

+ README是对数据集的介绍；
+ cora.content是所有论文的独自的信息；
cora.content共有2708行，每一行代表一个样本点，即一篇论文。每一行由三部分组成，分别是论文的编号，如31336；论文的词向量，一个有1433位的二进制；论文的类别，如Neural_Networks。
+ cora.cites是论文之间的引用记录。
cora.cites共5429行， 每一行有两个论文编号，表示第一个编号的论文先写，第二个编号的论文引用第一个编号的论文。如下所示：

In [3]:
import numpy as np
import pandas as pd

# 读取.content 文件
cora_content = pd.read_csv('../../data/cora/cora.content', sep='\t', header=None)
# 查看数据初始格式
print(cora_content.shape)
print(cora_content.head(3))

# 读取 .cites文件
cora_cites = pd.read_csv('../../data/cora/cora.cites', sep='\t', header=None)
print(cora_cites.shape)
print(cora_cites.head(3))

(2708, 1435)
      0     1     2     3     4     5     6     7     8     9     ...  1425  \
0    31336     0     0     0     0     0     0     0     0     0  ...     0   
1  1061127     0     0     0     0     0     0     0     0     0  ...     0   
2  1106406     0     0     0     0     0     0     0     0     0  ...     0   

   1426  1427  1428  1429  1430  1431  1432  1433                    1434  
0     0     1     0     0     0     0     0     0         Neural_Networks  
1     1     0     0     0     0     0     0     0           Rule_Learning  
2     0     0     0     0     0     0     0     0  Reinforcement_Learning  

[3 rows x 1435 columns]
(5429, 2)
    0       1
0  35    1033
1  35  103482
2  35  103515


建立从paper_id到[0,2707]数字间的映射函数

In [12]:
content_idx=list(cora_content.index) #将索引制作成列表
paper_id = list(cora_content.iloc[:,0])#将content第一列取出
mp = dict(zip(paper_id, content_idx))#映射成{论文id:索引编号}的字典形式
#查看某个论文id对应的索引编号
mp[31336]

0

提取feature matrix（2708，1433）

In [14]:
#切片提取从第一列到倒数第二列（左闭右开）
feature = cora_content.iloc[:,1:-1]
print(feature.shape)
print(feature.head(3))


(2708, 1433)
   1     2     3     4     5     6     7     8     9     10    ...  1424  \
0     0     0     0     0     0     0     0     0     0     0  ...     0   
1     0     0     0     0     0     0     0     0     0     0  ...     0   
2     0     0     0     0     0     0     0     0     0     0  ...     0   

   1425  1426  1427  1428  1429  1430  1431  1432  1433  
0     0     0     1     0     0     0     0     0     0  
1     0     1     0     0     0     0     0     0     0  
2     0     0     0     0     0     0     0     0     0  

[3 rows x 1433 columns]


标签进行one-hot编码

In [16]:
label = cora_content.iloc[:, -1]
label = pd.get_dummies(label) # 读热编码
label.head(3)

Unnamed: 0,Case_Based,Genetic_Algorithms,Neural_Networks,Probabilistic_Methods,Reinforcement_Learning,Rule_Learning,Theory
0,0,0,1,0,0,0,0
1,0,0,0,0,0,1,0
2,0,0,0,0,1,0,0


创建adjacent matrix

In [19]:
mat_size = cora_content.shape[0] #第一维的大小2708就是邻接矩阵的规模
adj_mat = np.zeros((mat_size, mat_size)) #创建0矩阵
for i, j in zip(cora_cites[0], cora_cites[1]): #枚举形式（u，v）
    x = mp[i]
    y = mp[j]
    adj_mat[x][y]=adj_mat[y][x]=1

print(sum(adj_mat).shape)
print(sum(sum(adj_mat)))

(2708,)
10556.0


如果需要后续转为numpy或者其他形式（之前一直使用pandas的dataframe格式）

In [20]:
#转换为numpy格式的数据
feature = np.array(feature)
label = np.array(label)
adj_mat =np.array(adj_mat)

## 2. GCN

In [1]:
import numpy as np
import os
import warnings
import scipy.sparse as sp
from time import time
from sklearn.metrics import accuracy_score
import tensorflow as tf
from collections import defaultdict
import pickle
import networkx as nx

In [13]:
# 在深度学习中往往利用easydict建立一个全局的变量, 这里记录相关的参数配置
from easydict import EasyDict
config = {
    'dataset':'cora',
    'hidden1':16,
    'epochs':2,
    'early_stopping':20,
    'weight_decay':5e-4,
    'learning_rate': 0.01,
    'dropout':0.,
    'verbose':False,
    'logging':False,
    'gpu_id':None
}
FLAGS = EasyDict(config)


读取数据

In [34]:
from matplotlib import pyplot as plt
def load_data_planetoid(dataset):
    keys ={'x', 'y', 'tx','ty','allx', 'ally', 'graph'} # 文件名后缀
    # print(type(keys))
    objects = defaultdict() # 带默认值的dict
    # print(objects)
    for key in keys:
        with open('data_split/ind.{}.{}'.format(dataset, key), 'rb') as f:
            objects[key]=pickle.load(f, encoding='latin1')
            # print(key, "  ", type(objects[key]))
    
    check_x = objects['x'].toarray()
    # print("allx ", objects['allx'].toarray().shape)
    # print("tx ", objects['tx'].toarray().shape)
    # print("x ", check_x.shape, '\n', check_x[0:3,:])
    # print("ally ", objects['ally'].shape)
    # print("ty ", objects['ty'].shape)
    # print("y ", objects['y'].shape, '\n', objects['y'][0:3,:])
    # print("graph ",len(objects['graph']), objects['graph'][0])
    
    test_index = [int(x) for x in open('data_split/ind.{}.test.index'.format(dataset))]
    # print('test_index', type(test_index), len(test_index), test_index[:3])
    test_index_sort = np.sort(test_index)
    # print('test_index_sort', test_index_sort[0:5])
    G = nx.from_dict_of_lists(objects['graph'])
    
    A_mat = nx.adjacency_matrix(G)
    # print('A_mat', type(A_mat), A_mat.toarray().shape, '\n', A_mat.toarray()[0:3,:])
    X_mat = sp.vstack((objects['allx'], objects['tx'])).tolil()
    # print('X_mat', type(X_mat), X_mat.toarray().shape, '\n', X_mat.toarray()[0:3,:])
    # 把特征矩阵还原，和对应的邻接矩阵对应起来，因为之前是打乱的，不对齐的话，特征就和对应的节点搞错了。
    X_mat[test_index, :] = X_mat[test_index_sort,:]
    z_vec = np.vstack((objects['ally'], objects['ty']))
    # print(type(z_vec), z_vec.shape, '\n', z_vec[0:3,:])

    z_vec[test_index, :] = z_vec[test_index_sort, :]
    z_vec = z_vec.argmax(1)
    # print('z_vec', type(z_vec),  '\n', z_vec.shape, z_vec[0:3])

    
    train_idx = range(len(objects['y']))
    val_idx = range(len(objects['y']), len(objects['y']) + 500)
    test_idx = test_index_sort.tolist()
    # print('train_idx ', len(train_idx), train_idx)
    # print('val_idx ', len(val_idx), val_idx)
    # print('test_idx ', len(test_idx))

    return A_mat, X_mat, z_vec, train_idx, val_idx, test_idx

cora_data = load_data_planetoid(FLAGS.dataset)

处理稀疏矩阵

In [35]:
# 稀疏矩阵的 dropout
def sparse_dropout(x, dropout_rate, noise_shape):
    # print('dropout', x.shape, 'rate', dropout_rate, 'noise', (noise_shape))
    
    random_tensor = 1 - dropout_rate
    random_tensor += tf.random.uniform(noise_shape)
    dropout_mask = tf.cast(tf.floor(random_tensor), dtype=tf.bool)
    # 从稀疏矩阵中取出dropout_mask对应的元素
    pre_out = tf.sparse.retain(x, dropout_mask)
    
    
    return pre_out * (1. / (1 - dropout_rate))
    
# 稀疏矩阵转稀疏张量
def sp_matrix_to_sp_tensor(M):
    # print('M', type(M), M.shape)
    if not isinstance(M, sp.csr.csr_matrix):
        M = M.tocsr()
    # 获取非0元素坐标
    row, col = M.nonzero()
    # SparseTensor 参数： 二维坐标数组，数据，形状
    X = tf.SparseTensor(np.mat([row, col]).T, M.data, M.shape)
    X = tf.cast(X, tf.float32)
    # print('X', type(X), X.shape)
    return X



定义图卷积层

In [36]:
import tensorflow as tf
from keras import activations, regularizers, constraints, initializers

class GCNConv(tf.keras.layers.Layer):
    def __init__( self,
                 units,
                 activation=lambda x:x,
                 use_bias = True,
                 kernel_initializer='glorot_uniform',
                 bias_initializer='zeros',
                 **kwargs):
        
        super(GCNConv, self).__init__()
        
        self.units = units
        self.activation = activations.get(activation)
        self.use_bias = use_bias
        self.kernel_initializer=initializers.get(kernel_initializer)
        self.bias_initializer=initializers.get(bias_initializer)
        
    def build(self, input_shape):
        """GCN has two inputs : [shape(An), shape(X)]
        """
        fdim = input_shape[1][1] #feature dim
        # print('input_shape', type(input_shape), input_shape, fdim, self.units)
        # 初始化权重矩阵
        self.weight = self.add_weight(name='weight',
                                     shape=(fdim, self.units),
                                     initializer= self.kernel_initializer,
                                     trainable=True)
        if self.use_bias:
            # 初始化偏置项目
            self.bias = self.add_weight(name='bias',
                                       shape=(self.units, ),
                                       initializer = self.bias_initializer,
                                       trainable=True)
    
    def call(self, inputs):
        """ GCN has two inputs : [An, X]
        """
        self.An = inputs[0]
        self.X = inputs[1]
        # print('An', type(self.An), self.An.shape)
        # print('X', type(self.X), self.X.shape)
        # print('W', type(self.weight), self.weight.shape)

        
        # 计算XW
        if isinstance(self.X, tf.SparseTensor):
            h = tf.sparse.sparse_dense_matmul(self.X, self.weight)
        else:
            h = tf.matmul(self.X, self.weight)
        # 计算AxW
        # print('h', type(h), h.shape)
        output = tf.sparse.sparse_dense_matmul(self.An, h)
        # print('bias', type(self.bias), self.bias.shape)
        
        if self.use_bias:
            output = tf.nn.bias_add(output, self.bias)
        
        if self.activation:
            output = self.activation(output)
        
        # print('output', type(output), output.shape)
        return output
        

        


定义GCN模型

In [54]:
tf.get_logger().setLevel('ERROR')

class GCN():
    def __init__(self, An, X, sizes, **kwargs):
        self.with_relu = True
        self.with_bias = True
        
        self.lr = FLAGS.learning_rate
        self.dropout = FLAGS.dropout
        self.verbose = FLAGS.verbose
        
        self.An = An
        self.X = X
        self.layer_sizes = sizes
        self.shape = An.shape
        
        self.An_tf = sp_matrix_to_sp_tensor(self.An)
        self.X_tf = sp_matrix_to_sp_tensor(self.X)
        
        self.layer1 = GCNConv(self.layer_sizes[0], activation='relu')
        self.layer2 = GCNConv(self.layer_sizes[1])
        self.opt = tf.optimizers.Adam(learning_rate = self.lr)
        
    def train(self, idx_train, labels_train, idx_val, label_val):
        # print(len(idx_train), labels_train.shape, len(idx_val), label_val.shape)
        # print(idx_train, labels_train, len(idx_val), label_val)

        K = labels_train.max() + 1
        print(K)
        train_losses = []
        val_losses = []
        # use adam to optimize
        for it in range(FLAGS.epochs):
            tic = time()
            with tf.GradientTape() as tape:
                _loss = self.loss_fn(idx_train, np.eye(K)[labels_train])
            # optimize over weights
            grad_list = tape.gradient(_loss, self.var_list)
            grads_and_vars = zip(grad_list, self.var_list)
            self.opt.apply_gradients(grads_and_vars)
            
            # evaluate on the training
            train_loss, train_acc = self.evaluate(idx_train, labels_train, training=True)
            train_losses.append(train_loss)
            val_loss, val_acc = self.evaluate(idx_val, label_val, training=False)
            val_losses.append(val_loss)
            toc =time()
            if self.verbose:
                print("iter:{:03d}".format(it),
                      "train_loss:{:.4f}".format(train_loss),
                      "train_acc:{:.4f}".format(train_acc),
                      "val_loss:{:.4f}".format(val_loss),
                      "val_acc:{:.4f}".format(val_acc),
                      "time:{:.4f}".format(toc - tic))
            
        return train_losses
    
    def loss_fn(self, idx, labels, training=True):
        if training:
            # .nnz 是获得X中元素的个数
            _X = sparse_dropout(self.X_tf, self.dropout, [self.X.nnz])
        else:
            _X = self.X_tf
            
        self.h1 = self.layer1([self.An_tf,_X])
        if training:
            _h1 = tf.nn.dropout(self.h1, self.dropout)
        else:
            _h1 = self.h1
        
        self.h2 = self.layer2([self.An_tf, _h1])
        print('h2', self.h2.shape)
        print('idx', len(idx))
        self.var_list = self.layer1.weights + self.layer2.weights
        # calculate the loss base on idx and labels
        _logits = tf.gather(self.h2, idx)
        # print('logit', _logits.shape)
        _loss_per_node = tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=_logits)
         #print('_loss_per_node', _loss_per_node.shape)

        _loss = tf.reduce_mean(_loss_per_node)
        print
        # 加上L2正则项
        _loss +=FLAGS.weight_decay * sum(map(tf.nn.l2_loss, self.layer1.weights))
        # print('_loss', _loss)
        return _loss
    
    def evaluate(self, idx, true_labels, training):
        K = true_labels.max() +1
        _loss = self.loss_fn(idx, np.eye(K)[true_labels], training=training).numpy()
        _pred_logits = tf.gather(self.h2, idx)
        _pred_labels = tf.argmax(_pred_logits, axis=1).numpy()
        _acc = accuracy_score(_pred_labels, true_labels)
        
        return _loss, _acc
    
            
        
        
        

计算标准化的邻接矩阵：根号D * A * 根号D

In [55]:
# 计算标准化的邻接矩阵：根号D * A * 根号D
def preprocess_graph(adj):
    # print('adj ', adj.shape, type(adj))
    # _A = A+I
    _adj = adj + sp.eye(adj.shape[0])
    # _dseq: 各个节点的度构成的列表
    _dseq = _adj.sum(1).A1
    # print(type(_dseq), _dseq.shape)
    # 构造开根号的度矩阵
    _D_half = sp.diags(np.power(_dseq, -0.5))
    # 计算标准化的邻接矩阵, @ 表示矩阵乘法
    adj_normalized = _D_half @ _adj @ _D_half
    return adj_normalized.tocsr()
 

In [56]:
if __name__ == "__main__":
    # 读取数据
    # A_mat：邻接矩阵
    # X_mat：特征矩阵
    # z_vec：label
    # train_idx,val_idx,test_idx: 要使用的节点序号
    A_mat, X_mat, z_vec, train_idx, val_idx, test_idx = load_data_planetoid(FLAGS.dataset)
    # 邻居矩阵标准化
    An_mat = preprocess_graph(A_mat)
    print('An_mat', An_mat.shape)
    # 节点的类别个数
    K = z_vec.max() + 1
    print(K)
    # 构造GCN模型
    gcn = GCN(An_mat, X_mat, [FLAGS.hidden1, K])
    # 训练
    gcn.train(train_idx, z_vec[train_idx], val_idx, z_vec[val_idx])
    # 测试
    test_res = gcn.evaluate(test_idx, z_vec[test_idx], training=False)
    print("Dataset {}".format(FLAGS.dataset),
          "Test loss {:.4f}".format(test_res[0]),
          "test acc {:.4f}".format(test_res[1]))

An_mat (2708, 2708)
7
7
h2 (2708, 7)
idx 140
logit (140, 7)
_loss_per_node (140,)
_loss tf.Tensor(1.9627378, shape=(), dtype=float32)
h2 (2708, 7)
idx 140
logit (140, 7)
_loss_per_node (140,)
_loss tf.Tensor(1.86438, shape=(), dtype=float32)
h2 (2708, 7)
idx 500
logit (500, 7)
_loss_per_node (500,)
_loss tf.Tensor(1.9248261, shape=(), dtype=float32)
h2 (2708, 7)
idx 140
logit (140, 7)
_loss_per_node (140,)
_loss tf.Tensor(1.86438, shape=(), dtype=float32)
h2 (2708, 7)
idx 140
logit (140, 7)
_loss_per_node (140,)
_loss tf.Tensor(1.7453369, shape=(), dtype=float32)
h2 (2708, 7)
idx 500
logit (500, 7)
_loss_per_node (500,)
_loss tf.Tensor(1.8768542, shape=(), dtype=float32)
h2 (2708, 7)
idx 1000
logit (1000, 7)
_loss_per_node (1000,)
_loss tf.Tensor(1.87412, shape=(), dtype=float32)
Dataset cora Test loss 1.8741 test acc 0.3250
