## Graph Convolutional Network（GCN）

### Reference
1. [图卷积神经网络(GCN)理解与tensorflow2.0代码实现](https://github.com/zxxwin/tf2_gcn)

2.6.0


## 1. Cora数据集合
Cora数据集由机器学习论文组成，是近年来图深度学习很喜欢使用的数据集。
整个数据集有2708篇论文，所有样本点被分为7个类别，
类别分别是
1）基于案例；
2）遗传算法；
3）神经网络；
4）概率方法；
5）强化学习；
6）规则学习；
7）理论。
每篇论文都由一个1433维的词向量表示，所以，每个样本点具有1433个特征。词向量的每个元素都对应一个词，且该元素只有0或1两个取值。取0表示该元素对应的词不在论文中，取1表示在论文中。


数据下载链接：https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz

Reference: [cora数据集的读取和处理](https://blog.csdn.net/weixin_41650348/article/details/109406230)


In [16]:
!cd ../../data;wget https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz;tar -zxvf cora.tgz

10078.38s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


--2022-07-08 19:01:39--  https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz
Resolving linqs-data.soe.ucsc.edu (linqs-data.soe.ucsc.edu)... 128.114.47.74
Connecting to linqs-data.soe.ucsc.edu (linqs-data.soe.ucsc.edu)|128.114.47.74|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 168052 (164K) [application/x-gzip]
Saving to: ‘cora.tgz’


2022-07-08 19:01:41 (141 KB/s) - ‘cora.tgz’ saved [168052/168052]

cora/
cora/README
cora/cora.cites
cora/cora.content


10086.30s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


/workspace/user_code/davidwwang/workspace/tensorflow/gnn


###### 数据集查看
下载的压缩包中有三个文件，分别是cora.cites，cora.content，README。

+ README是对数据集的介绍；
+ cora.content是所有论文的独自的信息；
cora.content共有2708行，每一行代表一个样本点，即一篇论文。每一行由三部分组成，分别是论文的编号，如31336；论文的词向量，一个有1433位的二进制；论文的类别，如Neural_Networks。
+ cora.cites是论文之间的引用记录。
cora.cites共5429行， 每一行有两个论文编号，表示第一个编号的论文先写，第二个编号的论文引用第一个编号的论文。如下所示：

In [3]:
import numpy as np
import pandas as pd

# 读取.content 文件
cora_content = pd.read_csv('../../data/cora/cora.content', sep='\t', header=None)
# 查看数据初始格式
print(cora_content.shape)
print(cora_content.head(3))

# 读取 .cites文件
cora_cites = pd.read_csv('../../data/cora/cora.cites', sep='\t', header=None)
print(cora_cites.shape)
print(cora_cites.head(3))

(2708, 1435)
      0     1     2     3     4     5     6     7     8     9     ...  1425  \
0    31336     0     0     0     0     0     0     0     0     0  ...     0   
1  1061127     0     0     0     0     0     0     0     0     0  ...     0   
2  1106406     0     0     0     0     0     0     0     0     0  ...     0   

   1426  1427  1428  1429  1430  1431  1432  1433                    1434  
0     0     1     0     0     0     0     0     0         Neural_Networks  
1     1     0     0     0     0     0     0     0           Rule_Learning  
2     0     0     0     0     0     0     0     0  Reinforcement_Learning  

[3 rows x 1435 columns]
(5429, 2)
    0       1
0  35    1033
1  35  103482
2  35  103515


建立从paper_id到[0,2707]数字间的映射函数

In [12]:
content_idx=list(cora_content.index) #将索引制作成列表
paper_id = list(cora_content.iloc[:,0])#将content第一列取出
mp = dict(zip(paper_id, content_idx))#映射成{论文id:索引编号}的字典形式
#查看某个论文id对应的索引编号
mp[31336]

0

提取feature matrix（2708，1433）

In [14]:
#切片提取从第一列到倒数第二列（左闭右开）
feature = cora_content.iloc[:,1:-1]
print(feature.shape)
print(feature.head(3))


(2708, 1433)
   1     2     3     4     5     6     7     8     9     10    ...  1424  \
0     0     0     0     0     0     0     0     0     0     0  ...     0   
1     0     0     0     0     0     0     0     0     0     0  ...     0   
2     0     0     0     0     0     0     0     0     0     0  ...     0   

   1425  1426  1427  1428  1429  1430  1431  1432  1433  
0     0     0     1     0     0     0     0     0     0  
1     0     1     0     0     0     0     0     0     0  
2     0     0     0     0     0     0     0     0     0  

[3 rows x 1433 columns]


标签进行one-hot编码

In [16]:
label = cora_content.iloc[:, -1]
label = pd.get_dummies(label) # 读热编码
label.head(3)

Unnamed: 0,Case_Based,Genetic_Algorithms,Neural_Networks,Probabilistic_Methods,Reinforcement_Learning,Rule_Learning,Theory
0,0,0,1,0,0,0,0
1,0,0,0,0,0,1,0
2,0,0,0,0,1,0,0


创建adjacent matrix

In [19]:
mat_size = cora_content.shape[0] #第一维的大小2708就是邻接矩阵的规模
adj_mat = np.zeros((mat_size, mat_size)) #创建0矩阵
for i, j in zip(cora_cites[0], cora_cites[1]): #枚举形式（u，v）
    x = mp[i]
    y = mp[j]
    adj_mat[x][y]=adj_mat[y][x]=1

print(sum(adj_mat).shape)
print(sum(sum(adj_mat)))

(2708,)
10556.0


如果需要后续转为numpy或者其他形式（之前一直使用pandas的dataframe格式）

In [20]:
#转换为numpy格式的数据
feature = np.array(feature)
label = np.array(label)
adj_mat =np.array(adj_mat)

## 2. GCN

In [21]:
import tensorflow as tf
print(tf.__version__)

2.6.0


定义图卷积层

In [23]:
import tensorflow as tf
from keras import activations, regularizers, constraints, initializers

class GCNConv(tf.keras.layers.Layer):
    def __init__( self,
                 units,
                 activations=lambda x:x,
                 use_bias = True,
                 kernel_initializer='glorot_uniform',
                 bias_initializer='zeros',
                 **kwargs):
        
        super(GCNConv, self).__init__()
        
        self.units = units
        self.activations = activations.get(activation)
        self.use_bias = use_bias
        self.kernel_initializer=initializers.get(kernel_initializer)
        self.bias_initializer=initializers.get(bias_initializer)
        
    def build(self, input_shape):
        """GCN has two inputs : [shape(An), shape(X)]
        """
        fdim = input_shape[1][1] #feature dim
        # 初始化权重矩阵
        self.weight = self.add_weight(name='weight',
                                     shape=(fdim, self.units),
                                     initializer= self.kernel_initializer,
                                     trainable=True)
        if self.use_bias:
            # 初始化偏置项目
            self.bias = self.add_weight(name='bias',
                                       shape=(self.units, ),
                                       initializer = self.bias_initializer,
                                       trainable=True)
    
    def call(self, inputs):
        """ GCN has two inputs : [An, X]
        """
        self.An = inputs.shape[0]
        self.X = inputs.shape[1]
        # 计算XW
        if isinstance(self.X, tf.SparseTensor):
            h = tf.sparse.sparse_dense_matmul(self.X, self.weight)
        else:
            h = tf.matmul(self.X, self.weight)
        # 计算AxW
        output = tf.sparse.sparse_dense_matmul(self.An, h)
        
        if self.use_bias:
            output = tf.nn.bias_add(output, self.bias)
        
        if self.activations:
            output = self.activation(output)
            
        return output
        

        


定义GCN模型

In [None]:
class GCN():
    def __init__(self, An, X, sizes, **kwargs):
        self.with_relu = True
        self.with_bias = True
        
        