In [None]:
import numpy as np
import pandas as pd
import os
import random
from sklearn.utils import shuffle

!pip install spektral

### 1. Basic Graph Theory
Let’s start from the beginning — basic graph theory. 

Nowadays, a lot of information are represented in graphs. 

For example Google’s Knowledge Graph that helps with the Search Engine Optimization (SEO), chemical molecular structure, document citation networks (document A has cited document B), and social media networks (who is connected to who?). 

A graph consists of 2 main elements, nodes (vertices or points) and edges (links or lines) where the nodes are connected by edges.

Now comes the next question, which part of the data are nodes, and which one are edges? 

There is no strict answer to this as we should define nodes and edges ourselves. 

For example, in a chemical molecule that consists multiple atoms, the atoms can be defined as nodes and the bond between atoms can be defined as edges.

Illustrations of checmial molecules
<img src = "https://miro.medium.com/max/500/1*_IhgiTj5GC92HVUfQDy-6g.png">

Another example is document citation networks from the [Cora dataset](https://relational.fit.cvut.cz/dataset/CORA). 

The nodes represent individual document and each edge represents whether that document is cited by the other.

How the edges link the nodes allows us to distinguish between a directed vs an undirected graph. 

Simply put, in a directed graph, direction matters, and edges cannot be used in the other direction. 
Undirected graphs behave in the opposite manner, the edges follow no direction and can be used interchangeably.

<img src = "https://miro.medium.com/max/875/1*zXvtHcbKASzD0vPSh3EoCA.png">

<img src = "https://miro.medium.com/max/293/1*v8jNKaprhHJj8RNjDYYPNw.png">

Besides that, there is also a special type of graph where each node is connected to all other nodes, this is called a complete graph.

<img src = "https://miro.medium.com/max/321/1*JA9xZHR9PEk2wFdLbYpqkw.png">

### 2. Translating Graph into Features for Neural Networks

#### 2.1. Adjacency matrix (A)
An adjacency matrix is a N x N matrix filled with either 0 or 1, where N is the total number of nodes. 

Adjacency matrices are able to represent the existence of edges the connect the node pairs through the value in the matrices. 

For example, if we have 5 nodes in our graph, then the shape of the matrix is [5, 5]. 

Matrix element Aᵢⱼ is 1 if an edge exists between node i and j. 

From the adjacency matrix below, we can see that the connection between node 2 and 3 (A₂₃) is colored yellow to represent 1 as they’re connected, while A₂₁ is dark purple as node 2 and 1 are not connected to each other.

<img src = "https://miro.medium.com/max/875/1*lvWOW6EyxXi3nSn7CqdQQw.png">

#### 2.2 Node attributes matrix (X)
Unlike adjacency matrices that models the relationship between nodes, this matrix represents the features or attributes of each node. 

If there are N nodes and the size of node attributes is F, then the shape of this matrix is N x F.

In the example of the CORA dataset, we will have a corpus that contains words from all the documents. 

The node attributes would be bag-of-words that indicate the presence of a word in the document, while each document is represented by a node. 

In this case, F will represent the size of the corpus (the total number of unique words) while N is the total number of documents available.

<img src = "https://miro.medium.com/max/595/1*9obE8LS2gPSFjbxR8yml4Q.png">

#### 2.3. Edge attributes matrix (E)
Sometimes, edges can have its own attributes too, just like nodes. 

If the size of edge attributes is S and the number of edges available is n_edges, the shape of this matrix is n_edges x S.

### 3. Graph Neural Networks Data Modes
If we look back at the examples of the chemical molecular structures and document citation networks mentioned earlier, we will realize that both of them have different graph representation settings.

For example, if we want to classify chemical molecules, we will consider each molecule as 1 different graph; 
so in this setting, the number of the graphs we have will be as many as the number of the molecules. We call this Batch Mode. 

On the other hand, if we want to classify documents within a document citation network, we will only have 1 big graph consisting of all the documents as the nodes. In this setting, we will call it Single Mode.

### 4. Graph Convolutional Networks

In GCN, we will take into account the Adjacency Matrix (A) in the forward propagation equation in addition to the node features (or so-called input features). 

`A` is a matrix that represents the edges or connection between the nodes in the forward propagation equation. 

The insertion of A in the forward pass equation enables the model to learn the feature representations based on nodes connectivity. 

For the sake of simplicity, the bias b is omitted. The resulting GCN can be seen as the first-order approximation of Spectral Graph Convolution in the form of a message passing network where the information is propagated along the neighboring nodes within the graph.

By adding the adjacency matrix as an additional element, the forward pass equation will then be:
<img src = "https://miro.medium.com/max/600/1*2cT063K_PIvJVRqFn8c5gg.png">

`A*` is the normalized version of A. To get better understanding on why we need to normalize A and what happens during forward pass in GCNs, let’s do an experiment.

### 5. Building Graph Convolutional Networks

#### 5.1. Initializing the Graph G
Let’s start by building a simple undirected graph (G) using NetworkX. 

The graph G will consist of 6 nodes and the feature of each node will correspond to that particular node number. 

For example, node 1 will have a node feature of 1, node 2 will have a node feature of 2, and so on. To simplify, we are not going to assign edge features in this experiment.

In [None]:
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import fractional_matrix_power

import warnings
warnings.filterwarnings("ignore", category=UserWarning)


#Initialize the graph
G = nx.Graph(name='G')

#Create nodes
#In this example, the graph will consist of 6 nodes.
#Each node is assigned node feature which corresponds to the node name
for i in range(6):
    G.add_node(i, name=i)


#Define the edges and the edges to the graph
edges = [(0,1),(0,2),(1,2),(0,3),(3,4),(3,5),(4,5)]
G.add_edges_from(edges)

#See graph info
print('Graph Info:\n', nx.info(G))

#Inspect the node features
print('\nGraph Nodes: ', G.nodes.data())

#Plot the graph
nx.draw(G, with_labels=True, font_weight='bold')
plt.show()

Since we only have 1 graph, this data configuration is an example of a Single Mode representation. 

We will build a GCN that will learn the nodes features representation.

#### 5.2 Inserting Adjacency Matrix (A) to Forward Pass Equation
##### 5.2.1 The next step is to obtain the Adjacency Matrix (A) and Node Features Matrix (X) from graph G.

In [None]:
#Get the Adjacency Matrix (A) and Node Features Matrix (X) as numpy array
A = np.array(nx.attr_matrix(G, node_attr='name')[0])
X = np.array(nx.attr_matrix(G, node_attr='name')[1])
X = np.expand_dims(X,axis=1)

print('Shape of A: ', A.shape)
print('\nShape of X: ', X.shape)
print('\nAdjacency Matrix (A):\n', A)
print('\nNode Features Matrix (X):\n', X)

Now, let’s investigate how by inserting A into the forward pass equation adds to richer feature representation of the model. We are going to perform dot product of A and X. Let’s call the result of this dot product operation as AX in this notebook.

In [None]:
#Dot product Adjacency Matrix (A) and Node Features (X)
AX = np.dot(A,X)
print("Dot product of A and X (AX):\n", AX)

From the results, it is apparent that AX represents the sum of neighboring nodes features. 

For example, the first row of AX corresponds to the sum of nodes features connected to node 0, which is node 1, 2, and 3. 

This gives us an idea how the propagation mechanism is happening in GCNs and how the node connectivity impacts the hidden features representation seen by GCNs.

`There is one problem here, while AX sums up the adjacent node features, it does not take into account the features of the node itself.`

##### 5.2.2 Inserting Self-Loops and Normalizing A
To address this problem, we now add self-loops to each node of A. 

Adding self-loops is basically a mechanism to connect a node to itself. 

That being said, all the diagonal elements of Adjacency Matrix A will now become 1 because each node is connected to itself. 

Let’s call A with self-loops added as A_hat and recalculate AX, which is now the dot product of A_hat and X.

In [None]:
#Add Self Loops
G_self_loops = G.copy()

self_loops = []
for i in range(G.number_of_nodes()):
    self_loops.append((i,i))

G_self_loops.add_edges_from(self_loops)

#Check the edges of G_self_loops after adding the self loops
print('Edges of G with self-loops:\n', G_self_loops.edges)

#Get the Adjacency Matrix (A) and Node Features Matrix (X) of added self-lopps graph
A_hat = np.array(nx.attr_matrix(G_self_loops, node_attr='name')[0])
print('Adjacency Matrix of added self-loops G (A_hat):\n', A_hat)

#Calculate the dot product of A_hat and X (AX)
AX = np.dot(A_hat, X)
print('AX:\n', AX)

Now we can see AX has now considered features of the nodes itself.

But theres is still another problem. 

`The elements of AX are not normalized.` 

Similar to data pre-processing for any Neural Networks operation, we need to normalize the features to prevent numerical instabilities and vanishing/exploding gradients in order for the model to converge. 

In GCNs, we normalize our data by calculating the Degree Matrix (D) and performing dot product operation of the inverse of D with AX.

In graph terminology, the term “degree” refers to the number of edges a node is connected to.

`normalized features = (Inverse)D A X` 

We will call normalized features`DAX` in this notebook. 

In [None]:
#Get the Degree Matrix of the added self-loops graph
Deg_Mat = G_self_loops.degree()
print('Degree Matrix of added self-loops G (D): ', Deg_Mat)

#Convert the Degree Matrix to a N x N matrix where N is the number of nodes
D = np.diag([deg for (n,deg) in list(Deg_Mat)])
print('Degree Matrix of added self-loops G as numpy array (D):\n', D)

#Find the inverse of Degree Matrix (D)
D_inv = np.linalg.inv(D)
print('Inverse of D:\n', D_inv)

#Dot product of D and AX for normalization
DAX = np.dot(D_inv,AX)
print('DAX:\n', DAX)

If we compare DAX with AX, we will notice that:
<img src = "https://miro.medium.com/max/353/1*ltJAgs2TYVIW3j16H3oWzg.png">

We can see the impact normalization has on DAX, where the element that corresponds to node 3 has lower values compared to node 4 and 5. `But why would node 3 have different values after normalization if it has the same initial value as node 4 and 5?`

Let’s take a look back at our graph. 

Node 3 has 3 incident edges, while nodes 4 and 5 only have 2 incident edges. 

The fact that node 3 has a higher degree than node 4 and 5 leads to a lower weighting of node 3’s features in DAX. 

`In other words, the lower the degree of a node, the stronger that a node belongs to a certain group or cluster.`

In the original GCN paper, authors `Kipf and Welling` states that doing symmetric normalization will make dynamics more interesting, hence, the normalization equation is modified from:

<img src = "https://miro.medium.com/max/530/1*yd8uL8Ewj_C4ES5faZVUxg.png">

Let’s calculate the normalized values using the new symmetric normalization equation!

In [None]:
#Symmetrically-normalization
D_half_norm = fractional_matrix_power(D, -0.5)
DADX = D_half_norm.dot(A_hat).dot(D_half_norm).dot(X)
print('DADX:\n', DADX)

Looking back at "Forward pass equation" that we saw earlier, we will realize that we now have the answers to what is `A*`! In the paper, `A*` is referred to as renormalization trick

### 6.  Implementation of Graph Convolution Neural Networks (GCNN) using Spektral API

Spektral API, is a Python library for graph deep learning based on Tensorflow 2. 

We are going to perform Semi-Supervised Node Classification using CORA dataset, similar to the work presented in the original GCN paper by Thomas Kipf and Max Welling (2017).

#### 6.1 Dataset Overview

`CORA citation network dataset consists of 2708 nodes, where each node represents a document or a technical paper.` 

The node features are bag-of-words representation that indicates the presence of a word in the document. 

`The vocabulary — hence, also the node features — contains 1433 words.`

We will treat the dataset as an undirected graph where the edge represents whether one document cites the other or vice versa. 

`There is no edge feature in this dataset.` 

The goal of this task is to classify the nodes (or the documents) into 7 different classes which correspond to the papers’ research areas. 

This is a single-label multi-class classification problem with `Single Mode` data representation setting.

This implementation is also an example of `Transductive Learning`, where the neural network sees all data, including the test dataset, during the training. 

This is contrast to `Inductive Learning` — which is the typical Supervised Learning — where the test data is kept separate during the training.

#### 6.2 Text Classification Problem

Since we are going to classify documents based on their textual features, a common machine learning way to look at this problem is by seeing it as a supervised text classification problem. 

Using this approach, the machine learning model will learn each document’s hidden representation only based on its own features.

Below image is an illustration of the text classification apporach on a document classification problem:
<img src = "https://miro.medium.com/max/700/1*JC9WB-6BbH9hxsU4PlYgcg.png">

This approach might work well if there are enough labeled examples for each class. 

Unfortunately, in real world cases, labeling data might be expensive.

Let's look at another approach:

Besides its own text content, normally, a technical paper also cites other related papers. 

Intuitively, the cited papers are likely to belong to similar research area.

In this citation network dataset, we want to leverage the citation information from each paper in addition to its own textual content. Hence, the dataset has now turned into a network of papers.
<img src = "https://miro.medium.com/max/700/1*gT9NBC2Ybl7w7V_RGEAILQ.png">


Using this configuration, we can utilize Graph Neural Networks, such as Graph Convolutional Networks (GCNs), to build a model that learns the documents interconnection in addition to their own textual features. 

The GCN model will learn the nodes (or documents) hidden representation not only based on its own features, but also its neighboring nodes’ features. 

Hence, we can reduce the number of necessary labeled examples and implement semi-supervised learning utilizing the Adjacency Matrix (A) or the nodes connectivity within a graph.

Another case where Graph Neural Networks might be useful is when each example does not have distinct features on its own, but the relations between the examples can enrich the feature representations.

#### 6.3 Implementation of Graph Convolutional Networks

##### 6.3.1 Loading and Parsing the Dataset

In this experiment, we are going to build and train a GCN model using Spektral API that is built on Tensorflow 2. 

Although Spektral provides built-in functions to load and preprocess CORA dataset, but we are going to use the raw dataset in order to gain deeper understanding on the data preprocessing and configuration. 

We will use `cora.content` and `cora.cites` files and we will randomly shuffle the data.

In `cora.content` file, each line consists of several elements:
1. first element indicates the document (or node) ID,
2. 2nd until the last second elements indicate the node features,
3. last element indicates the label of that particular node.

In `cora.cites` file, each line contains a tuple of documents (or nodes) IDs. 
1. The first element of the tuple indicates the ID of the paper being cited
2. while the second element indicates the paper containing the citation. 

Although this configuration represents a directed graph, in this approach we treat the dataset as an undirected graph.

In [None]:
#loading the data

all_data = []
all_edges = []

for root,dirs,files in os.walk('../input/coradataset'):
    #print(root)
    for file in files:
        #print(file)
        if '.content' in file:
            with open(os.path.join(root,file),'r') as f:                
                all_data.extend(f.read().splitlines())
        elif 'cites' in file:
            with open(os.path.join(root,file),'r') as f:
                all_edges.extend(f.read().splitlines())

                
#Shuffle the data because the raw data is ordered based on the label
random_state = 77
all_data = shuffle(all_data,random_state=random_state)

After loading the data, we build `Node Features Matrix (X)` and `a list containing tuples of adjacent nodes`. 

This edges list will be used to build a graph from where we can obtain the Adjacency Matrix (A).

In [None]:
#parse the data
labels = []
nodes = []
X = []

for i,data in enumerate(all_data):
    elements = data.split('\t')
    labels.append(elements[-1])
    X.append(elements[1:-1])
    nodes.append(elements[0])

X = np.array(X,dtype=int)
N = X.shape[0] #the number of nodes
F = X.shape[1] #the size of node features
print('X shape: ', X.shape)


#parse the edge
edge_list=[]
for edge in all_edges:
    e = edge.split('\t')
    edge_list.append((e[0],e[1]))

print('\nNumber of nodes (N): ', N)
print('\nNumber of features (F) of each node: ', F)
print('\nCategories: ', set(labels))

num_classes = len(set(labels))
print('\nNumber of classes: ', num_classes)


##### 6.3.2 Setting the Train, Validation, and Test Mask
We will feed in the Node Features Matrix (X) and Adjacency Matrix (A) to the neural networks. 

We are also going to set Boolean masks with a length of N for each training, validation, and testing dataset.

The elements of those masks are True when they belong to corresponding training, validation, or test dataset. 

For example, the elements of train mask are True for those which belong to training data.

<img src="https://miro.medium.com/max/875/1*nuanGfgWcumBM-jb6cOtiA.png">

In the original paper, they picked 20 labeled examples for each class. 

Hence, with 7 classes, we will have a total of 140 labeled training examples. 

We will also use 500 labeled validation examples and 1000 labeled testing examples.

In [None]:
def limit_data(labels,limit=20,val_num=500,test_num=1000):
    '''
    Get the index of train, validation, and test data
    '''
    label_counter = dict((l, 0) for l in labels)
    train_idx = []

    for i in range(len(labels)):
        label = labels[i]
        if label_counter[label]<limit:
            #add the example to the training data
            train_idx.append(i)
            label_counter[label]+=1
        
        #exit the loop once we found 20 examples for each class
        if all(count == limit for count in label_counter.values()):
            break
    
    #get the indices that do not go to traning data
    rest_idx = [x for x in range(len(labels)) if x not in train_idx]
    val_idx = rest_idx[:val_num]
    test_idx = rest_idx[val_num:(val_num+test_num)]
    return train_idx, val_idx,test_idx

train_idx,val_idx,test_idx = limit_data(labels)

#set the mask
train_mask = np.zeros((N,),dtype=bool)
train_mask[train_idx] = True

val_mask = np.zeros((N,),dtype=bool)
val_mask[val_idx] = True

test_mask = np.zeros((N,),dtype=bool)
test_mask[test_idx] = True

##### 6.3.3 Obtaining the Adjacency Matrix
The next step is to obtain the Adjacency Matrix (A) of the graph. 

We use NetworkX to help us do this. 

We will initialize a graph and then add the nodes and edges lists to the graph.

In [None]:
#build the graph
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edge_list)

#obtain the adjacency matrix (A)
A = nx.adjacency_matrix(G)
print('Graph info: ', nx.info(G))

##### 6.3.4 Converting the label to one-hot encoding
The last step before building our GCN is, just like any other machine learning model, encoding the labels and then converting them to one-hot encoding.

In [None]:
from sklearn import preprocessing
from keras.utils import to_categorical
def encode_label(labels):
    label_encoder = preprocessing.LabelEncoder()
    labels = label_encoder.fit_transform(labels)
    labels = to_categorical(labels)
    return labels, label_encoder.classes_

labels_encoded, classes = encode_label(labels)

We are now done with data preprocessing and ready to build our GCN!

##### 6.3.5 Build the Graph Convolutional Networks
The GCN model architectures and hyperparameters follow the design from GCN original paper. 

The GCN model will take 2 inputs:
1. the Node Features Matrix (X) 
2. Adjacency Matrix (A). 

We are going to implement 2-layer GCN with Dropout layers and L2 regularization. 

We are also going to set the maximum training epochs to be 200 and implement Early Stopping with patience of 10. 

It means that the training will be stopped once the validation loss does not decrease for 10 consecutive epochs.

To monitor the training and validation accuracy and loss, we are also going to call TensorBoard in the callbacks.

Before feeding in the Adjacency Matrix (A) to the GCN, we need to do extra preprocessing by performing renormalization trick according to the original paper. 

The code to train GCN below was originally obtained from [Spektral GitHub page](https://github.com/danielegrattarola/spektral/blob/master/examples/node_prediction/citation_gcn.py).

In [None]:
from spektral.layers import GCNConv
from keras.layers import Input,Dropout
from tensorflow.keras import regularizers
from keras import Model
from keras.optimizers import Adam
import tensorflow as tf

channels = 16           # Number of channels in the first layer
dropout = 0.5           # Dropout rate for the features
l2_reg = 5e-4           # L2 regularization rate
learning_rate = 1e-2    # Learning rate
epochs = 200            # Number of training epochs
es_patience = 10        # Patience for early stopping

# Preprocessing operations
A = GCNConv.preprocess(A).astype('f4')

# Model definition
X_in = Input(shape=(F, ))
fltr_in = Input((N, ), sparse=True)

dropout_1 = Dropout(dropout)(X_in)
graph_conv_1 = GCNConv(channels,
                         activation='relu',
                         kernel_regularizer=regularizers.l2(l2_reg),
                         use_bias=False)([dropout_1, fltr_in])

dropout_2 = Dropout(dropout)(graph_conv_1)
graph_conv_2 = GCNConv(num_classes,
                         activation='softmax',
                         use_bias=False)([dropout_2, fltr_in])

# Build model
model = Model(inputs=[X_in, fltr_in], outputs=graph_conv_2)
optimizer = Adam(lr=learning_rate)
model.compile(optimizer=optimizer,
              loss='categorical_crossentropy',
              weighted_metrics=['acc'])
model.summary()

tbCallBack_GCN = tf.keras.callbacks.TensorBoard(
    log_dir='./Tensorboard_GCN_cora',
)
callback_GCN = [tbCallBack_GCN]

##### Train the Graph Convolutional Networks
We are implementing Transductive Learning, which means we will feed the whole graph to both training and testing. 

We separate the training, validation, and testing data using the Boolean masks we have constructed before. 

These masks will be passed to sample_weight argument. We set the batch_size to be the whole graph size, otherwise the graph will be shuffled.

To better evaluate the model performance for each class, we use F1-score instead of accuracy and loss metrics.


In [None]:
from keras.callbacks import EarlyStopping
# Train model
validation_data = ([X, A], labels_encoded, val_mask)
model.fit([X, A],
          labels_encoded,
          sample_weight=train_mask,
          epochs=epochs,
          batch_size=N,
          validation_data=validation_data,
          shuffle=False,
          callbacks=[
              EarlyStopping(patience=es_patience,  restore_best_weights=True),
              tbCallBack_GCN
          ])

In [None]:
y = labels_encoded

In [None]:
from sklearn.metrics import classification_report
y_pred = model.predict([X, A],batch_size=N)
report = classification_report(np.argmax(y,axis=1), np.argmax(y_pred,axis=1), target_names=classes)
print('GCN Classification Report: \n {}'.format(report))

# Hope you liked this notebook!