# Academic Papers Classification Contest

## Introduction

Graph neural networks(GNNs) is a variant of neural network specialized in dealing with non-euclidean structured data. It is widely applied in recommender systems, financial risk control and biology computations. There are three types of Graph Neural Network problems: node classification, connectivity prediction and graph classification.

The ogbn-arxiv dataset consists of many academic papers as nodes, with references as edges. Each paper is represented by a 100 dimensional vector. Our task is to infer their class based on their representations. 


## Requirements 
This notebook is based on PaddlePaddle 2.2

## Structure of the Code

1) Read the ogbn-arxiv files, Including the graph and node representations

2) Construct the graph neural network

3) Start Training

4) Perform the final prediction and generate the submission file

5) Using Label Propagation to further improve the accuracy of the sumbimission.

In [1]:

# If a persistence installation is required, 
# you need to use the persistence path as the following: 
!mkdir /home/aistudio/external-libraries
!pip install pgl easydict -q -t /home/aistudio/external-libraries

mkdir: cannot create directory ‘/home/aistudio/external-libraries’: File exists
[31mERROR: parl 1.4.1 has requirement pyzmq==18.1.1, but you'll have pyzmq 22.3.0 which is incompatible.[0m
[31mERROR: paddlefsl 1.0.0 has requirement numpy~=1.19.2, but you'll have numpy 1.21.5 which is incompatible.[0m
[31mERROR: paddlefsl 1.0.0 has requirement pillow==8.2.0, but you'll have pillow 7.1.2 which is incompatible.[0m
[31mERROR: paddlefsl 1.0.0 has requirement requests~=2.24.0, but you'll have requests 2.22.0 which is incompatible.[0m
[31mERROR: blackhole 1.0.1 has requirement numpy<=1.19.5, but you'll have numpy 1.21.5 which is incompatible.[0m


In [2]:

# Also add the following code, 
# so that every time the environment (kernel) starts, 
# just run the following code: 
import sys 
sys.path.append('/home/aistudio/external-libraries')

In [3]:
import pgl
import paddle
import paddle.fluid as fluid
import paddle.nn as nn
import numpy as np
import time
import pandas as pd

In [4]:
from easydict import EasyDict as edict

config = {
    "model_name": "GCN",
    "num_class": 35,
    "num_layers": 8,
    "dropout": 0.3,
    "hidden_size": 256,
    "learning_rate": 0.01,
    "weight_decay": 0.0005,
    "edge_dropout": 0.00
}

config = edict(config)

## Data Loading 

In this cell block we read the dataset, including the graph and features, as well as the training/validation/testing set data split.

In [5]:
from collections import namedtuple

Dataset = namedtuple("Dataset", 
               ["graph", "num_classes", "train_index",
                "train_label", "valid_index", "valid_label", "test_index", "node_feat", "edges", "node_label"])

def load_edges(num_nodes, self_loop=True, add_inverse_edge=True):
    # read edges
    edges = pd.read_csv("work/edges.csv", header=None, names=["src", "dst"]).values

    if add_inverse_edge:
        edges = np.vstack([edges, edges[:, ::-1]])

    if self_loop:
        src = np.arange(0, num_nodes)
        dst = np.arange(0, num_nodes)
        self_loop = np.vstack([src, dst]).T
        edges = np.vstack([edges, self_loop])
    
    return edges

def load():
    # read edges and features
    node_feat = np.load("work/feat.npy")
    num_nodes = node_feat.shape[0]
    edges = load_edges(num_nodes=num_nodes, self_loop=True, add_inverse_edge=True)
    graph = pgl.graph.Graph(num_nodes=num_nodes, edges=edges, node_feat={"feat": node_feat})
    
    df = pd.read_csv("work/train.csv")
    node_index = df["nid"].values
    node_label = df["label"].values
    train_part = int(len(node_index) * 0.8)
    train_index = node_index[:train_part]
    train_label = node_label[:train_part]
    valid_index = node_index[train_part:]
    valid_label = node_label[train_part:]
    test_index = pd.read_csv("work/test.csv")["nid"].values
    dataset = Dataset(graph=graph, 
                    train_label=train_label,
                    train_index=train_index,
                    valid_index=valid_index,
                    valid_label=valid_label,
                    test_index=test_index, num_classes=35, node_feat = node_feat, edges = edges, node_label=node_label)
    return dataset

In [6]:
dataset = load()

train_index = dataset.train_index
train_label = paddle.to_tensor(np.reshape(dataset.train_label, [-1 , 1]))
train_index = paddle.to_tensor(np.expand_dims(train_index, -1))

val_index = dataset.valid_index
val_label = paddle.to_tensor(np.reshape(dataset.valid_label, [-1, 1]))
val_index = paddle.to_tensor(np.expand_dims(val_index, -1))

test_index = dataset.test_index
test_index = paddle.to_tensor(np.expand_dims(test_index, -1))
test_label = paddle.to_tensor(np.zeros((len(test_index), 1), dtype="int64"))
num_class = dataset.num_classes


## Model Construction

In [7]:
import pgl
from pgl.sampling import subgraph
from pgl.graph import Graph
import graphmodel_1
from graphmodel_1 import Model
from unimpmodel import UniMP
import paddle
import paddle.nn as nn
import numpy as np
import time
 #Using CPU
#place = fluid.CPUPlace()
# Using GPU
place = fluid.CUDAPlace(0)
model_name = config.get("model_name", "GCN")
if model_name == "UniMP":
    model = UniMP(config)
else:
    model = Model(config)
lr = 0.005
#lr = paddle.optimizer.lr.ExponentialDecay(learning_rate=config.get("learning_rate", 0.005), gamma=0.9, verbose=True)
optim = paddle.optimizer.Adam(learning_rate = lr, parameters = model.parameters())

W0109 17:23:01.197204   299 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0109 17:23:01.200810   299 device_context.cc:465] device: 0, cuDNN Version: 7.6.


## Start Training
Graph Neural Networks usually uses full batch training. However, GraphSAGE uses mini batch training. There are also algorithms that first partition the graph into subgraphs(Cluster-GCN)


In [None]:
epoch = 250

criterion = paddle.nn.loss.CrossEntropyLoss()


edges = dataset.edges
graph = dataset.graph
graph.tensor()
for epoch in range(epoch):
    # Full Batch Training
    # input shape = [N, Cin]
    # output shape [N, Co]

    #pgl.sampling.subgraph(graph, nodes, eid=None, edges=None, with_node_feat=True, with_edge_feat=True)
    #g = subgraph(graph=graph, nodes=train_index, edges=edges)
    #g.tensor()
    
    pred = model(graph, graph.node_feat["feat"])
    print(pred.shape)
    pred = paddle.gather(pred, train_index)
    loss = criterion(pred, train_label)
    loss.backward()
    acc = paddle.metric.accuracy(input=pred, label=train_label, k=1)
    
    optim.step()
    optim.clear_grad()
    
    #optim.minimize(loss)
    #optim.clear_grad()
    #if(epoch % 50 == 0):
    #    lr.step()
    
    # Full Batch Validation
    #g = subgraph(graph=graph, nodes=val_index, edges=edges)
    #g.tensor()
    val_pred = model(graph, graph.node_feat["feat"])
    val_pred = paddle.gather(val_pred, val_index)
    val_acc = paddle.metric.accuracy(input=val_pred, label=val_label, k=1)
    print("Epoch", epoch, "Train Acc", acc, "Valid Acc", val_acc)



[130644, 35]
Epoch 0 Train Acc Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
       [0.01491422]) Valid Acc Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
       [0.18794048])
[130644, 35]
Epoch 1 Train Acc Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
       [0.18859899]) Valid Acc Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
       [0.14850146])
[130644, 35]
Epoch 2 Train Acc Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
       [0.14739802]) Valid Acc Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
       [0.14850146])
[130644, 35]
Epoch 3 Train Acc Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
       [0.14739802]) Valid Acc Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
       [0.14850146])
[130644, 35]
Epoch 4 Train Acc Tensor(shape=[1], dtype=float32, place=CUDAPl

## Saving the model 
We use the PaddlePaddle API to save the model parameters for further correcting and smoothing.

In [None]:
test_pred = model(graph, graph.node_feat["feat"])
test_pred = paddle.gather(test_pred, test_index)
test_pred = paddle.argmax(test_pred, axis=1)
test_index = np.array(test_index)
test_pred = np.array(test_pred)
submission = pd.DataFrame(data={
                            "nid": test_index.reshape(-1),
                            "label": test_pred.reshape(-1)
                        })
submission.to_csv("submission.csv", index=False)

#saving the state dict for smoothing final results 
paddle.save(model.state_dict(), "model_state_dict")




## Correct and Smooth
Using Label Propagation to smooth the predicted results.

In [None]:
from correctandsmooth import LayerPropagation, CorrectAndSmooth

model_state_dict = paddle.load('model_state_dict')
model.load_dict(model_state_dict)
y_pred = model(graph, graph.node_feat['feat']) 

y_soft = nn.functional.softmax(y_pred)

cas = CorrectAndSmooth(50, 0.979, 'DAD', 100, 0.5, 'DAD', 20.)

mask_idx = paddle.concat([train_index, val_index])
node_label = paddle.to_tensor(np.reshape(dataset.node_label, [-1 , 1]))

mask_label = paddle.gather(node_label, mask_idx)
mask_label = paddle.nn.functional.one_hot(mask_label, num_classes=35)
y_soft = cas.smooth(graph, y_soft, mask_label, mask_idx)



## Generating Submission Files
We have arrived at our last step! Use pandas to save the file to an csv

In [None]:
pred = paddle.argmax(y_soft, axis=-1, keepdim=True)
test_index = paddle.to_tensor(test_index)
pred = paddle.gather(pred, test_index)
test_index = np.array(test_index)
pred = np.array(pred)

test_index = np.array(test_index)
pred = np.array(pred)
submission = pd.DataFrame(data={
                            "nid": test_index.reshape(-1),
                            "label": pred.reshape(-1)
                        })
submission.to_csv("submission_cs.csv", index=False)
