This notebook contains some simple code for viewing the node IDs from the Microsoft Academic Graph that we used for the real data experiments in our paper.

In [None]:
import pickle
import os

In [None]:
DATA_DIR = "data/"

In [None]:
# Dataset 1.

# Each list contains the IDs of the nodes from the Microsoft Academic Graph that
# we use in our experiments.
train_nodes = pickle.load(open(os.path.join(DATA_DIR, "dataset_1/train_nodes.pkl"), "rb"))
val_nodes = pickle.load(open(os.path.join(DATA_DIR, "dataset_1/val_nodes.pkl"), "rb"))
test_nodes = pickle.load(open(os.path.join(DATA_DIR, "dataset_1/test_nodes.pkl"), "rb"))

In [None]:
class PaperIdAndIndexMap:
    
    def __init__(self, topo_sorted_nodes):
        self.paper_id_to_idx = {}
        self.idx_to_paper_id = {}
        for idx, paper_id in enumerate(topo_sorted_nodes):
            self.paper_id_to_idx[paper_id] = idx
            self.idx_to_paper_id[idx] = paper_id

In [None]:
print("Number of training nodes: %d" % len(train_nodes))
print("Number of validation nodes: %d" % len(val_nodes))
print("Number of test nodes: %d" % len(test_nodes))

Number of training nodes: 1709405
Number of validation nodes: 244200
Number of test nodes: 488403


In [None]:
# Dataset 2.

# Each list contains the IDs of the nodes from the Microsoft Academic Graph that
# we use in our experiments.
train_nodes = pickle.load(open(os.path.join(DATA_DIR, "dataset_2/train_nodes.pkl"), "rb"))
val_nodes = pickle.load(open(os.path.join(DATA_DIR, "dataset_2/val_nodes.pkl"), "rb"))
test_nodes = pickle.load(open(os.path.join(DATA_DIR, "dataset_2/test_nodes.pkl"), "rb"))

In [None]:
print("Number of training nodes: %d" % len(train_nodes))
print("Number of validation nodes: %d" % len(val_nodes))
print("Number of test nodes: %d" % len(test_nodes))

Number of training nodes: 930064
Number of validation nodes: 132866
Number of test nodes: 265734


In our paper, we preprocess the paper text (title and abstract) and 768-dimensional embedding using [SciBERT](https://github.com/allenai/scibert) and [bert-as-a-service](https://github.com/hanxiao/bert-as-service).

We provide the preprocessed data with the 768-dimensional embeddings for both
the datasets used in our paper. The data is in the form of a dictionary that maps each paper ID to its text embedding and can be loaded using the `pickle` library. The dataset can be found [here](https://drive.google.com/file/d/1cfR6strHk3SoSUHbYv_yY1fXbgWZaP5T/view?usp=sharing).

In [None]:
# A dictionary where the key is Microsoft Academic Graph node ID representing
# an academic paper and the value is a 768-dimensional `np.array` representing
# the text embedding that we use in our training pipeline.
# `nodes_to_scibert_embedding_dataset_{1,2}.pkl` can be downloaded from the links above.
nodes_to_scibert = pickle.load(open(os.path.join(DATA_DIR, "nodes_to_scibert_embedding_dataset_1.pkl"), "rb"))