In [None]:
%matplotlib inline
import os
import sys
import pandas as pd

In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.features.build_features import load_data
from src.model.train import train_test

# Preface

This dataset involves predicting what category scientific papers fall into. There is a network of papers as well as the words used in the title.

## Load CORA Data

In [3]:
A, X, y, idx_train, idx_val, idx_test = load_data("../data/cora/", "cora", 1000, 2000, 3000)

Loading cora dataset...
Graph Info:
 Name: G
Type: Graph
Number of nodes: 2715
Number of edges: 5278
Average degree:   3.8880


torch.linalg.eig returns complex tensors of dtype cfloat or cdouble rather than real tensors mimicking complex tensors.
L, _ = torch.eig(A)
should be replaced with
L_complex = torch.linalg.eigvals(A)
and
L, V = torch.eig(A, eigenvectors=True)
should be replaced with
L_complex, V_complex = torch.linalg.eig(A) (Triggered internally at  /pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp:2897.)
  evals, evecs = torch.eig (m, eigenvectors = True)  # get eigendecomposition


Shape of A:  torch.Size([2715, 2715])

Shape of X:  torch.Size([2708, 1433])

Adjacency Matrix (A):
 tensor([[1.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 1.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 1.0000,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [0.0000, 0.0000, 0.0000,  ..., 0.2000, 0.2000, 0.2582],
        [0.0000, 0.0000, 0.0000,  ..., 0.2000, 0.2000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.2582, 0.0000, 0.3333]])

Node Features Matrix (X):
 tensor([[-0.0771, -0.1111, -0.1629,  ..., -0.0471, -0.1568, -0.0667],
        [-0.0771, -0.1111, -0.1629,  ..., -0.0471, -0.1568, -0.0667],
        [-0.0771, -0.1111, -0.1629,  ..., -0.0471, -0.1568, -0.0667],
        ...,
        [-0.0771, -0.1111, -0.1629,  ..., -0.0471, -0.1568, -0.0667],
        [-0.0771, -0.1111, -0.1629,  ..., -0.0471, -0.1568, -0.0667],
        [-0.0771, -0.1111, -0.1629,  ..., -0.0471, -0.1568, -0.0667]])


# FCN (features only)

In [9]:
train_test(A, X, y, idx_train, idx_val, idx_test,
    no_cuda = False,
    seed = 40,
    epochs = 200,
    learning_rate = 0.0001,
    weight_decay = 5e-4,
    hidden_units = 256,
    dropout = 0.5,
    type = "FCN")

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

# FCN (features and ad-hoc graph variables)

In [None]:
A1, X1, y1, idx_train, idx_val, idx_test = load_data("../data/twitch/", "musae_ENGB_", 1000, 2000, 3000, True)

In [None]:
train_test(A1, X1, y1, idx_train, idx_val, idx_test,
    no_cuda = False,
    seed = 40,
    epochs = 200,
    learning_rate = 0.0001,
    weight_decay = 5e-4,
    hidden_units = 256,
    dropout = 0.5,
    type = "FCN")

# node2vec (graph only)

node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4)

 Embed nodes

mod = node2vec.fit(window=10, min_count=1, batch_words=4)

Results 

Epoch: 0190 loss_train: 0.6387 acc_train: 0.6500 loss_val: 0.7089 acc_val: 0.5180 time: 0.0031s

Epoch: 0200 loss_train: 0.6372 acc_train: 0.6580 loss_val: 0.7102 acc_val: 0.5220 time: 0.0030s

Optimization Finished!
Total time elapsed: 0.6735s
Test set results: loss= 0.7037 accuracy= 0.5180

# node2vec (graph and features)

Epoch: 0190 loss_train: 0.6347 acc_train: 0.6420 loss_val: 0.7088 acc_val: 0.5095 time: 0.0045s

Epoch: 0200 loss_train: 0.6377 acc_train: 0.6420 loss_val: 0.7098 acc_val: 0.5095 time: 0.0030s

Optimization Finished!
Total time elapsed: 0.6197s
Test set results: loss= 0.7161 accuracy= 0.5093

# GCN (nodes and features)

In [None]:
train_test(A, X, y, idx_train, idx_val, idx_test,
    no_cuda = False,
    seed = 40,
    epochs = 500,
    learning_rate = 0.0001,
    weight_decay = 5e-4,
    hidden_units = 256,
    dropout = 0.5,
    type = "GCN")

# Discussion Questions

This node classification task is transductive as not all papers are classified in a category. Papers aren't always put into categories and there could be value in finding a proper one for them based on related or similar data in a citation network.

The Train-Test split can be done on the feature level then so that the full network can still be inputted into the model since there are essentially unlabled gaps in the network that we would want to classify.

*Summarize how each ML approach handles inductive graph learning (adding new nodes and edges at test-time). What computation has to occur at test-time?*

FCN(only features): This does fine with inductive graph learning because there is no link between the data points.

FCN (features and ad-hoc graph variables): This also does fine since there are features to carry predictions even if new data points don't have ad-hoc summarizations. I think at test time this graph input has to be imputed.

All graph inputted approaches: Need to use something like GraphSAGE that generates embeddings by "sampling and aggregating features from a node’s local neighborhood"