Questions about use_valedges_as_input and train_on_subgraph on collab #2

skepsun · 2022-04-15T09:48:30Z

Thanks for your excellent work on link prediction with GNNs. I have two questions about used tricks on ogbl-collab dataset.

For trick 'use_valedges_as_input':
I note that this trick in original OGB example script contains additional operations: During testing, to obtain scores on training and validation nodes, only raw training edges are used:

https://github.com/snap-stanford/ogb/blob/c8f0d2aca80a4f885bfd6ad5258ecf1c2d0ac2d9/examples/linkproppred/collab/gnn.py#L140

Then augmented training edges including validation edges are used to obtain test scores:

https://github.com/snap-stanford/ogb/blob/c8f0d2aca80a4f885bfd6ad5258ecf1c2d0ac2d9/examples/linkproppred/collab/gnn.py#L166

But in PLNLP implementation, the raw training edges have been replaced by augmented version including validation edges, which means that training, validation and test scores are all based on augmented training edges. The very 'high' reported validation scores (100%@50) seem over-fitted, which are supposed to be close to test scores (~70%@50).

For trick 'train_on_subgraph':
This trick limits the time range of training edges and validation edges to achieve better performance on test edges. However, it seems that test edges are also filtered (>=2010) in PLNLP. It is a bit confusing for me, since the test set is 'modified'.

zhitao-wang · 2022-04-18T03:23:40Z

Thanks for your question.

This dataset allows including validation links in training when all the hyperparameters are finalized using the validation set. Our test scores are obtained by following the rules: fix all the hyperparameters and use training and validation sets.
See the issue GraphSAGE (val as input) on collab does not reproduce the leaderboard results snap-stanford/ogb#84
Validation socres do not determine the rank on leaderborad and we also notice that the other method "HOP-REC" uploaded the over-fitted validation scores (100%) before us, which seems to be allowed. Therefore, we also uploaded the validation scores when training and validation sets are used.
Test edges are not filtered. This dataset is splitted by the time and all test edges are in 2019 (See https://ogb.stanford.edu/docs/linkprop/#ogbl-collab). 'train_on_subgraph' creates adjencency matrix only based on the filtered training and validation edges, it does not change test edges but only reindexes them.

skepsun · 2022-04-18T04:43:03Z

Thanks for your detailed replies!

Sorry for unclear expression in my first question. I know that test scores are much more important than validation scores. The first question is just about more 'precise' validation scores, which are used in OGB official GraphSAGE (val as input) script.
I also made some mistakes in my second question. The test edges are not filtered by time range (>=2010) since they are all in 2019. The nodes of test edges are reindexed. The nodes in test edges which do not exist in filtered training&validation graph are reindexed as -1. However, certain edges, whose source and destination nodes are both reindexed as -1, are actually reindexed as the 'self-loops' of node -1. It happens in reindexing process:

PLNLP/main.py

Line 172 in 3840ea9

split_edge['test']['edge'] = n_idx[split_edge['test']['edge']]

PLNLP/main.py

Line 173 in 3840ea9

split_edge['test']['edge_neg'] = n_idx[split_edge['test']['edge_neg']]

Such reindexing may equivalently filter out these edges, since it makes no sense in predicting self-loops. I tried to directly filter edges and counted the amount of edges for different sets before&after filtering:


import dgl
import numpy_indexed as npi
from ogb.linkproppred import DglLinkPropPredDataset
import torch

def filter_edge(split, nodes):
    mask = npi.in_(split['edge'][:,0], nodes) & npi.in_(split['edge'][:,1], nodes)
    raw_num = len(mask)
    filtered_num = mask.sum()
    ratio = 1 - filtered_num / raw_num
    print(raw_num, filtered_num, f'{ratio*100:.4f}%')
    split['edge'] = split['edge'][mask]
    split['year'] = split['year'][mask]
    split['weight'] = split['weight'][mask]
    if 'edge_neg' in split.keys():
        mask = npi.in_(split['edge_neg'][:,0], nodes) & npi.in_(split['edge_neg'][:,1], nodes)
        split['edge_neg'] = split['edge_neg'][mask]
    return split

dataset = DglLinkPropPredDataset(name='ogbl-collab')
graph = dataset[0]

split_edge = dataset.get_edge_split()

mask = (graph.edata['year'] >= 2010).view(-1)
        
filtered_nodes = torch.cat([graph.edges()[0][mask], graph.edges()[1][mask]], dim=0).unique()
graph.remove_edges((~mask).nonzero(as_tuple=False).view(-1))

split_edge['train'] = filter_edge(split_edge['train'], filtered_nodes)
split_edge['valid'] = filter_edge(split_edge['valid'], filtered_nodes)
split_edge['test'] = filter_edge(split_edge['test'], filtered_nodes)

The output is:

1179052 770389 34.6603%
60084 57987 3.4901%
46329 44455 4.0450%

About 4% test edges are filtered out in this script, or equivalently reindexed as the 'self-loops' of node -1 in PLNLP script. With this example script (directly filtering edges) I got similar performance gain (64%->68.5%), which may demonstrate the equivalence between reindexing and filtering.

zhitao-wang · 2022-04-19T03:28:59Z

Thanks for your reminder.

Unseen nodes reindexed with -1 would result in unexpected self-loop edges in test set, which was not considered in our previous experiments. We also collect a detailed statistics of self-loop edges after filtering and reindexing:

dataset = PygLinkPropPredDataset(name='ogbl-collab')
data = dataset[0]
split_edge = dataset.get_edge_split()

if hasattr(data, 'num_nodes'):
    num_nodes = data.num_nodes
else:
    num_nodes = data.adj_t.size(0)

selected_year_index = torch.reshape(
    (split_edge['train']['year'] >= 2010).nonzero(as_tuple=False), (-1,))
split_edge['train']['edge'] = split_edge['train']['edge'][selected_year_index]
split_edge['train']['weight'] = split_edge['train']['weight'][selected_year_index]
split_edge['train']['year'] = split_edge['train']['year'][selected_year_index]
train_edge_index = split_edge['train']['edge'].t()
# create adjacency matrix
new_edges = to_undirected(train_edge_index, split_edge['train']['weight'], reduce='add')
new_edge_index, new_edge_weight = new_edges[0], new_edges[1]
data.adj_t = SparseTensor(row=new_edge_index[0],
                          col=new_edge_index[1],
                          value=new_edge_weight.to(torch.float32))
data.edge_index = new_edge_index

full_edge_index = torch.cat([split_edge['valid']['edge'].t(), split_edge['train']['edge'].t()], dim=-1)
full_edge_weight = torch.cat([split_edge['train']['weight'], split_edge['valid']['weight']], dim=-1)
# create adjacency matrix
new_edges = to_undirected(full_edge_index, full_edge_weight, reduce='add')
new_edge_index, new_edge_weight = new_edges[0], new_edges[1]
data.adj_t = SparseTensor(row=new_edge_index[0],
                          col=new_edge_index[1],
                          value=new_edge_weight.to(torch.float32))
data.edge_index = new_edge_index

row, col, edge_weight = data.adj_t.coo()
subset = set(row.tolist()).union(set(col.tolist()))
subset, _ = torch.sort(torch.tensor(list(subset)))
# For unseen node we set its index as -1
n_idx = torch.zeros(num_nodes, dtype=torch.long) - 1
n_idx[subset] = torch.arange(subset.size(0))
# Reindex edge_index, adj_t, num_nodes
data.edge_index = n_idx[data.edge_index]
data.adj_t = SparseTensor(row=n_idx[row], col=n_idx[col], value=edge_weight)
num_nodes = subset.size(0)
if hasattr(data, 'x'):
    if data.x is not None:
        data.x = data.x[subset]
# Reindex train valid test edges
split_edge['train']['edge'] = n_idx[split_edge['train']['edge']]
split_edge['valid']['edge'] = n_idx[split_edge['valid']['edge']]
split_edge['valid']['edge_neg'] = n_idx[split_edge['valid']['edge_neg']]
split_edge['test']['edge'] = n_idx[split_edge['test']['edge']]
split_edge['test']['edge_neg'] = n_idx[split_edge['test']['edge_neg']]

test_index_dif = split_edge['test']['edge'][:, 0] - split_edge['test']['edge'][:, 1]
test_neg_index_dif = split_edge['test']['edge_neg'][:, 0] - split_edge['test']['edge_neg'][:, 1]
total_test_num, non_self_loop_test_num = len(test_index_dif), len(torch.nonzero(test_index_dif))
total_test_neg_num, non_self_loop_test_neg_num = len(test_neg_index_dif), len(torch.nonzero(test_neg_index_dif))
print(total_test_num, non_self_loop_test_num, (total_test_num - non_self_loop_test_num)/total_test_num)
print(total_test_neg_num, non_self_loop_test_neg_num, (total_test_neg_num - non_self_loop_test_neg_num)/total_test_neg_num)

The output is:

46329 46254 0.00161
100000 82182 0.17818

Negative test edges have much larger proportion of "self-loop" pairs than positive test edges. Reindexing with -1 not only influences the positive test edges but also negatvie test edges with larger proportion.

In our previous experiments, we have very close performance with & without 'train_on_subgraph':

70.59 (with 'train_on_subgraph')
70.51 (without `'train_on_subgraph')

Our main purpose of using 'train_on_subgraph' is to reduce the number of parameters, since the embeddings of unseen nodes in test set are never updated during training process.

Please use our code without using "train_on_subgraph" and we will update the code to avoid above situation.

skepsun · 2022-04-19T03:39:24Z

Thanks for your kind explanations! The negative test edges have much more self-loops. The effect needs to be explored. I will try your code with different settings.

skepsun closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about use_valedges_as_input and train_on_subgraph on collab #2

Questions about use_valedges_as_input and train_on_subgraph on collab #2

skepsun commented Apr 15, 2022

zhitao-wang commented Apr 18, 2022

skepsun commented Apr 18, 2022 •

edited

zhitao-wang commented Apr 19, 2022 •

edited

skepsun commented Apr 19, 2022

Questions about use_valedges_as_input and train_on_subgraph on collab #2

Questions about use_valedges_as_input and train_on_subgraph on collab #2

Comments

skepsun commented Apr 15, 2022

zhitao-wang commented Apr 18, 2022

skepsun commented Apr 18, 2022 • edited

zhitao-wang commented Apr 19, 2022 • edited

skepsun commented Apr 19, 2022

skepsun commented Apr 18, 2022 •

edited

zhitao-wang commented Apr 19, 2022 •

edited