Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about use_valedges_as_input and train_on_subgraph on collab #2

Closed
skepsun opened this issue Apr 15, 2022 · 4 comments
Closed

Comments

@skepsun
Copy link

skepsun commented Apr 15, 2022

Thanks for your excellent work on link prediction with GNNs. I have two questions about used tricks on ogbl-collab dataset.

For trick 'use_valedges_as_input':
I note that this trick in original OGB example script contains additional operations: During testing, to obtain scores on training and validation nodes, only raw training edges are used:

https://github.com/snap-stanford/ogb/blob/c8f0d2aca80a4f885bfd6ad5258ecf1c2d0ac2d9/examples/linkproppred/collab/gnn.py#L140

Then augmented training edges including validation edges are used to obtain test scores:

https://github.com/snap-stanford/ogb/blob/c8f0d2aca80a4f885bfd6ad5258ecf1c2d0ac2d9/examples/linkproppred/collab/gnn.py#L166

But in PLNLP implementation, the raw training edges have been replaced by augmented version including validation edges, which means that training, validation and test scores are all based on augmented training edges. The very 'high' reported validation scores (100%@50) seem over-fitted, which are supposed to be close to test scores (~70%@50).

For trick 'train_on_subgraph':
This trick limits the time range of training edges and validation edges to achieve better performance on test edges. However, it seems that test edges are also filtered (>=2010) in PLNLP. It is a bit confusing for me, since the test set is 'modified'.

@zhitao-wang
Copy link
Owner

Thanks for your question.

  1. This dataset allows including validation links in training when all the hyperparameters are finalized using the validation set. Our test scores are obtained by following the rules: fix all the hyperparameters and use training and validation sets.
    See the issue GraphSAGE (val as input) on collab does not reproduce the leaderboard results snap-stanford/ogb#84
  2. Validation socres do not determine the rank on leaderborad and we also notice that the other method "HOP-REC" uploaded the over-fitted validation scores (100%) before us, which seems to be allowed. Therefore, we also uploaded the validation scores when training and validation sets are used.
  3. Test edges are not filtered. This dataset is splitted by the time and all test edges are in 2019 (See https://ogb.stanford.edu/docs/linkprop/#ogbl-collab). 'train_on_subgraph' creates adjencency matrix only based on the filtered training and validation edges, it does not change test edges but only reindexes them.

@skepsun
Copy link
Author

skepsun commented Apr 18, 2022

Thanks for your detailed replies!

  1. Sorry for unclear expression in my first question. I know that test scores are much more important than validation scores. The first question is just about more 'precise' validation scores, which are used in OGB official GraphSAGE (val as input) script.

  2. I also made some mistakes in my second question. The test edges are not filtered by time range (>=2010) since they are all in 2019. The nodes of test edges are reindexed. The nodes in test edges which do not exist in filtered training&validation graph are reindexed as -1. However, certain edges, whose source and destination nodes are both reindexed as -1, are actually reindexed as the 'self-loops' of node -1. It happens in reindexing process:

    PLNLP/main.py

    Line 172 in 3840ea9

    split_edge['test']['edge'] = n_idx[split_edge['test']['edge']]

    PLNLP/main.py

    Line 173 in 3840ea9

    split_edge['test']['edge_neg'] = n_idx[split_edge['test']['edge_neg']]

    Such reindexing may equivalently filter out these edges, since it makes no sense in predicting self-loops. I tried to directly filter edges and counted the amount of edges for different sets before&after filtering:


import dgl
import numpy_indexed as npi
from ogb.linkproppred import DglLinkPropPredDataset
import torch

def filter_edge(split, nodes):
    mask = npi.in_(split['edge'][:,0], nodes) & npi.in_(split['edge'][:,1], nodes)
    raw_num = len(mask)
    filtered_num = mask.sum()
    ratio = 1 - filtered_num / raw_num
    print(raw_num, filtered_num, f'{ratio*100:.4f}%')
    split['edge'] = split['edge'][mask]
    split['year'] = split['year'][mask]
    split['weight'] = split['weight'][mask]
    if 'edge_neg' in split.keys():
        mask = npi.in_(split['edge_neg'][:,0], nodes) & npi.in_(split['edge_neg'][:,1], nodes)
        split['edge_neg'] = split['edge_neg'][mask]
    return split

dataset = DglLinkPropPredDataset(name='ogbl-collab')
graph = dataset[0]

split_edge = dataset.get_edge_split()

mask = (graph.edata['year'] >= 2010).view(-1)
        
filtered_nodes = torch.cat([graph.edges()[0][mask], graph.edges()[1][mask]], dim=0).unique()
graph.remove_edges((~mask).nonzero(as_tuple=False).view(-1))

split_edge['train'] = filter_edge(split_edge['train'], filtered_nodes)
split_edge['valid'] = filter_edge(split_edge['valid'], filtered_nodes)
split_edge['test'] = filter_edge(split_edge['test'], filtered_nodes)

The output is:

1179052 770389 34.6603%
60084 57987 3.4901%
46329 44455 4.0450%

About 4% test edges are filtered out in this script, or equivalently reindexed as the 'self-loops' of node -1 in PLNLP script. With this example script (directly filtering edges) I got similar performance gain (64%->68.5%), which may demonstrate the equivalence between reindexing and filtering.

@zhitao-wang
Copy link
Owner

zhitao-wang commented Apr 19, 2022

Thanks for your reminder.

  1. Unseen nodes reindexed with -1 would result in unexpected self-loop edges in test set, which was not considered in our previous experiments. We also collect a detailed statistics of self-loop edges after filtering and reindexing:
dataset = PygLinkPropPredDataset(name='ogbl-collab')
data = dataset[0]
split_edge = dataset.get_edge_split()

if hasattr(data, 'num_nodes'):
    num_nodes = data.num_nodes
else:
    num_nodes = data.adj_t.size(0)

selected_year_index = torch.reshape(
    (split_edge['train']['year'] >= 2010).nonzero(as_tuple=False), (-1,))
split_edge['train']['edge'] = split_edge['train']['edge'][selected_year_index]
split_edge['train']['weight'] = split_edge['train']['weight'][selected_year_index]
split_edge['train']['year'] = split_edge['train']['year'][selected_year_index]
train_edge_index = split_edge['train']['edge'].t()
# create adjacency matrix
new_edges = to_undirected(train_edge_index, split_edge['train']['weight'], reduce='add')
new_edge_index, new_edge_weight = new_edges[0], new_edges[1]
data.adj_t = SparseTensor(row=new_edge_index[0],
                          col=new_edge_index[1],
                          value=new_edge_weight.to(torch.float32))
data.edge_index = new_edge_index

full_edge_index = torch.cat([split_edge['valid']['edge'].t(), split_edge['train']['edge'].t()], dim=-1)
full_edge_weight = torch.cat([split_edge['train']['weight'], split_edge['valid']['weight']], dim=-1)
# create adjacency matrix
new_edges = to_undirected(full_edge_index, full_edge_weight, reduce='add')
new_edge_index, new_edge_weight = new_edges[0], new_edges[1]
data.adj_t = SparseTensor(row=new_edge_index[0],
                          col=new_edge_index[1],
                          value=new_edge_weight.to(torch.float32))
data.edge_index = new_edge_index

row, col, edge_weight = data.adj_t.coo()
subset = set(row.tolist()).union(set(col.tolist()))
subset, _ = torch.sort(torch.tensor(list(subset)))
# For unseen node we set its index as -1
n_idx = torch.zeros(num_nodes, dtype=torch.long) - 1
n_idx[subset] = torch.arange(subset.size(0))
# Reindex edge_index, adj_t, num_nodes
data.edge_index = n_idx[data.edge_index]
data.adj_t = SparseTensor(row=n_idx[row], col=n_idx[col], value=edge_weight)
num_nodes = subset.size(0)
if hasattr(data, 'x'):
    if data.x is not None:
        data.x = data.x[subset]
# Reindex train valid test edges
split_edge['train']['edge'] = n_idx[split_edge['train']['edge']]
split_edge['valid']['edge'] = n_idx[split_edge['valid']['edge']]
split_edge['valid']['edge_neg'] = n_idx[split_edge['valid']['edge_neg']]
split_edge['test']['edge'] = n_idx[split_edge['test']['edge']]
split_edge['test']['edge_neg'] = n_idx[split_edge['test']['edge_neg']]

test_index_dif = split_edge['test']['edge'][:, 0] - split_edge['test']['edge'][:, 1]
test_neg_index_dif = split_edge['test']['edge_neg'][:, 0] - split_edge['test']['edge_neg'][:, 1]
total_test_num, non_self_loop_test_num = len(test_index_dif), len(torch.nonzero(test_index_dif))
total_test_neg_num, non_self_loop_test_neg_num = len(test_neg_index_dif), len(torch.nonzero(test_neg_index_dif))
print(total_test_num, non_self_loop_test_num, (total_test_num - non_self_loop_test_num)/total_test_num)
print(total_test_neg_num, non_self_loop_test_neg_num, (total_test_neg_num - non_self_loop_test_neg_num)/total_test_neg_num)

The output is:

46329 46254 0.00161
100000 82182 0.17818

Negative test edges have much larger proportion of "self-loop" pairs than positive test edges. Reindexing with -1 not only influences the positive test edges but also negatvie test edges with larger proportion.

  1. In our previous experiments, we have very close performance with & without 'train_on_subgraph':
70.59 (with 'train_on_subgraph')
70.51 (without `'train_on_subgraph') 

Our main purpose of using 'train_on_subgraph' is to reduce the number of parameters, since the embeddings of unseen nodes in test set are never updated during training process.

Please use our code without using "train_on_subgraph" and we will update the code to avoid above situation.

@skepsun
Copy link
Author

skepsun commented Apr 19, 2022

Thanks for your kind explanations! The negative test edges have much more self-loops. The effect needs to be explored. I will try your code with different settings.

@skepsun skepsun closed this as completed Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants