Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use the whole graph adjacency matrix for link prediction task? #72

Closed
LeeJunHyun opened this issue Oct 2, 2020 · 14 comments
Closed

use the whole graph adjacency matrix for link prediction task? #72

LeeJunHyun opened this issue Oct 2, 2020 · 14 comments

Comments

@LeeJunHyun
Copy link

LeeJunHyun commented Oct 2, 2020

https://github.com/snap-stanford/ogb/blob/master/examples/linkproppred/collab/gnn.py#L106

it seems that gnn model takes the whole adjacency matrix (data.adj_t).

but as far as I know, in the standard setting, gnn takes an incomplete set of edges (split_edge['train']['edge']) and predicts the rest (split_edge['valid'] and split_edge['test']).

Should I fix it? or could you please give me some reference for this setting?

I really appreciate the great commitment of you all.

@weihua916
Copy link
Contributor

Hi, the graph object only contains training edges, so there is no information leak.

@LeeJunHyun
Copy link
Author

LeeJunHyun commented Oct 3, 2020

Hi @weihua916 ,

Thanks for your reply.

Do you mean that the graph object (PygLinkPropPredDataset()) only has training edges?

PygLinkPropPredDataset() and data.adj_t have same number of edges.

Then how did you extract test and valid edges?

In dataset code, https://github.com/snap-stanford/ogb/blob/master/ogb/linkproppred/dataset_pyg.py#L67
they are just loaded from the split path.

dataset
PygLinkPropPredDataset()

dataset[0]
Data(edge_index=[2, 2358104], edge_weight=[2358104, 1], edge_year=[2358104, 1], x=[235868, 128])

data.adj_t
SparseTensor(row=tensor([ 0, 0, 0, ..., 235867, 235867, 235867]),
col=tensor([ 20649, 21913, 46512, ..., 230251, 230251, 235583]),
val=tensor([1., 1., 1., ..., 1., 3., 2.]),
size=(235868, 235868), nnz=2358104, density=0.00%)

split_edge['train']['edge'].shape
torch.Size([1179052, 2])

@LeeJunHyun
Copy link
Author

LeeJunHyun commented Oct 3, 2020

This is what I know as a standard-setting, so could you please give me some other references that you have?
(the below image is from GRL book)
image

@weihua916
Copy link
Contributor

1179052 * 2 = 2358104.

@LeeJunHyun
Copy link
Author

LeeJunHyun commented Oct 3, 2020

Maybe there is something that I missed,

so is adj_t equal to split_edge['train']['edge']?

@weihua916
Copy link
Contributor

https://ogb.stanford.edu/docs/nodeprop/
Note: For undirected graphs, the loaded graphs will have a doubled number of edges because we add the bidirectional edges automatically.

@LeeJunHyun
Copy link
Author

I thought that valid and test edges are from adj_t.

then could you tell me where valid and test edges are from?

@weihua916
Copy link
Contributor

They are from split_edge['valid']['edge'] and split_edge['test']['edge']

@LeeJunHyun
Copy link
Author

LeeJunHyun commented Oct 3, 2020

Oh, I mean, where split_edge['valid']['edge'] and split_edge['test']['edge'] are from?

In dataset code, https://github.com/snap-stanford/ogb/blob/master/ogb/linkproppred/dataset_pyg.py#L67
they are just loaded from the split path.
(there are train.pt, valid.pt, and test.pt in the split path)

how did you create train.pt, valid.pt, and test.pt?

@LeeJunHyun
Copy link
Author

And I performed the validation to find overlapped edges between data.adj_t and split_edge['test']['edge'].
There are overlapped edges in both train and test sets.
please let me know if there is another point that I'm still missing.

        row, col, _ = data.adj_t.coo()
        total_edge = torch.stack([row,col],dim=0).t()

        for test_edge in split_edge['test']['edge']:
            overlap_validation = (test_edge==total_edge).sum(dim=1)==2
            if overlap_validation.max() >0:
                overlap_idx = overlap_validation.nonzero().squeeze()
                print(f'Test edge: {test_edge}')
                print(f'Train edge: {total_edge[overlap_idx,:]}')
                print('*'*20)

@LeeJunHyun
Copy link
Author

The results:

Test edge: tensor([105700, 201535])
Train edge: tensor([[105700, 201535],
[105700, 201535],
[105700, 201535],
[105700, 201535],
[105700, 201535]])


Test edge: tensor([159698, 220004])
Train edge: tensor([159698, 220004])


Test edge: tensor([ 10737, 220004])
Train edge: tensor([[ 10737, 220004],
[ 10737, 220004],
[ 10737, 220004]])


Test edge: tensor([106042, 220004])
Train edge: tensor([[106042, 220004],
[106042, 220004]])


Test edge: tensor([131588, 161470])
Train edge: tensor([[131588, 161470],
[131588, 161470]])


Test edge: tensor([117925, 112050])
Train edge: tensor([[117925, 112050],
[117925, 112050],
[117925, 112050],
[117925, 112050]])
...

@LeeJunHyun
Copy link
Author

When I set total_edge=split_edge['train']['edge'], there are also overlapped edges (between split_edge['train']['edge'] and split_edge['test']['edge']).

@weihua916
Copy link
Contributor

The overlapping edges are expected in ogbl-collab. See https://ogb.stanford.edu/docs/linkprop/#ogbl-collab.

@LeeJunHyun
Copy link
Author

Because it is split by time (year). It makes sense now.

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants