Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about ogbl-biokg #92

Closed
vymao opened this issue Nov 25, 2020 · 14 comments
Closed

Confusion about ogbl-biokg #92

vymao opened this issue Nov 25, 2020 · 14 comments

Comments

@vymao
Copy link

vymao commented Nov 25, 2020

Hi, could you explain the test data for ogbl-biokg in more detail?

Specifically, we corrupt each test triplet edges by replacing its head or tail with randomly-sampled 1,000 negative entities (500 for head and 500 for tail), while ensuring the resulting triplets do not appear in KG.

I'm not sure I fully understand. Does this mean that you randomly sample, from all nodes, each head and tail (500 for each) for each test edge? If we are to predict the existence of edges, what will this information be used for?

@weihua916
Copy link
Contributor

weihua916 commented Nov 25, 2020

Hi!

Does this mean that you randomly sample, from all nodes, each head and tail (500 for each) for each test edge?

That's correct.

If we are to predict the existence of edges, what will this information be used for?

Those negative entities are used to evaluate your ML models. Good ML models should rank the ground-truth positive entity higher than the negative entities for each test triplet.

@vymao
Copy link
Author

vymao commented Nov 25, 2020

Does this mean we must test against every existing edge and the corresponding 1000 negative edges for each edge? Or can we sample the edges to get an estimate?

@weihua916
Copy link
Contributor

weihua916 commented Nov 25, 2020

Hi! I am not sure if I understand your question, but please refer to our example code for how we evaluate model performance. The most relevant part is here.

@vymao
Copy link
Author

vymao commented Nov 28, 2020

For example, the validation set has this as the keys: dict_keys(['head_type', 'head', 'head_neg', 'relation', 'tail_type', 'tail', 'tail_neg']). So we have a positive edge from (head, tail) and 500 negative edges described by (head_neg, tail_neg). Is this correct?

@weihua916
Copy link
Contributor

Correct.

@vymao
Copy link
Author

vymao commented Nov 29, 2020

Will either head_neg or tail_neg reference nodes for which there is no positive edge feature data? It seems that the features (ie. protein, function, side effect, etc.) that describe the nodes are only available for the positive edges.

So for example, if node 15230 is not in the positive edge data, but is either a negative head or tail of a negative edge, then it seems we wouldn't know the classification of the node then, since this isn't given for negative edges.

@weihua916
Copy link
Contributor

weihua916 commented Nov 29, 2020

For i-th validation triplet (val['head'][i] of type val['head_type'][i], val['relation'][i], val['tail'][i] of type val['tail_type'][i]), we corrupt the head and tail entities by sampling 500 negative entities for each, and they are val['head_neg'][i] and val['tail_neg'][i].

@vymao
Copy link
Author

vymao commented Nov 29, 2020

Yes, I am aware. But the head and tail of the negative entities are nodes, which should have labels (ie. protein, function, side effect, etc.). Are we to assume that the label for every val['head_neg'][i] is the same as val['head'][i]? Or can the labels of val['head_neg'][i] be different, and if so, will this information always be listed within the training data?

@weihua916
Copy link
Contributor

Yes, the node type of negative entities is the same as the positive entity. We have this in our dataset description: "we only consider ranking against entities of the same type. For instance, when corrupting head entities of the protein type, we only consider negative protein entities."

@vymao
Copy link
Author

vymao commented Dec 1, 2020

Thank you. Another question: does train_edge contain every node within the graph? I tried mapping the node indices of the heads and tails but that only seems to cover 48688 nodes. The maximum index in the train-edge head and tail is 45084.

If not, are we expected to predict edges for nodes that do not exist in the training data?

@weihua916
Copy link
Contributor

They should contain all the nodes. I think you have not taken the node types into account when you count. Can you share your code to obtain 45084?

@vymao
Copy link
Author

vymao commented Dec 1, 2020

Ah I see. Is there a reason for this? I noticed the other link prediction datasets has the node index combined.

@weihua916
Copy link
Contributor

Because it is a heterogeneous graph.

@vymao
Copy link
Author

vymao commented Dec 2, 2020

Ok. In the paper, it says that "All relations are modeled as directed edges, among which the relations connecting the same entity types (e.g., protein-protein, drug-drug, function-function) are always symmetric, i.e., the edges are bi-directional." Are the bidirectional edges reflected in the edge index dictionary? As in, for some directed edge [A, B], is [B, A] also in the same list?

Also, do you know approximately the average out-degree of the nodes? Just want to check and make sure I processed it correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants