Confusion about ogbl-biokg #92

vymao · 2020-11-25T00:20:44Z

Hi, could you explain the test data for ogbl-biokg in more detail?

Specifically, we corrupt each test triplet edges by replacing its head or tail with randomly-sampled 1,000 negative entities (500 for head and 500 for tail), while ensuring the resulting triplets do not appear in KG.

I'm not sure I fully understand. Does this mean that you randomly sample, from all nodes, each head and tail (500 for each) for each test edge? If we are to predict the existence of edges, what will this information be used for?

weihua916 · 2020-11-25T06:34:21Z

Hi!

Does this mean that you randomly sample, from all nodes, each head and tail (500 for each) for each test edge?

That's correct.

If we are to predict the existence of edges, what will this information be used for?

Those negative entities are used to evaluate your ML models. Good ML models should rank the ground-truth positive entity higher than the negative entities for each test triplet.

vymao · 2020-11-25T17:36:55Z

Does this mean we must test against every existing edge and the corresponding 1000 negative edges for each edge? Or can we sample the edges to get an estimate?

weihua916 · 2020-11-25T17:50:31Z

Hi! I am not sure if I understand your question, but please refer to our example code for how we evaluate model performance. The most relevant part is here.

vymao · 2020-11-28T21:58:01Z

For example, the validation set has this as the keys: dict_keys(['head_type', 'head', 'head_neg', 'relation', 'tail_type', 'tail', 'tail_neg']). So we have a positive edge from (head, tail) and 500 negative edges described by (head_neg, tail_neg). Is this correct?

weihua916 · 2020-11-28T22:02:53Z

Correct.

vymao · 2020-11-29T05:16:23Z

Will either head_neg or tail_neg reference nodes for which there is no positive edge feature data? It seems that the features (ie. protein, function, side effect, etc.) that describe the nodes are only available for the positive edges.

So for example, if node 15230 is not in the positive edge data, but is either a negative head or tail of a negative edge, then it seems we wouldn't know the classification of the node then, since this isn't given for negative edges.

weihua916 · 2020-11-29T06:07:52Z

For i-th validation triplet (val['head'][i] of type val['head_type'][i], val['relation'][i], val['tail'][i] of type val['tail_type'][i]), we corrupt the head and tail entities by sampling 500 negative entities for each, and they are val['head_neg'][i] and val['tail_neg'][i].

vymao · 2020-11-29T16:29:48Z

Yes, I am aware. But the head and tail of the negative entities are nodes, which should have labels (ie. protein, function, side effect, etc.). Are we to assume that the label for every val['head_neg'][i] is the same as val['head'][i]? Or can the labels of val['head_neg'][i] be different, and if so, will this information always be listed within the training data?

weihua916 · 2020-11-29T17:21:14Z

Yes, the node type of negative entities is the same as the positive entity. We have this in our dataset description: "we only consider ranking against entities of the same type. For instance, when corrupting head entities of the protein type, we only consider negative protein entities."

vymao · 2020-12-01T04:34:07Z

Thank you. Another question: does train_edge contain every node within the graph? I tried mapping the node indices of the heads and tails but that only seems to cover 48688 nodes. The maximum index in the train-edge head and tail is 45084.

If not, are we expected to predict edges for nodes that do not exist in the training data?

weihua916 · 2020-12-01T04:50:46Z

They should contain all the nodes. I think you have not taken the node types into account when you count. Can you share your code to obtain 45084?

vymao · 2020-12-01T20:46:01Z

Ah I see. Is there a reason for this? I noticed the other link prediction datasets has the node index combined.

weihua916 · 2020-12-01T22:07:56Z

Because it is a heterogeneous graph.

vymao · 2020-12-02T19:28:57Z

Ok. In the paper, it says that "All relations are modeled as directed edges, among which the relations connecting the same entity types (e.g., protein-protein, drug-drug, function-function) are always symmetric, i.e., the edges are bi-directional." Are the bidirectional edges reflected in the edge index dictionary? As in, for some directed edge [A, B], is [B, A] also in the same list?

Also, do you know approximately the average out-degree of the nodes? Just want to check and make sure I processed it correctly.

weihua916 closed this as completed Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about ogbl-biokg #92

Confusion about ogbl-biokg #92

vymao commented Nov 25, 2020

weihua916 commented Nov 25, 2020 •

edited

vymao commented Nov 25, 2020

weihua916 commented Nov 25, 2020 •

edited

vymao commented Nov 28, 2020

weihua916 commented Nov 28, 2020

vymao commented Nov 29, 2020

weihua916 commented Nov 29, 2020 •

edited

vymao commented Nov 29, 2020

weihua916 commented Nov 29, 2020

vymao commented Dec 1, 2020 •

edited

weihua916 commented Dec 1, 2020

vymao commented Dec 1, 2020

weihua916 commented Dec 1, 2020

vymao commented Dec 2, 2020 •

edited

Confusion about ogbl-biokg #92

Confusion about ogbl-biokg #92

Comments

vymao commented Nov 25, 2020

weihua916 commented Nov 25, 2020 • edited

vymao commented Nov 25, 2020

weihua916 commented Nov 25, 2020 • edited

vymao commented Nov 28, 2020

weihua916 commented Nov 28, 2020

vymao commented Nov 29, 2020

weihua916 commented Nov 29, 2020 • edited

vymao commented Nov 29, 2020

weihua916 commented Nov 29, 2020

vymao commented Dec 1, 2020 • edited

weihua916 commented Dec 1, 2020

vymao commented Dec 1, 2020

weihua916 commented Dec 1, 2020

vymao commented Dec 2, 2020 • edited

weihua916 commented Nov 25, 2020 •

edited

weihua916 commented Nov 25, 2020 •

edited

weihua916 commented Nov 29, 2020 •

edited

vymao commented Dec 1, 2020 •

edited

vymao commented Dec 2, 2020 •

edited