New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusion about ogbl-biokg #92
Comments
Hi!
That's correct.
Those negative entities are used to evaluate your ML models. Good ML models should rank the ground-truth positive entity higher than the negative entities for each test triplet. |
Does this mean we must test against every existing edge and the corresponding 1000 negative edges for each edge? Or can we sample the edges to get an estimate? |
Hi! I am not sure if I understand your question, but please refer to our example code for how we evaluate model performance. The most relevant part is here. |
For example, the validation set has this as the keys: |
Correct. |
Will either So for example, if node 15230 is not in the positive edge data, but is either a negative head or tail of a negative edge, then it seems we wouldn't know the classification of the node then, since this isn't given for negative edges. |
For i-th validation triplet |
Yes, I am aware. But the head and tail of the negative entities are nodes, which should have labels (ie. protein, function, side effect, etc.). Are we to assume that the label for every |
Yes, the node type of negative entities is the same as the positive entity. We have this in our dataset description: "we only consider ranking against entities of the same type. For instance, when corrupting head entities of the protein type, we only consider negative protein entities." |
Thank you. Another question: does If not, are we expected to predict edges for nodes that do not exist in the training data? |
They should contain all the nodes. I think you have not taken the node types into account when you count. Can you share your code to obtain 45084? |
Ah I see. Is there a reason for this? I noticed the other link prediction datasets has the node index combined. |
Because it is a heterogeneous graph. |
Ok. In the paper, it says that "All relations are modeled as directed edges, among which the relations connecting the same entity types (e.g., protein-protein, drug-drug, function-function) are always symmetric, i.e., the edges are bi-directional." Are the bidirectional edges reflected in the edge index dictionary? As in, for some directed edge [A, B], is [B, A] also in the same list? Also, do you know approximately the average out-degree of the nodes? Just want to check and make sure I processed it correctly. |
Hi, could you explain the test data for ogbl-biokg in more detail?
I'm not sure I fully understand. Does this mean that you randomly sample, from all nodes, each head and tail (500 for each) for each test edge? If we are to predict the existence of edges, what will this information be used for?
The text was updated successfully, but these errors were encountered: