which version of ogbn-proteins dataset did you use in cluster_gin.py file? #15

Elizabeth1997 · 2020-03-30T22:17:47Z

Hello, OGB team, hope you are doing great! I just downloaded the example code of ogbn-proteins and ran the cluster_gin.py. I found out you didn't use node features and the node species information has been changed from previous one-hot encoding (version 3) to taxonomy ID. However, in cluster_gin.py file, cluster_data.data.x = cluster_data.data.x.to(torch.float) this statement is incorrect cause there is no attribute called x now. You can check when you set the argument use_node_features to be True. Another question is that it is possible for us to use one-hot encoding features provided previously? Because we have no idea about the meaning represented by taxonomy ID of each protein or does the similarity between two proteins could be expressed by the difference of their taxonomy IDs? Thank you for replying in advance and have a good one!

The text was updated successfully, but these errors were encountered:

weihua916 · 2020-03-30T22:31:31Z

Hi, thanks for the interest. Apologies for the inconsistency, as we are actively developing OGB now. As noted in the README, the datasets are likely to change in the next couple of weeks.

To answer your questions: We deleted the input node feature because we empirically found that including it gives worse performance (possibly because we use the species split, and the input node features are also about the species). Nevertheless, you can always use the meta information such as species id in your model. Taxonomy ID is just an identifier of species (https://www.ncbi.nlm.nih.gov/taxonomy), and it is fully compatible with the one-hot encoding that we provided before.

To further clarify the ogbn-proteins dataset: Each species has many proteins that are connected by edges that represent their interactions. There are also some edges between proteins across different species. The goal is to use the labels assigned to the proteins from the 6 species to make prediction on the proteins from the rest of the 2 species (one for validation and one for test.)

Hope this clarifies your questions. Please let us know if you have further questions.

Elizabeth1997 · 2020-03-31T09:01:00Z

Thank you so much for your instant and detailed reply! I think your answer clarifies my questions well. We also found out worse performance with using node features empirically as you mentioned :) About your clarification about the species, we plotted a figure illustrating that. Hope it can help other people who are interested in your work as well to have a better understanding. Thank you again and we are looking forward to your further work coming out soon.

Elizabeth1997 · 2020-03-31T13:45:22Z

Hi, thanks for the interest. Apologies for the inconsistency, as we are actively developing OGB now. As noted in the README, the datasets are likely to change in the next couple of weeks.

To answer your questions: We deleted the input node feature because we empirically found that including it gives worse performance (possibly because we use the species split, and the input node features are also about the species). Nevertheless, you can always use the meta information such as species id in your model. Taxonomy ID is just an identifier of species (https://www.ncbi.nlm.nih.gov/taxonomy), and it is fully compatible with the one-hot encoding that we provided before.

To further clarify the ogbn-proteins dataset: Each species has many proteins that are connected by edges that represent their interactions. There are also some edges between proteins across different species. The goal is to use the labels assigned to the proteins from the 6 species to make prediction on the proteins from the rest of the 2 species (one for validation and one for test.)

Hope this clarifies your questions. Please let us know if you have further questions.

Hi, OGB team, I also noticed that you only generate batches using ClusterGCN once before training and testing.

loader = ClusterLoader(cluster_data, batch_size=args.batch_size,
                           shuffle=True, num_workers=args.num_workers)

However in ClusterGCN code, they generate batches every epoch, which may help to improve the model's performance cause all edges could be utilized.

rusty1s · 2020-03-31T13:51:00Z

Hi @Elizabeth1997,
with shuffle=True, the ClusterLoader generates different mini-batches for each epoch, too. That way, different partitions are merged together to a batch in each epoch while we utilize both inter- and intra-connections between partitions.

Elizabeth1997 · 2020-03-31T16:09:17Z

Hi @Elizabeth1997,
with shuffle=True, the ClusterLoader generates different mini-batches for each epoch, too. That way, different partitions are merged together to a batch in each epoch while we utilize both inter- and intra-connections between partitions.

Hello @rusty1s , I see. Thank you for your reply!

Elizabeth1997 · 2020-04-03T20:46:29Z

Hi OGB team, will you release the paper on time (mid April) ?

weihua916 · 2020-04-05T21:50:29Z

We will try our best to release the paper by then. We will keep you updated. Thank you for your patience.

Elizabeth1997 · 2020-04-06T06:18:30Z

We will try our best to release the paper by then. We will keep you updated. Thank you for your patience.

Thank you very much.

Elizabeth1997 · 2020-04-06T19:08:37Z

Hello @rusty1s, hope you are doing great. I noticed that in graph_saint.py (for ogbn_products dataset) file you used edge_index for training while used adjacent matrix for inference. Why not just put the model to CPU and do inference using the whole graph's edge indexes?

rusty1s · 2020-04-10T18:47:24Z

Because the message passing scheme of PyG would require to much memory and would be even slower than sparse matrix multiplication on the CPU. This is just a workaround to overcome current limitations in PyG (which I am working on to fix).

Elizabeth1997 · 2020-04-10T19:22:30Z

Hello @rusty1s, thanks for your reply. I see, many thanks!

Elizabeth1997 closed this as completed Apr 6, 2020

Elizabeth1997 reopened this Apr 6, 2020

weihua916 closed this as completed May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

which version of ogbn-proteins dataset did you use in cluster_gin.py file? #15

which version of ogbn-proteins dataset did you use in cluster_gin.py file? #15

Elizabeth1997 commented Mar 30, 2020

weihua916 commented Mar 30, 2020 •

edited

Elizabeth1997 commented Mar 31, 2020

Elizabeth1997 commented Mar 31, 2020

rusty1s commented Mar 31, 2020 •

edited

Elizabeth1997 commented Mar 31, 2020

Elizabeth1997 commented Apr 3, 2020

weihua916 commented Apr 5, 2020

Elizabeth1997 commented Apr 6, 2020

Elizabeth1997 commented Apr 6, 2020

rusty1s commented Apr 10, 2020

Elizabeth1997 commented Apr 10, 2020

which version of ogbn-proteins dataset did you use in cluster_gin.py file? #15

which version of ogbn-proteins dataset did you use in cluster_gin.py file? #15

Comments

Elizabeth1997 commented Mar 30, 2020

weihua916 commented Mar 30, 2020 • edited

Elizabeth1997 commented Mar 31, 2020

Elizabeth1997 commented Mar 31, 2020

rusty1s commented Mar 31, 2020 • edited

Elizabeth1997 commented Mar 31, 2020

Elizabeth1997 commented Apr 3, 2020

weihua916 commented Apr 5, 2020

Elizabeth1997 commented Apr 6, 2020

Elizabeth1997 commented Apr 6, 2020

rusty1s commented Apr 10, 2020

Elizabeth1997 commented Apr 10, 2020

weihua916 commented Mar 30, 2020 •

edited

rusty1s commented Mar 31, 2020 •

edited