Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

which version of ogbn-proteins dataset did you use in cluster_gin.py file? #15

Closed
Elizabeth1997 opened this issue Mar 30, 2020 · 11 comments

Comments

@Elizabeth1997
Copy link

Hello, OGB team, hope you are doing great! I just downloaded the example code of ogbn-proteins and ran the cluster_gin.py. I found out you didn't use node features and the node species information has been changed from previous one-hot encoding (version 3) to taxonomy ID. However, in cluster_gin.py file, cluster_data.data.x = cluster_data.data.x.to(torch.float) this statement is incorrect cause there is no attribute called x now. You can check when you set the argument use_node_features to be True. Another question is that it is possible for us to use one-hot encoding features provided previously? Because we have no idea about the meaning represented by taxonomy ID of each protein or does the similarity between two proteins could be expressed by the difference of their taxonomy IDs? Thank you for replying in advance and have a good one!
Screen Shot 1441-08-07 at 12 57 22 AM

@weihua916
Copy link
Contributor

weihua916 commented Mar 30, 2020

Hi, thanks for the interest. Apologies for the inconsistency, as we are actively developing OGB now. As noted in the README, the datasets are likely to change in the next couple of weeks.

To answer your questions: We deleted the input node feature because we empirically found that including it gives worse performance (possibly because we use the species split, and the input node features are also about the species). Nevertheless, you can always use the meta information such as species id in your model. Taxonomy ID is just an identifier of species (https://www.ncbi.nlm.nih.gov/taxonomy), and it is fully compatible with the one-hot encoding that we provided before.

To further clarify the ogbn-proteins dataset: Each species has many proteins that are connected by edges that represent their interactions. There are also some edges between proteins across different species. The goal is to use the labels assigned to the proteins from the 6 species to make prediction on the proteins from the rest of the 2 species (one for validation and one for test.)

Hope this clarifies your questions. Please let us know if you have further questions.

@Elizabeth1997
Copy link
Author

Thank you so much for your instant and detailed reply! I think your answer clarifies my questions well. We also found out worse performance with using node features empirically as you mentioned :) About your clarification about the species, we plotted a figure illustrating that. Hope it can help other people who are interested in your work as well to have a better understanding. Thank you again and we are looking forward to your further work coming out soon.
ogb_proteins

@Elizabeth1997
Copy link
Author

Hi, thanks for the interest. Apologies for the inconsistency, as we are actively developing OGB now. As noted in the README, the datasets are likely to change in the next couple of weeks.

To answer your questions: We deleted the input node feature because we empirically found that including it gives worse performance (possibly because we use the species split, and the input node features are also about the species). Nevertheless, you can always use the meta information such as species id in your model. Taxonomy ID is just an identifier of species (https://www.ncbi.nlm.nih.gov/taxonomy), and it is fully compatible with the one-hot encoding that we provided before.

To further clarify the ogbn-proteins dataset: Each species has many proteins that are connected by edges that represent their interactions. There are also some edges between proteins across different species. The goal is to use the labels assigned to the proteins from the 6 species to make prediction on the proteins from the rest of the 2 species (one for validation and one for test.)

Hope this clarifies your questions. Please let us know if you have further questions.

Hi, OGB team, I also noticed that you only generate batches using ClusterGCN once before training and testing.

loader = ClusterLoader(cluster_data, batch_size=args.batch_size,
                           shuffle=True, num_workers=args.num_workers)

However in ClusterGCN code, they generate batches every epoch, which may help to improve the model's performance cause all edges could be utilized.

@rusty1s
Copy link
Collaborator

rusty1s commented Mar 31, 2020

Hi @Elizabeth1997,
with shuffle=True, the ClusterLoader generates different mini-batches for each epoch, too. That way, different partitions are merged together to a batch in each epoch while we utilize both inter- and intra-connections between partitions.

@Elizabeth1997
Copy link
Author

Hi @Elizabeth1997,
with shuffle=True, the ClusterLoader generates different mini-batches for each epoch, too. That way, different partitions are merged together to a batch in each epoch while we utilize both inter- and intra-connections between partitions.

Hello @rusty1s , I see. Thank you for your reply!

@Elizabeth1997
Copy link
Author

Hi OGB team, will you release the paper on time (mid April) ?

@weihua916
Copy link
Contributor

We will try our best to release the paper by then. We will keep you updated. Thank you for your patience.

@Elizabeth1997
Copy link
Author

We will try our best to release the paper by then. We will keep you updated. Thank you for your patience.

Thank you very much.

@Elizabeth1997 Elizabeth1997 reopened this Apr 6, 2020
@Elizabeth1997
Copy link
Author

Hello @rusty1s, hope you are doing great. I noticed that in graph_saint.py (for ogbn_products dataset) file you used edge_index for training while used adjacent matrix for inference. Why not just put the model to CPU and do inference using the whole graph's edge indexes?

@rusty1s
Copy link
Collaborator

rusty1s commented Apr 10, 2020

Because the message passing scheme of PyG would require to much memory and would be even slower than sparse matrix multiplication on the CPU. This is just a workaround to overcome current limitations in PyG (which I am working on to fix).

@Elizabeth1997
Copy link
Author

Hello @rusty1s, thanks for your reply. I see, many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants