-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Link Prediction on Heterogeneous Graphs with Heterogeneous Graph Learning #3958
Comments
You are right that we can strengthen our set of examples in that regard. Currently, we only have a single example of link-level prediction for movie recommendation, see here. Do you have some specific example in mind? Happy to work with you on this together. |
Hi @rusty1s , thanks for your reply! And yes, I have a specific example in mind. I have been trying to work on link prediction with the public biomedical knowledge graph ogbl-biokg. I think this would be a good example, since it's also a benchmarking graph for many other approaches. Indeed, I'd be happy to work on this together. When I check the link prediction example for movie recommendation that you mentioned, I see that there are the following tasks to do:
Is there anything I am missing? |
Sounds amazing. I'm wondering whether we need to make |
That would be a good idea! I saw that you also contributed to the availability of a pytorch-geometric formatted ogb dataset here. It looks like this has been done exactly in the way that is required for pytorch geometric - so could we directly use this class? As to the node features: Since @anniekmyatt did it in her example, I thought it might be necessary to have node features. But of course, if this is not necessary we can leave it out. |
Thanks for mentioning me, I’m happy to help out if you’d like me to! I think in your case you need to use a contrastive loss (either hinge or cross entropy), using examples of positive and negative edges. i think the ogbl-biokg doesn’t have features on the nodes, @rusty1s , so I thought @sophiakrix needs to add some if she wants to use a GNN. |
@sophiakrix , the encoder in the movielens example does have a heterogeneous encoder due to this line:
I think you mainly need to change the decoder and loss function, and think about how you feed in your batches of edges (are you just looking to predict on one edge type or would it be beneficial to optimise for different edge types in your use case?). |
@sophiakrix Yes, you can directly use the PyG dataset from the Indeed, the |
Thanks for all the tips ! And thanks @anniekmyatt for the nice implementation of your MovieLense example :) I'll start working on adapting it now to the ogbl-biokg. I'll keep you updated :) |
When trying to use the ogbl-biokg dataset with the PyGLinkPropPredDataset class, I realised that this is a
And when I look at the ogbl-biokg
@anniekmyatt did access the node types explicitly for the MovieLens dataset, and in the case of ogbl-biokg, I cannot access the nodes and their features like this. Is there another way to set them?
|
Currently, hetero_data = HeteroData()
for node_type, num_nodes in data.num_nodes_dict.items():
hetero_data[node_type].num_nodes = num_nodes
for edge_type, edge_index in data.edge_index_dict.items():
hetero_data[edge_type].edge_index = edge_index |
Great, I thought that that's the best way to go, too. When I then tried to train the model, I encountered a few issues: 1) Message passing error for direct edge typesWhen passing in a direct edge type to the model
the following error gets thrown:
The
2) Error for reverse edge typesThe reverse edge types are lacking the
Error:
Would it be okay to simply add these attributes from the corresponding direct edge type? |
How does your model look like? It looks like you are mixing encoder and decoder parts. The encoder takes |
I did not change the model from @anniekmyatt, since she said the encoder would already accept heterogeneous input. It looks as follows:
I will adapt the decoder in a following step, but I am yet unsure how to handle that there are different node type pairs, and not only one ('user', 'movie') as in the example. |
Yes, I hard-coded the edge type in the movielens example because we were only interested in one edge type and there's nothing in that decoder to handle other edge types. To handle different edge types, you can make a decoder in which you pass the edge type as an input to the forward function. You could use a bilinear decoder, like the DistMult decoder, because it learns parameters related to the different edge types. You then need to loop over the different edge types and aggregate the loss. Let me know if this makes sense or if you'd like more detail. |
Just a few more thoughts:
|
Hi @anniekmyatt ! Thanks for your thoughts :) As to your question: I actually only need prediction for the out edge |
Something to consider for the message passing error for direct edge types: The same error already happens during lazy initialization in this line here:
Error:
I commented these lines out since they were throwing the error (bad decision, I know now)... Do you have an idea why the encoder has a problem with this input? As mentioned in the documentation, after converting a homogeneous model to a heterogeneous model with
The
|
I'm trying to reproduce this but don't think I have enough info to do so. Your Out of curiosity, what does your |
Ha, turns out I could reproduce your error! 🥳 To be honest, I'm not sure what's happening, something is going wrong in the |
I realised you had asked a few questions earlier that are still unanswered:
The
This is a good question. I have to admit that I am not sure that we are going about this the best way for the Our approach appears to be to learn the graph structure through both learnable node embeddings and a GNN. Given you are interested in only one edge type the decoder could either be just a dot product, or you could tack on something like a I'm wondering whether it might be better to ditch the GNN completely and just work on implementing one or more of the KG algorithms (like |
Can you show me how?
I wouldn't say this is true at all, so I am interested to find out more about this error. The only reason one should use @sophiakrix @anniekmyatt Yes, we could think about adding different encoder and decoder parts to PyG directly. Currently, we leave the task of KGEs (without multi-hop reasoning), e.g., |
@rusty1s Here is the code I used and which is throwing the error, so that you can reproduce it:
|
Thank you. I think this is resolved in master, as it does not throw an error for me. There was some issue in
|
Thanks for the info! I think it solves the error. The issue is that now when I try the lazy initialization, I get a Therefore, I thought of doing mini-batching for the training, but I am facing an issue here. I followed the example for mini-batch training with the HGTLoader:
When trying to call the next batch, it gives me the following error:
Do you have an idea why that could be? Also, lazy initialization is not possible here, no? Update:
Is the parameter Or is the issue that I am using a node sampler and not an edge sampler ? I then tried to use an edge sampler from the docs for the samplers, the |
Ah, thanks! I had some issues with getting this to run in the past. I must admit I didn't check recently and just went straight to my workaround using the Adding dropout to the homogeneous GNN before |
@sophiakrix Please see my newly created issue here: #4026. |
@anniekmyatt Thanks for all your answers and for putting the effort in to reproduce the error. Could you possibly show how you implemented the As to your question how my
About the decoder part: I think a simple dot product would already be sufficient, just to try it out. Of course, it would be good to be able to compare more decoders, but I'm quite tight on time for this, since I have until the end of this month. |
Sure! As @rusty1s explained above, it turns out you don't need to use the
|
Yeah, if you're a bit tight for time I would start as simple as possible and get it to a state where it trains, and then you can add complexity later if needed. :-) Thanks for sharing your Also note that you can access the |
Hi |
This would indicate that these edge types are not reachable from your set of seed links within the specified number of hops. |
What could be the possible ways to solve this issue?
is it to feed the model with the whole train_data as spiting train_dataset results in losing some edges
…________________________________
From: Matthias Fey ***@***.***>
Sent: Sunday, May 7, 2023 10:31 AM
To: pyg-team/pytorch_geometric ***@***.***>
Cc: Aoesha Alsobhe (PGR) ***@***.***>; Mention ***@***.***>
Subject: Re: [pyg-team/pytorch_geometric] Link Prediction on Heterogeneous Graphs with Heterogeneous Graph Learning (Issue #3958)
⚠ External sender. Take care when opening links or attachments. Do not provide your login details.
This would indicate that these edge types are not reachable from your set of seed links within the specified number of hops.
—
Reply to this email directly, view it on GitHub<#3958 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ATQFWHPZB67DBGBGKSJR7TDXE5TVJANCNFSM5M3KS32Q>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
How is your |
my graph looks like that:
HeteroData(
disorder={
node_id=[22384],
x=[22384, 384]
},
drug={
node_id=[14315],
x=[14315, 384]
},
protein={
node_id=[204961],
x=[204961, 769]
},
gene={
node_id=[81491],
x=[81491, 1536]
},
pathway={ x=[2566, 769] },
phenotype={ x=[16614, 384] },
side_effect={ x=[15095, 384] },
go={ x=[43558, 768] },
tissue={ x=[64, 384] },
(protein, EncodedBy, gene)={ edge_index=[2, 32930] },
(gene, AWith, disorder)={ edge_index=[2, 30252] },
(drug, HasTarget, protein)={ edge_index=[2, 26537] },
(drug, HasIndication, disorder)={
edge_index=[2, 16837],
edge_label=[16837]
},
(drug, drug_has_side_effect, side_effect)={ edge_index=[2, 217739] },
(disorder, disorder_is_subtype_of_disorder, disorder)={ edge_index=[2, 71024] },
(disorder, disorder_has_phenotype, phenotype)={ edge_index=[2, 216172] },
(drug, drug_has_contraindication, disorder)={ edge_index=[2, 13788] },
(drug, molecule_similarity_molecule_index, drug)={ edge_index=[2, 376398] },
(protein, protein_has_go_annotation, go)={ edge_index=[2, 297002] },
(go, go_is_subtype_of_go, go)={ edge_index=[2, 140116] },
(protein, protein_in_pathway, pathway)={ edge_index=[2, 121759] },
(gene, gene_expressed_in_tissue, tissue)={ edge_index=[2, 1133252] },
(protein, protein_expressed_in_tissue, tissue)={ edge_index=[2, 632173] },
(protein, interact_with, protein)={ edge_index=[2, 32090] },
(gene, rev_EncodedBy, protein)={ edge_index=[2, 32930] },
(disorder, rev_AWith, gene)={ edge_index=[2, 30252] },
(protein, rev_HasTarget, drug)={ edge_index=[2, 26537] },
(disorder, rev_HasIndication, drug)={
edge_index=[2, 16837],
edge_label=[16837]
},
(side_effect, rev_drug_has_side_effect, drug)={ edge_index=[2, 217739] },
(phenotype, rev_disorder_has_phenotype, disorder)={ edge_index=[2, 216172] },
(disorder, rev_drug_has_contraindication, drug)={ edge_index=[2, 13788] },
(go, rev_protein_has_go_annotation, protein)={ edge_index=[2, 297002] },
(pathway, rev_protein_in_pathway, protein)={ edge_index=[2, 121759] },
(tissue, rev_gene_expressed_in_tissue, gene)={ edge_index=[2, 1133252] },
(tissue, rev_protein_expressed_in_tissue, protein)={ edge_index=[2, 632173] }
)
Train_Loader
# Define seed edges:
from torch_geometric.loader import LinkNeighborLoader
edge_label_index = train_data["drug", "HasIndication", "disorder"].edge_label_index
edge_label = train_data["drug", "HasIndication", "disorder"].edge_label
train_loader = LinkNeighborLoader(
data=train_data,
num_neighbors=[20, 10],
neg_sampling_ratio=2.0,
edge_label_index=(("drug", "HasIndication", "disorder"), edge_label_index),
edge_label=edge_label,
batch_size=16,
shuffle=True,
)
import torch_geometric.transforms as T
transform = T.RandomLinkSplit(
num_val=0.1,
num_test=0.1,
is_undirected=True,
disjoint_train_ratio=0.3, # TODO
neg_sampling_ratio=2.0, # TODO
add_negative_train_samples=False,
edge_types=("drug", "HasIndication", "disorder"),
rev_edge_types=("disorder", "rev_HasIndication", "drug"),
)
train_data, val_data, test_data = transform(data)
…________________________________
From: Matthias Fey ***@***.***>
Sent: Monday, May 8, 2023 6:15 AM
To: pyg-team/pytorch_geometric ***@***.***>
Cc: Aoesha Alsobhe (PGR) ***@***.***>; Mention ***@***.***>
Subject: Re: [pyg-team/pytorch_geometric] Link Prediction on Heterogeneous Graphs with Heterogeneous Graph Learning (Issue #3958)
⚠ External sender. Take care when opening links or attachments. Do not provide your login details.
How is your LinkNeighborLoader defined and how does the metadata of your graph look like?
—
Reply to this email directly, view it on GitHub<#3958 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ATQFWHLKSPQEUCECDLG3RNDXFB6ONANCNFSM5M3KS32Q>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
So it looks like a two-hop sampling in
As such, two-hop sampling is not able to reach information such as |
i think something is wrong with my model. The model predicted negative, positive, greater than one values where as the positive and negative lables restricted to zero or one: and here the predicted values: I expected the prediction can be only zero or one? Thanks |
This depends on how your model output looks like. If you return logits, then the range of output can be |
Hi
should I do sigmond after each layer in the model or apply it after the final prediction
thanks
…________________________________
From: Matthias Fey ***@***.***>
Sent: Friday, May 12, 2023 4:54 PM
To: pyg-team/pytorch_geometric ***@***.***>
Cc: Aoesha Alsobhe (PGR) ***@***.***>; Mention ***@***.***>
Subject: Re: [pyg-team/pytorch_geometric] Link Prediction on Heterogeneous Graphs with Heterogeneous Graph Learning (Issue #3958)
⚠ External sender. Take care when opening links or attachments. Do not provide your login details.
This depends on how your model output looks like. If you return logits, then the range of output can be (-inf, inf), and you can squash that to (0, 1) via sigmoid.
—
Reply to this email directly, view it on GitHub<#3958 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ATQFWHIIUIJD2KHJIMAF32TXFZMNHANCNFSM5M3KS32Q>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
In this case, it should be sigmoid after the final prediction. You can also use sigmoid as intermediate activations, but usually something like ReLU performs better. |
[like] Aoesha Alsobhe (PGR) reacted to your message:
…________________________________
From: Matthias Fey ***@***.***>
Sent: Monday, May 15, 2023 7:08:24 PM
To: pyg-team/pytorch_geometric ***@***.***>
Cc: Aoesha Alsobhe (PGR) ***@***.***>; Mention ***@***.***>
Subject: Re: [pyg-team/pytorch_geometric] Link Prediction on Heterogeneous Graphs with Heterogeneous Graph Learning (Issue #3958)
⚠ External sender. Take care when opening links or attachments. Do not provide your login details.
In this case, it should be sigmoid after the final prediction. You can also use sigmoid as intermediate activations, but usually something like ReLU performs better.
—
Reply to this email directly, view it on GitHub<#3958 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ATQFWHMAMVX2H5HCBISKIT3XGJ5KRANCNFSM5M3KS32Q>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
When we perform negative sampling (Medium Blog) in (Defining Edge-level Training & Splits Defining Mini-batch Loaders), I believe we don't create any negative homogenous links, e.g., (users<--->users or movies<-->movies). In other words, all negative samples are of types (users--rates--movies or movies--re-rates--users). Just want to confirm. Related to above, is the discussion here relevant. Could you explain where I can use that? |
Yes, negative edges are just sampled for the supervision loss, which operates on the single edge type (user, rates, movie). We do not modify the underlying graph. |
Thanks @rusty1s! Kind of related to this. Could you explain the difference between defining mini-batch loaders (here) versus performing negative sampling "for every training epoch" (here). I am putting the code snippet for both: # In the first hop, we sample at most 20 neighbors.
# In the second hop, we sample at most 10 neighbors.
# In addition, during training, we want to sample negative edges on-the-fly with
# a ratio of 2:1.
# We can make use of the `loader.LinkNeighborLoader` from PyG:
from torch_geometric.loader import LinkNeighborLoader
# Define seed edges:
edge_label_index = train_data["user", "rates", "movie"].edge_label_index
edge_label = train_data["user", "rates", "movie"].edge_label
train_loader = LinkNeighborLoader(
data=train_data,
num_neighbors=[20, 10],
neg_sampling_ratio=2.0,
edge_label_index=(("user", "rates", "movie"), edge_label_index),
edge_label=edge_label,
batch_size=128,
shuffle=True,
) Versus from torch_geometric.utils import negative_sampling
def train_link_predictor(
model, train_data, val_data, optimizer, criterion, n_epochs=100
):
for epoch in range(1, n_epochs + 1):
model.train()
optimizer.zero_grad()
z = model.encode(train_data.x, train_data.edge_index)
# sampling training negatives for every training epoch
neg_edge_index = negative_sampling(
edge_index=train_data.edge_index, num_nodes=train_data.num_nodes,
num_neg_samples=train_data.edge_label_index.size(1), method='sparse')
edge_label_index = torch.cat(
[train_data.edge_label_index, neg_edge_index],
dim=-1,
)
edge_label = torch.cat([
train_data.edge_label,
train_data.edge_label.new_zeros(neg_edge_index.size(1))
], dim=0)
out = model.decode(z, edge_label_index).view(-1)
loss = criterion(out, edge_label)
loss.backward()
optimizer.step()
val_auc = eval_link_predictor(model, val_data)
if epoch % 10 == 0:
print(f"Epoch: {epoch:03d}, Train Loss: {loss:.3f}, Val AUC: {val_auc:.3f}")
return model |
The difference is whether you want to pre-compute negative samples or not. Usually, you want to pre-compute them for validation and testing, such that your results are comparable across different epochs/runs. During training, it is usually better to sample negatives on the fly to avoid overfitting on a specific set of negatives. That's why we usually use both |
Thanks @rusty1s! I fully understood |
There is an option in |
Thanks @rusty1s! Sorry for confusion. I am aware of the negative sampling option of |
We randomly sample a new set of negative edges in every mini-batch. This is what I mean by on-the-fly. |
Hi @rusty1s ! From discussion (#3958 (comment)), I am deciding between |
Yes, they are both very similar, and should perform very similar. I cannot give a good rule of thumb here on which to use, at best just try out both :) |
Hello great Pytorch Geometric team:) Apply the transformation to the datasetHowever, negative sampling adding to training data but not Val and not test? |
also I have a nother issue when i did check train_data['drug', 'HasIndication','disorder'].edge_label I got this: |
this function assign ones and zeros lables to edge lables even though i put False for add negative sampling parameter |
I think there is a big mistake in RandomLinkSplit : |
This was fixed on 2.2 and afterwards. Simply upgrade your PYG version. |
Hi thanks for the anwer but i already have version 2.4, the problem is that Randomlinksplit add lables to train edges ( ones )and then during minibatch also adding postive lables wich lead to 2
…________________________________
From: Ashkan Golgoon ***@***.***>
Sent: Friday, June 9, 2023 5:16 PM
To: pyg-team/pytorch_geometric ***@***.***>
Cc: Aoesha Alsobhe (PGR) ***@***.***>; Mention ***@***.***>
Subject: Re: [pyg-team/pytorch_geometric] Link Prediction on Heterogeneous Graphs with Heterogeneous Graph Learning (Issue #3958)
⚠ External sender. Take care when opening links or attachments. Do not provide your login details.
also I have a nother issue when i did check train_data['drug', 'HasIndication','disorder'].edge_label
I got this: tensor([2, 2, 2, ..., 0, 0, 0])
This was fixed on 2.2 and afterwards. Simply upgrade your PYG version.
—
Reply to this email directly, view it on GitHub<#3958 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ATQFWHLD3F567E5NO3NYJ43XKND6NANCNFSM5M3KS32Q>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi |
🚀 The feature, motivation and pitch
I am working with heterogeneous knowledge graphs and am trying to do link prediction on them. The specific issue I am facing is that I cannot find any working implementation that would allow me to do link prediction on a graph with multiple node types and multiple edge types and predict the existence and the type of edge between nodes.
It would be great to have a working example of how to do link prediction on a heterogeneous graph with the heterogeneous graph learning module.
The text was updated successfully, but these errors were encountered: