Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train the encoder for our own data? (A Knowledge graph and sample query) #16

Open
rd27995 opened this issue Apr 19, 2021 · 5 comments

Comments

@rd27995
Copy link

rd27995 commented Apr 19, 2021

Hi,

I have a target graph in the form of a directed networkx graphs with 14M nodes and 54M edges.
I wanted to know how can I make use of this target graph along with another query graph (of size 30 Nodes 33 Edges) to train the encoder?

I can only see options to make use of inbuilt datasets in PyTorch gemetric. Is there any simpler way I can use my own datasets?

@rd27995 rd27995 changed the title How to train the encoder for our own data? (A Knowledge graph and sample query0 How to train the encoder for our own data? (A Knowledge graph and sample query) Apr 19, 2021
@jessxphil
Copy link

I have the same question.

@sML-90
Copy link

sML-90 commented Apr 28, 2021

+1

@qema
Copy link
Collaborator

qema commented May 8, 2021

Thanks for the question and sorry for the late reply. There is not currently a user-facing mechanism to incorporate custom datasets due to the need to define things like train/test split and subgraph sampling -- in general one can create a new DataSource (see common/data.py) to handle new datasets. Note that a pretrained model (such as the one provided in the repo) may be able to handle testing on new datasets, in which case subgraph_matching/alignment.py can load in new graphs to evaluate on.

If the goal is to train on new datasets, as a bit of a hack, one could append an "elif" after this line:

dataset = [g for g in nx.graph_atlas_g()[1:] if nx.is_connected(g)]

with a spec for a new dataset:
elif name == 'newdataset': dataset = [list of networkx or pytorch geometric graphs]

and train using the command line option --dataset=newdataset-balanced and test with --dataset=newdataset-imbalanced.

@rd27995
Copy link
Author

rd27995 commented May 10, 2021

Thanks @qema, I was able to train the network using my custom datasets, however, I get only around 70 % validation accuracy.
Any suggestions to improve the model accuracy or finetune it?
I am using all default model parameters.
The second plot depicts validation metrics.

Training_Metrics_500_samples_300_nodes_each

Val_Results_100_Epochs_500_Samples_300_nodes

@qema
Copy link
Collaborator

qema commented Jun 12, 2021

Hi @rd27995, please see the new experimental branch which supports node features and harder negative sampling. For now, the above procedure to add new datasets is still needed. However, one can now train with --dataset=newdataset-basis and test with --dataset=newdataset-imbalanced (-basis being the new data source with harder negative examples). Also, note that testing on the imbalanced dataset (which samples random pairs of graphs) may give a more realistic picture of model performance than validation (which uses an artificial 50-50 label split as well as artificially-generated negative examples).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants