Perform fully-inductive OR semi-inductive link prediction using my own dataset #1319
-
Hi All, I have a dataset of scientific abstracts manually labeled with gold standard entities and relations, which can be used to build a Separately, I have an information extraction model that identifies entities and relations in new abstracts from the same scientific discipline. I would like to use a PyKEEN model trained on my existing gold standard data to make predictions of links between the predictions on a new abstract(s). In essence, I'd like to write a piece of code that takes a set of triples from a new abstract as an input (a "mini-graph"), and outputs a set of the highest-scoring predicted links between entities in the mini-graph. The mini-graph may or may not have entities that overlap with those in the original training set, so it could be either semi-inductive or fully-inductive, depending on the abstract. I would also like to be able to do this iteratively; that is, re-apply the same pre-trained model on new abstracts as I receive them. I've read the introduction to inductive predictions in the docs; however, I can't quite figure out how to adapt it to my use case. Specifically speaking, I'm confused about how to generate all four of the triples factories indicated in the docs: As of right now, I have the following implementation: from pykeen.triples import TriplesFactory
from pykeen.training import SLCWATrainingLoop
from pykeen.models.inductive import InductiveNodePiece
from pykeen.losses import NSSALoss
from torch.optim import Adam
from pykeen.predict import predict_all
# Read in data as lists of 3-tuples, where each 3-tuple is a triple
gold_trips = ...
one_abstract_trips = ...
tf_gold = TriplesFactory.from_labeled_triples(trips, create_inverse_triples=True)
tf_one_abstract = TriplesFactory.from_labeled_triples(one_abstract_trips, create_inverse_triples=True)
model = InductiveNodePiece(
triples_factory=tf_gold, # training factory, used to tokenize training nodes
inference_factory=tf_one_abstract, # inference factory, used to tokenize inference nodes
num_tokens=12, # length of a node hash - how many unique relations per node will be used
aggregation="mlp", # aggregation function, defaults to an MLP, can be any PyTorch function
loss=NSSALoss(margin=15), # dummy loss
random_seed=42,
)
optimizer = Adam(params=model.parameters(), lr=0.0005)
training_loop = SLCWATrainingLoop(
triples_factory=tf_gold, # training triples
model=model,
optimizer=optimizer,
mode="training", # necessary to specify for the inductive mode - training has its own set of nodes
)
training_loop.train(
triples_factory=tf_gold,
# stopper=early_stopper,
num_epochs=100,
)
predictions = predict_all(model=model) It all runs, except for the last line. I have a feeling I'm missing some fundamental concept about inductive predictions here; would be very appreciative of any and all feedback/advice! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 9 replies
-
Hi @serenalotreck, the inductive model was mostly coded with the "standard" benchmark setting of a fixed pre-defined validation/test split in mind, rather than the open-ended, continuous prediction setting.
It is not yet fully integrated with the prediction workflow we developed for the transductive setting. So, you would need to set, e.g., the validation factory to the one you want to run prediction with, e.g.: model = InductiveNodePiece(
...,
inference_factory=tf_one_abstract,
validation_factory=tf_one_abstract, # <-
) When running this code, I found another issue where the number of entities was not properly set; this will be fixed with #1320 |
Beta Was this translation helpful? Give feedback.
-
@mberr, thanks for your help with #1411! I think that this discussion relates to it in that many folks are excited to use PyKEEN for inductive LP modeling on their own data but are unclear about how to do this best. I'm trying to follow your suggestions above but keep hitting various errors. Here's my workflow:
This flow runs, but it does throw the following warning, which I'm not sure if I can safely ignore or not:
Just wanted to follow up on this discussion to see if anything's changed in the code base and if my flow logic and parameterization / setup above look sound and to check if anything is unnecessary / redundant. Thanks again for your help! Hopefully these comments here are helpful for others using PyKEEN for similar things! |
Beta Was this translation helpful? Give feedback.
Hi @serenalotreck, the inductive model was mostly coded with the "standard" benchmark setting of a fixed pre-defined validation/test split in mind, rather than the open-ended, continuous prediction setting.
transductive_training
: the training portioninductive_inference
: the triples available at inference (validation/test time)inductive_validation
/inductive_test
: the triples to scoreIt is not yet fully integrated with the prediction workflow we developed for the transductive setting. So, you would need to set, e.g., the validation factory to the one you want to run prediction with, e.g.: