Perform fully-inductive OR semi-inductive link prediction using my own dataset #1319

serenalotreck · 2023-09-12T22:04:46Z

serenalotreck
Sep 12, 2023

Hi All,

I have a dataset of scientific abstracts manually labeled with gold standard entities and relations, which can be used to build a TriplesFactory.

Separately, I have an information extraction model that identifies entities and relations in new abstracts from the same scientific discipline.

I would like to use a PyKEEN model trained on my existing gold standard data to make predictions of links between the predictions on a new abstract(s). In essence, I'd like to write a piece of code that takes a set of triples from a new abstract as an input (a "mini-graph"), and outputs a set of the highest-scoring predicted links between entities in the mini-graph.

The mini-graph may or may not have entities that overlap with those in the original training set, so it could be either semi-inductive or fully-inductive, depending on the abstract. I would also like to be able to do this iteratively; that is, re-apply the same pre-trained model on new abstracts as I receive them.

I've read the introduction to inductive predictions in the docs; however, I can't quite figure out how to adapt it to my use case.

Specifically speaking, I'm confused about how to generate all four of the triples factories indicated in the docs: transductive_training, inductive_inference, inductive_validation, inductive_test. Right now, I have my gold standard dataset set up as transductive_training, and the abstract of interest's triples as inductive_inference.

As of right now, I have the following implementation:

from pykeen.triples import TriplesFactory
from pykeen.training import SLCWATrainingLoop
from pykeen.models.inductive import InductiveNodePiece
from pykeen.losses import NSSALoss
from torch.optim import Adam
from pykeen.predict import predict_all

# Read in data as lists of 3-tuples, where each 3-tuple is a triple
gold_trips = ...
one_abstract_trips = ...

tf_gold = TriplesFactory.from_labeled_triples(trips, create_inverse_triples=True)
tf_one_abstract = TriplesFactory.from_labeled_triples(one_abstract_trips, create_inverse_triples=True)

model = InductiveNodePiece(
    triples_factory=tf_gold,  # training factory, used to tokenize training nodes
    inference_factory=tf_one_abstract,  # inference factory, used to tokenize inference nodes
    num_tokens=12,  # length of a node hash - how many unique relations per node will be used
    aggregation="mlp",  # aggregation function, defaults to an MLP, can be any PyTorch function
    loss=NSSALoss(margin=15),  # dummy loss
    random_seed=42,
)

optimizer = Adam(params=model.parameters(), lr=0.0005)

training_loop = SLCWATrainingLoop(
    triples_factory=tf_gold,  # training triples
    model=model,
    optimizer=optimizer,
    mode="training",   # necessary to specify for the inductive mode - training has its own set of nodes
)

training_loop.train(
    triples_factory=tf_gold,
    # stopper=early_stopper,
    num_epochs=100,
)

predictions = predict_all(model=model)

It all runs, except for the last line. predict_all only runs if I pass mode=training, but then it's operating on my gold standard triples, not the one abstract. Additionally, the only other options to pass for mode are validation and testing -- which to my undestanding, still wouldn't be making predictions on my one abstract even if I had provided them.

I have a feeling I'm missing some fundamental concept about inductive predictions here; would be very appreciative of any and all feedback/advice!

Answered by mberr

Sep 13, 2023

Hi @serenalotreck, the inductive model was mostly coded with the "standard" benchmark setting of a fixed pre-defined validation/test split in mind, rather than the open-ended, continuous prediction setting.

transductive_training: the training portion
inductive_inference: the triples available at inference (validation/test time)
inductive_validation / inductive_test: the triples to score

It is not yet fully integrated with the prediction workflow we developed for the transductive setting. So, you would need to set, e.g., the validation factory to the one you want to run prediction with, e.g.:

model = InductiveNodePiece(
    ...,
    inference_factory=tf_one_abstract,
    validation_factory=

View full answer

mberr · 2023-09-13T10:19:30Z

mberr
Sep 13, 2023
Maintainer

Hi @serenalotreck, the inductive model was mostly coded with the "standard" benchmark setting of a fixed pre-defined validation/test split in mind, rather than the open-ended, continuous prediction setting.

transductive_training: the training portion
inductive_inference: the triples available at inference (validation/test time)
inductive_validation / inductive_test: the triples to score

It is not yet fully integrated with the prediction workflow we developed for the transductive setting. So, you would need to set, e.g., the validation factory to the one you want to run prediction with, e.g.:

model = InductiveNodePiece(
    ...,
    inference_factory=tf_one_abstract,
    validation_factory=tf_one_abstract, # <-
)

When running this code, I found another issue where the number of entities was not properly set; this will be fixed with #1320

9 replies

serenalotreck Sep 13, 2023
Author

EDIT: I just noticed that inference_factory alone should already be enough.

The model training does run if I only pass inference_factory in addition to the training triples, but there isn't an option to pass mode=inference at prediction time -- doesn't that mean I do still need validation_factory?

Okay, this looks to be a bit more involved than I initially thought it was; I'll see whether we can create a utility method here.

Thanks so much, that would be awesome!

mberr Sep 13, 2023
Maintainer

The model training does run if I only pass inference_factory in addition to the training triples, but there isn't an option to pass mode=inference at prediction time -- doesn't that mean I do still need validation_factory?

By default, if no validation factory is provided, the entity representations for validation are created based on the inference factory; thus, you should be able to specify mode="validation".

mberr Sep 13, 2023
Maintainer

Thanks so much, that would be awesome!

cf. #1322

serenalotreck Sep 13, 2023
Author

By default, if no validation factory is provided, the entity representations for validation are created based on the inference factory; thus, you should be able to specify mode="validation".

I get the same error about shape, so this seems correct!

cf. #1322

Thanks for jumping on that! Looking forward to seeing these integrated & using them!

mberr Sep 22, 2023
Maintainer

#1322 is merged into master and will be part of our next release (soon)

AlexMRuch · 2024-08-08T21:46:38Z

AlexMRuch
Aug 8, 2024

@mberr, thanks for your help with #1411! I think that this discussion relates to it in that many folks are excited to use PyKEEN for inductive LP modeling on their own data but are unclear about how to do this best.

I'm trying to follow your suggestions above but keep hitting various errors. Here's my workflow:

    # Make and split triple factory
    triple_factory = TriplesFactory.from_labeled_triples(
        hrt_df.values,
        create_inverse_triples=True,
        entity_to_id=None,
        relation_to_id=None,
        compact_id=True,
        filter_out_candidate_inverse_relations=True,
        metadata=None,
    )
    transductive_training, inductive_inference = (triple_factory.split(ratios=0.8, random_state=42))


    # Create model pipeline components
    model = InductiveNodePiece(
        triples_factory=transductive_training,
        inference_factory=inductive_inference,
        validation_factory=inductive_inference,  # @mberr, based on your notes above, it sounds like this should be here
        test_factory=inductive_inference,  # @mberr, should this be here as well if one wants to run testing?
        num_tokens=12,
        aggregation="mlp",
        loss=NSSALoss(margin=15),
        random_seed=42,
    )

    training_loop = SLCWATrainingLoop(
        triples_factory=transductive_training,
        model=model,
        optimizer=Adam(params=model.parameters(), lr=0.0005),
        mode="training",
    )

    valid_evaluator = SampledRankBasedEvaluator(
        mode="validation",
        evaluation_factory=inductive_inference,  # @mberr, is this correct?
        #additional_filter_triples=inductive_inference.mapped_triples,  # commented out, I presume it would filter out evaluation_factory?
    )

    test_evaluator = SampledRankBasedEvaluator(
        mode="testing",  # necessary to specify for the inductive mode - this will use inference nodes
        evaluation_factory=inductive_inference,  # @mberr, is this correct?
        #additional_filter_triples=inductive_inference.mapped_triples,  # commented out, I presume it would filter out evaluation_factory?
    )

    early_stopper = EarlyStopper(
        model=model,
        training_triples_factory=transductive_training,
        evaluation_triples_factory=inductive_inference,
        frequency=5,  # Default: 10
        patience=3,  # Default: 2
        result_tracker=None,
        evaluation_batch_size=256,
        evaluator=valid_evaluator,
    )
    
    # Run training and evaluation loops
    training_loop.train(
        triples_factory=transductive_training,
        stopper=early_stopper,
        num_epochs=5,  # Default = 100
    )
    result = test_evaluator.evaluate(
        model=model,
        mapped_triples=inductive_inference.mapped_triples,
        #additional_filter_triples=inductive_inference.mapped_triples,  # commented out, I presume it would filter out mapped_triples?
        batch_size=256,
    )


    # Get predictions
    pack = predict_all(model=model, k=100000, mode="testing", target=target).  # @mberr, should this be testing or validation – does it matter in this case?
    pred = pack.process(factory=inductive_inference)

This flow runs, but it does throw the following warning, which I'm not sure if I can safely ignore or not:

pykeen.utils - WARNING - The filtered setting was enabled, but there were no `additional_filter_triples`
given. This means you probably forgot to pass (at least) the training triples.

Just wanted to follow up on this discussion to see if anything's changed in the code base and if my flow logic and parameterization / setup above look sound and to check if anything is unnecessary / redundant.

Thanks again for your help! Hopefully these comments here are helpful for others using PyKEEN for similar things!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform fully-inductive OR semi-inductive link prediction using my own dataset #1319

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Perform fully-inductive OR semi-inductive link prediction using my own dataset #1319

serenalotreck Sep 12, 2023

Replies: 2 comments · 9 replies

mberr Sep 13, 2023 Maintainer

serenalotreck Sep 13, 2023 Author

mberr Sep 13, 2023 Maintainer

mberr Sep 13, 2023 Maintainer

serenalotreck Sep 13, 2023 Author

mberr Sep 22, 2023 Maintainer

AlexMRuch Aug 8, 2024

serenalotreck
Sep 12, 2023

Replies: 2 comments 9 replies

mberr
Sep 13, 2023
Maintainer

serenalotreck Sep 13, 2023
Author

mberr Sep 13, 2023
Maintainer

mberr Sep 13, 2023
Maintainer

serenalotreck Sep 13, 2023
Author

mberr Sep 22, 2023
Maintainer

AlexMRuch
Aug 8, 2024