As an alternative to using a pre-packaged dataset, the training and testing can be set explicitly by file path or with instances of :class:`pykeen.triples.TriplesFactory`. Throughout this tutorial, the paths to the training, testing, and validation sets for built-in :class:`pykeen.datasets.Nations` will be used as examples.
You've got a training and testing file as 3-column TSV files, all ready to go. You're sure that there aren't any entities or relations appearing in the testing set that don't appear in the training set. Load them in the pipeline like this:
>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> result = pipeline(
... training=NATIONS_TRAIN_PATH,
... testing=NATIONS_TEST_PATH,
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')
PyKEEN will take care of making sure that the entities are mapped from their labels to appropriate integer (technically, 0-dimensional :class:`torch.LongTensor`) indexes and that the different sets of triples share the same mapping.
This is equally applicable for the :func:`pykeen.hpo.hpo_pipeline`, which has a similar interface to the :func:`pykeen.pipeline.pipeline` as in:
>>> from pykeen.hpo import hpo_pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH, NATIONS_VALIDATE_PATH
>>> result = hpo_pipeline(
... n_trials=3, # you probably want more than this
... training=NATIONS_TRAIN_PATH,
... testing=NATIONS_TEST_PATH,
... validation=NATIONS_VALIDATE_PATH,
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_hpo_pre_stratified_transe')
The remainder of the examples will be for :func:`pykeen.pipeline.pipeline`, but all work exactly the same for :func:`pykeen.hpo.hpo_pipeline`.
If you want to add dataset-wide arguments, you can use the dataset_kwargs
argument
to the :class:`pykeen.pipeline.pipeline` to enable options like create_inverse_triples=True
.
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> result = pipeline(
... training=NATIONS_TRAIN_PATH,
... testing=NATIONS_TEST_PATH,
... dataset_kwargs={'create_inverse_triples': True},
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')
If you want finer control over how the triples are created, for example, if they are not all coming from TSV files, you can use the :class:`pykeen.triples.TriplesFactory` interface.
>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> training = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> testing = TriplesFactory.from_path(
... NATIONS_TEST_PATH,
... entity_to_id=training.entity_to_id,
... relation_to_id=training.relation_to_id,
... )
>>> result = pipeline(
... training=training,
... testing=testing,
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')
Warning
The instantiation of the testing factory, we used the entity_to_id
and relation_to_id
keyword arguments.
This is because PyKEEN automatically assigns numeric identifiers to all entities and relations for each triples
factory. However, we want the identifiers to be exactly the same for the testing set as the training
set, so we just reuse it. If we didn't have the same identifiers, then the testing set would get mixed up with
the wrong identifiers in the training set during evaluation, and we'd get nonsense results.
The dataset_kwargs
argument is ignored when passing your own :class:`pykeen.triples.TriplesFactory`, so be
sure to include the create_inverse_triples=True
in the instantiation of those classes if that's your
desired behavior as in:
>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> training = TriplesFactory.from_path(
... NATIONS_TRAIN_PATH,
... create_inverse_triples=True,
... )
>>> testing = TriplesFactory.from_path(
... NATIONS_TEST_PATH,
... entity_to_id=training.entity_to_id,
... relation_to_id=training.relation_to_id,
... create_inverse_triples=True,
... )
>>> result = pipeline(
... training=training,
... testing=testing,
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')
Triples factories can also be instantiated using the triples
keyword argument instead of the path
argument
if you already have triples loaded in a :class:`numpy.ndarray`.
It's more realistic your real-world dataset is not already stratified into training and testing sets. PyKEEN has you covered with :func:`pykeen.triples.TriplesFactory.split`, which will allow you to create a stratified dataset.
>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
>>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> training, testing = tf.split()
>>> result = pipeline(
... training=training,
... testing=testing,
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_unstratified_transe')
By default, this is an 80/20 split. If you want to use early stopping, you'll also need a validation set, so you should specify the splits:
>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
>>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> training, testing, validation = tf.split([.8, .1, .1])
>>> result = pipeline(
... training=training,
... testing=testing,
... validation=validation,
... model='TransE',
... stopper='early',
... epochs=5, # short epochs for testing - you should go
... # higher, especially with early stopper enabled
... )
>>> result.save_to_directory('doctests/test_unstratified_stopped_transe')
For a tutorial on how to use your own data together with checkpoints, see :ref:`byod_and_checkpoints_training` and :ref:`byod_and_checkpoints_manually`.