Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃捑馃捊 Change serialization format #785

Merged
merged 9 commits into from
Feb 16, 2022
Merged

Conversation

mberr
Copy link
Member

@mberr mberr commented Feb 15, 2022

This PR changes the serialization format of the triples factory to store the key components, triples (and label to ID mappings), in a compressed, but human-readable format. This allows easy inspection outside of PyKEEN.

For FB15k237, I saw that the total file size was also significantly reduced (from 6.9MiB on ef769a8 = master to 1.1MiB for bf47eb5)

Based upon #655 (comment)

@mberr mberr requested a review from cthoyt February 15, 2022 16:06
@cthoyt
Copy link
Member

cthoyt commented Feb 15, 2022

@mberr is this covered by unit tests? Have you checked it's working on a variety of datasets?

@mberr
Copy link
Member Author

mberr commented Feb 15, 2022

@mberr is this covered by unit tests? Have you checked it's working on a variety of datasets?

class TestUtils(unittest.TestCase):
comprises the serialization tests.

So far I only tested this for FB15k237, but I can try a few other tomorrow.

@mberr
Copy link
Member Author

mberr commented Feb 16, 2022

@mberr is this covered by unit tests? Have you checked it's working on a variety of datasets?

class TestUtils(unittest.TestCase):

comprises the serialization tests.

So far I only tested this for FB15k237, but I can try a few other tomorrow.

The ten smallest datasets work all.

from docdata import get_docdata
from pykeen.datasets import dataset_resolver, get_dataset
from pykeen.triples.triples_factory import TriplesFactory


def _triples(d: str) -> int:
    return get_docdata(dataset_resolver.lookup_dict[d])["statistics"]["triples"]


dataset_list = sorted(dataset_resolver.lookup_dict, key=_triples)
for name in dataset_list[:10]:
    print(name)
    dataset = get_dataset(dataset=name)
    path = f"/tmp/{name}-temp"
    dataset.training.to_path_binary(path=path)
    TriplesFactory.from_path_binary(path=path)

@mberr mberr added this to the PyKEEN v1.8.0 milestone Feb 16, 2022
@mberr mberr changed the title Change serialization format 馃捑馃捊 Change serialization format Feb 16, 2022
@mberr mberr requested a review from cthoyt February 16, 2022 08:47
@mberr mberr mentioned this pull request Feb 16, 2022
5 tasks
@mberr mberr merged commit e82057b into master Feb 16, 2022
@mberr mberr deleted the change-serialization-format branch February 16, 2022 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants