-
-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
馃捑馃捊 Change serialization format #785
Conversation
trigger ci
trigger ci
@mberr is this covered by unit tests? Have you checked it's working on a variety of datasets? |
pykeen/tests/test_triples_factory.py Line 501 in 4084a1a
So far I only tested this for FB15k237, but I can try a few other tomorrow. |
trigger ci
trigger ci
trigger ci
The ten smallest datasets work all. from docdata import get_docdata
from pykeen.datasets import dataset_resolver, get_dataset
from pykeen.triples.triples_factory import TriplesFactory
def _triples(d: str) -> int:
return get_docdata(dataset_resolver.lookup_dict[d])["statistics"]["triples"]
dataset_list = sorted(dataset_resolver.lookup_dict, key=_triples)
for name in dataset_list[:10]:
print(name)
dataset = get_dataset(dataset=name)
path = f"/tmp/{name}-temp"
dataset.training.to_path_binary(path=path)
TriplesFactory.from_path_binary(path=path) |
This PR changes the serialization format of the triples factory to store the key components, triples (and label to ID mappings), in a compressed, but human-readable format. This allows easy inspection outside of PyKEEN.
For FB15k237, I saw that the total file size was also significantly reduced (from
6.9MiB
on ef769a8 =master
to1.1MiB
for bf47eb5)Based upon #655 (comment)