Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ConceptNet #160

Merged
merged 14 commits into from Dec 10, 2020
Merged

Add ConceptNet #160

merged 14 commits into from Dec 10, 2020

Conversation

cthoyt
Copy link
Member

@cthoyt cthoyt commented Nov 21, 2020

Closes #15

This PR adds the ConceptNet dataset to PyKEEN. It is distributed as a single gzipped TSV file, so it gets automatically split into training/testing/validations sets. This PR required a small extension to the SingleTabbedDataset base class to allow for the specification of usecols since there are 5 columns in the file (edge URL, relation, head, tail, metadata).

CC @isspek this could be useful for you given your interest in #2

Results

ConceptNet (create_inverse_triples=False)
Name        Entities    Relations      Triples
----------  ----------  -----------  ---------
Training    13506431    17396623      27259908
Testing     13506431    17396623       3407488
Validation  13506431    17396623       3407489
Total       -           -             34074885

Caveats

There's a small concern that the library has some relations ending in _inverse, which PyKEEN interprets in a special way and throws this warning (when running python -m pykeen.datasets.conceptnet to make a summary):

INFO:pykeen.triples.triples_factory:Some triples already have suffix _inverse. Creating TriplesFactory based on inverse triples

TODO

@cthoyt cthoyt added the 💾 Dataset Related to datasets label Nov 21, 2020
@cthoyt cthoyt self-assigned this Nov 21, 2020
@mberr
Copy link
Member

mberr commented Nov 21, 2020

Not necessarily required for this PR, but it might be interesting to allow keeping the meta-data for triples, e.g. to support qualifiers, cf. e.g. https://www.aclweb.org/anthology/2020.emnlp-main.596/

@cthoyt
Copy link
Member Author

cthoyt commented Dec 2, 2020

@mberr this PR is technically ready, but the implementation of dataset cleanup using np.isin is restrictively slow for this large dataset and I have yet to be able to generate a split in < 3 hours (after which I gave up each time)

@mberr
Copy link
Member

mberr commented Dec 2, 2020

@mberr this PR is technically ready, but the implementation of dataset cleanup using np.isin is restrictively slow for this large dataset and I have yet to be able to generate a split in < 3 hours (after which I gave up each time)

Is this for randomized cleanup? This one moves one triple at a time, right?

@mberr
Copy link
Member

mberr commented Dec 2, 2020

@cthoyt I think I have an idea for another cleanup implemention 🙂

@cthoyt
Copy link
Member Author

cthoyt commented Dec 2, 2020

I did some benchmarking of the cleanup algorithm last week - deterministic is way faster assuming you're okay with its caveats. It's currently the default.
cleanup_time_summary

@mberr
Copy link
Member

mberr commented Dec 2, 2020

@cthoyt #187

@cthoyt cthoyt added the blocked:pykeen Waiting on another PR or issue in PyKEEN label Dec 3, 2020
@mberr mberr removed the blocked:pykeen Waiting on another PR or issue in PyKEEN label Dec 10, 2020
@mberr
Copy link
Member

mberr commented Dec 10, 2020

I removed the blocked label since #187 has been merged to master.

@cthoyt
Copy link
Member Author

cthoyt commented Dec 10, 2020

@PyKEEN-bot test

@@ -49,6 +51,8 @@
'WN18',
'WN18RR',
'YAGO310',
'DRKG',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated change, right?

Copy link
Member Author

@cthoyt cthoyt Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, but it was missing and this is a dataset related PR so I threw it in there (the docs weren't showing it becuase I forgot to include it in all)

@cthoyt cthoyt merged commit f25c011 into master Dec 10, 2020
@cthoyt cthoyt deleted the add-conceptnet branch December 10, 2020 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ConceptNet
3 participants