Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐍 🌟 Hyper-Relational Statement Factory #1117

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

migalkin
Copy link
Member

@migalkin migalkin commented Sep 16, 2022

This PR introduces a series of efforts on integrating hyper-relational graphs (aka RDF* or LPG) into PyKEEN with factories, datasets, and models.

Here is the adaptation of the code of our ISWC'21 paper:

  • StatementFactory (currently a mix of functions of CoreTriplesFactory and TriplesFactory)
  • HyperRelationalUnpackedRemoteDataset (the sole purpose of it is to have an explicit argument max_num_qualifier_pairs to limit the maximum statement length
  • A family of hyper-relational WD50K datasets (with 13%, 33%, 66%, and 100% statements with qualifiers)

Loading statements of various sizes, we now have padding entities/relations, right now they are the last entries of entity_to_id and relation_to_id. They are auxiliary and, for instance, we won't need to build inverse type for the padding relation - this is a bit similar to NodePiece with its auxiliary tokens.

Some caveats:

  • Input statements in the datasets might generally be of arbitrary length and we do not know this length in advance, but the loading procedure in pandas.read_csv requires a pre-defined number of columns to load (otherwise crashes). So it's either a loading speed with pandas, or dummy for loops (but preserving the whole dataset)
  • Loading statements into a fixed-size tensor of shape (num_statements, max_len) is memory-inefficient - a lot of entries will be padding indices. We have functions to build sparse edge_index and qualifier_index, but they will be applicable for GNN encoders, not for training/evaluation. Ideally, we'd need some sort of RaggedTensor from TensorFlow, but there is no standard implementation of that in PyTorch

The first "egg" of the project "PyKEEN Hyper-Relational" 😅

@migalkin
Copy link
Member Author

migalkin commented Sep 16, 2022

Loading of hyper-relational WD50K datasets works, you can already try it with smth like

from pykeen.datasets.wd50k import WD50K_100


def main():
    dataset = WD50K_100(
        create_inverse_triples=True,
        max_num_qualifier_pairs=5,
        eager=True
    )

    print(dataset.num_entities)

if __name__ == '__main__':
    main()

if len(column_remapping) > max_len:
raise ValueError("remapping must have length not more than the max statement length")

# TODO find a way to load files w/o knowing max_len in advance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can open a file handle then just read the first row, split by the delimiter, then count. then just close the file handle

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah if it was that easy 😢 Statements in the input files are not sorted "longest to shortest" so the longest statement might appear somewhere in the middle of the file, eg

h, r, t, qr1, qv1  # the first row does not have padding commas
...
h, r, t, qr1, qv1, qr2, qv2, qr3, qv3

so we'd still need to read through the whole file to identify the maximum length

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add a small test dataset and demonstrate loading it in unit tests

@@ -944,3 +945,105 @@ def _get_df(self) -> pd.DataFrame:
df = df[usecols]

return df


class HyperRelationalUnpackedRemoteDataset(PathDataset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a different base class mixin for this kind of dataset to note that the triples factories have to be SatementFactories

Copy link
Member

@mberr mberr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, nice to see that hyper-relational statements may finds their way into PyKEEN 🚀

I added a few comments; I think we can merge/re-use some of the new code with existing triple utilities

TRIPLES_VALID_URL = f"{BASE_URL}/triples/valid.txt"
TRIPLES_TEST_URL = f"{BASE_URL}/triples/test.txt"
TRIPLES_TRAIN_URL = f"{BASE_URL}/triples/train.txt"
BASE_URL = "https://raw.githubusercontent.com/migalkin/StarE/master/data/clean"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could make sense to put this Zenodo

arxiv: 2009.10847
github: migalkin/StarE
statistics:
entities: 47,156 (5,460 qualifier-only)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cthoyt do we rely somewhere on parsing these statistics to integers? If yes, we should store the second number under a different key

logger = logging.getLogger(__name__)

INVERSE_SUFFIX = "_inverse"
TRIPLES_DF_COLUMNS = ("head_id", "head_label", "relation_id", "relation_label", "tail_id", "tail_label")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These may already exist as constants either in pykeen/constants.py or pykeen/typing.py

Comment on lines +56 to +57
entity_to_id,
relation_to_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
entity_to_id,
relation_to_id,
entity_to_id: Mapping[str, int],
relation_to_id: Mapping[str, int],

entity_to_id,
relation_to_id,
) -> Tuple[torch.LongTensor, Collection[np.ndarray]]:
if statements.size == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we re-use code from pykeen/triples/triples_factory.py?

entity_to_id=self.entity_to_id,
relation_to_id=self.relation_to_id,
labels=labels,
# entity_mask=entity_mask,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

"""Return the percentage of statements with qualifiers."""
return (~(self.mapped_statements[:, 3::2] == self.padding_idx).all(dim=1)).sum().item() / self.num_statements

def extra_repr(self) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better overwrite iter_extra_repr and yield from super

Comment on lines +524 to +526
def __repr__(self) -> str:
return f"{self.__class__.__name__}({self.extra_repr()})"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __repr__(self) -> str:
return f"{self.__class__.__name__}({self.extra_repr()})"

not necessary



def create_statement_entity_mapping(statements: Iterable[str]):
entities = list(sorted(set([e for s in statements for e in s[::2]]))) # padding is already in the ndarray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment seems to assume numpy array, but the parameter type annotation says str

return s_p_q_to_multi_tails_new


def _create_multi_label_instances(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we re-use this code for normal triples? If yes, then we can just extend the existing function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants