Support for text on Literal models #470

sntcristian · 2021-05-28T08:52:18Z

I am currently working on a project on knowledge graph embeddings and I wanted to test the efficiency of models which use literals to enrich the embeddings (i.e. LiteralE models).
I have seen that your library already provides an implementation of the models DistMult and ComplEx which use information from numerical literals and I was wondering if you thought about providing support also for textual literals such as in here: https://github.com/SmartDataAnalytics/LiteralE.
The issue is that the repository mentioned above is no more maintained and lacks of utilities such as hyper parameter optimization that you already offer in your suite.

mberr · 2021-05-28T09:32:06Z

I am not very familiar with LiteralE, but from here it looks like the textual literals are encoded in a preprocessing step. From there on, you can treat them similarly as numeric literals (each entity is associated with a Glove(?) vector).

A bit more involved would be to train the text encoder in an end-to-end fashion. For this, you would need to create a custom RepresentationModule comprising the textual encoder.

sntcristian · 2021-05-28T14:43:32Z

I think it's not just a matter of preprocessing. By looking at the difference between your implementation of DistMultLiteral and the one of the class "DistMult_Literal_gate_text" here[row516] I can see that their model allows to enrich the representation of entities with information coming both from numerical and textual vectors associated to them and that this interaction is managed through a gate mechanism. By looking at your code for DistMultLiteral I see that you provide an enriched representation of entities through a linear transformation but this does not allow to simultaneously use information both from textual and numeric literals.

cthoyt · 2021-05-29T13:55:59Z

I agree with max - if you want to use textual literals it would be much more elegant to convert them to numeric values during preprocessing (either one hot encodings if you consider them as categorical or using a text embedding like glove/word2vec or any pretrained language model). Perhaps @mberr we can provide a representation module that uses BERT to encode text literals for use with one of the literal models.

Also note that the DistMultLiteral and ComplExLiteral models in pykeen are both subclasses of the more general LiteralE model. You probably noticed that hacking on this isn’t so easy from the end-user functions like the pipeline, but if you want to dig into the code it is quite extensible

mberr · 2021-05-29T17:06:25Z

By looking at your code for DistMultLiteral I see that you provide an enriched representation of entities through a linear transformation but this does not allow to simultaneously use information both from textual and numeric literals.

Yes, this is right. Currently we do not have an implementation for a combination of three representations (learned embedding + numeric + textual). If you are up to writing your own class, you may easily extend this, taking inspiration from the existing model for combining two representations. We would also be very happy to accept a PR for this if you are willing to 🙂

sntcristian · 2021-05-30T15:24:59Z

Perhaps @mberr we can provide a representation module that uses BERT to encode text literals for use with one of the literal models.

That would be very interesting for me. By the way, I'll accept your suggestions and try to extend the code as far as I can.

mberr · 2021-05-30T18:50:30Z

I guess the easiest way is to use the bert-as-service library, cf. https://github.com/hanxiao/bert-as-service#building-a-qa-semantic-search-engine-in-3-minutes

Here a simple example where you use the entity labels as only text information. While this might already be quite useful, having additional text information (description texts, etc.) might give even better results.

import torch
from bert_serving.client import BertClient
from pykeen.datasets import get_dataset

bc = BertClient()
triples_factory = get_dataset(dataset="nations").training   # modify for your triples factory
dim = 768   # depends on layer, pooling of BERT encoder, cf. configuration of bert serving 
text_features = torch.empty(triples_factory.num_entities, dim)
for entity_label, entity_id in triples_factory.entity_to_id.items():
    text_features[entity_id] = torch.as_tensor(data=bc.encode([entity_label])[0], dtype=torch.get_default_dtype()).view(-1)

Disclaimer: written in browser, may contain typos or syntax errors.

sntcristian · 2021-05-30T20:23:54Z

yes, I was developing similar code for creating an ad hoc class called "TriplesTextualLiteralsFactory" which I will pass as input to an extended DistMultLiteral model. The issue is that, as you see in the following code, I cannot create a "literals_to_id" dictionary since I do not one-hot encode the values but I use fixed BERT embeddings dimensions.
Do you think I need "literals_to_id" for training instances?

def create_matrix_of_literals(
    textual_triples: np.array,
    entity_to_id: EntityMapping,
) -> np.ndarray:
    # Prepare literal matrix, set every literal to zero, and afterwards fill in the corresponding value if available
    txt_literals = np.zeros([len(entity_to_id), 768], dtype=np.float32)
    model = SentenceTransformer('allenai-specter')
    # TODO vectorize code
    for h, r, lit in textual_triples:
        try:
            sentences = [lit]
            embeddings = model.encode(sentences)
            # row define entityl. Set the corresponding literal for the entity
            txt_literals[entity_to_id[h], :] = embeddings[0]
        except KeyError:
            logger.info("Either entity or relation to literal doesn't exist.")
            continue

    return txt_literals

mberr · 2021-05-30T20:42:16Z

Actually I think we do not need them. There is even an TODO in the code:

pykeen/src/pykeen/triples/instances.py

Lines 121 to 128 in 4536840

    
           @fix_dataclass_init_docs 
        
           @dataclass 
        
           class MultimodalInstances(Instances): 
        
               """Triples and mappings to their indices as well as multimodal data.""" 
        
               #: TODO: do we need these? 
        
               numeric_literals: Mapping[str, np.ndarray] 
        
               literals_to_id: Mapping[str, int]

@mali-git any reason why we should keep them?

cthoyt · 2021-05-30T23:17:38Z

The entire premise of the numeric triples factory is questionable now that we have the extensible representations. I’m a bit woozy from my vaccine this morning but I could look into cutting that out completely

moritzblum · 2021-08-20T08:48:25Z

I think it's not just a matter of preprocessing. By looking at the difference between your implementation of DistMultLiteral and the one of the class "DistMult_Literal_gate_text" here[row516] I can see that their model allows to enrich the representation of entities with information coming both from numerical and textual vectors associated to them and that this interaction is managed through a gate mechanism. By looking at your code for DistMultLiteral I see that you provide an enriched representation of entities through a linear transformation but this does not allow to simultaneously use information both from textual and numeric literals.

I'm working on a similar topic as @sntcristian and found this discussion.

Without studying the original LiteralE code completely, I would say the gating mechanism you mention is actually not for combining textual and numeric vectors. It combines an embedding and a literal vector by a gating mechanism like e.g. in RNNs. This allows the model to learn "whether the additional literal information is useful or not, and adapt accordingly, e.g. by incorporating or ignoring that information."(Kristiadi et al.) The gating mechanism you mention is defined later, it starts at 705, and it just extends the gate by self.gate_txt_lit(x_lit_txt) .
Gate:
gate = self.gate_activation(self.g1(x_ent) + self.g2(x_lit) + self.gate_bias)
GateMulti:
gate = self.gate_activation(self.gate_ent(x_ent) + self.gate_num_lit(x_lit_num) + self.gate_txt_lit(x_lit_txt) + self.gate_bias)

So I would say only merging numerical and textual data in an external representation is not enough, the gating mechanism should be extended to reach the best performance with the model. If you are working with textual literals only, just transforming them into a numerical representation should be totally fine.

Hopefully this helps you in some way. 😉

Rodrigo-A-Pereira · 2021-09-13T15:35:40Z

Hi,
I have an implementation of DistMultLiteral with that gating mechanism mentioned by @moritzblum on my fork of Pykeen (based on the code from the official LiteralE repo), should i make a PR?

cthoyt · 2021-09-14T07:19:11Z

We’d definitely welcome a request. Have you seen our implementation of the literal models? I’m curious what the differences are (besides the gating function)

Rodrigo-A-Pereira · 2021-09-14T08:43:38Z

We’d definitely welcome a request. Have you seen our implementation of the literal models? I’m curious what the differences are (besides the gating function)

In terms of the "model architecture" the only difference is the gating mechanism (see LiteralE paper equation 4, the current implementation of pykeen tackles the "h" and the rest of the equation is the gating mechanism).
In terms of the implementation on the pykeen library only a few changes where needed mainly in the combinations.py module to not have the concatenation of the entity embedding and literal on the forward method of the RealCombination class, since its more convenient for the gate to receive them separately (given that there exist specific linear transformations for each of them in the operation).

I'll try to make a PR in the next few hours 😊

jas-ho · 2022-06-03T09:50:54Z

Actually I think we do not need them. There is even an TODO in the code:

pykeen/src/pykeen/triples/instances.py

Lines 121 to 128 in 4536840

@fix_dataclass_init_docs

@dataclass

class MultimodalInstances(Instances):

"""Triples and mappings to their indices as well as multimodal data."""

#: TODO: do we need these?

numeric_literals: Mapping[str, np.ndarray]

literals_to_id: Mapping[str, int]

@mali-git any reason why we should keep them?

the mentioned MultimodalInstances class has been removed in bbd69546445f45fb815ebbdc5f213f0ba1f5a834, right? If I wanted to create a multi-modal (including text) dataset for pykeen now: what baseclass should I use? sorry in case that's obvious to you guys 😅

mberr · 2022-06-03T10:07:41Z

This somewhat depends on how you want to use the text.

If you encode the text (e.g. using SBERT) in a preprocessing step, you could use TriplesNumericLiteralsFactory.

A somewhat more flexible variant is described here, where you create representations for the text with a trainable encoder; notice that this requires significant more computation and memory.

jas-ho · 2022-06-03T10:19:33Z

Hmm, I see.. doesn't the linked description to label-based representations assume that the text I want to use are the node IDs? what if the text I want to use is not and cannot be the node ID?

For example, I might want to add information from a "note" attribute in the original knowledge graph into the representation where the "note" field might not be unique.

mberr · 2022-06-03T10:47:31Z

entity_representations = LabelBasedTransformerRepresentation.from_triples_factory(
    triples_factory=dataset.training,
)

creates representations for the entities labels. In some cases, e.g. for Countries, there are meaningful; in other cases, e.g., CodexSmall, these are just some IDs.

However, we can also directly use the constructor of LabelBasedTransformerRepresentation to create representations with any labels, which come from other sources, e.g., for an example with two entities

entity_representations = LabelBasedTransformerRepresentation(labels=["brazil", "uk"])

Notice that the order has to match with the internal IDs, PyKEEN uses (the index in the sorted list of entity names). So you may instead use the following

from operator import itemgetter

from pykeen.datasets import get_dataset
from pykeen.nn.representation import LabelBasedTransformerRepresentation

dataset = get_dataset(dataset="nations")
notes = {
    "brazil": "Brazil, a sunny country in South America",
    "uk": "United Kingdom, a rainy country in Europe",
    ....,
}
labels = [notes[key] for (key, _id) in sorted(dataset.entity_to_id.items(), key=itemgetter(1))]
entity_representations = LabelBasedTransformerRepresentation(labels=labels)

The following code would then train a model with a DistMult interaction function and directly trainable relation embeddings

from pykeen.pipeline import pipeline
from pykeen.models import ERModel

result = pipeline(
    dataset=dataset,
    model=ERModel,
    model_kwargs=dict(
        interaction="distmult",
        entity_representations=entity_representations,
        relation_representation_kwargs=dict(
            shape=entity_representations.shape,
        ),
    ),
)

cthoyt · 2022-07-04T13:19:02Z

I believe this is now implemented as of #966, #969, and #964

sntcristian added the enhancement New feature or request label May 28, 2021

cthoyt mentioned this issue Jun 23, 2021

How do I use external representations in KGEMs? #506

Closed

Rodrigo-A-Pereira mentioned this issue Sep 14, 2021

Add DistMultLiteral variant with gating mechanism #591

Merged

cthoyt closed this as completed Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for text on Literal models #470

Support for text on Literal models #470

sntcristian commented May 28, 2021

mberr commented May 28, 2021

sntcristian commented May 28, 2021

cthoyt commented May 29, 2021

mberr commented May 29, 2021

sntcristian commented May 30, 2021

mberr commented May 30, 2021

sntcristian commented May 30, 2021 •

edited by mberr

mberr commented May 30, 2021

cthoyt commented May 30, 2021

moritzblum commented Aug 20, 2021

Rodrigo-A-Pereira commented Sep 13, 2021

cthoyt commented Sep 14, 2021

Rodrigo-A-Pereira commented Sep 14, 2021

jas-ho commented Jun 3, 2022 •

edited

mberr commented Jun 3, 2022

jas-ho commented Jun 3, 2022

mberr commented Jun 3, 2022 •

edited

cthoyt commented Jul 4, 2022

Support for text on Literal models #470

Support for text on Literal models #470

Comments

sntcristian commented May 28, 2021

mberr commented May 28, 2021

sntcristian commented May 28, 2021

cthoyt commented May 29, 2021

mberr commented May 29, 2021

sntcristian commented May 30, 2021

mberr commented May 30, 2021

sntcristian commented May 30, 2021 • edited by mberr

mberr commented May 30, 2021

cthoyt commented May 30, 2021

moritzblum commented Aug 20, 2021

Rodrigo-A-Pereira commented Sep 13, 2021

cthoyt commented Sep 14, 2021

Rodrigo-A-Pereira commented Sep 14, 2021

jas-ho commented Jun 3, 2022 • edited

mberr commented Jun 3, 2022

jas-ho commented Jun 3, 2022

mberr commented Jun 3, 2022 • edited

cthoyt commented Jul 4, 2022

sntcristian commented May 30, 2021 •

edited by mberr

jas-ho commented Jun 3, 2022 •

edited

mberr commented Jun 3, 2022 •

edited