Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for text on Literal models #470

Closed
sntcristian opened this issue May 28, 2021 · 18 comments
Closed

Support for text on Literal models #470

sntcristian opened this issue May 28, 2021 · 18 comments
Labels
enhancement New feature or request

Comments

@sntcristian
Copy link

I am currently working on a project on knowledge graph embeddings and I wanted to test the efficiency of models which use literals to enrich the embeddings (i.e. LiteralE models).
I have seen that your library already provides an implementation of the models DistMult and ComplEx which use information from numerical literals and I was wondering if you thought about providing support also for textual literals such as in here: https://github.com/SmartDataAnalytics/LiteralE.
The issue is that the repository mentioned above is no more maintained and lacks of utilities such as hyper parameter optimization that you already offer in your suite.

@sntcristian sntcristian added the enhancement New feature or request label May 28, 2021
@mberr
Copy link
Member

mberr commented May 28, 2021

I am not very familiar with LiteralE, but from here it looks like the textual literals are encoded in a preprocessing step. From there on, you can treat them similarly as numeric literals (each entity is associated with a Glove(?) vector).

A bit more involved would be to train the text encoder in an end-to-end fashion. For this, you would need to create a custom RepresentationModule comprising the textual encoder.

@sntcristian
Copy link
Author

I think it's not just a matter of preprocessing. By looking at the difference between your implementation of DistMultLiteral and the one of the class "DistMult_Literal_gate_text" here[row516] I can see that their model allows to enrich the representation of entities with information coming both from numerical and textual vectors associated to them and that this interaction is managed through a gate mechanism. By looking at your code for DistMultLiteral I see that you provide an enriched representation of entities through a linear transformation but this does not allow to simultaneously use information both from textual and numeric literals.

@cthoyt
Copy link
Member

cthoyt commented May 29, 2021

I agree with max - if you want to use textual literals it would be much more elegant to convert them to numeric values during preprocessing (either one hot encodings if you consider them as categorical or using a text embedding like glove/word2vec or any pretrained language model). Perhaps @mberr we can provide a representation module that uses BERT to encode text literals for use with one of the literal models.

Also note that the DistMultLiteral and ComplExLiteral models in pykeen are both subclasses of the more general LiteralE model. You probably noticed that hacking on this isn’t so easy from the end-user functions like the pipeline, but if you want to dig into the code it is quite extensible

@mberr
Copy link
Member

mberr commented May 29, 2021

By looking at your code for DistMultLiteral I see that you provide an enriched representation of entities through a linear transformation but this does not allow to simultaneously use information both from textual and numeric literals.

Yes, this is right. Currently we do not have an implementation for a combination of three representations (learned embedding + numeric + textual). If you are up to writing your own class, you may easily extend this, taking inspiration from the existing model for combining two representations. We would also be very happy to accept a PR for this if you are willing to 🙂

@sntcristian
Copy link
Author

Perhaps @mberr we can provide a representation module that uses BERT to encode text literals for use with one of the literal models.

That would be very interesting for me. By the way, I'll accept your suggestions and try to extend the code as far as I can.

@mberr
Copy link
Member

mberr commented May 30, 2021

I guess the easiest way is to use the bert-as-service library, cf. https://github.com/hanxiao/bert-as-service#building-a-qa-semantic-search-engine-in-3-minutes

Here a simple example where you use the entity labels as only text information. While this might already be quite useful, having additional text information (description texts, etc.) might give even better results.

import torch
from bert_serving.client import BertClient
from pykeen.datasets import get_dataset

bc = BertClient()
triples_factory = get_dataset(dataset="nations").training   # modify for your triples factory
dim = 768   # depends on layer, pooling of BERT encoder, cf. configuration of bert serving 
text_features = torch.empty(triples_factory.num_entities, dim)
for entity_label, entity_id in triples_factory.entity_to_id.items():
    text_features[entity_id] = torch.as_tensor(data=bc.encode([entity_label])[0], dtype=torch.get_default_dtype()).view(-1)

Disclaimer: written in browser, may contain typos or syntax errors.

@sntcristian
Copy link
Author

sntcristian commented May 30, 2021

yes, I was developing similar code for creating an ad hoc class called "TriplesTextualLiteralsFactory" which I will pass as input to an extended DistMultLiteral model. The issue is that, as you see in the following code, I cannot create a "literals_to_id" dictionary since I do not one-hot encode the values but I use fixed BERT embeddings dimensions.
Do you think I need "literals_to_id" for training instances?

def create_matrix_of_literals(
    textual_triples: np.array,
    entity_to_id: EntityMapping,
) -> np.ndarray:
    # Prepare literal matrix, set every literal to zero, and afterwards fill in the corresponding value if available
    txt_literals = np.zeros([len(entity_to_id), 768], dtype=np.float32)
    model = SentenceTransformer('allenai-specter')
    # TODO vectorize code
    for h, r, lit in textual_triples:
        try:
            sentences = [lit]
            embeddings = model.encode(sentences)
            # row define entityl. Set the corresponding literal for the entity
            txt_literals[entity_to_id[h], :] = embeddings[0]
        except KeyError:
            logger.info("Either entity or relation to literal doesn't exist.")
            continue

    return txt_literals

@mberr
Copy link
Member

mberr commented May 30, 2021

Actually I think we do not need them. There is even an TODO in the code:

@fix_dataclass_init_docs
@dataclass
class MultimodalInstances(Instances):
"""Triples and mappings to their indices as well as multimodal data."""
#: TODO: do we need these?
numeric_literals: Mapping[str, np.ndarray]
literals_to_id: Mapping[str, int]

@mali-git any reason why we should keep them?

@cthoyt
Copy link
Member

cthoyt commented May 30, 2021

The entire premise of the numeric triples factory is questionable now that we have the extensible representations. I’m a bit woozy from my vaccine this morning but I could look into cutting that out completely

@moritzblum
Copy link

I think it's not just a matter of preprocessing. By looking at the difference between your implementation of DistMultLiteral and the one of the class "DistMult_Literal_gate_text" here[row516] I can see that their model allows to enrich the representation of entities with information coming both from numerical and textual vectors associated to them and that this interaction is managed through a gate mechanism. By looking at your code for DistMultLiteral I see that you provide an enriched representation of entities through a linear transformation but this does not allow to simultaneously use information both from textual and numeric literals.

I'm working on a similar topic as @sntcristian and found this discussion.

Without studying the original LiteralE code completely, I would say the gating mechanism you mention is actually not for combining textual and numeric vectors. It combines an embedding and a literal vector by a gating mechanism like e.g. in RNNs. This allows the model to learn "whether the additional literal information is useful or not, and adapt accordingly, e.g. by incorporating or ignoring that information."(Kristiadi et al.) The gating mechanism you mention is defined later, it starts at 705, and it just extends the gate by self.gate_txt_lit(x_lit_txt) .
Gate:
gate = self.gate_activation(self.g1(x_ent) + self.g2(x_lit) + self.gate_bias)
GateMulti:
gate = self.gate_activation(self.gate_ent(x_ent) + self.gate_num_lit(x_lit_num) + self.gate_txt_lit(x_lit_txt) + self.gate_bias)

So I would say only merging numerical and textual data in an external representation is not enough, the gating mechanism should be extended to reach the best performance with the model. If you are working with textual literals only, just transforming them into a numerical representation should be totally fine.

Hopefully this helps you in some way. 😉

@Rodrigo-A-Pereira
Copy link
Contributor

Hi,
I have an implementation of DistMultLiteral with that gating mechanism mentioned by @moritzblum on my fork of Pykeen (based on the code from the official LiteralE repo), should i make a PR?

@cthoyt
Copy link
Member

cthoyt commented Sep 14, 2021

We’d definitely welcome a request. Have you seen our implementation of the literal models? I’m curious what the differences are (besides the gating function)

@Rodrigo-A-Pereira
Copy link
Contributor

We’d definitely welcome a request. Have you seen our implementation of the literal models? I’m curious what the differences are (besides the gating function)

In terms of the "model architecture" the only difference is the gating mechanism (see LiteralE paper equation 4, the current implementation of pykeen tackles the "h" and the rest of the equation is the gating mechanism).
In terms of the implementation on the pykeen library only a few changes where needed mainly in the combinations.py module to not have the concatenation of the entity embedding and literal on the forward method of the RealCombination class, since its more convenient for the gate to receive them separately (given that there exist specific linear transformations for each of them in the operation).

I'll try to make a PR in the next few hours 😊

@jas-ho
Copy link
Contributor

jas-ho commented Jun 3, 2022

Actually I think we do not need them. There is even an TODO in the code:

@fix_dataclass_init_docs
@dataclass
class MultimodalInstances(Instances):
"""Triples and mappings to their indices as well as multimodal data."""
#: TODO: do we need these?
numeric_literals: Mapping[str, np.ndarray]
literals_to_id: Mapping[str, int]

@mali-git any reason why we should keep them?

the mentioned MultimodalInstances class has been removed in bbd69546445f45fb815ebbdc5f213f0ba1f5a834, right? If I wanted to create a multi-modal (including text) dataset for pykeen now: what baseclass should I use? sorry in case that's obvious to you guys 😅

@mberr
Copy link
Member

mberr commented Jun 3, 2022

This somewhat depends on how you want to use the text.

If you encode the text (e.g. using SBERT) in a preprocessing step, you could use TriplesNumericLiteralsFactory.

A somewhat more flexible variant is described here, where you create representations for the text with a trainable encoder; notice that this requires significant more computation and memory.

@jas-ho
Copy link
Contributor

jas-ho commented Jun 3, 2022

Hmm, I see.. doesn't the linked description to label-based representations assume that the text I want to use are the node IDs? what if the text I want to use is not and cannot be the node ID?

For example, I might want to add information from a "note" attribute in the original knowledge graph into the representation where the "note" field might not be unique.

@mberr
Copy link
Member

mberr commented Jun 3, 2022

entity_representations = LabelBasedTransformerRepresentation.from_triples_factory(
    triples_factory=dataset.training,
)

creates representations for the entities labels. In some cases, e.g. for Countries, there are meaningful; in other cases, e.g., CodexSmall, these are just some IDs.

However, we can also directly use the constructor of LabelBasedTransformerRepresentation to create representations with any labels, which come from other sources, e.g., for an example with two entities

entity_representations = LabelBasedTransformerRepresentation(labels=["brazil", "uk"])

Notice that the order has to match with the internal IDs, PyKEEN uses (the index in the sorted list of entity names). So you may instead use the following

from operator import itemgetter

from pykeen.datasets import get_dataset
from pykeen.nn.representation import LabelBasedTransformerRepresentation

dataset = get_dataset(dataset="nations")
notes = {
    "brazil": "Brazil, a sunny country in South America",
    "uk": "United Kingdom, a rainy country in Europe",
    ....,
}
labels = [notes[key] for (key, _id) in sorted(dataset.entity_to_id.items(), key=itemgetter(1))]
entity_representations = LabelBasedTransformerRepresentation(labels=labels)

The following code would then train a model with a DistMult interaction function and directly trainable relation embeddings

from pykeen.pipeline import pipeline
from pykeen.models import ERModel

result = pipeline(
    dataset=dataset,
    model=ERModel,
    model_kwargs=dict(
        interaction="distmult",
        entity_representations=entity_representations,
        relation_representation_kwargs=dict(
            shape=entity_representations.shape,
        ),
    ),
)

@cthoyt
Copy link
Member

cthoyt commented Jul 4, 2022

I believe this is now implemented as of #966, #969, and #964

@cthoyt cthoyt closed this as completed Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants