Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve vectorization of novelty computation #51

Merged
merged 64 commits into from Aug 12, 2020

Conversation

mberr
Copy link
Member

@mberr mberr commented Jul 12, 2020

This PR improves the vectorization of novelty computation for predict_heads / predict_tails.

Fixes #49

Still to do:

  • Provide fast implementation for scoring/sorting all possible triples (@mberr)
  • Provide in documentation tutorial about making predictions (@cthoyt)

@mberr mberr mentioned this pull request Jul 12, 2020
@mberr mberr requested review from lvermue and cthoyt July 12, 2020 09:21
src/pykeen/models/base.py Outdated Show resolved Hide resolved
tests/test_pipeline.py Outdated Show resolved Hide resolved
Also found bug where if you don't want novelties but you have a testing map it added it anyway. Fixed with improvement to conditional
@cthoyt cthoyt marked this pull request as draft July 21, 2020 10:59
@cthoyt
Copy link
Member

cthoyt commented Jul 21, 2020

@mberr i just added a todo for each of us. I think we should include a specific implementation for scoring all possible triples. hopefully theres a fast way we can do this when we know we're doing it for everything

@mberr
Copy link
Member Author

mberr commented Jul 21, 2020

@cthoyt Scoring all possible triples quickly gets infeasible: For a smaller dataset such as FB15k-237, we already have 53,325,000,000 possible triples.

@cthoyt
Copy link
Member

cthoyt commented Jul 21, 2020

@cthoyt Scoring all possible triples quickly gets infeasible: For a smaller dataset such as FB15k-237, we already have 53,325,000,000 possible triples.

Hmm good point. I'll just write a short tutorial on how somebody could go about doing that, if they wanted

@mberr
Copy link
Member Author

mberr commented Jul 21, 2020

@cthoyt Scoring all possible triples quickly gets infeasible: For a smaller dataset such as FB15k-237, we already have 53,325,000,000 possible triples.

Hmm good point. I'll just write a short tutorial on how somebody could go about doing that, if they wanted

An easy option would be to use e.g. predict_scores_all_tails in a huge for-loop:

batch_size = 16
for r in range(model.num_relations):
    for e in range(0, model.num_entities, batch_size):
        hs = torch.arange(e, min(e + batch_size, model.num_entities), device=model.device)
        hr_batch = torch.stack([hs, hs.new_empty(1).fill_(value=r).repeat(hs.shape[0])])
        t_scores = model.predict_scores_all_tails(hr_batch=hr_batch)

Disclaimer: Written in browser, so no guarantees 😉

@mberr
Copy link
Member Author

mberr commented Jul 22, 2020

@cthoyt I added an implementation to get the top-k triples. While this is still prohibitively expensive in terms of required computation for reasonably sized datasets, it fixed the memory issue of needing to store all scores.

@mberr
Copy link
Member Author

mberr commented Jul 22, 2020

The proposed implementation is model-agnostic.

For specific interaction functions, we could design more efficient ways to compute the highest scoring triples (e.g. using NN search in embedding space for distance-based models). This solutions would however be specific to an interaction function.

@mberr
Copy link
Member Author

mberr commented Jul 22, 2020

Regarding the unittest: I put it into the DistMult unittest, since the predict_top_k_triples method relies on predict_scores_all_tails, which is tested for all models, but testing predict_top_k_triples for all models may get expensive (in particular for slower models such as ConvKB).

@mberr mberr requested a review from cthoyt July 22, 2020 08:07
Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result of model.predict_top_k_triples(5) gives back only a tensor. It would be nice to make it possible to also return this as a dataframe, including the labels for entities as well as the scores themselves. Using the same implementation, it should also be possible to make K optional, and return all triples. This might also motivate a name change for this function

>>> model.predict_top_k_triples(5)
tensor([[ 0, 40, 12],
        [ 0, 37, 12],
        [ 0, 27, 12],
        [ 0, 53, 12],
        [ 0, 15, 12]])

@mberr
Copy link
Member Author

mberr commented Jul 23, 2020

The result of model.predict_top_k_triples(5) gives back only a tensor. It would be nice to make it possible to also return this as a dataframe, including the labels for entities as well as the scores themselves.
[...]

>>> model.predict_top_k_triples(5)
tensor([[ 0, 40, 12],
        [ 0, 37, 12],
        [ 0, 27, 12],
        [ 0, 53, 12],
        [ 0, 15, 12]])

What about keeping everything ID-based based, and have one method which convert a tensor of triples ID-based triples to the desired dataframe? Using IDs ensures everything is fast.

Using the same implementation, it should also be possible to make K optional, and return all triples. This might also motivate a name change for this function

I still do not see a real use case where returning all triples is desirable except maybe on some dummy KGs. That being said, you can easily make k optional and return all triples by modifying

# reduce size if necessary
if result.shape[0] > k:
scores, ind = scores.topk(k=k, largest=True, sorted=False)
result = result[ind]

to additionally perform this only if k is provided.

cthoyt and others added 10 commits August 12, 2020 02:02
@mberr please confirm this change makes sense
Simplify code by using that torch.sort returns the sorted array *and* the indices.
- unit tests written
- some of implementation written (abstracted some of the code from the previous implementation for sided predictions)
- I left a todo for @mberr to implement get_novelty_all_mask()
@mberr
Copy link
Member Author

mberr commented Aug 12, 2020

@cthoyt : @lvermue will also take a look at it, and see how it can be merged with the filtering code from evaluation 🙂

@cthoyt
Copy link
Member

cthoyt commented Aug 12, 2020

Great! Do we want to hold the Pr for that though? Or do another one later?

@mberr
Copy link
Member Author

mberr commented Aug 12, 2020

Great! Do we want to hold the Pr for that though? Or do another one later?

I added a Python-only quick-and-dirty variant, so we can merge this one without introducing unnecessary delay - when you go for computing scores for all triples you may face performance issues anyway 😉

@mberr
Copy link
Member Author

mberr commented Aug 12, 2020

These notebooks really blow up the diff-count 😕

@cthoyt
Copy link
Member

cthoyt commented Aug 12, 2020

@mberr I just got this error from the following code

from pykeen.pipeline import pipeline
result = pipeline(
    dataset='Nations',
    model='RotatE',
    random_seed=1235,
    device='cpu',
    training_kwargs=dict(num_epochs=100), 
)
model = result.model
model.score_all_triples()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-62bc458439cd> in <module>
      1 # Score all triples
----> 2 model.score_all_triples()

~/dev/pykeen/src/pykeen/models/base.py in score_all_triples(self, k, batch_size, return_tensors, add_novelties, remove_known, testing)
    716                     'knowledge graphs.'
    717                 )
--> 718                 return self._score_all_triples(
    719                     batch_size=batch_size,
    720                     return_tensors=return_tensors,

~/dev/pykeen/src/pykeen/models/base.py in _score_all_triples(self, batch_size, return_tensors, add_novelties, remove_known, testing)
    653 
    654         rv = self.make_labeled_df(triples, score=scores)
--> 655         return _postprocess_prediction_all_df(
    656             df=rv,
    657             add_novelties=add_novelties,

~/dev/pykeen/src/pykeen/models/base.py in _postprocess_prediction_all_df(df, add_novelties, remove_known, training, testing)
    154     if add_novelties or remove_known:
    155         assert training is not None
--> 156         df['in_training'] = ~get_novelty_all_mask(
    157             mapped_triples=training,
    158             query=df[['head_id', 'relation_id', 'tail_id']].values,

~/dev/pykeen/src/pykeen/models/base.py in get_novelty_all_mask(mapped_triples, query)
    113 ) -> np.ndarray:
    114     known = set(tuple(triple) for triple in mapped_triples.tolist())
--> 115     return np.asarray([(q in known) for q in query], dtype=np.bool)
    116 
    117 

~/dev/pykeen/src/pykeen/models/base.py in <listcomp>(.0)
    113 ) -> np.ndarray:
    114     known = set(tuple(triple) for triple in mapped_triples.tolist())
--> 115     return np.asarray([(q in known) for q in query], dtype=np.bool)
    116 
    117 

TypeError: unhashable type: 'numpy.ndarray'

I think q needs to be tuple(q)

@cthoyt
Copy link
Member

cthoyt commented Aug 12, 2020

Doing tuple(q) fixed it!

:return: Parallel arrays of triples and scores
"""
# initialize buffer on cpu
scores = torch.empty(self.num_relations, self.num_entities, self.num_entities, dtype=torch.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we specify the device here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the documentation

torch.empty(*size, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False, pin_memory=False) → Tensor

and

device (torch.device, optional) – the desired device of returned tensor. Default: if None, uses the current device for the default tensor type (see torch.set_default_tensor_type()). device will be the CPU for CPU tensor types and the current CUDA device for CUDA tensor types.

So setting it to cpu should be the safer option.

@cthoyt cthoyt merged commit ed6b8b6 into master Aug 12, 2020
@cthoyt cthoyt deleted the improve-novelty-computation branch August 12, 2020 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

device mismatch between triples and model when model is in inference mode
4 participants