Improve vectorization of novelty computation #51

mberr · 2020-07-12T08:44:44Z

This PR improves the vectorization of novelty computation for predict_heads / predict_tails.

Fixes #49

Still to do:

Provide fast implementation for scoring/sorting all possible triples (@mberr)
Provide in documentation tutorial about making predictions (@cthoyt)

src/pykeen/models/base.py

tests/test_pipeline.py

Also found bug where if you don't want novelties but you have a testing map it added it anyway. Fixed with improvement to conditional

cthoyt · 2020-07-21T11:01:26Z

@mberr i just added a todo for each of us. I think we should include a specific implementation for scoring all possible triples. hopefully theres a fast way we can do this when we know we're doing it for everything

mberr · 2020-07-21T11:58:34Z

@cthoyt Scoring all possible triples quickly gets infeasible: For a smaller dataset such as FB15k-237, we already have 53,325,000,000 possible triples.

cthoyt · 2020-07-21T12:03:50Z

@cthoyt Scoring all possible triples quickly gets infeasible: For a smaller dataset such as FB15k-237, we already have 53,325,000,000 possible triples.

Hmm good point. I'll just write a short tutorial on how somebody could go about doing that, if they wanted

mberr · 2020-07-21T13:23:19Z

@cthoyt Scoring all possible triples quickly gets infeasible: For a smaller dataset such as FB15k-237, we already have 53,325,000,000 possible triples.

Hmm good point. I'll just write a short tutorial on how somebody could go about doing that, if they wanted

An easy option would be to use e.g. predict_scores_all_tails in a huge for-loop:

batch_size = 16
for r in range(model.num_relations):
    for e in range(0, model.num_entities, batch_size):
        hs = torch.arange(e, min(e + batch_size, model.num_entities), device=model.device)
        hr_batch = torch.stack([hs, hs.new_empty(1).fill_(value=r).repeat(hs.shape[0])])
        t_scores = model.predict_scores_all_tails(hr_batch=hr_batch)

Disclaimer: Written in browser, so no guarantees 😉

mberr · 2020-07-22T08:02:11Z

@cthoyt I added an implementation to get the top-k triples. While this is still prohibitively expensive in terms of required computation for reasonably sized datasets, it fixed the memory issue of needing to store all scores.

mberr · 2020-07-22T08:03:40Z

The proposed implementation is model-agnostic.

For specific interaction functions, we could design more efficient ways to compute the highest scoring triples (e.g. using NN search in embedding space for distance-based models). This solutions would however be specific to an interaction function.

mberr · 2020-07-22T08:05:50Z

Regarding the unittest: I put it into the DistMult unittest, since the predict_top_k_triples method relies on predict_scores_all_tails, which is tested for all models, but testing predict_top_k_triples for all models may get expensive (in particular for slower models such as ConvKB).

cthoyt

The result of model.predict_top_k_triples(5) gives back only a tensor. It would be nice to make it possible to also return this as a dataframe, including the labels for entities as well as the scores themselves. Using the same implementation, it should also be possible to make K optional, and return all triples. This might also motivate a name change for this function

>>> model.predict_top_k_triples(5)
tensor([[ 0, 40, 12],
        [ 0, 37, 12],
        [ 0, 27, 12],
        [ 0, 53, 12],
        [ 0, 15, 12]])

mberr · 2020-07-23T17:25:26Z

The result of model.predict_top_k_triples(5) gives back only a tensor. It would be nice to make it possible to also return this as a dataframe, including the labels for entities as well as the scores themselves.
[...]
>>> model.predict_top_k_triples(5)
tensor([[ 0, 40, 12],
        [ 0, 37, 12],
        [ 0, 27, 12],
        [ 0, 53, 12],
        [ 0, 15, 12]])

What about keeping everything ID-based based, and have one method which convert a tensor of ~~triples~~ ID-based triples to the desired dataframe? Using IDs ensures everything is fast.

Using the same implementation, it should also be possible to make K optional, and return all triples. This might also motivate a name change for this function

I still do not see a real use case where returning all triples is desirable except maybe on some dummy KGs. That being said, you can easily make k optional and return all triples by modifying

pykeen/src/pykeen/models/base.py

Lines 612 to 615 in 6bbc8d5

    
           # reduce size if necessary 
        
           if result.shape[0] > k: 
        
               scores, ind = scores.topk(k=k, largest=True, sorted=False) 
        
               result = result[ind]

to additionally perform this only if k is provided.

@mberr

@mberr please confirm this change makes sense

Simplify code by using that torch.sort returns the sorted array *and* the indices.

@mberr

- unit tests written - some of implementation written (abstracted some of the code from the previous implementation for sided predictions) - I left a todo for @mberr to implement get_novelty_all_mask()

[skip ci]

src/pykeen/models/base.py

mberr · 2020-08-12T16:10:21Z

@cthoyt : @lvermue will also take a look at it, and see how it can be merged with the filtering code from evaluation 🙂

cthoyt · 2020-08-12T16:22:28Z

Great! Do we want to hold the Pr for that though? Or do another one later?

mberr · 2020-08-12T16:28:40Z

Great! Do we want to hold the Pr for that though? Or do another one later?

I added a Python-only quick-and-dirty variant, so we can merge this one without introducing unnecessary delay - when you go for computing scores for all triples you may face performance issues anyway 😉

mberr · 2020-08-12T16:31:36Z

These notebooks really blow up the diff-count 😕

cthoyt · 2020-08-12T17:42:39Z

@mberr I just got this error from the following code

from pykeen.pipeline import pipeline
result = pipeline(
    dataset='Nations',
    model='RotatE',
    random_seed=1235,
    device='cpu',
    training_kwargs=dict(num_epochs=100), 
)
model = result.model
model.score_all_triples()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-62bc458439cd> in <module>
      1 # Score all triples
----> 2 model.score_all_triples()

~/dev/pykeen/src/pykeen/models/base.py in score_all_triples(self, k, batch_size, return_tensors, add_novelties, remove_known, testing)
    716                     'knowledge graphs.'
    717                 )
--> 718                 return self._score_all_triples(
    719                     batch_size=batch_size,
    720                     return_tensors=return_tensors,

~/dev/pykeen/src/pykeen/models/base.py in _score_all_triples(self, batch_size, return_tensors, add_novelties, remove_known, testing)
    653 
    654         rv = self.make_labeled_df(triples, score=scores)
--> 655         return _postprocess_prediction_all_df(
    656             df=rv,
    657             add_novelties=add_novelties,

~/dev/pykeen/src/pykeen/models/base.py in _postprocess_prediction_all_df(df, add_novelties, remove_known, training, testing)
    154     if add_novelties or remove_known:
    155         assert training is not None
--> 156         df['in_training'] = ~get_novelty_all_mask(
    157             mapped_triples=training,
    158             query=df[['head_id', 'relation_id', 'tail_id']].values,

~/dev/pykeen/src/pykeen/models/base.py in get_novelty_all_mask(mapped_triples, query)
    113 ) -> np.ndarray:
    114     known = set(tuple(triple) for triple in mapped_triples.tolist())
--> 115     return np.asarray([(q in known) for q in query], dtype=np.bool)
    116 
    117 

~/dev/pykeen/src/pykeen/models/base.py in <listcomp>(.0)
    113 ) -> np.ndarray:
    114     known = set(tuple(triple) for triple in mapped_triples.tolist())
--> 115     return np.asarray([(q in known) for q in query], dtype=np.bool)
    116 
    117 

TypeError: unhashable type: 'numpy.ndarray'

I think q needs to be tuple(q)

cthoyt · 2020-08-12T17:46:41Z

Doing tuple(q) fixed it!

[skip ci]

cthoyt · 2020-08-12T18:31:31Z

src/pykeen/models/base.py

+        :return: Parallel arrays of triples and scores
+        """
+        # initialize buffer on cpu
+        scores = torch.empty(self.num_relations, self.num_entities, self.num_entities, dtype=torch.float32)


should we specify the device here?

From the documentation

torch.empty(*size, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False, pin_memory=False) → Tensor

and

device (torch.device, optional) – the desired device of returned tensor. Default: if None, uses the current device for the default tensor type (see torch.set_default_tensor_type()). device will be the CPU for CPU tensor types and the current CUDA device for CUDA tensor types.

So setting it to cpu should be the safer option.

mberr added 2 commits July 12, 2020 10:42

Improve vectorization

843c845

Improve docstring

850abe2

mberr mentioned this pull request Jul 12, 2020

Fix device mismatch #50

Merged

mberr requested review from lvermue and cthoyt July 12, 2020 09:21

cthoyt reviewed Jul 12, 2020

View reviewed changes

src/pykeen/models/base.py Outdated Show resolved Hide resolved

mali-git mentioned this pull request Jul 12, 2020

device mismatch between triples and model when model is in inference mode #49

Closed

mberr and others added 6 commits July 14, 2020 08:36

Extract function from method

54c720a

Add unittest for get_novelty_mask

9b5c602

Fix line-too-long

0237a1a

Add ability to include testing novelties

ae2a3f7

Add tests for prediction

ae9f78c

Merge branch 'master' into improve-novelty-computation

6ec3959

mberr commented Jul 16, 2020

View reviewed changes

tests/test_pipeline.py Outdated Show resolved Hide resolved

Add more tests

b91425b

Also found bug where if you don't want novelties but you have a testing map it added it anyway. Fixed with improvement to conditional

cthoyt marked this pull request as draft July 21, 2020 10:59

mberr added 2 commits July 22, 2020 09:59

Add utility to predict high-score triples

0a34fc3

Adjust warning message

065b761

mberr requested a review from cthoyt July 22, 2020 08:07

cthoyt added 2 commits July 23, 2020 00:35

Use itertools to combine independent nested loops

e20d891

Add docs

6bbc8d5

cthoyt requested changes Jul 22, 2020

View reviewed changes

cthoyt and others added 10 commits August 12, 2020 02:02

Fix tuple addition

4b98d1d

Make sure scores get sorted too!

b478613

@mberr please confirm this change makes sense

Return sweet sweet dataframes

4e9426b

Update docs

0cbee80

Factor out update logic

05db6d6

Update docs

fcf2956

Update base.py

3f16ab8

Simplify code by using that torch.sort returns the sorted array *and* the indices.

Fix dataframe processing

793c59c

Begin implementing df postprocessing for score_all_triples

9386a5d

- unit tests written - some of implementation written (abstracted some of the code from the previous implementation for sided predictions) - I left a todo for @mberr to implement get_novelty_all_mask()

Add predict all to notebook

e66bace

[skip ci]

cthoyt reviewed Aug 12, 2020

View reviewed changes

src/pykeen/models/base.py Show resolved Hide resolved

mberr added 2 commits August 12, 2020 16:56

Add .values

322d0a6

Fix type annotation

d9b5f66

Add quick and dirty solution for get_novelty_all_mask

e932893

Fix returns

da26ae4

cthoyt added 2 commits August 12, 2020 19:46

Fix tuple checking

469064f

Update notebooks

3e5f029

Improve cosmetics of membership check

53d8fd4

[skip ci]

cthoyt approved these changes Aug 12, 2020

View reviewed changes

cthoyt reviewed Aug 12, 2020

View reviewed changes

cthoyt added 2 commits August 12, 2020 20:33

Add more tests becuase I have no idea what's wrong

9b625f3

Fix negation

4a7cf20

cthoyt merged commit ed6b8b6 into master Aug 12, 2020

cthoyt deleted the improve-novelty-computation branch August 12, 2020 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve vectorization of novelty computation #51

Improve vectorization of novelty computation #51

mberr commented Jul 12, 2020 •

edited by cthoyt

cthoyt commented Jul 21, 2020

mberr commented Jul 21, 2020

cthoyt commented Jul 21, 2020

mberr commented Jul 21, 2020

mberr commented Jul 22, 2020

mberr commented Jul 22, 2020

mberr commented Jul 22, 2020

cthoyt left a comment

mberr commented Jul 23, 2020 •

edited

mberr commented Aug 12, 2020

cthoyt commented Aug 12, 2020

mberr commented Aug 12, 2020

mberr commented Aug 12, 2020

cthoyt commented Aug 12, 2020 •

edited

cthoyt commented Aug 12, 2020

cthoyt Aug 12, 2020

mberr Aug 12, 2020

Improve vectorization of novelty computation #51

Improve vectorization of novelty computation #51

Conversation

mberr commented Jul 12, 2020 • edited by cthoyt

cthoyt commented Jul 21, 2020

mberr commented Jul 21, 2020

cthoyt commented Jul 21, 2020

mberr commented Jul 21, 2020

mberr commented Jul 22, 2020

mberr commented Jul 22, 2020

mberr commented Jul 22, 2020

cthoyt left a comment

Choose a reason for hiding this comment

mberr commented Jul 23, 2020 • edited

mberr commented Aug 12, 2020

cthoyt commented Aug 12, 2020

mberr commented Aug 12, 2020

mberr commented Aug 12, 2020

cthoyt commented Aug 12, 2020 • edited

cthoyt commented Aug 12, 2020

cthoyt Aug 12, 2020

Choose a reason for hiding this comment

mberr Aug 12, 2020

Choose a reason for hiding this comment

mberr commented Jul 12, 2020 •

edited by cthoyt

mberr commented Jul 23, 2020 •

edited

cthoyt commented Aug 12, 2020 •

edited