New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve vectorization of novelty computation #51
Conversation
Also found bug where if you don't want novelties but you have a testing map it added it anyway. Fixed with improvement to conditional
@mberr i just added a todo for each of us. I think we should include a specific implementation for scoring all possible triples. hopefully theres a fast way we can do this when we know we're doing it for everything |
@cthoyt Scoring all possible triples quickly gets infeasible: For a smaller dataset such as FB15k-237, we already have 53,325,000,000 possible triples. |
Hmm good point. I'll just write a short tutorial on how somebody could go about doing that, if they wanted |
An easy option would be to use e.g. batch_size = 16
for r in range(model.num_relations):
for e in range(0, model.num_entities, batch_size):
hs = torch.arange(e, min(e + batch_size, model.num_entities), device=model.device)
hr_batch = torch.stack([hs, hs.new_empty(1).fill_(value=r).repeat(hs.shape[0])])
t_scores = model.predict_scores_all_tails(hr_batch=hr_batch) Disclaimer: Written in browser, so no guarantees 😉 |
@cthoyt I added an implementation to get the top-k triples. While this is still prohibitively expensive in terms of required computation for reasonably sized datasets, it fixed the memory issue of needing to store all scores. |
The proposed implementation is model-agnostic. For specific interaction functions, we could design more efficient ways to compute the highest scoring triples (e.g. using NN search in embedding space for distance-based models). This solutions would however be specific to an interaction function. |
Regarding the unittest: I put it into the DistMult unittest, since the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result of model.predict_top_k_triples(5)
gives back only a tensor. It would be nice to make it possible to also return this as a dataframe, including the labels for entities as well as the scores themselves. Using the same implementation, it should also be possible to make K optional, and return all triples. This might also motivate a name change for this function
>>> model.predict_top_k_triples(5)
tensor([[ 0, 40, 12],
[ 0, 37, 12],
[ 0, 27, 12],
[ 0, 53, 12],
[ 0, 15, 12]])
What about keeping everything ID-based based, and have one method which convert a tensor of
I still do not see a real use case where returning all triples is desirable except maybe on some dummy KGs. That being said, you can easily make pykeen/src/pykeen/models/base.py Lines 612 to 615 in 6bbc8d5
to additionally perform this only if |
@mberr please confirm this change makes sense
Simplify code by using that torch.sort returns the sorted array *and* the indices.
- unit tests written - some of implementation written (abstracted some of the code from the previous implementation for sided predictions) - I left a todo for @mberr to implement get_novelty_all_mask()
[skip ci]
Great! Do we want to hold the Pr for that though? Or do another one later? |
I added a Python-only quick-and-dirty variant, so we can merge this one without introducing unnecessary delay - when you go for computing scores for all triples you may face performance issues anyway 😉 |
These notebooks really blow up the diff-count 😕 |
@mberr I just got this error from the following code from pykeen.pipeline import pipeline
result = pipeline(
dataset='Nations',
model='RotatE',
random_seed=1235,
device='cpu',
training_kwargs=dict(num_epochs=100),
)
model = result.model
model.score_all_triples() ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-19-62bc458439cd> in <module>
1 # Score all triples
----> 2 model.score_all_triples()
~/dev/pykeen/src/pykeen/models/base.py in score_all_triples(self, k, batch_size, return_tensors, add_novelties, remove_known, testing)
716 'knowledge graphs.'
717 )
--> 718 return self._score_all_triples(
719 batch_size=batch_size,
720 return_tensors=return_tensors,
~/dev/pykeen/src/pykeen/models/base.py in _score_all_triples(self, batch_size, return_tensors, add_novelties, remove_known, testing)
653
654 rv = self.make_labeled_df(triples, score=scores)
--> 655 return _postprocess_prediction_all_df(
656 df=rv,
657 add_novelties=add_novelties,
~/dev/pykeen/src/pykeen/models/base.py in _postprocess_prediction_all_df(df, add_novelties, remove_known, training, testing)
154 if add_novelties or remove_known:
155 assert training is not None
--> 156 df['in_training'] = ~get_novelty_all_mask(
157 mapped_triples=training,
158 query=df[['head_id', 'relation_id', 'tail_id']].values,
~/dev/pykeen/src/pykeen/models/base.py in get_novelty_all_mask(mapped_triples, query)
113 ) -> np.ndarray:
114 known = set(tuple(triple) for triple in mapped_triples.tolist())
--> 115 return np.asarray([(q in known) for q in query], dtype=np.bool)
116
117
~/dev/pykeen/src/pykeen/models/base.py in <listcomp>(.0)
113 ) -> np.ndarray:
114 known = set(tuple(triple) for triple in mapped_triples.tolist())
--> 115 return np.asarray([(q in known) for q in query], dtype=np.bool)
116
117
TypeError: unhashable type: 'numpy.ndarray' I think |
Doing |
[skip ci]
:return: Parallel arrays of triples and scores | ||
""" | ||
# initialize buffer on cpu | ||
scores = torch.empty(self.num_relations, self.num_entities, self.num_entities, dtype=torch.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we specify the device here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the documentation
torch.empty(*size, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False, pin_memory=False) → Tensor
and
device (torch.device, optional)
– the desired device of returned tensor. Default: ifNone
, uses the current device for the default tensor type (seetorch.set_default_tensor_type()
). device will be the CPU for CPU tensor types and the current CUDA device for CUDA tensor types.
So setting it to cpu
should be the safer option.
This PR improves the vectorization of novelty computation for
predict_heads
/predict_tails
.Fixes #49
Still to do: