Apparent bug for torchtext.data.vocab.Vocab
causing collisions between OOV words and in vocab words.
#447
Labels
torchtext.data.vocab.Vocab
causing collisions between OOV words and in vocab words.
#447
When using
torchtext
version 0.3.1, I notice a rather alarming bug affectingtorchtext.vocab.Vocab
that causes collisions between OOV words and non-OOV words. When a vocabulary is instantiated withspecials_first=True
, there is a collision between OOV words and the first special word:which yields:
Similarly, when
specials_first=False
, there is a collision between OOV words and the highest frequency word in the vocabulary:The culprit appears to be the following line of
vocab.py
:self.stoi.update({tok: i for i, tok in enumerate(self.itos)})
I assume this behavior is unintended. If so, it's a relatively easy fix assuming that
_default_unk_index
sill never be changed to output anything other than 0.The text was updated successfully, but these errors were encountered: