Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent bug for torchtext.data.vocab.Vocab causing collisions between OOV words and in vocab words. #447

Closed
nryant opened this issue Oct 12, 2018 · 2 comments

Comments

@nryant
Copy link

nryant commented Oct 12, 2018

When using torchtext version 0.3.1, I notice a rather alarming bug affecting torchtext.vocab.Vocab that causes collisions between OOV words and non-OOV words. When a vocabulary is instantiated with specials_first=True, there is a collision between OOV words and the first special word:

from collections import Counter
import torchtext
word_freqs = Counter({'the' : 10, 'apple' : 5, 'cat' : 2})
word_vocab = torchtext.vocab.Vocab(word_freqs, specials=['<pad>'], specials_first=True)
for word in ['<pad>', 'dfsddfsdfds']:
   print(word, word_vocab.stoi[word])

which yields:

('<pad>', 0)
('dfsddfsdfds', 0)

Similarly, when specials_first=False, there is a collision between OOV words and the highest frequency word in the vocabulary:

word_vocab = torchtext.vocab.Vocab(word_freqs, specials=['<pad>'], specials_first=False)
for word in ['the', 'dfsddfsdfds']:
    print(word, word_vocab.stoi[word])
('the', 0)
('dfsddfsdfds', 0)

The culprit appears to be the following line of vocab.py:

self.stoi.update({tok: i for i, tok in enumerate(self.itos)})

I assume this behavior is unintended. If so, it's a relatively easy fix assuming that _default_unk_index sill never be changed to output anything other than 0.

@mttk
Copy link
Contributor

mttk commented Oct 16, 2018

Yeah this is not the nicest bug. Assuming UNK=0 is ok since it's already hardcoded (so we can hardcode it further). I'll fix this when I get the time, if someone wants to take a stab at it, feel free.

@mttk
Copy link
Contributor

mttk commented Jan 31, 2019

Should be fixed via #482

@mttk mttk closed this as completed Jan 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants