-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch does not carry index #84
Comments
I think in general changing the text is a bad thing. If you're happy to have all your tokenizers output spaCy The spaCy There's currently a small gap in the API around this -- there's a method to register a new boolean feature flag, but not to register a new string feature. But even with the missing method, the code isn't too bad. I'll show usage where the tokenization is provided by e.g. NLTK, to show how this can be used without the rest of spaCy's stuff: import nltk
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy.attrs import NORM
def make_tokenizer(vocab_words, represent_oov):
vocab = Vocab(lex_attr_getters: {NORM: represent_oov})
for text in vocab_words:
lex = vocab[text]
lex.norm_ = text
lex.is_oov = False # Writing to Lexemes updates the vocab.
# All other words will get their NORM via the represent_oov setter.
# We also assign a setter for is_oov
vocab.lex_attr_getters[IS_OOV] = lambda text: True
def tokenize(text):
words = nltk.word_tokenize(text)
# If you use spaCy's tokenizer you won't have to do this part, but NLTK destroys the
# alignment. Boo.
# In spaCy each Token knows the length of its text-string, and whether a space followed.
# the tokenizer cannot change the text, only decide where to split. We also don't throw away
# characters other than ' '. This means we never lose alignment.
spaces = align_tokens(words, text)
return Doc(vocab, words=words, spaces=spaces)
return tokenize
def works_in_theory_untested(vocab_list):
tokenizer = make_tokenizer(vocab_list, lambda text: '<UNK>')
doc = tokenizer(text)
for word in doc:
print(word.text, word.norm_)
# Produces a numpy array of uint64 values.
# You could also export LENGTH, SPACE. Then the cumulative of both columns
# will give you the starting index of the token in the string.
array = doc.to_array([NORM])
return array |
So I really like the spaCy tokenizer/doc API, but if we went that route we'd probably need to be able to reconstruct In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn't work, though! |
+1, I don't think that this is the right thing to do at the moment. |
Tokenization is fully reversible if you have So, spaCy's tokenizers are already fully reversible. You could use them as an internal mechanism to solve this, if you like :). It doesn't have to change your user-facing API, I don't think.
I'm planning to add PyTorch tensors as a back-end option for thinc, in addition to Cupy. I also need to write examples of hooking PyTorch models into spaCy. While I'm here: is it easy to pass a gradient back to a PyTorch model? Most libraries seem to communicate by loss, which makes it harder to compose them with models outside the library. |
For passing a gradient back to PyTorch, |
Use Case:
replace_unk
most strategies of replacing tokens rely on aligning with the source sequence before numericializeProblem:
Using the Batch object, you are unable to retrieve the original text before padding and numericialize.
There are no indexes stored with the batch to retrieve the original text in the dataset.
Quick work around:
Define a field in dataset that is an 'index' field. While building your dataset, pass in indexes for each item.
Batch will then allow you to look up an index attribute.
The text was updated successfully, but these errors were encountered: