Batch does not carry index #84

PetrochukM · 2017-08-03T20:57:51Z

Use Case:
replace_unk most strategies of replacing tokens rely on aligning with the source sequence before numericialize

Problem:
Using the Batch object, you are unable to retrieve the original text before padding and numericialize.
There are no indexes stored with the batch to retrieve the original text in the dataset.

Quick work around:
Define a field in dataset that is an 'index' field. While building your dataset, pass in indexes for each item.

Batch will then allow you to look up an index attribute.

The text was updated successfully, but these errors were encountered:

honnibal · 2017-08-06T15:00:54Z

I think in general changing the text is a bad thing. If you're happy to have all your tokenizers output spaCy Doc objects you could have a better solution.

The spaCy Doc object holds a TokenC* array, and each TokenC struct holds a const pointer to a LexemeC. The lexemes are vocabulary items, and they have a number of integer fields. This means you can register string transforms that are computed once over the vocabulary, with the results available to each lexical item.

There's currently a small gap in the API around this -- there's a method to register a new boolean feature flag, but not to register a new string feature. But even with the missing method, the code isn't too bad. I'll show usage where the tokenization is provided by e.g. NLTK, to show how this can be used without the rest of spaCy's stuff:

import nltk
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy.attrs import NORM

def make_tokenizer(vocab_words, represent_oov):
    vocab = Vocab(lex_attr_getters: {NORM: represent_oov})
    for text in vocab_words:
        lex = vocab[text]
        lex.norm_ = text 
        lex.is_oov = False # Writing to Lexemes updates the vocab.
    # All other words will get their NORM via the represent_oov setter.
    # We also assign a setter for is_oov
    vocab.lex_attr_getters[IS_OOV] = lambda text: True
    
    def tokenize(text):
        words = nltk.word_tokenize(text)
        # If you use spaCy's tokenizer you won't have to do this part, but NLTK destroys the
        # alignment. Boo.
        # In spaCy each Token knows the length of its text-string, and whether a space followed.
        # the tokenizer cannot change the text, only decide where to split. We also don't throw away
        # characters other than ' '. This means we never lose alignment.
        spaces = align_tokens(words, text)
        return Doc(vocab, words=words, spaces=spaces)
    return tokenize

def works_in_theory_untested(vocab_list):
    tokenizer = make_tokenizer(vocab_list, lambda text: '<UNK>')
    doc = tokenizer(text)
    for word in doc:
        print(word.text, word.norm_)
    # Produces a numpy array of uint64 values.
    # You could also export LENGTH, SPACE. Then the cumulative of both columns
    # will give you the starting index of the token in the string.
    array = doc.to_array([NORM])
    return array

jekbradbury · 2017-08-07T05:15:48Z

So I really like the spaCy tokenizer/doc API, but if we went that route we'd probably need to be able to reconstruct Doc objects from the output of (e.g.) Seq2Seq decoders. We could only do this if all info in a Doc object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc here. So what I'm trying to do now is build a fully reversible tokenizer -- one that works for any language with spaces (and any other language when coupled with a BPE algorithm) -- and allow the raw source text to be fully reconstructed from a list of indices, whether those are indices from data or indices from a model. Check out revtok in the reversible branch; I'll add the subword stuff soon. The overall idea is to augment the existing Field API with a ReversibleField subclass that additionally exposes inverses of all the processing methods; I think that will solve the main question here (if you just want a batch of tags to let you retrieve something extra/nontextual from the original data sample, like an image, then include that tag as a vocabless Field).

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn't work, though!

nelson-liu · 2017-08-07T05:26:02Z

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption.

+1, I don't think that this is the right thing to do at the moment.

honnibal · 2017-08-07T07:04:21Z

We could only do this if all info in a Doc object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc here.

Tokenization is fully reversible if you have (orth_id, has_space) pairs. If you wanted a single sequence of ints, you would double the number of entries in the vocab in theory. Of course the extra bit introduces little extra entropy given the word ID.

So, spaCy's tokenizers are already fully reversible. You could use them as an internal mechanism to solve this, if you like :). It doesn't have to change your user-facing API, I don't think.

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn't work, though!

I'm planning to add PyTorch tensors as a back-end option for thinc, in addition to Cupy. I also need to write examples of hooking PyTorch models into spaCy.

While I'm here: is it easy to pass a gradient back to a PyTorch model? Most libraries seem to communicate by loss, which makes it harder to compose them with models outside the library.

jekbradbury · 2017-08-07T19:41:51Z

For passing a gradient back to PyTorch,var.backward has an optional grad_output argument that allows you to inject a gradient in a specific place in the computation graph. If you want to inject several gradients, you can use torch.autograd.backward((var_1, var_2), (grad_1, grad_2)) I believe.

PetrochukM changed the title ~~Batch loses index~~ Batch does not carry index Aug 3, 2017

jekbradbury added the enhancement label Aug 5, 2017

jekbradbury closed this as completed Dec 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch does not carry index #84

Batch does not carry index #84

PetrochukM commented Aug 3, 2017 •

edited

Loading

honnibal commented Aug 6, 2017

jekbradbury commented Aug 7, 2017

nelson-liu commented Aug 7, 2017

honnibal commented Aug 7, 2017

jekbradbury commented Aug 7, 2017

Batch does not carry index #84

Batch does not carry index #84

Comments

PetrochukM commented Aug 3, 2017 • edited Loading

honnibal commented Aug 6, 2017

jekbradbury commented Aug 7, 2017

nelson-liu commented Aug 7, 2017

honnibal commented Aug 7, 2017

jekbradbury commented Aug 7, 2017

PetrochukM commented Aug 3, 2017 •

edited

Loading