Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch does not carry index #84

Closed
PetrochukM opened this issue Aug 3, 2017 · 5 comments
Closed

Batch does not carry index #84

PetrochukM opened this issue Aug 3, 2017 · 5 comments

Comments

@PetrochukM
Copy link
Contributor

PetrochukM commented Aug 3, 2017

Use Case:
replace_unk most strategies of replacing tokens rely on aligning with the source sequence before numericialize

Problem:
Using the Batch object, you are unable to retrieve the original text before padding and numericialize.
There are no indexes stored with the batch to retrieve the original text in the dataset.

Quick work around:
Define a field in dataset that is an 'index' field. While building your dataset, pass in indexes for each item.

Batch will then allow you to look up an index attribute.

@PetrochukM PetrochukM changed the title Batch loses index Batch does not carry index Aug 3, 2017
@honnibal
Copy link

honnibal commented Aug 6, 2017

I think in general changing the text is a bad thing. If you're happy to have all your tokenizers output spaCy Doc objects you could have a better solution.

The spaCy Doc object holds a TokenC* array, and each TokenC struct holds a const pointer to a LexemeC. The lexemes are vocabulary items, and they have a number of integer fields. This means you can register string transforms that are computed once over the vocabulary, with the results available to each lexical item.

There's currently a small gap in the API around this -- there's a method to register a new boolean feature flag, but not to register a new string feature. But even with the missing method, the code isn't too bad. I'll show usage where the tokenization is provided by e.g. NLTK, to show how this can be used without the rest of spaCy's stuff:

import nltk
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy.attrs import NORM

def make_tokenizer(vocab_words, represent_oov):
    vocab = Vocab(lex_attr_getters: {NORM: represent_oov})
    for text in vocab_words:
        lex = vocab[text]
        lex.norm_ = text 
        lex.is_oov = False # Writing to Lexemes updates the vocab.
    # All other words will get their NORM via the represent_oov setter.
    # We also assign a setter for is_oov
    vocab.lex_attr_getters[IS_OOV] = lambda text: True
    
    def tokenize(text):
        words = nltk.word_tokenize(text)
        # If you use spaCy's tokenizer you won't have to do this part, but NLTK destroys the
        # alignment. Boo.
        # In spaCy each Token knows the length of its text-string, and whether a space followed.
        # the tokenizer cannot change the text, only decide where to split. We also don't throw away
        # characters other than ' '. This means we never lose alignment.
        spaces = align_tokens(words, text)
        return Doc(vocab, words=words, spaces=spaces)
    return tokenize

def works_in_theory_untested(vocab_list):
    tokenizer = make_tokenizer(vocab_list, lambda text: '<UNK>')
    doc = tokenizer(text)
    for word in doc:
        print(word.text, word.norm_)
    # Produces a numpy array of uint64 values.
    # You could also export LENGTH, SPACE. Then the cumulative of both columns
    # will give you the starting index of the token in the string.
    array = doc.to_array([NORM])
    return array

@jekbradbury
Copy link
Contributor

So I really like the spaCy tokenizer/doc API, but if we went that route we'd probably need to be able to reconstruct Doc objects from the output of (e.g.) Seq2Seq decoders. We could only do this if all info in a Doc object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc here. So what I'm trying to do now is build a fully reversible tokenizer -- one that works for any language with spaces (and any other language when coupled with a BPE algorithm) -- and allow the raw source text to be fully reconstructed from a list of indices, whether those are indices from data or indices from a model. Check out revtok in the reversible branch; I'll add the subword stuff soon. The overall idea is to augment the existing Field API with a ReversibleField subclass that additionally exposes inverses of all the processing methods; I think that will solve the main question here (if you just want a batch of tags to let you retrieve something extra/nontextual from the original data sample, like an image, then include that tag as a vocabless Field).

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn't work, though!

@nelson-liu
Copy link
Contributor

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption.

+1, I don't think that this is the right thing to do at the moment.

@honnibal
Copy link

honnibal commented Aug 7, 2017

We could only do this if all info in a Doc object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc here.

Tokenization is fully reversible if you have (orth_id, has_space) pairs. If you wanted a single sequence of ints, you would double the number of entries in the vocab in theory. Of course the extra bit introduces little extra entropy given the word ID.

So, spaCy's tokenizers are already fully reversible. You could use them as an internal mechanism to solve this, if you like :). It doesn't have to change your user-facing API, I don't think.

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn't work, though!

I'm planning to add PyTorch tensors as a back-end option for thinc, in addition to Cupy. I also need to write examples of hooking PyTorch models into spaCy.

While I'm here: is it easy to pass a gradient back to a PyTorch model? Most libraries seem to communicate by loss, which makes it harder to compose them with models outside the library.

@jekbradbury
Copy link
Contributor

For passing a gradient back to PyTorch,var.backward has an optional grad_output argument that allows you to inject a gradient in a specific place in the computation graph. If you want to inject several gradients, you can use torch.autograd.backward((var_1, var_2), (grad_1, grad_2)) I believe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants