Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revtok integration: reversible fields and subwords #107

Merged
merged 34 commits into from
Oct 10, 2017
Merged

Conversation

jekbradbury
Copy link
Contributor

@jekbradbury jekbradbury commented Sep 4, 2017

Integrates revtok, a simple fully-reversible tokenizer that optionally supports subwords. Adds the ReversibleField.reverse API to go from a padded batch of token IDs to a list of detokenized sentences. Speed is not great right now but I'm working on an implementation in a fast language and will add an optional wrapper for that in the Python revtok when it's ready.

Added a first stab at a test (want to release revtok to pypi before I add it to requirements.txt and the test can actually run); will add more.

Usage example:

>>> from torchtext import data, datasets
>>> TEXT = data.SubwordField()
>>> LABEL = data.Field(sequential=False)
>>> train, dev, test = datasets.SNLI.splits(TEXT, LABEL)
>>> LABEL.build_vocab(train)
>>> TEXT.build_vocab(train, max_size=2000) # currently takes about 8 minutes
>>> train[1].hypothesis
['\ue302 a ', ' person ', ' is ', ' at ', ' a ', ' diner ', ' ordering ', ' an ', ' omelette ', '. ']
>>> TEXT.segment(train, dev, test) # currently takes about 6 minutes
>>> train[1].hypothesis
['\ue302 a ', ' person ', ' is ', ' at ', ' a ', ' di', 'ner ', ', ', ' ', 'or', 'd', 'ering ', ' an ', ' o', 'm', 'el', 'e', 't', 'te ', '. ']
>>> train_iter, dev_iter, test_iter = data.BucketIterator.splits((train, dev, test), batch_size=16)
>>> b = next(iter(train))
>>> TEXT.reverse(b.premise.data)
['In a park a kid is chasing pigeons and two men are walking', 'A basketball player is hanging onto the rim while the ball is in the basket', 'An old man is speaking in a brown fedora and blue jacket.', ...]

@bmccann bmccann self-requested a review September 9, 2017 19:36
LABEL.build_vocab(cooked)
TEXT.build_vocab(cooked, max_size=100)
TEXT.segment(cooked)
batch = next(iter(data.Iterator(cooked, 1)))

This comment was marked as off-topic.

This comment was marked as off-topic.

TEXT.build_vocab(cooked, max_size=100)
TEXT.segment(cooked)
batch = next(iter(data.Iterator(cooked, 1)))
self.assertEqual(TEXT.reverse(batch.text)[0], raw[0].text)

This comment was marked as off-topic.

This comment was marked as off-topic.

print("Please install revtok.")
raise
if not self.batch_first:
batch.t_()

This comment was marked as off-topic.

This comment was marked as off-topic.

LABEL = data.Field(sequential=False)
RAW = data.Field(sequential=False, use_vocab=False)
raw, = TREC.splits(RAW, LABEL, train='TREC_10.label',
validation=None, test=None)

This comment was marked as off-topic.

Copy link
Contributor

@bmccann bmccann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments where I think the test breaks right now. -- (changed the test in ac4f794)

It looks like the counts in the SubwordVocab are handled differently for special tokens (no subtract). Why is that?

Can you add comments in the subword field and vocab with a short description of how the handling differs from normal fields and vocabs and why?

I'm reading through revotk right now, but until then I think once the test is working this looks like a nice addition. We should probably wait until after the most recent wave of vocab additions and improvements go through before merging though.

self.assertEqual(TEXT.reverse(batch.text.data)[0], raw[0].text)

def test_subword_snli(self):
TEXT = data.SubwordField()

This comment was marked as off-topic.

@bmccann bmccann force-pushed the reversible branch 2 times, most recently from 2c0487e to bf9a490 Compare September 11, 2017 15:29
@jekbradbury
Copy link
Contributor Author

jekbradbury commented Oct 6, 2017

Notes from CI:

@jekbradbury
Copy link
Contributor Author

I think I messed up somewhere in a rebase, so the push build is failing on Python 2 due to a failure to round-trip the capitalization thing even though the PR build is succeeding and the branch claims to be up to date with master...

@jekbradbury jekbradbury changed the title [WIP] Revtok integration: reversible fields and subwords Revtok integration: reversible fields and subwords Oct 10, 2017
@jekbradbury jekbradbury merged commit 4f5801b into master Oct 10, 2017
@windweller
Copy link

I installed revtok by pip install revtok...maybe the version is not correct?
This error was thrown:

  File "text_classifier.py", line 298, in <module>
    TEXT = data.ReversibleField(sequential=True, tokenize=tokenizer, lower=True)
  File "/torchtext/data/field.py", line 324, in __init__
    super(ReversibleField, self).__init__(**kwargs)
  File "/torchtext/data/field.py", line 148, in __init__
    self.tokenize = get_tokenizer(tokenize)
  File "/torchtext/data/utils.py", line 32, in get_tokenizer
    import revtok
  File "/revtok/__init__.py", line 1, in <module>
    from vocab.vocab import Vocab, OutOfVocabularyException
ImportError: No module named vocab.vocab

Is the revtok on pip is not the correct version, how should we install it??

@jekbradbury
Copy link
Contributor Author

jekbradbury commented Nov 15, 2017

Currently you need to install from source https://github.com/jekbradbury/revtok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants