Revtok integration: reversible fields and subwords #107

jekbradbury · 2017-09-04T07:31:33Z

Integrates revtok, a simple fully-reversible tokenizer that optionally supports subwords. Adds the ReversibleField.reverse API to go from a padded batch of token IDs to a list of detokenized sentences. Speed is not great right now but I'm working on an implementation in a fast language and will add an optional wrapper for that in the Python revtok when it's ready.

Added a first stab at a test (want to release revtok to pypi before I add it to requirements.txt and the test can actually run); will add more.

Usage example:

>>> from torchtext import data, datasets
>>> TEXT = data.SubwordField()
>>> LABEL = data.Field(sequential=False)
>>> train, dev, test = datasets.SNLI.splits(TEXT, LABEL)
>>> LABEL.build_vocab(train)
>>> TEXT.build_vocab(train, max_size=2000) # currently takes about 8 minutes
>>> train[1].hypothesis
['\ue302 a ', ' person ', ' is ', ' at ', ' a ', ' diner ', ' ordering ', ' an ', ' omelette ', '. ']
>>> TEXT.segment(train, dev, test) # currently takes about 6 minutes
>>> train[1].hypothesis
['\ue302 a ', ' person ', ' is ', ' at ', ' a ', ' di', 'ner ', ', ', ' ', 'or', 'd', 'ering ', ' an ', ' o', 'm', 'el', 'e', 't', 'te ', '. ']
>>> train_iter, dev_iter, test_iter = data.BucketIterator.splits((train, dev, test), batch_size=16)
>>> b = next(iter(train))
>>> TEXT.reverse(b.premise.data)
['In a park a kid is chasing pigeons and two men are walking', 'A basketball player is hanging onto the rim while the ball is in the basket', 'An old man is speaking in a brown fedora and blue jacket.', ...]

test/test_subword.py

+        LABEL.build_vocab(cooked)
+        TEXT.build_vocab(cooked, max_size=100)
+        TEXT.segment(cooked)
+        batch = next(iter(data.Iterator(cooked, 1)))


test/test_subword.py

+        TEXT.build_vocab(cooked, max_size=100)
+        TEXT.segment(cooked)
+        batch = next(iter(data.Iterator(cooked, 1)))
+        self.assertEqual(TEXT.reverse(batch.text)[0], raw[0].text)


torchtext/data/field.py

+            print("Please install revtok.")
+            raise
+        if not self.batch_first:
+            batch.t_()


test/test_subword.py

+        LABEL = data.Field(sequential=False)
+        RAW = data.Field(sequential=False, use_vocab=False)
+        raw, = TREC.splits(RAW, LABEL, train='TREC_10.label',
+                           validation=None, test=None)


bmccann

I added some comments where I think the test breaks right now. -- (changed the test in ac4f794)

It looks like the counts in the SubwordVocab are handled differently for special tokens (no subtract). Why is that?

Can you add comments in the subword field and vocab with a short description of how the handling differs from normal fields and vocabs and why?

I'm reading through revotk right now, but until then I think once the test is working this looks like a nice addition. We should probably wait until after the most recent wave of vocab additions and improvements go through before merging though.

test/test_subword.py

+        self.assertEqual(TEXT.reverse(batch.text.data)[0], raw[0].text)
+
+    def test_subword_snli(self):
+        TEXT = data.SubwordField()


jekbradbury · 2017-10-06T03:26:18Z

Notes from CI:

declare encoding on field.py for Python 2
fix subword TREC test to use CPU device
fix Vocab tests that explicitly provide specials to also include '<unk>' (and note as a breaking change that also closes Adding is_label option in Field to remove unknown label #121)

* adding Multi30k wrapper * abstracting tar.gz compression * removing cls.filename * refactor datasets for downloading * bug in sst tree examples

Fixed #133

Also: - Add a dataset for testing numeric features (float and int) - Coerce non-sequential data with use_vocab=False to numeric types

jekbradbury · 2017-10-09T18:48:58Z

I think I messed up somewhere in a rebase, so the push build is failing on Python 2 due to a failure to round-trip the capitalization thing even though the PR build is succeeding and the branch claims to be up to date with master...

windweller · 2017-11-13T16:29:35Z

I installed revtok by pip install revtok...maybe the version is not correct?
This error was thrown:

  File "text_classifier.py", line 298, in <module>
    TEXT = data.ReversibleField(sequential=True, tokenize=tokenizer, lower=True)
  File "/torchtext/data/field.py", line 324, in __init__
    super(ReversibleField, self).__init__(**kwargs)
  File "/torchtext/data/field.py", line 148, in __init__
    self.tokenize = get_tokenizer(tokenize)
  File "/torchtext/data/utils.py", line 32, in get_tokenizer
    import revtok
  File "/revtok/__init__.py", line 1, in <module>
    from vocab.vocab import Vocab, OutOfVocabularyException
ImportError: No module named vocab.vocab

Is the revtok on pip is not the correct version, how should we install it??

jekbradbury · 2017-11-15T06:16:43Z

Currently you need to install from source https://github.com/jekbradbury/revtok

bmccann self-requested a review September 9, 2017 19:36

bmccann reviewed Sep 10, 2017

View reviewed changes

test/test_subword.py Outdated

LABEL.build_vocab(cooked)

TEXT.build_vocab(cooked, max_size=100)

TEXT.segment(cooked)

batch = next(iter(data.Iterator(cooked, 1)))

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

bmccann reviewed Sep 10, 2017

View reviewed changes

torchtext/data/field.py Outdated

print("Please install revtok.")

raise

if not self.batch_first:

batch.t_()

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

bmccann reviewed Sep 10, 2017

View reviewed changes

test/test_subword.py Outdated

LABEL = data.Field(sequential=False)

RAW = data.Field(sequential=False, use_vocab=False)

raw, = TREC.splits(RAW, LABEL, train='TREC_10.label',

validation=None, test=None)

This comment was marked as off-topic.

Sign in to view

bmccann suggested changes Sep 10, 2017

View reviewed changes

bmccann approved these changes Sep 11, 2017

View reviewed changes

jekbradbury commented Sep 11, 2017

View reviewed changes

test/test_subword.py Outdated

self.assertEqual(TEXT.reverse(batch.text.data)[0], raw[0].text)

def test_subword_snli(self):

TEXT = data.SubwordField()

This comment was marked as off-topic.

Sign in to view

bmccann added 3 commits September 11, 2017 14:40

adding Multi30k wrapper

93189bd

abstracting tar.gz compression

795850d

removing cls.filename

3cacb41

bmccann force-pushed the reversible branch 2 times, most recently from 2c0487e to bf9a490 Compare September 11, 2017 15:29

refactor datasets for downloading

f8b9647

bmccann force-pushed the reversible branch from bf9a490 to 3503466 Compare September 11, 2017 18:02

bug in sst tree examples

e5f16a3

bmccann force-pushed the reversible branch from 95f5f44 to 655e191 Compare September 12, 2017 21:06

bmccann force-pushed the reversible branch from d593591 to b7d3be7 Compare September 21, 2017 23:02

jekbradbury force-pushed the reversible branch 3 times, most recently from fa62a93 to cdd27ca Compare September 30, 2017 00:22

jekbradbury and others added 8 commits October 8, 2017 20:28

starting work on reversible field processing

8033710

strip examples and values

711c5a3

removing snli test

2636e05

making subword vocab serializable

420a991

Add pretrained string aliases to Vocab.load_vectors

1117e75

Multi30k and Dataset download refactor (#116)

34bfbe1

* adding Multi30k wrapper * abstracting tar.gz compression * removing cls.filename * refactor datasets for downloading * bug in sst tree examples

Make travis slow python3 environment sudo-required (#122)

160a773

limiting strips to newlines

94c6951

bmccann and others added 16 commits October 8, 2017 20:28

iwslt (#126)

6a17d10

allowing IWSLT to download multiple language pairs

11a17bb

more unk fixes and device_of for reverse

b5656dc

simplify segment

bd2d9ca

unknown subword

cd4cfd7

allow list as reversible tokenizer

9fd54ae

moving rstrip to just before tokenization

fd08618

Update README.md to fix #133 (#134)

2cb86b1

Fixed #133

Finish tests for Field class (#119)

2bc3948

Also: - Add a dataset for testing numeric features (float and int) - Coerce non-sequential data with use_vocab=False to numeric types

Update and clean up Dataset docstrings (#123)

1a6b8ca

Fix #131 (#135)

c0e967d

add revtok to requirements

36c797b

move test_subword

20ac3e0

add encoding to field test

5a4c8bd

use CPU device for subword test

47edd42

add <unk> explicitly to specials in vocab tests

98cb758

jekbradbury force-pushed the reversible branch from 5f67b40 to 98cb758 Compare October 9, 2017 03:30

jekbradbury added 2 commits October 8, 2017 20:31

Merge branch 'master' into reversible

5771863

Fix SubwordField constructor

312bb57

jekbradbury changed the title ~~[WIP] Revtok integration: reversible fields and subwords~~ Revtok integration: reversible fields and subwords Oct 10, 2017

jekbradbury merged commit 4f5801b into master Oct 10, 2017

jekbradbury mentioned this pull request Oct 10, 2017

Adding is_label option in Field to remove unknown label #121

Closed

This was referenced Dec 22, 2017

Understanding vocabulary of text and labels #17

Closed

[Feature Request] parameter for detokenization in fields #82

Closed

[Discussion] Object Design #89

Closed

jekbradbury mentioned this pull request Apr 27, 2018

[Help Wantted]Could any one gives an example of using the subword field? #290

Open

ZhuBaohe mentioned this pull request Sep 18, 2019

SubwordField documentation is absent #599

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revtok integration: reversible fields and subwords #107

Revtok integration: reversible fields and subwords #107

jekbradbury commented Sep 4, 2017 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

bmccann left a comment •

edited

Loading

This comment was marked as off-topic.

jekbradbury commented Oct 6, 2017 •

edited

Loading

jekbradbury commented Oct 9, 2017

windweller commented Nov 13, 2017

jekbradbury commented Nov 15, 2017 •

edited

Loading

Revtok integration: reversible fields and subwords #107

Revtok integration: reversible fields and subwords #107

Conversation

jekbradbury commented Sep 4, 2017 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

bmccann left a comment • edited Loading

Choose a reason for hiding this comment

This comment was marked as off-topic.

jekbradbury commented Oct 6, 2017 • edited Loading

jekbradbury commented Oct 9, 2017

windweller commented Nov 13, 2017

jekbradbury commented Nov 15, 2017 • edited Loading

jekbradbury commented Sep 4, 2017 •

edited

Loading

bmccann left a comment •

edited

Loading

jekbradbury commented Oct 6, 2017 •

edited

Loading

jekbradbury commented Nov 15, 2017 •

edited

Loading