-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revtok integration: reversible fields and subwords #107
Conversation
test/test_subword.py
Outdated
LABEL.build_vocab(cooked) | ||
TEXT.build_vocab(cooked, max_size=100) | ||
TEXT.segment(cooked) | ||
batch = next(iter(data.Iterator(cooked, 1))) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_subword.py
Outdated
TEXT.build_vocab(cooked, max_size=100) | ||
TEXT.segment(cooked) | ||
batch = next(iter(data.Iterator(cooked, 1))) | ||
self.assertEqual(TEXT.reverse(batch.text)[0], raw[0].text) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torchtext/data/field.py
Outdated
print("Please install revtok.") | ||
raise | ||
if not self.batch_first: | ||
batch.t_() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_subword.py
Outdated
LABEL = data.Field(sequential=False) | ||
RAW = data.Field(sequential=False, use_vocab=False) | ||
raw, = TREC.splits(RAW, LABEL, train='TREC_10.label', | ||
validation=None, test=None) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments where I think the test breaks right now. -- (changed the test in ac4f794)
It looks like the counts in the SubwordVocab are handled differently for special tokens (no subtract). Why is that?
Can you add comments in the subword field and vocab with a short description of how the handling differs from normal fields and vocabs and why?
I'm reading through revotk right now, but until then I think once the test is working this looks like a nice addition. We should probably wait until after the most recent wave of vocab additions and improvements go through before merging though.
test/test_subword.py
Outdated
self.assertEqual(TEXT.reverse(batch.text.data)[0], raw[0].text) | ||
|
||
def test_subword_snli(self): | ||
TEXT = data.SubwordField() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
2c0487e
to
bf9a490
Compare
fa62a93
to
cdd27ca
Compare
Notes from CI:
|
* adding Multi30k wrapper * abstracting tar.gz compression * removing cls.filename * refactor datasets for downloading * bug in sst tree examples
Also: - Add a dataset for testing numeric features (float and int) - Coerce non-sequential data with use_vocab=False to numeric types
5f67b40
to
98cb758
Compare
I think I messed up somewhere in a rebase, so the push build is failing on Python 2 due to a failure to round-trip the capitalization thing even though the PR build is succeeding and the branch claims to be up to date with master... |
I installed revtok by
Is the revtok on pip is not the correct version, how should we install it?? |
Currently you need to install from source https://github.com/jekbradbury/revtok |
Integrates revtok, a simple fully-reversible tokenizer that optionally supports subwords. Adds the
ReversibleField.reverse
API to go from a padded batch of token IDs to a list of detokenized sentences. Speed is not great right now but I'm working on an implementation in a fast language and will add an optional wrapper for that in the Python revtok when it's ready.Added a first stab at a test (want to release revtok to pypi before I add it to requirements.txt and the test can actually run); will add more.
Usage example: