-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sentencepiece binding to torchtext, plus example to build torchtext dataset with sentencepiece #597
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm noticing the dataset class does not use iterator. Is this meant to represent a consensus we agreed on?
torchtext/data/functional.py
Outdated
>>> sp_user.load('spm_user.model') | ||
""" | ||
|
||
import sentencepiece as spm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this be at the beginning?
examples/sentencepiece/README.md
Outdated
@@ -0,0 +1,13 @@ | |||
# This is an example to create a dataset with SentencePiece binding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: title should probably just be "# Create a dataset with SentencePiece binding" since the text already refers to the content as an example, and its in the example folder :)
test/data/test_transforms.py
Outdated
|
||
class TestUtils(TorchtextTestCase): | ||
def test_sentencepiece_encode_as_ids(self): | ||
from torchtext.data.transforms import sentencepiece_encode_as_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same about import
test/data/test_transforms.py
Outdated
ref_results) | ||
|
||
def test_sentencepiece_encode_as_pieces(self): | ||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same about imports
torchtext/data/transforms.py
Outdated
>>> list(sp_tokens_generator(list_a)) | ||
[['_sentence', 'piece', '_en', 'co', 'de', '_as', '_pieces'], | ||
['_example', 's', '_to', '_try', '!']] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra space?
torchtext/data/transforms.py
Outdated
>>> list(sp_id_generator(list_a)) | ||
[[9858, 9249, 1629, 1305, 1809, 53, 842], | ||
[2347, 13, 9, 150, 37]] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra space?
Not to the dataset level. At current stage, I would like to implement a few transforms (two in this case) in the format of generators, apply them as tokenizers. As we have more usage cases, we can sit together and make a final decision across all the domains. :) |
@@ -137,6 +137,10 @@ def test(data_): | |||
help='device (default=cpu)') | |||
parser.add_argument('--data', default='.data', | |||
help='data directory (default=.data)') | |||
parser.add_argument('--use-sp-tokenizer', type=bool, default=False, | |||
help='use sentencepiece tokenizer (default=False)') | |||
parser.add_argument('--vocab-size', type=int, default=20000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd change this to say "--sp-vocab-size" just so it's less confusing
model_prefix, | ||
model_type) | ||
spm.SentencePieceTrainer.train(spm_training_string) | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah because the straining also involves writing to disk - is it possible to just return the object and have the user save it herself (if necessary). That way we don't add any constraints for where this is supposed to be saved or whether it's supposed to be saved at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked the APIs in sentencepiece, and it seems they don't have the option to return an object.
torchtext/data/transforms.py
Outdated
""" | ||
|
||
sp_model = spm.SentencePieceProcessor() | ||
sp_model.Load(spm_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of loading the model here, can this just accept a sentence piece model and nothing else? That way we don't add any constraints on where the model is supposed to be loaded from. It could however be good to have a "load_sentencepiece_model_from_path" or something that wraps the first two lines (just because static constructors are more idiomatic than a method that overwrites an entire state).
torchtext/data/transforms.py
Outdated
""" | ||
|
||
sp_model = spm.SentencePieceProcessor() | ||
sp_model.Load(spm_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing as above. Here the "load...from_path" function can be used to pass into this. Also, if you can load it independently you can run this multiple times.
Further, this doesn't need to return a function that then yields, it can yield itself.
torchtext/data/transforms.py
Outdated
|
||
Arguments: | ||
spm_path: the file path saving the sentencepiece model. | ||
txt_iter: input sentence text generator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't accept a "txt_iter" argument.
torchtext/data/transforms.py
Outdated
return _internal_func | ||
|
||
|
||
def sentencepiece_encode_as_pieces(spm_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "pieces" means "tokens" in our language I suggest we use tokens instead of pieces here because we need to standardize on a language within our project even if other projects might use a different language.
Add sentencepiece binding to torchtext.
Show an example to build torchtext dataset with sentencepiece. Reproduce fastTest's results in the example with sentencepiece binding.