Add Sentencepiece binding to torchtext, plus example to build torchtext dataset with sentencepiece #597

zhangguanheng66 · 2019-09-09T20:41:04Z

Add sentencepiece binding to torchtext.
Show an example to build torchtext dataset with sentencepiece. Reproduce fastTest's results in the example with sentencepiece binding.

vincentqb

I'm noticing the dataset class does not use iterator. Is this meant to represent a consensus we agreed on?

vincentqb · 2019-09-24T18:39:30Z

torchtext/data/functional.py

+        >>> sp_user.load('spm_user.model')
+    """
+
+    import sentencepiece as spm


nit: should this be at the beginning?

vincentqb · 2019-09-24T18:41:13Z

examples/sentencepiece/README.md

@@ -0,0 +1,13 @@
+# This is an example to create a dataset with SentencePiece binding. 


nit: title should probably just be "# Create a dataset with SentencePiece binding" since the text already refers to the content as an example, and its in the example folder :)

vincentqb · 2019-09-24T18:42:27Z

test/data/test_transforms.py

+
+class TestUtils(TorchtextTestCase):
+    def test_sentencepiece_encode_as_ids(self):
+        from torchtext.data.transforms import sentencepiece_encode_as_ids


nit: same about import

vincentqb · 2019-09-24T18:42:32Z

test/data/test_transforms.py

+                         ref_results)
+
+    def test_sentencepiece_encode_as_pieces(self):
+        import sys


nit: same about imports

vincentqb · 2019-09-24T18:43:10Z

torchtext/data/transforms.py

+        >>> list(sp_tokens_generator(list_a))
+            [['_sentence', 'piece', '_en', 'co', 'de', '_as', '_pieces'],
+             ['_example', 's', '_to', '_try', '!']]
+


nit: extra space?

vincentqb · 2019-09-24T18:46:20Z

torchtext/data/transforms.py

+        >>> list(sp_id_generator(list_a))
+            [[9858, 9249, 1629, 1305, 1809, 53, 842],
+             [2347, 13, 9, 150, 37]]
+


nit: extra space?

torchtext/data/transforms.py

zhangguanheng66 · 2019-09-25T18:44:31Z

I'm noticing the dataset class does not use iterator. Is this meant to represent a consensus we agreed on?

Not to the dataset level. At current stage, I would like to implement a few transforms (two in this case) in the format of generators, apply them as tokenizers. As we have more usage cases, we can sit together and make a final decision across all the domains. :)

cpuhrsch · 2019-09-26T17:17:49Z

examples/text_classification/train.py

@@ -137,6 +137,10 @@ def test(data_):
                        help='device (default=cpu)')
    parser.add_argument('--data', default='.data',
                        help='data directory (default=.data)')
+    parser.add_argument('--use-sp-tokenizer', type=bool, default=False,
+                        help='use sentencepiece tokenizer (default=False)')
+    parser.add_argument('--vocab-size', type=int, default=20000,


I'd change this to say "--sp-vocab-size" just so it's less confusing

cpuhrsch · 2019-09-26T17:18:24Z

torchtext/data/functional.py

+                                                   model_prefix,
+                                                   model_type)
+    spm.SentencePieceTrainer.train(spm_training_string)
+    return None


Why return None?

Ah because the straining also involves writing to disk - is it possible to just return the object and have the user save it herself (if necessary). That way we don't add any constraints for where this is supposed to be saved or whether it's supposed to be saved at all.

I have checked the APIs in sentencepiece, and it seems they don't have the option to return an object.

cpuhrsch · 2019-09-26T17:21:18Z

torchtext/data/transforms.py

+    """
+
+    sp_model = spm.SentencePieceProcessor()
+    sp_model.Load(spm_path)


Instead of loading the model here, can this just accept a sentence piece model and nothing else? That way we don't add any constraints on where the model is supposed to be loaded from. It could however be good to have a "load_sentencepiece_model_from_path" or something that wraps the first two lines (just because static constructors are more idiomatic than a method that overwrites an entire state).

cpuhrsch · 2019-09-26T17:22:46Z

torchtext/data/transforms.py

+    """
+
+    sp_model = spm.SentencePieceProcessor()
+    sp_model.Load(spm_path)


Same thing as above. Here the "load...from_path" function can be used to pass into this. Also, if you can load it independently you can run this multiple times.

Further, this doesn't need to return a function that then yields, it can yield itself.

cpuhrsch · 2019-09-26T17:22:57Z

torchtext/data/transforms.py

+
+    Arguments:
+        spm_path: the file path saving the sentencepiece model.
+        txt_iter: input sentence text generator.


This doesn't accept a "txt_iter" argument.

cpuhrsch · 2019-09-26T17:23:40Z

torchtext/data/transforms.py

+    return _internal_func
+
+
+def sentencepiece_encode_as_pieces(spm_path):


If "pieces" means "tokens" in our language I suggest we use tokens instead of pieces here because we need to standardize on a language within our project even if other projects might use a different language.

Guanheng Zhang and others added 7 commits August 26, 2019 11:14

Add sentencepiece dependency to torchtext package

1be09e2

Add generate_sentencepiece_tokenizer_model

160930f

Some tests.

3abe2a8

Test on YelpFullReview with 20k vocab. Same results as fastText.

4e230e9

Add a sentencepiece example.

030fc4e

Reset text_classification dataset.

2a219b0

Merge branch 'sentence_piece' into example_spm

853edbc

zhangguanheng66 changed the title ~~Example to build torchtext dataset with sentencepiece~~ [WIP] Example to build torchtext dataset with sentencepiece Sep 9, 2019

Guanheng Zhang and others added 9 commits September 9, 2019 15:30

Train a model if spm is not provided.

69f2ae0

Allow to call setup_datasets() func

7bf8146

Remove vocab.

5f4ee68

Minor changes

c41eea3

Delete vocab.

eb6fa30

Add train file.

457cd39

Add args to train.py file.

c883dd1

Add a run script.

943538b

flake8

d6eae59

zhangguanheng66 changed the title ~~[WIP] Example to build torchtext dataset with sentencepiece~~ [WIP] Add Sentencepiece binding to torchtext, plus example to build torchtext dataset with sentencepiece Sep 12, 2019

Guanheng Zhang added 12 commits September 12, 2019 10:24

Update the test for get_tokenizer

43b78a3

Add a test file.

98470a0

Fix flake8

82ecd71

Add test to cover generate_sp_tokenizer() function.

535be98

Add docs

c787099

Add spm_data_generator() func.

f0da161

flake8

1d40431

Fix flake8

251ac60

Add test to cover spm_data_generator

a2b628c

Minor

80b0eec

change printout

c375ae3

skip test_get_tokenizer_sentencepiece in python2 envir.

d2d5438

Guanheng Zhang added 3 commits September 24, 2019 09:58

fix example

59752a7

Fix flake8

0e9844c

use sp transform generator.

3e2d8b8

vincentqb reviewed Sep 24, 2019

View reviewed changes

Guanheng Zhang added 4 commits September 24, 2019 12:11

Merge text_classification and sentencepiece examples.

6215f95

Add README file.

6740b9d

Merge sentencepiece example to text_classification.

0d9162a

Add warning message.

803f24f

Guanheng Zhang added 4 commits September 25, 2019 11:48

a few comments.

ce8b558

Minor.

200a4e4

Minor edits.

8aabcdd

Change the test names.

b93149f

cpuhrsch reviewed Sep 26, 2019

View reviewed changes

Guanheng Zhang added 6 commits September 26, 2019 11:28

Move transforms to functional.

08dd450

Flake8

6a69800

Change to --sp-vocab-size example train.py.

f01e201

Fix an error in unit tests.

20534ea

set vocab size.

7136d8c

Fix docs.

76f8d8c

cpuhrsch approved these changes Sep 26, 2019

View reviewed changes

cpuhrsch merged commit 2478197 into pytorch:master Sep 26, 2019

zhangguanheng66 mentioned this pull request Oct 2, 2019

How might I use the tokenizers from the HuggingFace Transformers library #609

Closed

zhangguanheng66 deleted the example_spm branch November 25, 2019 15:29

harshbafna mentioned this pull request Mar 16, 2020

Undocumented dependencies in text classification example pytorch/serve#90

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sentencepiece binding to torchtext, plus example to build torchtext dataset with sentencepiece #597

Add Sentencepiece binding to torchtext, plus example to build torchtext dataset with sentencepiece #597

zhangguanheng66 commented Sep 9, 2019 •

edited

Loading

vincentqb left a comment

vincentqb Sep 24, 2019

vincentqb Sep 24, 2019

vincentqb Sep 24, 2019

vincentqb Sep 24, 2019

vincentqb Sep 24, 2019

vincentqb Sep 24, 2019

zhangguanheng66 commented Sep 25, 2019

cpuhrsch Sep 26, 2019

cpuhrsch Sep 26, 2019

cpuhrsch Sep 26, 2019

zhangguanheng66 Sep 26, 2019 •

edited

Loading

cpuhrsch Sep 26, 2019

cpuhrsch Sep 26, 2019

cpuhrsch Sep 26, 2019

cpuhrsch Sep 26, 2019

		@@ -0,0 +1,13 @@
		# This is an example to create a dataset with SentencePiece binding.

		return _internal_func


		def sentencepiece_encode_as_pieces(spm_path):

Add Sentencepiece binding to torchtext, plus example to build torchtext dataset with sentencepiece #597

Add Sentencepiece binding to torchtext, plus example to build torchtext dataset with sentencepiece #597

Conversation

zhangguanheng66 commented Sep 9, 2019 • edited Loading

vincentqb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 commented Sep 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 Sep 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 commented Sep 9, 2019 •

edited

Loading

zhangguanheng66 Sep 26, 2019 •

edited

Loading