Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BC breaking] Add Sentencepiece torchscript Extension #755

Merged
merged 1 commit into from
May 8, 2020

Conversation

mthrok
Copy link
Contributor

@mthrok mthrok commented May 6, 2020

This PR adds torchscript extension _torchtext.so, which contains simple interface to SentencePiece.

  • SentencePiece v0.1.86 is used.
  • libsentencepiece.a is built right before _torchtext.so is compiled.
    The logic for triggering this build from setuptools can be found under build_tools/setup_helpers.
  • _torchtext.so provides interface to train a SentencePiece model and load a model from file.
  • Our custom SentencePiece class has the following methods
    • Encode
    • EncodeAsIds
    • EncodeAsPieces
    • is pickle-able TODO Add pickle/un-pickle test This is not for simple pickling, but for torchscrpit serialization.
  • Our custom SentencePiece class replaces the original SentencePieceProcessor class, which is the official Python binding using Swig.
    • Due to the limitation of torchscript, our custom SentencePiece class does not implement special methods __len__ and __getitem__.
    • Other methods provided SentencePieceProcessor are not implemented.
  • generate_sp_model and load_sp_model are replaces with equivalent C++ implementation that handles our custom SentencePiece class.
  • Original load_sp_model was returning sentencepiece.SentencePieceProcessor. (BC-braeaking.)
  • We cannot replace sentencepiece_tokenizer and sentencepiece_numericalizer with C++ ext yet as C++ ext does not handle generator.

Breaking change:

Previously torchtext.data.functional.load_sp_model returned sentencepiece.SentencePieceProcessor object, which supported the following methods, in addition to __len__ and __getitem__

$ grep '$self->' third_party/sentencepiece/python/sentencepiece.i
    return $self->Load(filename);
    return $self->LoadFromSerializedProto(filename);
    return $self->SetEncodeExtraOptions(extra_option);
    return $self->SetDecodeExtraOptions(extra_option);
    return $self->SetVocabulary(valid_vocab);
    return $self->ResetVocabulary();
    return $self->LoadVocabulary(filename, threshold);
    return $self->EncodeAsPieces(input);
    return $self->EncodeAsIds(input);
    return $self->NBestEncodeAsPieces(input, nbest_size);
    return $self->NBestEncodeAsIds(input, nbest_size);
    return $self->SampleEncodeAsPieces(input, nbest_size, alpha);
    return $self->SampleEncodeAsIds(input, nbest_size, alpha);
    return $self->DecodePieces(input);
    return $self->DecodeIds(input);
    return $self->EncodeAsSerializedProto(input);
    return $self->SampleEncodeAsSerializedProto(input, nbest_size, alpha);
    return $self->NBestEncodeAsSerializedProto(input, nbest_size);
    return $self->DecodePiecesAsSerializedProto(pieces);
    return $self->DecodeIdsAsSerializedProto(ids);
    return $self->GetPieceSize();
    return $self->PieceToId(piece);
    return $self->IdToPiece(id);
    return $self->GetScore(id);
    return $self->IsUnused(id);
    return $self->IsControl(id);
    return $self->IsUnused(id);
    return $self->GetPieceSize();
    return $self->PieceToId(key);

The new C++ Extension provides the following methods

Encode(input)
EncodeAsIds(input)
EncodeAsPieces(input)

@mthrok mthrok marked this pull request as draft May 6, 2020 17:13
@mthrok mthrok changed the title [WIP] Add Sentencepiece CPP Extension [WIP] Add Sentencepiece torchscript Extension May 7, 2020
@mthrok mthrok force-pushed the sentencepiece branch 3 times, most recently from 4dbf3f1 to 1a1b6f0 Compare May 7, 2020 20:25
@mthrok mthrok changed the title [WIP] Add Sentencepiece torchscript Extension Add Sentencepiece torchscript Extension May 8, 2020
@mthrok mthrok marked this pull request as ready for review May 8, 2020 01:08
Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a JIT test.

# Extension info
ext_modules=setup_helpers.get_ext_modules(),
cmdclass={
'build_ext': setup_helpers.BuildExtension.with_options(no_python_abi_suffix=True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this abi option necessary?

Also, I'd add a comment that references the issue around dropping symbols we recently referenced (I can dig it up if this doesn't ring a bell).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah not necessary. It's a reminding custom from what I followed in torchvision. (_C.so notation.)
I will remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it turned out that we need to provide the full path to so file, so removing abi suffix makes it easier to locate the library file.


public:
std::string content_;
explicit SentencePiece(const std::string &content) : content_(content) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be changed to take a ::sentencepiece:: via move constructor and then effectively represent a shell that forwards methods to the underlying object.

Do you need this string constructor? Does this match previous behavior? Usually I'd make this a separate factory function and then construct a SentencePiece object that is initialized with whatever processor_.LoadFromSerializedProto(content_); returns.

Copy link
Contributor Author

@mthrok mthrok May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class has to provide serialization mechanism so that scripted module can be saved.
Ideally, the underlying SentencePieceProcessor class should provide such mechanism, but it does not provide interface to serialize the model.
Note 1. The lifecycle of the official SentencePieceProcessor is 1. train and save to a file then 2. load from the saved file. It does not have a functionality to save the model.
Note 2. If you look at the source code of SentencePieceProcessor, it almost look like provide serialization interface, but ModelProto and all the functionality related to protobuf does not exist in the library file.

To workaround this, we need to keep the content of the saved model file (model data serialized protobuf) when loading a model from a file.
So, it is simpler to pass the model content as string to constructor and to instantiate the underlying SentencePieceProcessor there. Because, you still need to pass the model content separately for the later serialization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialization here is necessary to support pickle? As far as I know that's what we use to save models. We could look into adding custom pickle support as does pybind11 for this. This might also enable us, if they don't already or if there's a need, to apply compression algorithms on the fly and save a smaller model.

}
}

std::vector<std::string> Encode(const std::string &input) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you could play with iterables here. You pass an object that yields strings and return and object that yields iterators over strings. That could be quite powerful when it comes to avoiding the overhead of writing out a lot of text while processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder. Iterator is not supported by torchscript but we expect those transforms jitable.

@mthrok
Copy link
Contributor Author

mthrok commented May 8, 2020

@zhangguanheng66 @cpuhrsch

Addressed your feedback.
Regarding replacing sentencepiece_tokenizer and sentencepiece_numericalizer, we can try in the next step with @cpuhrsch 's idea to pass a generator to C++.

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but please document the BC-breaking changes and changed semantics so they can be referenced.

Also, maybe it makes sense to wait for len and getitem to land

@mthrok
Copy link
Contributor Author

mthrok commented May 8, 2020

@cpuhrsch I squashed the commits and added the list of methods available now and then in the commit message.
Checkout the commit message 👉https://github.com/pytorch/text/pull/755/commits

EDIT: also added the same message to this PR description.

@mthrok
Copy link
Contributor Author

mthrok commented May 8, 2020

Also, maybe it makes sense to wait for len and getitem to land

I prefer to merge this one then work on separately.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented May 8, 2020

Ok sounds good. Please add a [BC breaking] tag and then this should be good to go. Also, please do an import into fbcode after this.

@mthrok mthrok changed the title Add Sentencepiece torchscript Extension [BC breaking] Add Sentencepiece torchscript Extension May 8, 2020
This CC adds `torchscript` extension `_torchtext.so`, which contains simple interface to `SentencePiece`.

 - SentencePiece `v0.1.86` is used.
 - `libsentencepiece.a` is built right before `_torchtext.so` is compiled.
   The logic for triggering this build from `setuptools` can be found under `build_tools/setup_helpers`.
 - `_torchtext.so` provides interface to train a SentencePiece model and load a model from file.

Breaking change:

Previously `torchtext.data.functional.load_sp_model` returned `sentencepiece.SentencePieceProcessor` object, which supported the following methods, in addition to `__len__` and `__getitem__`

```
$ grep '$self->' third_party/sentencepiece/python/sentencepiece.i
    return $self->Load(filename);
    return $self->LoadFromSerializedProto(filename);
    return $self->SetEncodeExtraOptions(extra_option);
    return $self->SetDecodeExtraOptions(extra_option);
    return $self->SetVocabulary(valid_vocab);
    return $self->ResetVocabulary();
    return $self->LoadVocabulary(filename, threshold);
    return $self->EncodeAsPieces(input);
    return $self->EncodeAsIds(input);
    return $self->NBestEncodeAsPieces(input, nbest_size);
    return $self->NBestEncodeAsIds(input, nbest_size);
    return $self->SampleEncodeAsPieces(input, nbest_size, alpha);
    return $self->SampleEncodeAsIds(input, nbest_size, alpha);
    return $self->DecodePieces(input);
    return $self->DecodeIds(input);
    return $self->EncodeAsSerializedProto(input);
    return $self->SampleEncodeAsSerializedProto(input, nbest_size, alpha);
    return $self->NBestEncodeAsSerializedProto(input, nbest_size);
    return $self->DecodePiecesAsSerializedProto(pieces);
    return $self->DecodeIdsAsSerializedProto(ids);
    return $self->GetPieceSize();
    return $self->PieceToId(piece);
    return $self->IdToPiece(id);
    return $self->GetScore(id);
    return $self->IsUnused(id);
    return $self->IsControl(id);
    return $self->IsUnused(id);
    return $self->GetPieceSize();
    return $self->PieceToId(key);
```

The new C++ Extension provides the following methods

```
Encode(input)
EncodeAsIds(input)
EncodeAsPieces(input)
```
@mthrok mthrok merged commit 5de163a into pytorch:master May 8, 2020
@mthrok mthrok deleted the sentencepiece branch May 8, 2020 20:50
@zhangguanheng66
Copy link
Contributor

We need to update the README file for some instructions to build torchtext from master branch.

zhangguanheng66 pushed a commit to zhangguanheng66/text that referenced this pull request May 12, 2020
This CC adds `torchscript` extension `_torchtext.so`, which contains simple interface to `SentencePiece`.

 - SentencePiece `v0.1.86` is used.
 - `libsentencepiece.a` is built right before `_torchtext.so` is compiled.
   The logic for triggering this build from `setuptools` can be found under `build_tools/setup_helpers`.
 - `_torchtext.so` provides interface to train a SentencePiece model and load a model from file.

Breaking change:

Previously `torchtext.data.functional.load_sp_model` returned `sentencepiece.SentencePieceProcessor` object, which supported the following methods, in addition to `__len__` and `__getitem__`

```
$ grep '$self->' third_party/sentencepiece/python/sentencepiece.i
    return $self->Load(filename);
    return $self->LoadFromSerializedProto(filename);
    return $self->SetEncodeExtraOptions(extra_option);
    return $self->SetDecodeExtraOptions(extra_option);
    return $self->SetVocabulary(valid_vocab);
    return $self->ResetVocabulary();
    return $self->LoadVocabulary(filename, threshold);
    return $self->EncodeAsPieces(input);
    return $self->EncodeAsIds(input);
    return $self->NBestEncodeAsPieces(input, nbest_size);
    return $self->NBestEncodeAsIds(input, nbest_size);
    return $self->SampleEncodeAsPieces(input, nbest_size, alpha);
    return $self->SampleEncodeAsIds(input, nbest_size, alpha);
    return $self->DecodePieces(input);
    return $self->DecodeIds(input);
    return $self->EncodeAsSerializedProto(input);
    return $self->SampleEncodeAsSerializedProto(input, nbest_size, alpha);
    return $self->NBestEncodeAsSerializedProto(input, nbest_size);
    return $self->DecodePiecesAsSerializedProto(pieces);
    return $self->DecodeIdsAsSerializedProto(ids);
    return $self->GetPieceSize();
    return $self->PieceToId(piece);
    return $self->IdToPiece(id);
    return $self->GetScore(id);
    return $self->IsUnused(id);
    return $self->IsControl(id);
    return $self->IsUnused(id);
    return $self->GetPieceSize();
    return $self->PieceToId(key);
```

The new C++ Extension provides the following methods

```
Encode(input)
EncodeAsIds(input)
EncodeAsPieces(input)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants