-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BC breaking] Add Sentencepiece torchscript Extension #755
Conversation
4dbf3f1
to
1a1b6f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a JIT test.
# Extension info | ||
ext_modules=setup_helpers.get_ext_modules(), | ||
cmdclass={ | ||
'build_ext': setup_helpers.BuildExtension.with_options(no_python_abi_suffix=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this abi option necessary?
Also, I'd add a comment that references the issue around dropping symbols we recently referenced (I can dig it up if this doesn't ring a bell).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah not necessary. It's a reminding custom from what I followed in torchvision
. (_C.so
notation.)
I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it turned out that we need to provide the full path to so
file, so removing abi suffix makes it easier to locate the library file.
|
||
public: | ||
std::string content_; | ||
explicit SentencePiece(const std::string &content) : content_(content) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be changed to take a ::sentencepiece:: via move constructor and then effectively represent a shell that forwards methods to the underlying object.
Do you need this string constructor? Does this match previous behavior? Usually I'd make this a separate factory function and then construct a SentencePiece object that is initialized with whatever processor_.LoadFromSerializedProto(content_);
returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class has to provide serialization mechanism so that scripted module can be saved.
Ideally, the underlying SentencePieceProcessor
class should provide such mechanism, but it does not provide interface to serialize the model.
Note 1. The lifecycle of the official SentencePieceProcessor
is 1. train and save to a file then 2. load from the saved file. It does not have a functionality to save the model.
Note 2. If you look at the source code of SentencePieceProcessor
, it almost look like provide serialization interface, but ModelProto
and all the functionality related to protobuf
does not exist in the library file.
To workaround this, we need to keep the content of the saved model file (model data serialized protobuf) when loading a model from a file.
So, it is simpler to pass the model content as string to constructor and to instantiate the underlying SentencePieceProcessor
there. Because, you still need to pass the model content separately for the later serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The serialization here is necessary to support pickle? As far as I know that's what we use to save models. We could look into adding custom pickle support as does pybind11 for this. This might also enable us, if they don't already or if there's a need, to apply compression algorithms on the fly and save a smaller model.
} | ||
} | ||
|
||
std::vector<std::string> Encode(const std::string &input) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, you could play with iterables here. You pass an object that yields strings and return and object that yields iterators over strings. That could be quite powerful when it comes to avoiding the overhead of writing out a lot of text while processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder. Iterator is not supported by torchscript but we expect those transforms jitable.
Addressed your feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but please document the BC-breaking changes and changed semantics so they can be referenced.
Also, maybe it makes sense to wait for len and getitem to land
@cpuhrsch I squashed the commits and added the list of methods available now and then in the commit message. EDIT: also added the same message to this PR description. |
I prefer to merge this one then work on separately. |
Ok sounds good. Please add a [BC breaking] tag and then this should be good to go. Also, please do an import into fbcode after this. |
This CC adds `torchscript` extension `_torchtext.so`, which contains simple interface to `SentencePiece`. - SentencePiece `v0.1.86` is used. - `libsentencepiece.a` is built right before `_torchtext.so` is compiled. The logic for triggering this build from `setuptools` can be found under `build_tools/setup_helpers`. - `_torchtext.so` provides interface to train a SentencePiece model and load a model from file. Breaking change: Previously `torchtext.data.functional.load_sp_model` returned `sentencepiece.SentencePieceProcessor` object, which supported the following methods, in addition to `__len__` and `__getitem__` ``` $ grep '$self->' third_party/sentencepiece/python/sentencepiece.i return $self->Load(filename); return $self->LoadFromSerializedProto(filename); return $self->SetEncodeExtraOptions(extra_option); return $self->SetDecodeExtraOptions(extra_option); return $self->SetVocabulary(valid_vocab); return $self->ResetVocabulary(); return $self->LoadVocabulary(filename, threshold); return $self->EncodeAsPieces(input); return $self->EncodeAsIds(input); return $self->NBestEncodeAsPieces(input, nbest_size); return $self->NBestEncodeAsIds(input, nbest_size); return $self->SampleEncodeAsPieces(input, nbest_size, alpha); return $self->SampleEncodeAsIds(input, nbest_size, alpha); return $self->DecodePieces(input); return $self->DecodeIds(input); return $self->EncodeAsSerializedProto(input); return $self->SampleEncodeAsSerializedProto(input, nbest_size, alpha); return $self->NBestEncodeAsSerializedProto(input, nbest_size); return $self->DecodePiecesAsSerializedProto(pieces); return $self->DecodeIdsAsSerializedProto(ids); return $self->GetPieceSize(); return $self->PieceToId(piece); return $self->IdToPiece(id); return $self->GetScore(id); return $self->IsUnused(id); return $self->IsControl(id); return $self->IsUnused(id); return $self->GetPieceSize(); return $self->PieceToId(key); ``` The new C++ Extension provides the following methods ``` Encode(input) EncodeAsIds(input) EncodeAsPieces(input) ```
We need to update the README file for some instructions to build torchtext from master branch. |
This CC adds `torchscript` extension `_torchtext.so`, which contains simple interface to `SentencePiece`. - SentencePiece `v0.1.86` is used. - `libsentencepiece.a` is built right before `_torchtext.so` is compiled. The logic for triggering this build from `setuptools` can be found under `build_tools/setup_helpers`. - `_torchtext.so` provides interface to train a SentencePiece model and load a model from file. Breaking change: Previously `torchtext.data.functional.load_sp_model` returned `sentencepiece.SentencePieceProcessor` object, which supported the following methods, in addition to `__len__` and `__getitem__` ``` $ grep '$self->' third_party/sentencepiece/python/sentencepiece.i return $self->Load(filename); return $self->LoadFromSerializedProto(filename); return $self->SetEncodeExtraOptions(extra_option); return $self->SetDecodeExtraOptions(extra_option); return $self->SetVocabulary(valid_vocab); return $self->ResetVocabulary(); return $self->LoadVocabulary(filename, threshold); return $self->EncodeAsPieces(input); return $self->EncodeAsIds(input); return $self->NBestEncodeAsPieces(input, nbest_size); return $self->NBestEncodeAsIds(input, nbest_size); return $self->SampleEncodeAsPieces(input, nbest_size, alpha); return $self->SampleEncodeAsIds(input, nbest_size, alpha); return $self->DecodePieces(input); return $self->DecodeIds(input); return $self->EncodeAsSerializedProto(input); return $self->SampleEncodeAsSerializedProto(input, nbest_size, alpha); return $self->NBestEncodeAsSerializedProto(input, nbest_size); return $self->DecodePiecesAsSerializedProto(pieces); return $self->DecodeIdsAsSerializedProto(ids); return $self->GetPieceSize(); return $self->PieceToId(piece); return $self->IdToPiece(id); return $self->GetScore(id); return $self->IsUnused(id); return $self->IsControl(id); return $self->IsUnused(id); return $self->GetPieceSize(); return $self->PieceToId(key); ``` The new C++ Extension provides the following methods ``` Encode(input) EncodeAsIds(input) EncodeAsPieces(input) ```
This PR adds
torchscript
extension_torchtext.so
, which contains simple interface toSentencePiece
.v0.1.86
is used.libsentencepiece.a
is built right before_torchtext.so
is compiled.The logic for triggering this build from
setuptools
can be found underbuild_tools/setup_helpers
._torchtext.so
provides interface to train a SentencePiece model and load a model from file.SentencePiece
class has the following methodsEncode
EncodeAsIds
EncodeAsPieces
is pickle-able TODO Add pickle/un-pickle testThis is not for simple pickling, but fortorchscrpit
serialization.SentencePiece
class replaces the originalSentencePieceProcessor
class, which is the official Python binding using Swig.torchscript
, our customSentencePiece
class does not implement special methods__len__
and__getitem__
.SentencePieceProcessor
are not implemented.generate_sp_model
andload_sp_model
are replaces with equivalent C++ implementation that handles our customSentencePiece
class.load_sp_model
was returningsentencepiece.SentencePieceProcessor
. (BC-braeaking.)sentencepiece_tokenizer
andsentencepiece_numericalizer
with C++ ext yet as C++ ext does not handle generator.Breaking change:
Previously
torchtext.data.functional.load_sp_model
returnedsentencepiece.SentencePieceProcessor
object, which supported the following methods, in addition to__len__
and__getitem__
The new C++ Extension provides the following methods