Request for more challenging Transformer architecture use cases through a better performance tokenizer .NET library #33

GeorgeS2019 · 2021-10-08T13:54:10Z

Seq2SeqSharp is a valid alternative option for .NET Transformer architecture solution.

It seems with a cross platform .NET tokenizer library, especially with better performance than those provided through python library, this will make it less challenging for Seq2SeqSharp to explore other Transformer architecture real world End-To-End examples such as e.g. GPT2, BERT etc.

Raising this issue to promote user here to share their feedback for a concerting effort towards such .NET tokenization library.

zhongkaifu · 2021-10-08T16:03:47Z

Hi @GeorgeS2019 ,

Thanks for your comments.

I may not understand why you need to have a .NET tokenization library. Can you please specify which scenario you would like to use it ?

For Seq2SeqSharp, you can use any tokenization library for data processing. Seq2SeqSharp only care about tokens as input and it also outputs tokens. For example: In the Release Package, if you open test batch file, such as test_enu_chs.bat, you will find it calls "spm_encode.exe" firstly to encode given input sentences to BPE tokens, then call Seq2SeqConsole tool, and finally calls "spm_decode.exe" to decode BPE tokens back to sentences. Both "spm_encode" and "spm_decode" are from Google's SentencePiece project.

In addition, the release package includes vocabularies and models for 8 languages (Chinese, German, English, French, Italian, Japanese, Korean and Russian) so far. They were all built by SentencePiece library.

GeorgeS2019 · 2021-10-08T16:07:39Z

@zhongkaifu The BlingFire of Microsoft provide HuggingFace tokenizers very similar to those provided by HuggingFace BUT with claimed better performance.

e.g. GPT2 tokenizer provided by BlingFire matches exactly the Vocab size as that of HuggingFace. The library provide additional information on how to create your custom tokenizer based on the diverse templates (close to complete) as those of HuggingFace

GeorgeS2019 changed the title ~~Request for more challenging Transformer architecture use cases through a better performance tokenizers~~ Request for more challenging Transformer architecture use cases through a better performance tokenizer .NET library Oct 8, 2021

GeorgeS2019 mentioned this issue Nov 19, 2021

using SentencePiece from C# code #43

Merged

zhongkaifu closed this as completed Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for more challenging Transformer architecture use cases through a better performance tokenizer .NET library #33

Request for more challenging Transformer architecture use cases through a better performance tokenizer .NET library #33

GeorgeS2019 commented Oct 8, 2021

zhongkaifu commented Oct 8, 2021

GeorgeS2019 commented Oct 8, 2021

Request for more challenging Transformer architecture use cases through a better performance tokenizer .NET library #33

Request for more challenging Transformer architecture use cases through a better performance tokenizer .NET library #33

Comments

GeorgeS2019 commented Oct 8, 2021

zhongkaifu commented Oct 8, 2021

GeorgeS2019 commented Oct 8, 2021