Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for more challenging Transformer architecture use cases through a better performance tokenizer .NET library #33

Closed
GeorgeS2019 opened this issue Oct 8, 2021 · 2 comments

Comments

@GeorgeS2019
Copy link

Seq2SeqSharp is a valid alternative option for .NET Transformer architecture solution.

It seems with a cross platform .NET tokenizer library, especially with better performance than those provided through python library, this will make it less challenging for Seq2SeqSharp to explore other Transformer architecture real world End-To-End examples such as e.g. GPT2, BERT etc.

Raising this issue to promote user here to share their feedback for a concerting effort towards such .NET tokenization library.

@GeorgeS2019 GeorgeS2019 changed the title Request for more challenging Transformer architecture use cases through a better performance tokenizers Request for more challenging Transformer architecture use cases through a better performance tokenizer .NET library Oct 8, 2021
@zhongkaifu
Copy link
Owner

Hi @GeorgeS2019 ,

Thanks for your comments.

I may not understand why you need to have a .NET tokenization library. Can you please specify which scenario you would like to use it ?

For Seq2SeqSharp, you can use any tokenization library for data processing. Seq2SeqSharp only care about tokens as input and it also outputs tokens. For example: In the Release Package, if you open test batch file, such as test_enu_chs.bat, you will find it calls "spm_encode.exe" firstly to encode given input sentences to BPE tokens, then call Seq2SeqConsole tool, and finally calls "spm_decode.exe" to decode BPE tokens back to sentences. Both "spm_encode" and "spm_decode" are from Google's SentencePiece project.

In addition, the release package includes vocabularies and models for 8 languages (Chinese, German, English, French, Italian, Japanese, Korean and Russian) so far. They were all built by SentencePiece library.

@GeorgeS2019
Copy link
Author

@zhongkaifu The BlingFire of Microsoft provide HuggingFace tokenizers very similar to those provided by HuggingFace BUT with claimed better performance.

e.g. GPT2 tokenizer provided by BlingFire matches exactly the Vocab size as that of HuggingFace. The library provide additional information on how to create your custom tokenizer based on the diverse templates (close to complete) as those of HuggingFace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants