New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the HuggingFace llama Tokenizer #35
Comments
Alright, I did a first attempt, but couldn't manage to get it working. Here's what I tried:
Here, I got an "invalid UTF-8" error. By digging into the source, I figured out this expects a JSON file, so I tried pointing it at tokenizer_config.json, but that didn't work either 🤔
Digging more into the source, it doesn't even look like the file I got is correct. Perhaps I need to convert it in some other way? Pinging @Narsil again, if you would be so kind to give us a hand here 😅 |
Apart from my initial exploration, I also realized the I don't think all this is (especially OpenSSL) is needed just to get a tokenizer working, so we should look at this dependency more carefully. Maybe there's a way to extract just the bits we need? |
Hey you're trying to convert the model. There are other scripts for the tokenizer. I haven't finished it yet (just requires more testing). For dependencies you can use no-default - features. It does depend on esaxx-rs and onig which are not entirely needed for this specific tokenizer but the lib covers a bit more. Will share here once the file is done (and checked against) |
the tokenizer is ready here: https://huggingface.co/hf-internal-testing/tiny-random-llama/tree/main But it does require |
tokenizers=0.13.3 is released and can be used. The tokenizer is here https://huggingface.co/hf-internal-testing/llama-tokenizer (tokenizer.json). let tokenizer = Tokenizer::from_file("tokenizer.json").unwrap();
let encoded = tokenizer.encode("This is a test");
# None is the optional second sentence
# true is wether to add special tokens
let encoded = tokenizer.post_process(encoded, None, true).unwrap(); https://docs.rs/tokenizers/0.13.3/tokenizers/tokenizer/struct.TokenizerImpl.html#method.post_process Cheers ! |
Hi @Narsil! Thanks a lot :) We are evaluating what's the best route to integrate this, I have a few questions if you don't mind:
|
Bloom is supported, with exactly the same code (just use the file from bigscience/bloom)
These are the "byte-fallback" tokens. When encountering 'UNK' tokens, the bytefallback with split the char(s) into raw bytes, and use the tokens appropriately. When decoding, it will attempt to interpret the bytes as utf-8 and use the unknown glyph https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/decoders/byte_fallback.rs#L47 This is mirroring exactly what sentencepiece does. |
For rmkv I have no idea what tokenizer they use. Do you have a link? |
The official RWKV project uses the Python version of I'm also using it in my little RWKV inference experiment if an example of use would be helpful: https://github.com/KerfuffleV2/smolrsrwkv You will need the |
It seems like the current tokenizer can't handle non-English? For example:
But ### Human: 请给我讲一个关于狐狸的故事。 ### Assistant: 从前有一只非常机智的狐狸。她名叫小美,是整个森林中最受欢迎的狐狸之一。有一天,小美在追逐一只松鼠时,她发现了一件奇怪的东西。 这是一条光明的线条,从美麦上看起来是散发的,但是实际上跟随着小美的动作。小美快速找到了这个线条并试图拉响他们之间的联系。然而,在她的拼命中失去了一部分半透明绳子! 小美意识到现在只剩下几乎一线,但是她还是希望能够找到这条完整的线条。于是,她开始了旅程,去各个方向走动,细嫩地跟随着每一乎线之间的疑问线索。 小美花费不少时间才发现了那段完整的线条了。很好奇,这条线在哪里面与线走在各个方向之间是什么关系?小美就开始了她的冒险,她想要知道这条线和它所跟随的疑问与它具有何种联系。 在她一直追逐后,小美渐渐意识到了一个很特别的现象。每当她接近线走的一根线时,都会被拉动;然而,每当她与线走的附近时,他们之间是两根不相连的线段! 小美很困惑,为什么这条线和那个线之间要有这样一个特殊的联系?在她追逐着不断放弃之前的好奇心中,她开始意识到了线之间并存的其他奇妙现象。 举个例子:有一次,小美发现他们遇到了一只大雌狗。这只雌狗实在是线走的连续的线段!他们向前攀爬,向后攀爬,但是没有放开线条之间的联系,从而成为了连接着两个线段的中转点。 尽管雌狗已经有足够的体质,但是小美还是不会放过这种奇妙现象,最终也成了一只身旁的狐狸。而那条连环形的线段就可能让他们变得永远隔开,无法在互相之间创造联系。 Also I'm really impressed, it seems like Vicuna 13B can actually speak Mandarin! |
I'll test that with my fix for #11 which I suspect is the same issue |
Yes, it looks like the same thing to me as well. |
With #122, your sample prompt tokenizes correctly, but doesn't produce any output with Alpaca. I'll have to get my hands on a compatible Vicuna 13B at some point... |
The
tokenizers
crate by HuggingFace should give us a more correct tokenizer implementation than the one we're currently using.Looks like a LLaMA implementation already landed there huggingface/transformers#21955, and then @Narsil shared an additional PR on the tokenizers crate (not sure what this fixes, but I assume the changes are necessary?) huggingface/tokenizers#1183
Seems like we have everything we need to use the new tokenizer. An important point remains though: Are we allowed to distribute the tokenizer file? Can it be considered a completely independent thing from the weights?
The text was updated successfully, but these errors were encountered: