-
Notifications
You must be signed in to change notification settings - Fork 362
Use the HuggingFace llama Tokenizer #35
Comments
Alright, I did a first attempt, but couldn't manage to get it working. Here's what I tried:
Here, I got an "invalid UTF-8" error. By digging into the source, I figured out this expects a JSON file, so I tried pointing it at tokenizer_config.json, but that didn't work either 🤔
Digging more into the source, it doesn't even look like the file I got is correct. Perhaps I need to convert it in some other way? Pinging @Narsil again, if you would be so kind to give us a hand here 😅 |
Apart from my initial exploration, I also realized the I don't think all this is (especially OpenSSL) is needed just to get a tokenizer working, so we should look at this dependency more carefully. Maybe there's a way to extract just the bits we need? |
Hey you're trying to convert the model. There are other scripts for the tokenizer. I haven't finished it yet (just requires more testing). For dependencies you can use no-default - features. It does depend on esaxx-rs and onig which are not entirely needed for this specific tokenizer but the lib covers a bit more. Will share here once the file is done (and checked against) |
the tokenizer is ready here: https://huggingface.co/hf-internal-testing/tiny-random-llama/tree/main But it does require |
tokenizers=0.13.3 is released and can be used. The tokenizer is here https://huggingface.co/hf-internal-testing/llama-tokenizer (tokenizer.json). let tokenizer = Tokenizer::from_file("tokenizer.json").unwrap();
let encoded = tokenizer.encode("This is a test");
# None is the optional second sentence
# true is wether to add special tokens
let encoded = tokenizer.post_process(encoded, None, true).unwrap(); https://docs.rs/tokenizers/0.13.3/tokenizers/tokenizer/struct.TokenizerImpl.html#method.post_process Cheers ! |
Hi @Narsil! Thanks a lot :) We are evaluating what's the best route to integrate this, I have a few questions if you don't mind:
|
Bloom is supported, with exactly the same code (just use the file from bigscience/bloom)
These are the "byte-fallback" tokens. When encountering 'UNK' tokens, the bytefallback with split the char(s) into raw bytes, and use the tokens appropriately. When decoding, it will attempt to interpret the bytes as utf-8 and use the unknown glyph https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/decoders/byte_fallback.rs#L47 This is mirroring exactly what sentencepiece does. |
For rmkv I have no idea what tokenizer they use. Do you have a link? |
The official RWKV project uses the Python version of I'm also using it in my little RWKV inference experiment if an example of use would be helpful: https://github.com/KerfuffleV2/smolrsrwkv You will need the |
It seems like the current tokenizer can't handle non-English? For example:
But ### Human: 请给我讲一个关于狐狸的故事。 ### Assistant: 从前有一只非常机智的狐狸。她名叫小美,是整个森林中最受欢迎的狐狸之一。有一天,小美在追逐一只松鼠时,她发现了一件奇怪的东西。 这是一条光明的线条,从美麦上看起来是散发的,但是实际上跟随着小美的动作。小美快速找到了这个线条并试图拉响他们之间的联系。然而,在她的拼命中失去了一部分半透明绳子! 小美意识到现在只剩下几乎一线,但是她还是希望能够找到这条完整的线条。于是,她开始了旅程,去各个方向走动,细嫩地跟随着每一乎线之间的疑问线索。 小美花费不少时间才发现了那段完整的线条了。很好奇,这条线在哪里面与线走在各个方向之间是什么关系?小美就开始了她的冒险,她想要知道这条线和它所跟随的疑问与它具有何种联系。 在她一直追逐后,小美渐渐意识到了一个很特别的现象。每当她接近线走的一根线时,都会被拉动;然而,每当她与线走的附近时,他们之间是两根不相连的线段! 小美很困惑,为什么这条线和那个线之间要有这样一个特殊的联系?在她追逐着不断放弃之前的好奇心中,她开始意识到了线之间并存的其他奇妙现象。 举个例子:有一次,小美发现他们遇到了一只大雌狗。这只雌狗实在是线走的连续的线段!他们向前攀爬,向后攀爬,但是没有放开线条之间的联系,从而成为了连接着两个线段的中转点。 尽管雌狗已经有足够的体质,但是小美还是不会放过这种奇妙现象,最终也成了一只身旁的狐狸。而那条连环形的线段就可能让他们变得永远隔开,无法在互相之间创造联系。 Also I'm really impressed, it seems like Vicuna 13B can actually speak Mandarin! |
I'll test that with my fix for #11 which I suspect is the same issue |
Yes, it looks like the same thing to me as well. |
With #122, your sample prompt tokenizes correctly, but doesn't produce any output with Alpaca. I'll have to get my hands on a compatible Vicuna 13B at some point... |
The
tokenizers
crate by HuggingFace should give us a more correct tokenizer implementation than the one we're currently using.Looks like a LLaMA implementation already landed there huggingface/transformers#21955, and then @Narsil shared an additional PR on the tokenizers crate (not sure what this fixes, but I assume the changes are necessary?) huggingface/tokenizers#1183
Seems like we have everything we need to use the new tokenizer. An important point remains though: Are we allowed to distribute the tokenizer file? Can it be considered a completely independent thing from the weights?
The text was updated successfully, but these errors were encountered: