Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the HuggingFace llama Tokenizer #35

Closed
setzer22 opened this issue Mar 18, 2023 · 13 comments · Fixed by #271
Closed

Use the HuggingFace llama Tokenizer #35

setzer22 opened this issue Mar 18, 2023 · 13 comments · Fixed by #271
Labels
issue:enhancement New feature or request meta:maintenance Changes that will make it easier for us to maintain code

Comments

@setzer22
Copy link
Collaborator

The tokenizers crate by HuggingFace should give us a more correct tokenizer implementation than the one we're currently using.

Looks like a LLaMA implementation already landed there huggingface/transformers#21955, and then @Narsil shared an additional PR on the tokenizers crate (not sure what this fixes, but I assume the changes are necessary?) huggingface/tokenizers#1183

Seems like we have everything we need to use the new tokenizer. An important point remains though: Are we allowed to distribute the tokenizer file? Can it be considered a completely independent thing from the weights?

@setzer22
Copy link
Collaborator Author

setzer22 commented Mar 18, 2023

Alright, I did a first attempt, but couldn't manage to get it working. Here's what I tried:

  1. Pulled the https://github.com/huggingface/transformers/ repository.
  2. Installed torch using pip install torch
  3. Ran the converter script, to convert both the weights, and the tokenizer, to the huggingface format, i.e.:
python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir "/data/Llama/LLaMA/" --model_size 7B --output_dir /data/Llama/LLaMA/7B
  1. Added the https://github.com/huggingface/tokenizers/ crate as a git dependency in our Cargo.toml, pointing to the lastest main branch commit.
  2. Tried loading the tokenizer from the file, as suggested:
use tokenizers::tokenizer::{Result, Tokenizer};
let tokenizer = Tokenizer::from_file("/data/Llama/LLaMA/7B/tokenizer/tokenizer.model").unwrap();

Here, I got an "invalid UTF-8" error. By digging into the source, I figured out this expects a JSON file, so I tried pointing it at tokenizer_config.json, but that didn't work either 🤔

Error("expected `,` or `}`", line: 1, column: 13)'

Digging more into the source, it doesn't even look like the file I got is correct. Perhaps I need to convert it in some other way?

Pinging @Narsil again, if you would be so kind to give us a hand here 😅

@setzer22
Copy link
Collaborator Author

Apart from my initial exploration, I also realized the tokenizers crate brings in a ton of dependencies, plus requires installed OpenSSL libraries to build.

I don't think all this is (especially OpenSSL) is needed just to get a tokenizer working, so we should look at this dependency more carefully. Maybe there's a way to extract just the bits we need?

@Narsil
Copy link

Narsil commented Mar 18, 2023

Hey you're trying to convert the model. There are other scripts for the tokenizer. I haven't finished it yet (just requires more testing).

For dependencies you can use no-default - features. It does depend on esaxx-rs and onig which are not entirely needed for this specific tokenizer but the lib covers a bit more. Will share here once the file is done (and checked against)

@philpax philpax added issue:enhancement New feature or request meta:maintenance Changes that will make it easier for us to maintain code labels Mar 24, 2023
@philpax philpax mentioned this issue Mar 26, 2023
6 tasks
@Narsil
Copy link

Narsil commented Mar 26, 2023

the tokenizer is ready here: https://huggingface.co/hf-internal-testing/tiny-random-llama/tree/main

But it does require tokenizers@main and is not released yet. Will try to do a release next week (there's still a few needed updates within transformers, and some additional check since the change is much bigger than anticipated)

@Narsil
Copy link

Narsil commented Apr 5, 2023

tokenizers=0.13.3 is released and can be used.

The tokenizer is here https://huggingface.co/hf-internal-testing/llama-tokenizer (tokenizer.json).

let tokenizer = Tokenizer::from_file("tokenizer.json").unwrap();
let encoded = tokenizer.encode("This is a test");
# None is the optional second sentence
# true is wether to add special tokens
let encoded = tokenizer.post_process(encoded, None, true).unwrap();

https://docs.rs/tokenizers/0.13.3/tokenizers/tokenizer/struct.TokenizerImpl.html#method.post_process

Cheers !

@setzer22
Copy link
Collaborator Author

setzer22 commented Apr 6, 2023

Hi @Narsil! Thanks a lot :)

We are evaluating what's the best route to integrate this, I have a few questions if you don't mind:

  • We are considering a potential integration of BLOOM and RWKV in the future. Would it be possible to use this library to tokenize input for those models?
  • Do you happen to know what are the tokens 3 to 258 used for? They seem to be used to represent raw byte data. Is the point of this to allow the model to represent non-utf8 sequences of characters? How does the library handle these tokens when decoding back to string?

@Narsil
Copy link

Narsil commented Apr 6, 2023

We are considering a potential integration of BLOOM and RWKV in the future. Would it be possible to use this library to tokenize input for those models?

Bloom is supported, with exactly the same code (just use the file from bigscience/bloom)

Do you happen to know what are the tokens 3 to 258 used for? They seem to be used to represent raw byte data. Is the point of this to allow the model to represent non-utf8 sequences of characters? How does the library handle these tokens when decoding back to string?

These are the "byte-fallback" tokens. When encountering 'UNK' tokens, the bytefallback with split the char(s) into raw bytes, and use the tokens appropriately.

When decoding, it will attempt to interpret the bytes as utf-8 and use the unknown glyph in case of failures for each invalid byte:

https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/decoders/byte_fallback.rs#L47

This is mirroring exactly what sentencepiece does.

@Narsil
Copy link

Narsil commented Apr 6, 2023

For rmkv I have no idea what tokenizer they use. Do you have a link?

@KerfuffleV2
Copy link
Contributor

and RWKV in the future.

The official RWKV project uses the Python version of tokenizers.

I'm also using it in my little RWKV inference experiment if an example of use would be helpful: https://github.com/KerfuffleV2/smolrsrwkv

You will need the .json file which defines the tokenizer RWKV models are set up to use. You can find it here: https://github.com/BlinkDL/ChatRWKV/blob/main/v2/20B_tokenizer.json

@KerfuffleV2
Copy link
Contributor

It seems like the current tokenizer can't handle non-English? For example:

### Human: 请给我讲一个关于狐狸的故事。 as the prompt results in:

2023-04-07T14:56:15Z ERROR llama_cli] Failed to tokenize initial prompt.

But llama.cpp works fine:


### Human: 请给我讲一个关于狐狸的故事。

### Assistant:

从前有一只非常机智的狐狸。她名叫小美,是整个森林中最受欢迎的狐狸之一。有一天,小美在追逐一只松鼠时,她发现了一件奇怪的东西。

这是一条光明的线条,从美麦上看起来是散发的,但是实际上跟随着小美的动作。小美快速找到了这个线条并试图拉响他们之间的联系。然而,在她的拼命中失去了一部分半透明绳子!

小美意识到现在只剩下几乎一线,但是她还是希望能够找到这条完整的线条。于是,她开始了旅程,去各个方向走动,细嫩地跟随着每一乎线之间的疑问线索。

小美花费不少时间才发现了那段完整的线条了。很好奇,这条线在哪里面与线走在各个方向之间是什么关系?小美就开始了她的冒险,她想要知道这条线和它所跟随的疑问与它具有何种联系。

在她一直追逐后,小美渐渐意识到了一个很特别的现象。每当她接近线走的一根线时,都会被拉动;然而,每当她与线走的附近时,他们之间是两根不相连的线段!

小美很困惑,为什么这条线和那个线之间要有这样一个特殊的联系?在她追逐着不断放弃之前的好奇心中,她开始意识到了线之间并存的其他奇妙现象。

举个例子:有一次,小美发现他们遇到了一只大雌狗。这只雌狗实在是线走的连续的线段!他们向前攀爬,向后攀爬,但是没有放开线条之间的联系,从而成为了连接着两个线段的中转点。

尽管雌狗已经有足够的体质,但是小美还是不会放过这种奇妙现象,最终也成了一只身旁的狐狸。而那条连环形的线段就可能让他们变得永远隔开,无法在互相之间创造联系。


Also I'm really impressed, it seems like Vicuna 13B can actually speak Mandarin!

@philpax
Copy link
Collaborator

philpax commented Apr 7, 2023

I'll test that with my fix for #11 which I suspect is the same issue

@KerfuffleV2
Copy link
Contributor

Yes, it looks like the same thing to me as well.

@philpax
Copy link
Collaborator

philpax commented Apr 7, 2023

With #122, your sample prompt tokenizes correctly, but doesn't produce any output with Alpaca. I'll have to get my hands on a compatible Vicuna 13B at some point...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue:enhancement New feature or request meta:maintenance Changes that will make it easier for us to maintain code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants