feat: get vocab from tokenizer is better. #31

FFengIll · 2023-09-14T07:31:51Z

Some model may not provide vocab.txt and
tokenizer must work well after parse the tokenizer.json (and rebuild the vocab internal).

So we can just load vocab from tokenizer with a re-sort.

FFengIll · 2023-09-19T02:26:21Z

models/convert-to-ggml.py

@@ -22,8 +22,6 @@
 with open(dir_model + "/config.json", "r", encoding="utf-8") as f:
    hparams = json.load(f)

-with open(dir_model + "/vocab.txt", "r", encoding="utf-8") as f:


vocab.txt may not exist.

FFengIll · 2023-09-19T02:26:51Z

models/convert-to-ggml.py

+vocab_list = []
+
+# print(tokenizer.get_vocab())
+vocab = tokenizer.get_vocab()


tokenizer has a good implement to get vocab.

feat: get vocab from tokenizer is better.

1ccc15d

FFengIll mentioned this pull request Sep 18, 2023

GGUF file format specification ggerganov/ggml#302

Merged

skeskinen mentioned this pull request Sep 18, 2023

subword # should be an option. #33

Open

FFengIll commented Sep 19, 2023

View reviewed changes

feat: use vocab_size to loop vocab to avoid error.

4aa1a17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: get vocab from tokenizer is better. #31

feat: get vocab from tokenizer is better. #31

FFengIll commented Sep 14, 2023

FFengIll Sep 19, 2023

FFengIll Sep 19, 2023

feat: get vocab from tokenizer is better. #31

Are you sure you want to change the base?

feat: get vocab from tokenizer is better. #31

Conversation

FFengIll commented Sep 14, 2023

FFengIll Sep 19, 2023

Choose a reason for hiding this comment

FFengIll Sep 19, 2023

Choose a reason for hiding this comment