
## To run this file

### Meta tokenizer

First, copy and paste the original meta `tokenizer.py` file from [here](https://github.com/facebookresearch/llama/blob/6c7fe276574e78057f917549435a2554000a876d/llama/tokenizer.py).
Also, you'll need the original `tokenizer.model` file. For that, get the download [link](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and download it following the [instructions](https://github.com/facebookresearch/llama/blob/6c7fe276574e78057f917549435a2554000a876d/download.sh#L20).

### llama-cpp-python model

I'm using the [The Bloke](https://huggingface.co/TheBloke/Llama-2-7B-GGML) models. Go to the link and download the model `llama-2-7b-chat.ggmlv3.q2_K.bin` which is the smallest one.

In [13]:
import os
import llama_cpp
from llama_cpp import Llama
from typing import List
from chat_messages_formatter import Message, llama2_format_messages
from llama_cpp_wrapper import llama_cpp_tokenizer_encode
from tokenizer import Tokenizer
from functools import partial

In [14]:
model_path = os.path.join(os.environ["MODELS_PATH"], "TheBloke", "llama-2-7b-chat.ggmlv3.q2_K.bin")
llama_cpp_llm = Llama(model_path=model_path)
meta_tokenizer = Tokenizer(model_path=os.path.join(os.environ["MODELS_PATH"], 'Meta', 'LLaMA2', "tokenizer.model"))

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 


In [15]:
messages: List[Message] = [
        Message(role="user", content="How are you?"),
        Message(role="assistant", content="I'm fine!"),
        Message(role="user", content="Write a four page long essay about Hawaii."),
    ]

In [16]:
llama_cpp_messages_tokens = llama2_format_messages(messages, tokenizer_encode=partial(llama_cpp_tokenizer_encode, llm=llama_cpp_llm))
print(llama_cpp_messages_tokens)

[1, 29961, 25580, 29962, 3532, 14816, 29903, 6778, 13, 3492, 526, 263, 8444, 29892, 3390, 1319, 322, 15993, 20255, 29889, 29849, 1234, 408, 1371, 3730, 408, 1950, 29892, 1550, 1641, 9109, 29889, 3575, 6089, 881, 451, 3160, 738, 10311, 1319, 29892, 443, 621, 936, 29892, 11021, 391, 29892, 7916, 391, 29892, 304, 27375, 29892, 18215, 29892, 470, 27302, 2793, 29889, 3529, 9801, 393, 596, 20890, 526, 5374, 635, 443, 5365, 1463, 322, 6374, 297, 5469, 29889, 13, 13, 3644, 263, 1139, 947, 451, 1207, 738, 4060, 29892, 470, 338, 451, 2114, 1474, 16165, 261, 296, 29892, 5649, 2020, 2012, 310, 22862, 1554, 451, 1959, 29889, 960, 366, 1016, 29915, 29873, 1073, 278, 1234, 304, 263, 1139, 29892, 3113, 1016, 29915, 29873, 6232, 2089, 2472, 29889, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 5328, 526, 366, 29973, 518, 29914, 25580, 29962, 306, 29915, 29885, 2691, 29991, 29871, 2, 1, 29961, 25580, 29962, 14350, 263, 3023, 1813, 1472, 3686, 388, 1048, 26901, 29875, 29889, 518, 29914, 25580, 29962]


In [17]:
meta_messages_tokens = llama2_format_messages(messages, tokenizer_encode=meta_tokenizer.encode)
print(meta_messages_tokens)

[1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, 3492, 526, 263, 8444, 29892, 3390, 1319, 322, 15993, 20255, 29889, 29849, 1234, 408, 1371, 3730, 408, 1950, 29892, 1550, 1641, 9109, 29889, 3575, 6089, 881, 451, 3160, 738, 10311, 1319, 29892, 443, 621, 936, 29892, 11021, 391, 29892, 7916, 391, 29892, 304, 27375, 29892, 18215, 29892, 470, 27302, 2793, 29889, 3529, 9801, 393, 596, 20890, 526, 5374, 635, 443, 5365, 1463, 322, 6374, 297, 5469, 29889, 13, 13, 3644, 263, 1139, 947, 451, 1207, 738, 4060, 29892, 470, 338, 451, 2114, 1474, 16165, 261, 296, 29892, 5649, 2020, 2012, 310, 22862, 1554, 451, 1959, 29889, 960, 366, 1016, 29915, 29873, 1073, 278, 1234, 304, 263, 1139, 29892, 3113, 1016, 29915, 29873, 6232, 2089, 2472, 29889, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 5328, 526, 366, 29973, 518, 29914, 25580, 29962, 306, 29915, 29885, 2691, 29991, 29871, 2, 1, 518, 25580, 29962, 14350, 263, 3023, 1813, 1472, 3686, 388, 1048, 26901, 29875, 29889, 518, 29914, 25580, 29962]


Notice, the difference is between the tokens `518` and `29961`.

In [18]:
assert len(meta_messages_tokens) == len(llama_cpp_messages_tokens)

for i, (a, b) in enumerate(zip(meta_messages_tokens, llama_cpp_messages_tokens)):
    if a != b:
        print(f"#{i:3d}", a, "!=", b)

#  1 518 != 29961
#149 518 != 29961


Here I'm comparing the decodification for the tokens `518` and `29961`. I also added the token `29871` because it'll appear later in this notebook. Added `|` as a delimiter to ease visualization.

In [19]:
for token in [518, 29961, 29871]:
    ld = llama_cpp_llm.detokenize([token])
    decoded_ld = ld.decode('utf-8')
    md = meta_tokenizer.decode([token])
    print(f"Detokenizing token {token} with llama_cpp_llm              : type |{type(ld)}| length |{len(ld)}| detokenized |{ld}|")
    print(f"Detokenizing token {token} with llama_cpp_llm and decoding : type |{type(decoded_ld)}| length |{len(decoded_ld)}| detokenized |{decoded_ld}|")
    print(f"Detokenizing token {token} with meta_tokenizer             : type |{type(md)}| length |{len(md)}| detokenized |{md}|")
    print()



Detokenizing token 518 with llama_cpp_llm              : type |<class 'bytes'>| length |2| detokenized |b' ['|
Detokenizing token 518 with llama_cpp_llm and decoding : type |<class 'str'>| length |2| detokenized | [|
Detokenizing token 518 with meta_tokenizer             : type |<class 'str'>| length |1| detokenized |[|

Detokenizing token 29961 with llama_cpp_llm              : type |<class 'bytes'>| length |1| detokenized |b'['|
Detokenizing token 29961 with llama_cpp_llm and decoding : type |<class 'str'>| length |1| detokenized |[|
Detokenizing token 29961 with meta_tokenizer             : type |<class 'str'>| length |1| detokenized |[|

Detokenizing token 29871 with llama_cpp_llm              : type |<class 'bytes'>| length |1| detokenized |b' '|
Detokenizing token 29871 with llama_cpp_llm and decoding : type |<class 'str'>| length |1| detokenized | |
Detokenizing token 29871 with meta_tokenizer             : type |<class 'str'>| length |0| detokenized ||



Here I'm comparing the codification for the texts decoded above, which are |` [`| and |`[`|. Added `|` as a delimiter to ease visualization.

In [20]:
for text in [" [", "["]:
    le = llama_cpp_llm.tokenize(text.encode('utf-8'), add_bos=False)
    lte = llama_cpp_tokenizer_encode(text, bos=False, eos=False, llm=llama_cpp_llm)
    me = meta_tokenizer.encode(text, bos=False, eos=False)
    print(f"Tokenizing text |{text}| with encoding + llama_cpp_llm   : type |{type(le)}| length |{len(le)}| tokenized |{le}|")
    print(f"Tokenizing text |{text}| with llama_cpp_tokenizer_encode : type |{type(lte)}| length |{len(lte)}| tokenized |{lte}|")
    print(f"Tokenizing text |{text}| with meta_tokenizer             : type |{type(me)}| length |{len(me)}| tokenized |{me}|")
    print()


Tokenizing text | [| with encoding + llama_cpp_llm   : type |<class 'list'>| length |1| tokenized |[518]|
Tokenizing text | [| with llama_cpp_tokenizer_encode : type |<class 'list'>| length |1| tokenized |[518]|
Tokenizing text | [| with meta_tokenizer             : type |<class 'list'>| length |2| tokenized |[29871, 518]|

Tokenizing text |[| with encoding + llama_cpp_llm   : type |<class 'list'>| length |1| tokenized |[29961]|
Tokenizing text |[| with llama_cpp_tokenizer_encode : type |<class 'list'>| length |1| tokenized |[29961]|
Tokenizing text |[| with meta_tokenizer             : type |<class 'list'>| length |1| tokenized |[518]|



Using the low-level API from [`llama-cpp_python`](https://github.com/abetlen/llama-cpp-python) to load the vocabulary, it is possible to see that the issue seems to be in the vocabulary.

In [21]:
lparams = llama_cpp.llama_context_default_params()

ctx = llama_cpp.llama_init_from_file(str(model_path).encode("utf8"), lparams)

n_vocab = llama_cpp.llama_n_vocab(ctx)

strings = (llama_cpp.c_char_p * n_vocab)()
scores = (llama_cpp.c_float * n_vocab)()
n_vocab = llama_cpp.c_int(n_vocab)

assert llama_cpp.llama_get_vocab(llama_cpp.llama_context_p(ctx), strings, scores, n_vocab) == n_vocab.value

print(f"First 10 tokens in vocab of size {n_vocab.value}:")
print(strings[:10000])
print()

for token in [518, 29961, 29871]:
    print(f"Token {token:5d} is mapped to |{strings[token]}| in vocab. (decode to utf-8 |{strings[token].decode('utf8')}|)")
print()

vocab_map = {token_bytes.decode('utf8', errors='ignore'): idx for idx, token_bytes in enumerate(strings)}
for text in [" [", "[", " "]:
    print(f"Text |{text}| has token {vocab_map[text]:5d} in vocab.")

First 10 tokens in vocab of size 32000:

Token   518 is mapped to |b' ['| in vocab. (decode to utf-8 | [|)
Token 29961 is mapped to |b'['| in vocab. (decode to utf-8 |[|)
Token 29871 is mapped to |b' '| in vocab. (decode to utf-8 | |)

Text | [| has token   518 in vocab.
Text |[| has token 29961 in vocab.
Text | | has token 29871 in vocab.
