### Lets Explore the Tokenizers in HF.CO

In [2]:
# Login to Hugging face
from huggingface_hub import login

from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')

login(hf_token, add_to_git_credential=True)

#### Lets explore Meta Llama 3.1 Tokenizer

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B", trust_remote_code=True)

text = "Hello, I am AI Student learning LLM Tokenizers"
tokens = tokenizer.encode(text)

for token in tokens:
  print(f"{token:>10} => {tokenizer.decode([token]):<20}")

    128000 => <|begin_of_text|>   
      9906 => Hello               
        11 => ,                   
       358 =>  I                  
      1097 =>  am                 
     15592 =>  AI                 
     11988 =>  Student            
      6975 =>  learning           
       445 =>  L                  
     11237 => LM                  
      9857 =>  Token              
     12509 => izers               


##### Now there are Chat/Instruct models which take the message as dictionary. Hos do the Models understand this dictionary? For this, hugging face has developer a method to convert dictionary to prompt for Model

In [10]:
instruct_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct", trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, how are you?"}
]

prompt = instruct_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

print(prompt)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello, how are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




#### Lets Explore the tokenizers for few other Models

In [13]:
### Set Models to explore
llama = "meta-llama/Meta-Llama-3.1-8B-Instruct"
qwen3 = "Qwen/Qwen3-8B"
phi4  = "microsoft/phi-4"

### Tokenizers
llama_tokenizer = AutoTokenizer.from_pretrained(llama, trust_remote_code=True)
qwen3_tokenizer = AutoTokenizer.from_pretrained(qwen3, trust_remote_code=True)
phi4_tokenizer  = AutoTokenizer.from_pretrained(phi4 , trust_remote_code=True)

### Text to tokenize
text = "Hello, I am AI Student learning LLM Tokenizers"

### Tokenize
llama_tokens = llama_tokenizer.encode(text)
qwen3_tokens = qwen3_tokenizer.encode(text)
phi4_tokens  =  phi4_tokenizer.encode(text)

#### Looks at each Tokenizer
print("Llama :")
for token in llama_tokens:
  print(f"{token} => {llama_tokenizer.decode([token])}", end=" ")

print("\n\nQwen3 :")
for token in qwen3_tokens:
  print(f"{token} => {qwen3_tokenizer.decode([token])}", end=" ")

print("\n\nPhi4 :")
for token in phi4_tokens:
  print(f"{token} => {phi4_tokenizer.decode([token])}", end=" ")

Llama :
128000 => <|begin_of_text|> 9906 => Hello 11 => , 358 =>  I 1097 =>  am 15592 =>  AI 11988 =>  Student 6975 =>  learning 445 =>  L 11237 => LM 9857 =>  Token 12509 => izers 

Qwen3 :
9707 => Hello 11 => , 358 =>  I 1079 =>  am 15235 =>  AI 11726 =>  Student 6832 =>  learning 444 =>  L 10994 => LM 9660 =>  Token 12230 => izers 

Phi4 :
9906 => Hello 11 => , 358 =>  I 1097 =>  am 15592 =>  AI 11988 =>  Student 6975 =>  learning 445 =>  L 11237 => LM 9857 =>  Token 12509 => izers 