# Huggingface model exploration

In [1]:
!uv add transformers accelerate

[2mResolved [1m258 packages[0m [2min 0.57ms[0m[0m
[2mAudited [1m153 packages[0m [2min 0.99ms[0m[0m


## Choose a model you want to explore

In [2]:
model_name = "Qwen/Qwen3-0.6B"

## Load the model and generate one answer

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)

# Prepare the model input
prompt = "Give me a short introduction to large language models"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate text
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# Decode and display the output
output = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

print(f"MODEL: {model_name}\n")
print("PROMPT TEMPLATE:")
print(text)
print("ANSWER:")
print(output)

MODEL: Qwen/Qwen3-0.6B

PROMPT TEMPLATE:
<|im_start|>user
Give me a short introduction to large language models<|im_end|>
<|im_start|>assistant

ANSWER:
<think>
Okay, the user is asking for a short introduction to large language models. Let me start by recalling what I know about them. First, I should mention that they are big language models, which are AI systems designed to understand and generate human language.

I need to highlight their capabilities, like understanding complex texts and generating creative content. Also, their training data and how they learn from it. Maybe mention the different types, like GPT series or others. Oh, and their applications, like in various fields.

Wait, should I include something about their training process? Like how they are trained on massive datasets. Also, their ability to handle multiple languages. Oh, and maybe their use cases in different industries. Let me check if I'm covering all key points without being too technical. Keep it concise b

## Explore the tokenizer

A tokenizer converts raw text into smaller units called tokens, which a language model can process. LLMs cannot directly understand characters or words as humans do, so the tokenizer maps text to numeric IDs that the model was trained on. 

Instead of using full words only, modern tokenizers often use subwords, which are pieces of words (like un, break, ##able) that help handle rare, new, or misspelled terms more efficiently. This allows the model to understand and generate language flexibly, without needing every possible word in its vocabulary. 

In short, the tokenizer forms the bridge between human text and the model’s internal numeric representation.

**Main characteristics**

**1- Tokenizer type**

Hugging Face tokenizers can have two implementations: slow (Python) and fast (Rust-backed). Both follow the same conceptual processing pipeline, but fast tokenizers are significantly more efficient and provide richer features (like offset mapping).

| Feature                    | Slow Tokenizer (Python) | Fast Tokenizer (Rust)            |
| -------------------------- | ----------------------- | -------------------------------- |
| Language backend           | Pure Python             | Rust (`tokenizers` library)      |
| Speed                      | Slower                  | 10–100× faster                   |
| Offset mapping             | ❌ often missing         | ✅ available                      |
| Consistency with HF models | Good                    | Best / canonical implementations |
| Best for custom logic      | Easier to modify        | More restrictive                 |
| Unicode + splitting        | Python-based            | Optimized Rust implementation    |

In [60]:
tokenizer.name_or_path

'Qwen/Qwen3-0.6B'

In [53]:
type(tokenizer)

transformers.models.qwen2.tokenization_qwen2_fast.Qwen2TokenizerFast

In [56]:
tokenizer.is_fast

True

In [65]:
tokenizer.slow_tokenizer_class

transformers.models.qwen2.tokenization_qwen2.Qwen2Tokenizer

**2. Tokenization Strategy / Model**

Defines how text gets broken into tokens. Common strategies include:
- BPE (Byte Pair Encoding) — merges frequent byte or character pairs (e.g., GPT-2, RoBERTa)
- WordPiece — uses subwords prefixed with ## for continuation (e.g., BERT)
- Unigram / SentencePiece — probabilistic subword model (e.g., T5, ALBERT)
- Character/Byte-level — tokens represent characters or raw bytes

In [78]:
repr(tokenizer.backend_tokenizer.model)[:150] + " ..."

'BPE(dropout=None, unk_token=None, continuing_subword_prefix="", end_of_word_suffix="", fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab ...'

**3. Vocabulary**

The set of allowed tokens and their numeric IDs:
- Vocabulary size (e.g., 50k tokens)
- Special tokens (e.g., <pad>, <s>, </s>, <unk>, <mask>)
- Token format (e.g., ##sub in WordPiece, Ġword in BPE)

In [68]:
tokenizer.vocab_size

151643

In [47]:
for token_name in tokenizer.special_tokens_map.keys():
    if token_name != "additional_special_tokens":
        print(f"{token_name} -> {tokenizer.special_tokens_map[token_name]}")

eos_token -> <|im_end|>
pad_token -> <|endoftext|>


In [70]:
tokenizer.all_special_tokens

['<|im_end|>',
 '<|endoftext|>',
 '<|im_start|>',
 '<|object_ref_start|>',
 '<|object_ref_end|>',
 '<|box_start|>',
 '<|box_end|>',
 '<|quad_start|>',
 '<|quad_end|>',
 '<|vision_start|>',
 '<|vision_end|>',
 '<|vision_pad|>',
 '<|image_pad|>',
 '<|video_pad|>']

In [69]:
tokenizer.get_added_vocab()

{'<|endoftext|>': 151643,
 '<|im_start|>': 151644,
 '<|im_end|>': 151645,
 '<|object_ref_start|>': 151646,
 '<|object_ref_end|>': 151647,
 '<|box_start|>': 151648,
 '<|box_end|>': 151649,
 '<|quad_start|>': 151650,
 '<|quad_end|>': 151651,
 '<|vision_start|>': 151652,
 '<|vision_end|>': 151653,
 '<|vision_pad|>': 151654,
 '<|image_pad|>': 151655,
 '<|video_pad|>': 151656,
 '<tool_call>': 151657,
 '</tool_call>': 151658,
 '<|fim_prefix|>': 151659,
 '<|fim_middle|>': 151660,
 '<|fim_suffix|>': 151661,
 '<|fim_pad|>': 151662,
 '<|repo_name|>': 151663,
 '<|file_sep|>': 151664,
 '<tool_response>': 151665,
 '</tool_response>': 151666,
 '<think>': 151667,
 '</think>': 151668}

In [86]:
",".join(tokenizer.convert_ids_to_tokens([id for id in range(256,512)]))

'ĠĠ,ĠĠĠĠ,in,Ġt,ĠĠĠĠĠĠĠĠ,er,ĠĠĠ,on,Ġa,re,at,st,en,or,Ġth,ĊĊ,Ġc,le,Ġs,it,an,ar,al,Ġthe,;Ċ,Ġp,Ġf,ou,Ġ=,is,ĠĠĠĠĠĠĠ,ing,es,Ġw,ion,ed,ic,Ġb,Ġd,et,Ġm,Ġo,ĉĉ,ro,as,el,ct,nd,Ġin,Ġh,ent,id,Ġn,am,ĠĠĠĠĠĠĠĠĠĠĠ,Ġto,Ġre,--,Ġ{,Ġof,om,);Ċ,im,čĊ,Ġ(,il,//,Ġand,ur,se,Ġl,ex,ĠS,ad,Ġ",ch,ut,if,**,Ġ},em,ol,ĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ,th,)Ċ,Ġ{Ċ,Ġg,ig,iv,,Ċ,ce,od,Ġv,ate,ĠT,ag,ay,Ġ*,ot,us,ĠC,Ġst,ĠI,un,ul,ue,ĠA,ow,Ġ\',ew,Ġ<,ation,(),Ġfor,ab,ort,um,ame,Ġis,pe,tr,ck,âĢ,Ġy,ist,----,.ĊĊ,he,Ġe,lo,ĠM,Ġbe,ers,Ġon,Ġcon,ap,ub,ĠP,ĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ,ass,int,>Ċ,ly,urn,Ġ$,;ĊĊ,av,port,ir,->,nt,ction,end,Ġde,ith,out,turn,our,ĠĠĠĠĠ,lic,res,pt,==,Ġthis,Ġwh,Ġif,ĠD,ver,age,ĠB,ht,ext,=",Ġthat,****,ĠR,Ġit,ess,ĠF,Ġr,os,and,Ġas,ect,ke,rom,Ġ//,con,ĠL,(",qu,lass,Ġwith,iz,de,ĠN,Ġal,op,up,get,Ġ}Ċ,ile,Ġan,ata,ore,ri,Ġpro,;čĊ,ĉĉĉĉ,ter,ain,ĠW,ĠE,Ġcom,Ġreturn,art,ĠH,ack,import,ublic,Ġor,est,ment,ĠG,able,Ġ-,ine,ill,ind,ere,::,ity,Ġ+,Ġtr,elf,ight,(\',orm,ult,str,..,",,Ġyou,ype,pl,Ġnew,Ġj,ĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ,Ġfrom,Ġex,ĠO,ld,Ġ[,oc,:Ċ,Ġse'

**4. Language Model interface**

- Huggingface model input properties names
- Language model maximum sequence length
- Chat template that should be applied to the text before tokenization to match the instruct model training format, including special tokens

In [57]:
tokenizer.model_input_names

['input_ids', 'attention_mask']

In [58]:
tokenizer.model_max_length

131072

In [43]:
from IPython.display import Markdown, display
display(Markdown(f"```jinja\n{tokenizer.chat_template}\n```"))

```jinja
{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}
{%- for message in messages %}
    {%- if message.content is string %}
        {%- set content = message.content %}
    {%- else %}
        {%- set content = '' %}
    {%- endif %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and reasoning_content) %}
                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- endif %}
{%- endif %}
```

**Overall Tokenization Pipeline**

Regardless of slow or fast implementation, the tokenizer generally performs steps such as:

1. Normalization
2. Pre-tokenization
3. Subword tokenization / Model step
4. Post-processing (special tokens, padding/truncation, etc.)
5. Conversion to IDs
6. Output formatting

**1️- Normalization**

Transforms the input string into a standard form.

Common operations:
- Lowercasing (if model is uncased)
- Unicode normalization (NFD/NFC, etc.)
- Stripping accents
- Replacing special characters
- Handling control characters or whitespace cleanup

In [23]:
tokenizer.backend_tokenizer.normalizer

NFC()

**2- Pre-tokenization**

Splits text into basic token units before subword encoding.

Examples depending on tokenizer:
- Whitespace splitting
- Punctuation splitting ("hello," → ["hello", ","])
- Byte-level (GPT/BPE), where raw bytes are used

In [30]:
tokenizer.backend_tokenizer.pre_tokenizer

Sequence(pretokenizers=[Split(pattern=Regex("(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"), behavior=Isolated, invert=False), ByteLevel(add_prefix_space=False, trim_offsets=False, use_regex=False)])

**3- Subword Tokenization (Model step)**

Applies the vocabulary and subword rules (depending on model type).

Examples:
- BPE (GPT-2, RoBERTa): Merges frequent byte-pairs into subwords
- WordPiece (BERT): Uses "##" to join subwords
- SentencePiece/Unigram (T5, ALBERT): Probabilistic, language-agnostic model

In [78]:
repr(tokenizer.backend_tokenizer.model)[:150] + " ..."

'BPE(dropout=None, unk_token=None, continuing_subword_prefix="", end_of_word_suffix="", fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab ...'

**4- Post-processing**

Adds model-specific special tokens.

Examples:
- BERT: [CLS] tokens [SEP]
- GPT-2: no explicit BOS/EOS by default
- T5: <pad> token, etc.

Other applied options:
- Truncation (max length cutting)
- Padding (pad to max or dynamic length)

In [22]:
tokenizer.backend_tokenizer.post_processor

ByteLevel(add_prefix_space=False, trim_offsets=False, use_regex=False)

**5- Convert Tokens -> IDs**

Maps tokens/subwords to integer ids using the vocabulary.

Example: ["[CLS]", "hello", "world", "[SEP]"] → [101, 7592, 2088, 102]

In [92]:
from IPython.display import HTML, display

def display_tokens(text):
    enc = tokenizer(text, return_offsets_mapping=True)
    tokens = tokenizer.convert_ids_to_tokens(enc["input_ids"])
    offsets = enc["offset_mapping"]
    
    html = ""
    for tok, (start, end) in zip(tokens, offsets):
        part = text[start:end]
        html += f"<span style='background:#cce5ff; padding:2px; margin:2px; border-radius:3px;'>{tok}</span>"
    
    display(HTML(html))

In [93]:
display_tokens(prompt)

In [94]:
display_tokens("Donne-moi une introduction aux grands modèles de langage")

In [96]:
display_tokens("198254.17 + 14,76")

**6- Format Output**

Returns a dictionary like:

```python
{
  'input_ids': [...],
  'attention_mask': [...],
  'token_type_ids': [...],   # for some models (e.g., BERT)
  'offset_mapping': [...]    # fast tokenizers only
}
```

In [90]:
tokenizer(prompt, return_offsets_mapping=True)

{'input_ids': [35127, 752, 264, 2805, 16800, 311, 3460, 4128, 4119], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 4), (4, 7), (7, 9), (9, 15), (15, 28), (28, 31), (31, 37), (37, 46), (46, 53)]}

## Measure the tokenizer performance in 4 languages