# Huggingface model exploration

In [None]:
!uv add transformers accelerate datasets

## Choose a model you want to explore

In [1]:
model_name = "Qwen/Qwen3-0.6B"

## Load the model and generate one answer

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)

# Prepare the model input
prompt = "Give me a short introduction to large language models"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate text
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# Decode and display the output
output = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

print(f"MODEL: {model_name}\n")
print("PROMPT TEMPLATE:")
print(text)
print("ANSWER:")
print(output)

MODEL: Qwen/Qwen3-0.6B

PROMPT TEMPLATE:
<|im_start|>user
Give me a short introduction to large language models<|im_end|>
<|im_start|>assistant

ANSWER:
<think>
Okay, the user wants a short introduction to large language models. Let me start by recalling what I know about them. Large language models are AI systems designed to understand and generate human language. They're used in various applications like chatbots, translation, and content generation.

I should mention their core capabilities. Maybe talk about understanding context, generating text, and adapting to different languages. Also, highlight their applications. But wait, I need to keep it concise. The user might be looking for a brief overview, so I need to make sure not to get too technical. 

Wait, should I include something about their training data and how they process information? That could be useful. But the user asked for a short introduction, so maybe just the main points. Let me check if there are any key features 

## Explore the tokenizer

A tokenizer converts raw text into smaller units called tokens, which a language model can process. LLMs cannot directly understand characters or words as humans do, so the tokenizer maps text to numeric IDs that the model was trained on. 

Instead of using full words only, modern tokenizers often use subwords, which are pieces of words (like un, break, ##able) that help handle rare, new, or misspelled terms more efficiently. This allows the model to understand and generate language flexibly, without needing every possible word in its vocabulary. 

In short, the tokenizer forms the bridge between human text and the model’s internal numeric representation.

### Main characteristics

**1- Tokenizer type**

Hugging Face tokenizers can have two implementations: slow (Python) and fast (Rust-backed). Both follow the same conceptual processing pipeline, but fast tokenizers are significantly more efficient and provide richer features (like offset mapping).

| Feature                    | Slow Tokenizer (Python) | Fast Tokenizer (Rust)            |
| -------------------------- | ----------------------- | -------------------------------- |
| Language backend           | Pure Python             | Rust (`tokenizers` library)      |
| Speed                      | Slower                  | 10–100× faster                   |
| Offset mapping             | ❌ often missing         | ✅ available                      |
| Consistency with HF models | Good                    | Best / canonical implementations |
| Best for custom logic      | Easier to modify        | More restrictive                 |
| Unicode + splitting        | Python-based            | Optimized Rust implementation    |

In [60]:
tokenizer.name_or_path

'Qwen/Qwen3-0.6B'

In [53]:
type(tokenizer)

transformers.models.qwen2.tokenization_qwen2_fast.Qwen2TokenizerFast

In [56]:
tokenizer.is_fast

True

In [65]:
tokenizer.slow_tokenizer_class

transformers.models.qwen2.tokenization_qwen2.Qwen2Tokenizer

**2. Tokenization Strategy / Model**

Defines how text gets broken into tokens. Common strategies include:
- BPE (Byte Pair Encoding) — merges frequent byte or character pairs (e.g., GPT-2, RoBERTa)
- WordPiece — uses subwords prefixed with ## for continuation (e.g., BERT)
- Unigram / SentencePiece — probabilistic subword model (e.g., T5, ALBERT)
- Character/Byte-level — tokens represent characters or raw bytes

In [78]:
repr(tokenizer.backend_tokenizer.model)[:150] + " ..."

'BPE(dropout=None, unk_token=None, continuing_subword_prefix="", end_of_word_suffix="", fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab ...'

**3. Vocabulary**

The set of allowed tokens and their numeric IDs:
- Vocabulary size (e.g., 50k tokens)
- Special tokens (e.g., <pad>, <s>, </s>, <unk>, <mask>)
- Token format (e.g., ##sub in WordPiece, Ġword in BPE)

In [68]:
tokenizer.vocab_size

151643

In [47]:
for token_name in tokenizer.special_tokens_map.keys():
    if token_name != "additional_special_tokens":
        print(f"{token_name} -> {tokenizer.special_tokens_map[token_name]}")

eos_token -> <|im_end|>
pad_token -> <|endoftext|>


In [70]:
tokenizer.all_special_tokens

['<|im_end|>',
 '<|endoftext|>',
 '<|im_start|>',
 '<|object_ref_start|>',
 '<|object_ref_end|>',
 '<|box_start|>',
 '<|box_end|>',
 '<|quad_start|>',
 '<|quad_end|>',
 '<|vision_start|>',
 '<|vision_end|>',
 '<|vision_pad|>',
 '<|image_pad|>',
 '<|video_pad|>']

In [69]:
tokenizer.get_added_vocab()

{'<|endoftext|>': 151643,
 '<|im_start|>': 151644,
 '<|im_end|>': 151645,
 '<|object_ref_start|>': 151646,
 '<|object_ref_end|>': 151647,
 '<|box_start|>': 151648,
 '<|box_end|>': 151649,
 '<|quad_start|>': 151650,
 '<|quad_end|>': 151651,
 '<|vision_start|>': 151652,
 '<|vision_end|>': 151653,
 '<|vision_pad|>': 151654,
 '<|image_pad|>': 151655,
 '<|video_pad|>': 151656,
 '<tool_call>': 151657,
 '</tool_call>': 151658,
 '<|fim_prefix|>': 151659,
 '<|fim_middle|>': 151660,
 '<|fim_suffix|>': 151661,
 '<|fim_pad|>': 151662,
 '<|repo_name|>': 151663,
 '<|file_sep|>': 151664,
 '<tool_response>': 151665,
 '</tool_response>': 151666,
 '<think>': 151667,
 '</think>': 151668}

In [86]:
",".join(tokenizer.convert_ids_to_tokens([id for id in range(256,512)]))

'ĠĠ,ĠĠĠĠ,in,Ġt,ĠĠĠĠĠĠĠĠ,er,ĠĠĠ,on,Ġa,re,at,st,en,or,Ġth,ĊĊ,Ġc,le,Ġs,it,an,ar,al,Ġthe,;Ċ,Ġp,Ġf,ou,Ġ=,is,ĠĠĠĠĠĠĠ,ing,es,Ġw,ion,ed,ic,Ġb,Ġd,et,Ġm,Ġo,ĉĉ,ro,as,el,ct,nd,Ġin,Ġh,ent,id,Ġn,am,ĠĠĠĠĠĠĠĠĠĠĠ,Ġto,Ġre,--,Ġ{,Ġof,om,);Ċ,im,čĊ,Ġ(,il,//,Ġand,ur,se,Ġl,ex,ĠS,ad,Ġ",ch,ut,if,**,Ġ},em,ol,ĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ,th,)Ċ,Ġ{Ċ,Ġg,ig,iv,,Ċ,ce,od,Ġv,ate,ĠT,ag,ay,Ġ*,ot,us,ĠC,Ġst,ĠI,un,ul,ue,ĠA,ow,Ġ\',ew,Ġ<,ation,(),Ġfor,ab,ort,um,ame,Ġis,pe,tr,ck,âĢ,Ġy,ist,----,.ĊĊ,he,Ġe,lo,ĠM,Ġbe,ers,Ġon,Ġcon,ap,ub,ĠP,ĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ,ass,int,>Ċ,ly,urn,Ġ$,;ĊĊ,av,port,ir,->,nt,ction,end,Ġde,ith,out,turn,our,ĠĠĠĠĠ,lic,res,pt,==,Ġthis,Ġwh,Ġif,ĠD,ver,age,ĠB,ht,ext,=",Ġthat,****,ĠR,Ġit,ess,ĠF,Ġr,os,and,Ġas,ect,ke,rom,Ġ//,con,ĠL,(",qu,lass,Ġwith,iz,de,ĠN,Ġal,op,up,get,Ġ}Ċ,ile,Ġan,ata,ore,ri,Ġpro,;čĊ,ĉĉĉĉ,ter,ain,ĠW,ĠE,Ġcom,Ġreturn,art,ĠH,ack,import,ublic,Ġor,est,ment,ĠG,able,Ġ-,ine,ill,ind,ere,::,ity,Ġ+,Ġtr,elf,ight,(\',orm,ult,str,..,",,Ġyou,ype,pl,Ġnew,Ġj,ĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ,Ġfrom,Ġex,ĠO,ld,Ġ[,oc,:Ċ,Ġse'

**4. Language Model interface**

- Huggingface model input properties names
- Language model maximum sequence length
- Chat template that should be applied to the text before tokenization to match the instruct model training format, including special tokens

In [57]:
tokenizer.model_input_names

['input_ids', 'attention_mask']

In [58]:
tokenizer.model_max_length

131072

In [43]:
from IPython.display import Markdown, display
display(Markdown(f"```jinja\n{tokenizer.chat_template}\n```"))

```jinja
{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}
{%- for message in messages %}
    {%- if message.content is string %}
        {%- set content = message.content %}
    {%- else %}
        {%- set content = '' %}
    {%- endif %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and reasoning_content) %}
                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- endif %}
{%- endif %}
```

### Tokenization Pipeline

Regardless of slow or fast implementation, the tokenizer generally performs steps such as:

1. Normalization
2. Pre-tokenization
3. Subword tokenization / Model step
4. Post-processing (special tokens, padding/truncation, etc.)
5. Conversion to IDs
6. Output formatting

**1️- Normalization**

Transforms the input string into a standard form.

Common operations:
- Lowercasing (if model is uncased)
- Unicode normalization (NFD/NFC, etc.)
- Stripping accents
- Replacing special characters
- Handling control characters or whitespace cleanup

In [23]:
tokenizer.backend_tokenizer.normalizer

NFC()

**2- Pre-tokenization**

Splits text into basic token units before subword encoding.

Examples depending on tokenizer:
- Whitespace splitting
- Punctuation splitting ("hello," → ["hello", ","])
- Byte-level (GPT/BPE), where raw bytes are used

In [30]:
tokenizer.backend_tokenizer.pre_tokenizer

Sequence(pretokenizers=[Split(pattern=Regex("(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"), behavior=Isolated, invert=False), ByteLevel(add_prefix_space=False, trim_offsets=False, use_regex=False)])

**3- Subword Tokenization (Model step)**

Applies the vocabulary and subword rules (depending on model type).

Examples:
- BPE (GPT-2, RoBERTa): Merges frequent byte-pairs into subwords
- WordPiece (BERT): Uses "##" to join subwords
- SentencePiece/Unigram (T5, ALBERT): Probabilistic, language-agnostic model

In [78]:
repr(tokenizer.backend_tokenizer.model)[:150] + " ..."

'BPE(dropout=None, unk_token=None, continuing_subword_prefix="", end_of_word_suffix="", fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab ...'

**4- Post-processing**

Adds model-specific special tokens.

Examples:
- BERT: [CLS] tokens [SEP]
- GPT-2: no explicit BOS/EOS by default
- T5: <pad> token, etc.

Other applied options:
- Truncation (max length cutting)
- Padding (pad to max or dynamic length)

In [22]:
tokenizer.backend_tokenizer.post_processor

ByteLevel(add_prefix_space=False, trim_offsets=False, use_regex=False)

**5- Convert Tokens -> IDs**

Maps tokens/subwords to integer ids using the vocabulary.

Example: ["[CLS]", "hello", "world", "[SEP]"] → [101, 7592, 2088, 102]

In [92]:
from IPython.display import HTML, display

def display_tokens(text):
    enc = tokenizer(text, return_offsets_mapping=True)
    tokens = tokenizer.convert_ids_to_tokens(enc["input_ids"])
    offsets = enc["offset_mapping"]
    
    html = ""
    for tok, (start, end) in zip(tokens, offsets):
        part = text[start:end]
        html += f"<span style='background:#cce5ff; padding:2px; margin:2px; border-radius:3px;'>{tok}</span>"
    
    display(HTML(html))

In [93]:
display_tokens(prompt)

In [94]:
display_tokens("Donne-moi une introduction aux grands modèles de langage")

In [96]:
display_tokens("198254.17 + 14,76")

**6- Format Output**

Returns a dictionary like:

```python
{
  'input_ids': [...],
  'attention_mask': [...],
  'token_type_ids': [...],   # for some models (e.g., BERT)
  'offset_mapping': [...]    # fast tokenizers only
}
```

In [90]:
tokenizer(prompt, return_offsets_mapping=True)

{'input_ids': [35127, 752, 264, 2805, 16800, 311, 3460, 4128, 4119], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 4), (4, 7), (7, 9), (9, 15), (15, 28), (28, 31), (31, 37), (37, 46), (46, 53)]}

## Tokenizer performance in 4 languages

Load real world datasets containing common business and finance vocabulary, scraped from the web in 2023-2024 in 4 languages : english, french, german, spanish.

https://huggingface.co/datasets/frenchtext/bank-en-2401

https://huggingface.co/datasets/frenchtext/banque-fr-2311

https://huggingface.co/datasets/frenchtext/bank-de-2401

https://huggingface.co/datasets/frenchtext/bank-es-2401

In [102]:
from datasets import load_dataset

dataset_en = load_dataset("frenchtext/bank-en-2401",  split="train+valid+test")

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

In [103]:
dataset_fr = load_dataset("frenchtext/banque-fr-2311",  split="train+valid+test")

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

In [104]:
dataset_de = load_dataset("frenchtext/bank-de-2401",  split="train+valid+test")

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

In [107]:
dataset_es = load_dataset("frenchtext/bank-es-2401",  split="train+valid+test")

Resolving data files:   0%|          | 0/34 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/34 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/34 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/34 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/34 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/34 [00:00<?, ?it/s]

In [112]:
dataset_en.features

{'Uri': Value('string'),
 'ExtractedFromPDF': Value('bool'),
 'Timestamp': Value('string'),
 'Lang': Value('string'),
 'Title': Value('string'),
 'Text': Value('string'),
 'Words': Value('int32'),
 'AvgWordsLength': Value('int32'),
 'Chars': Value('int32'),
 'LetterChars': Value('int32'),
 'NumberChars': Value('int32'),
 'OtherChars': Value('int32')}

In [149]:
import time

def tokenize(dataitem):
    return tokenizer(dataitem["Text"])

def dataset_stats(dataset):    
    print(f"Analyzing dataset {dataset.info.dataset_name}")
    documents_count = len(dataset)
    print(f"- number of documents: {documents_count}")
    print()
    print("... tokenizing all documents ...")
    dataset.cleanup_cache_files()
    start = time.time()
    dataset = dataset.map(tokenize, batched=True)
    end = time.time()
    tokenize_duration = end-start
    print(f"--> done in {tokenize_duration:.3f} sec")
    print()
    print("... computing all statistics ...")
    start = time.time()
    dataset.total_tokens_count = 0
    dataset.avg_tokens_count = 0
    dataset.min_tokens_count = 2**31
    dataset.max_tokens_count = 0
    dataset.total_words_count = 0
    dataset.avg_words_count = 0
    dataset.min_words_count = 2**31
    dataset.max_words_count = 0
    dataset.total_chars_count = 0
    dataset.avg_chars_count = 0
    dataset.min_chars_count = 2**31
    dataset.max_chars_count = 0
    for dataitem in dataset:
        tokens = len(dataitem['input_ids'])
        words = dataitem['Words']
        chars = dataitem['Chars'] 
        dataset.total_tokens_count = dataset.total_tokens_count + tokens
        if tokens < dataset.min_tokens_count:
            dataset.min_tokens_count = tokens
        if tokens > dataset.max_tokens_count:
            dataset.max_tokens_count = tokens
        dataset.total_words_count = dataset.total_words_count + words
        if words < dataset.min_words_count:
            dataset.min_words_count = words
        if words > dataset.max_words_count:
            dataset.max_words_count = words
        dataset.total_chars_count = dataset.total_chars_count + chars
        if chars < dataset.min_chars_count:
            dataset.min_chars_count = chars
        if chars > dataset.max_chars_count:
            dataset.max_chars_count = chars
    dataset.avg_tokens_count = dataset.total_tokens_count / documents_count                                                            
    dataset.avg_words_count = dataset.total_words_count / documents_count
    dataset.avg_chars_count = dataset.total_chars_count / documents_count
    end = time.time()
    stats_duration = end-start
    print(f"--> done in {stats_duration:.3f} sec")
    print()
    print(f"- tokens: total={dataset.total_tokens_count:_}, avg={dataset.avg_tokens_count:_.2f}, min={dataset.min_tokens_count}, max={dataset.max_tokens_count:_}".replace("_"," "))
    print(f"- words: total={dataset.total_words_count:_}, avg={dataset.avg_words_count:_.2f}, min={dataset.min_words_count}, max={dataset.max_words_count:_}".replace("_"," "))
    print(f"- chars: total={dataset.total_chars_count:_}, avg={dataset.avg_chars_count:_.2f}, min={dataset.min_chars_count}, max={dataset.max_chars_count:_}".replace("_"," "))
    print()
    print(f"- chars per word: {dataset.total_chars_count/dataset.total_words_count:.2f}")
    print(f"- chars per token: {dataset.total_chars_count/dataset.total_tokens_count:.2f}")
    print()
    print(f"- tokens per word: {dataset.total_tokens_count/dataset.total_words_count:.2f}")
    print(f"- tokens per second: {dataset.total_tokens_count/tokenize_duration:_.2f}".replace("_"," "))
    return dataset

In [150]:
dataset_en = dataset_stats(dataset_en)

Analyzing dataset bank-en-2401
- number of documents: 25585

... tokenizing all documents ...


Map:   0%|          | 0/25585 [00:00<?, ? examples/s]

--> done in 35.439 sec

... computing all statistics ...
--> done in 26.766 sec

- tokens: total=89 411 161, avg=3 494.67, min=0, max=944 733
- words: total=57 605 232, avg=2 251.52, min=0, max=609 542
- chars: total=367 496 436, avg=14 363.75, min=0, max=3 866 079

- chars per word: 6.38
- chars per token: 4.11

- tokens per word: 1.55
- tokens per second: 2 522 947.96


In [151]:
dataset_fr = dataset_stats(dataset_fr)

Analyzing dataset banque-fr-2311
- number of documents: 85229

... tokenizing all documents ...


Map:   0%|          | 0/85229 [00:00<?, ? examples/s]

--> done in 41.897 sec

... computing all statistics ...
--> done in 40.069 sec

- tokens: total=126 572 248, avg=1 485.08, min=1, max=565 555
- words: total=67 061 556, avg=786.84, min=0, max=271 697
- chars: total=427 224 314, avg=5 012.66, min=2, max=1 724 199

- chars per word: 6.37
- chars per token: 3.38

- tokens per word: 1.89
- tokens per second: 3 021 066.65


In [152]:
dataset_de = dataset_stats(dataset_de)

Analyzing dataset bank-de-2401
- number of documents: 29745

... tokenizing all documents ...


Map:   0%|          | 0/29745 [00:00<?, ? examples/s]

--> done in 26.994 sec

... computing all statistics ...
--> done in 25.957 sec

- tokens: total=87 282 651, avg=2 934.36, min=0, max=434 883
- words: total=37 262 822, avg=1 252.74, min=0, max=158 911
- chars: total=282 155 923, avg=9 485.83, min=0, max=1 281 290

- chars per word: 7.57
- chars per token: 3.23

- tokens per word: 2.34
- tokens per second: 3 233 463.64


In [153]:
dataset_es = dataset_stats(dataset_es)

Analyzing dataset bank-es-2401
- number of documents: 25455

... tokenizing all documents ...


Map:   0%|          | 0/25455 [00:00<?, ? examples/s]

--> done in 17.253 sec

... computing all statistics ...
--> done in 16.248 sec

- tokens: total=54 677 880, avg=2 148.02, min=1, max=231 497
- words: total=29 180 953, avg=1 146.37, min=0, max=118 731
- chars: total=185 519 177, avg=7 288.12, min=2, max=776 947

- chars per word: 6.36
- chars per token: 3.39

- tokens per word: 1.87
- tokens per second: 3 169 201.46


In [155]:
print(f"Language compression comparison for the tokenizer: {model_name}")
print(f"- french/english: {((dataset_fr.total_tokens_count/dataset_fr.total_words_count)/(dataset_en.total_tokens_count/dataset_en.total_words_count)-1)*100:.1f} % additional tokens")
print(f"- german/english: {((dataset_de.total_tokens_count/dataset_de.total_words_count)/(dataset_en.total_tokens_count/dataset_en.total_words_count)-1)*100:.1f} % additional tokens")
print(f"- spanish/english: {((dataset_es.total_tokens_count/dataset_es.total_words_count)/(dataset_en.total_tokens_count/dataset_en.total_words_count)-1)*100:.1f} % additional tokens")

Language compression comparison for the tokenizer: Qwen/Qwen3-0.6B
- french/english: 21.6 % additional tokens
- german/english: 50.9 % additional tokens
- spanish/english: 20.7 % additional tokens


## Explore the model

### Main characteristics

In [199]:
model.name_or_path, model.config.architectures

('Qwen/Qwen3-0.6B', ['Qwen3ForCausalLM'])

In [206]:
f"{model.num_parameters():_} parameters".replace("_"," "), model.dtype

('596 049 920 parameters', torch.bfloat16)

In [198]:
from huggingface_hub import model_info
info = model_info(model_name)
f"Size in memory: {model.get_memory_footprint()/1024**3:.3f} GB", f"Size on disk: {info.safetensors.total/1024**3:.3f} GB"

('Size in memory: 1.110 GB', 'Size on disk: 0.700 GB')

In [205]:
model.vocab_size, tokenizer.model_max_length, model.config.tie_word_embeddings

(151936, 131072, True)

In [160]:
model.base_model

Qwen3Model(
  (embed_tokens): Embedding(151936, 1024)
  (layers): ModuleList(
    (0-27): 28 x Qwen3DecoderLayer(
      (self_attn): Qwen3Attention(
        (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
        (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
        (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
        (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
        (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
        (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
      )
      (mlp): Qwen3MLP(
        (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
        (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
        (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
        (act_fn): SiLUActivation()
      )
      (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
      (post_attention_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
    )
  )
  (norm): Qwen3RM

In [172]:
model.config.num_hidden_layers, set(model.config.layer_types), model.config.sliding_window

(28, {'full_attention'}, None)

In [208]:
model.config.hidden_size, model.config.intermediate_size, model.config.hidden_act

(1024, 3072, 'silu')

In [175]:
model.config.num_key_value_heads, model.config.num_attention_heads, model.config.head_dim

(8, 16, 128)

### Model architecture

You dont' need to understand the details the code below: it is a utility function which displays the details of a Pytroch model architecture.

In [156]:
from torch.nn import ModuleList
import inspect

memory_unit_mb = 1024*1024

def display_modules(module, name_prefix=None, depth=0, max_depth=99, forward_methods=None):
    if forward_methods is None:
        forward_methods = {}
    header = module.__class__.__name__
    if name_prefix is not None:
        header = f"{name_prefix}#{header}" 
    depth_prefix = "  "*depth
    print(depth_prefix+"---------------------")
    print(depth_prefix+header)
    if len(list(module.named_parameters(recurse=False))) > 0:
        print(depth_prefix+"> parameters")
        for name,parameter in module.named_parameters(recurse=False):
            print(depth_prefix+f"- {name}: {get_tensor_description(parameter)}")
    if len(list(module.named_buffers(recurse=False))) > 0:
        print(depth_prefix+"> buffers")
        for name,buffer in module.named_buffers(recurse=False):
            print(depth_prefix+f"- {name}: {get_tensor_description(buffer)}")
    if len(list(module.named_children())) > 0:
        print(depth_prefix+"> submodules")
        for name,submodule in module.named_children():
            print(depth_prefix+f"- {name}: {submodule.__class__.__name__}")
    source_code = inspect.getsource(module.forward)
    forward_methods[module.__class__.__name__] = source_code
    if depth < max_depth:
        for name,submodule in module.named_children():
            if isinstance(submodule, ModuleList):
                display_module_list(submodule, name_prefix=name, depth=depth+1, max_depth=max_depth, forward_methods=forward_methods)
            else:
                display_modules(submodule, name_prefix=name, depth=depth+1, max_depth=max_depth, forward_methods=forward_methods)
    if depth==0:
        print()
        print()
        for module_type,source_code in forward_methods.items():
            print("---------------------")
            print(f"{module_type}.forward()")
            print("---------------------")
            print(source_code)
            
def display_module_list(module_list, name_prefix=None, depth=0, max_depth=1, forward_methods=None):
    # ------------------------------
    # Detect repeated layers in ModuleList: code inspired from Pytorch: ModuleList.__repr__    
    list_of_reprs = [repr(item) for item in module_list]
    if len(list_of_reprs) == 0:
        return

    start_end_indices = [[0, 0]]
    repeated_blocks = [list_of_reprs[0]]
    for i, r in enumerate(list_of_reprs[1:], 1):
        if r == repeated_blocks[-1]:
            start_end_indices[-1][1] += 1
            continue

        start_end_indices.append([i, i])
        repeated_blocks.append(r)
    # -------------------------------
    
    depth_prefix = "  "*depth
    print(depth_prefix+"---------------------")
    print(depth_prefix+f"{name_prefix}#ModuleList")
    print(depth_prefix+"> submodules")
    named_submodules = []
    for (start_id, end_id) in start_end_indices:
        submodule = module_list[start_id]
        if start_id != end_id:      
            name = f"{start_id}..{end_id}"
            print(depth_prefix+f"- {name}: {(end_id-start_id+1)}X {submodule.__class__.__name__}")
        else:
            name = str(start_id)
            print(depth_prefix+f"- {name}: {submodule.__class__.__name__}")        
        named_submodules.append((name,submodule))
    if depth < max_depth:
        for name,submodule in named_submodules:
            if isinstance(submodule, ModuleList):
                display_module_list(submodule, name_prefix=name, depth=depth+1, max_depth=max_depth, forward_methods=forward_methods)
            else:
                display_modules(submodule, name_prefix=name, depth=depth+1, max_depth=max_depth, forward_methods=forward_methods)

def get_tensor_description(t):
    dtype = str(t.dtype)[6:]
    dimensions = str(t.size())[11:-1]
    total_byte_size = t.numel() * t.element_size()
    return f"{dtype} {dimensions} ({(total_byte_size/memory_unit_mb):.1f} MB)"

In [157]:
print(f"{model_name} architecture description")
display_modules(model)

Qwen/Qwen3-0.6B architecture description
---------------------
Qwen3ForCausalLM
> submodules
- model: Qwen3Model
- lm_head: Linear
  ---------------------
  model#Qwen3Model
  > submodules
  - embed_tokens: Embedding
  - layers: ModuleList
  - norm: Qwen3RMSNorm
  - rotary_emb: Qwen3RotaryEmbedding
    ---------------------
    embed_tokens#Embedding
    > parameters
    - weight: bfloat16 [151936, 1024] (296.8 MB)
    ---------------------
    layers#ModuleList
    > submodules
    - 0..27: 28X Qwen3DecoderLayer
      ---------------------
      0..27#Qwen3DecoderLayer
      > submodules
      - self_attn: Qwen3Attention
      - mlp: Qwen3MLP
      - input_layernorm: Qwen3RMSNorm
      - post_attention_layernorm: Qwen3RMSNorm
        ---------------------
        self_attn#Qwen3Attention
        > submodules
        - q_proj: Linear
        - k_proj: Linear
        - v_proj: Linear
        - o_proj: Linear
        - q_norm: Qwen3RMSNorm
        - k_norm: Qwen3RMSNorm
          -------

### Execute the model step by step

In [34]:
def show_tokens_probs(prompt, max_new_tokens=50):
    # Remove previous hooks on the model to be able to re-execute this cell in case of exception
    if 'hook_handles' not in vars():
        hook_handles = []
    else:
        for hook_handle in hook_handles:
            hook_handle.remove()
            hook_handles = []
    
    # Functions to decode input and output tokens
    def get_input_tokens(inpTensor):
        return "[" + "] [".join([tokenizer.decode(element.item()).replace('\n',"\\n") for element in inpTensor]) + "]"
    
    def get_output_tokens(outTensor):
        preds = torch.softmax(outTensor.float(), dim=1)
        next_token_ids = torch.topk(preds[-1,:], k=5)
        return [(tokenizer.decode(token_id), f"{preds[-1,token_id].item():.3f}") for token_id in next_token_ids.indices]
    
    # Display input and output tokens
    def print_embed_in(module, input, output):
        inpTensor = input[0].squeeze(dim=0)
        print(f">> Input : {get_input_tokens(inpTensor)}")
        
    def print_embed_out(module, input, output):
        outTensor = output[0]
        print(f">> Output:  {get_output_tokens(outTensor)}")
    
    # Register hooks to the input embedding and output lm head modules
    hook_handles.append(model.get_input_embeddings().register_forward_hook(print_embed_in))
    hook_handles.append(model.get_output_embeddings().register_forward_hook(print_embed_out))
    
    # Prepare the model input
    prompt = prompt
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    # Generate an answer with the hooks in place
    model.generate(
        **model_inputs,
        max_new_tokens=max_new_tokens
    )
    
    # Remove the hooks
    for hook_handle in hook_handles:
        hook_handle.remove()
    hook_handles = []

In [36]:
show_tokens_probs("Why is the sky blue ?", max_new_tokens=25)

>> Input : [<|im_start|>] [user] [\n] [Why] [ is] [ the] [ sky] [ blue] [ ?] [<|im_end|>] [\n] [<|im_start|>] [assistant] [\n] [<think>] [\n\n] [</think>] [\n\n]
>> Output:  [('The', '1.000'), ('It', '0.000'), ('There', '0.000'), ('Actually', '0.000'), ('Well', '0.000')]
>> Input : [The]
>> Output:  [(' sky', '0.995'), (' color', '0.003'), (' blue', '0.000'), (' **', '0.000'), (' reason', '0.000')]
>> Input : [ sky]
>> Output:  [(' appears', '0.892'), (' is', '0.107'), (' appearing', '0.000'), (' looks', '0.000'), (' seems', '0.000')]
>> Input : [ appears]
>> Output:  [(' blue', '1.000'), (' **', '0.000'), (' to', '0.000'), ('蓝色', '0.000'), (' bl', '0.000')]
>> Input : [ blue]
>> Output:  [(' because', '0.605'), (' due', '0.324'), (' primarily', '0.044'), (' in', '0.011'), (' for', '0.004')]
>> Input : [ because]
>> Output:  [(' of', '0.572'), (' it', '0.186'), (' **', '0.164'), (' the', '0.047'), (' sunlight', '0.010')]
>> Input : [ of]
>> Output:  [(' the', '0.865'), (' **', '0.117')

In [37]:
show_tokens_probs("Tell me a funny story", max_new_tokens=25)

>> Input : [<|im_start|>] [user] [\n] [Tell] [ me] [ a] [ funny] [ story] [<|im_end|>] [\n] [<|im_start|>] [assistant] [\n] [<think>] [\n\n] [</think>] [\n\n]
>> Output:  [('Sure', '0.886'), ('Ah', '0.039'), ('Certainly', '0.021'), ('Here', '0.011'), ('Oh', '0.009')]
>> Input : [Sure]
>> Output:  [('!', '0.940'), (',', '0.060'), (' here', '0.000'), ('...', '0.000'), ('!*', '0.000')]
>> Input : [!]
>> Output:  [(' Here', '0.985'), (' Let', '0.005'), (' I', '0.005'), (' here', '0.002'), (' �', '0.001')]
>> Input : [ Here]
>> Output:  [("'s", '0.893'), ('’s', '0.107'), (' is', '0.000'), (' it', '0.000'), (' a', '0.000')]
>> Input : ['s]
>> Output:  [(' a', '1.000'), (' an', '0.000'), (' my', '0.000'), (' one', '0.000'), (' **', '0.000')]
>> Input : [ a]
>> Output:  [(' funny', '0.818'), (' **', '0.098'), (' fun', '0.025'), (' hilarious', '0.013'), (' light', '0.012')]
>> Input : [ funny]
>> Output:  [(' story', '0.849'), (' and', '0.115'), (' one', '0.029'), (' tale', '0.002'), (' short',

### Profile the GPU compute and memory

WARNING: for the benchmark below to be meaningful, you should restart the notebook kernel, rexecute only the first cell with to define the model_name, then execute the cells below

In [2]:
import os
import psutil
import torch
from transformers.utils.hub import cached_file

def get_model_path_and_size_on_disk(model):    
    model_config_file = cached_file(model.name_or_path, "config.json", local_files_only=True)
    model_directory = os.path.dirname(model_config_file)
    
    total_size = 0
    for entry in os.listdir(model_directory):
        full_entry_path = os.path.join(model_directory, entry)
        if os.path.isfile(full_entry_path):
            total_size += os.path.getsize(full_entry_path)
    return model_directory,total_size

def get_used_cpu_memory():
    process = psutil.Process(os.getpid())
    process_memory = process.memory_info().rss
    return process_memory

def get_used_and_max_gpu_memory():
    used_memory = torch.cuda.memory_allocated(0)    
    max_used_memory = torch.cuda.max_memory_allocated(0)
    return used_memory,max_used_memory

def reset_max_gpu_memory():
    torch.cuda.reset_peak_memory_stats()

In [3]:
from time import perf_counter_ns
from transformers import AutoModelForCausalLM, AutoTokenizer

memory_unit_mb = 1024*1024
memory_unit_gb = 1024*1024*1024

time_unit_µs = 1000
time_unit_ms = 1000*1000
time_unit_s = 1000*1000*1000

class ModelBenchmark:   
    
    def __init__(self, pretrained_model_id):
        self.pretrained_model_id = pretrained_model_id
        self.tokenizer = None 
        self.model = None
        
        self.model_path = None
        self.model_size_on_disk = 0
        self.tokenizer_load_time_ns = 0
        self.tokenizer_cpu_memory = 0
        self.model_load_time_ns = 0
        self.model_cpu_memory = 0
        self.model_gpu_memory = 0
        self.model_load_max_gpu_memory = 0
        
    def trace_load_from_cache(self, **kwargs):
        cpu_memory_before = get_used_cpu_memory()
        gpu_memory_before = get_used_and_max_gpu_memory()[0]
        reset_max_gpu_memory()        
        time_before = perf_counter_ns()
        self.tokenizer = AutoTokenizer.from_pretrained(self.pretrained_model_id, **kwargs)
        cpu_memory_tokenizer = get_used_cpu_memory()
        time_tokenizer = perf_counter_ns()
        self.model = AutoModelForCausalLM.from_pretrained(self.pretrained_model_id, **kwargs)
        cpu_memory_model = get_used_cpu_memory()
        gpu_memory_model,max_gpu_memory_model = get_used_and_max_gpu_memory()     
        time_model = perf_counter_ns()
        
        self.model_path,self.model_size_on_disk = get_model_path_and_size_on_disk(self.model)
        self.tokenizer_load_time_ns = time_tokenizer-time_before
        self.tokenizer_cpu_memory = cpu_memory_tokenizer-cpu_memory_before
        self.model_load_time_ns = time_model-time_tokenizer
        self.model_cpu_memory = cpu_memory_model-cpu_memory_tokenizer
        self.model_gpu_memory = gpu_memory_model-gpu_memory_before
        self.model_load_max_gpu_memory = max_gpu_memory_model
        
        print(f"Model files: {(self.model_size_on_disk/1024/1024/1024):.2f} GB on disk")
        print(""f"(cache path: {self.model_path})")
        print()
        print(f"Tokenizer load time : {(self.tokenizer_load_time_ns/time_unit_ms):.2f} ms")
        print(f"Tokenizer CPU memory: {(self.tokenizer_cpu_memory/memory_unit_mb):.2f} MB")
        print()
        print(f"Model load time : {(self.model_load_time_ns/time_unit_ms):.2f} ms")
        print(f"Model CPU memory: {(self.model_cpu_memory/memory_unit_gb):.2f} GB")
        print(f"Model GPU memory: {(self.model_gpu_memory/memory_unit_gb):.2f} GB")
        print(f"Max   GPU memory: {(self.model_load_max_gpu_memory/memory_unit_gb):.2f} GB")
        print()

In [4]:
benchmark = ModelBenchmark(model_name)

In [5]:
benchmark.trace_load_from_cache(device_map="auto")

Model files: 1.41 GB on disk
(cache path: /home/models/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca)

Tokenizer load time : 499.47 ms
Tokenizer CPU memory: 112.12 MB

Model load time : 986.08 ms
Model CPU memory: 0.10 GB
Model GPU memory: 2.22 GB
Max   GPU memory: 2.80 GB



In [12]:
import time

prompt = "Write a short novel about a large language model"
inputs = benchmark.tokenizer(prompt, return_tensors="pt").to("cuda")

# Warmup
_ = benchmark.model.generate(**inputs, max_new_tokens=10)

# Benchmark
max_new_tokens = 100
start = time.perf_counter()
output = benchmark.model.generate(**inputs, max_new_tokens=max_new_tokens)
end = time.perf_counter()

# Count only the generated tokens (not including prompt)
generated_tokens = output.shape[-1] - inputs["input_ids"].shape[-1]
tokens_per_second = generated_tokens / (end - start)

print(f"Generated tokens: {generated_tokens}")
print(f"Time: {end - start:.4f} sec")
print(f"Tokens/sec: {tokens_per_second:.2f}")

Generated tokens: 100
Time: 2.2130 sec
Tokens/sec: 45.19


In [24]:
import torch
from torch.profiler import ProfilerActivity
from IPython.display import display, Markdown

def add_call_stacks(event):
    filtered_stack = []
    torch_calls = []
    for frame in event.stack:
        if "profile_forward" in frame:
            break
        elif not frame.startswith("<built-in") and not frame.startswith("torch/"):
            function = frame.split(": ")[1]
            if function!="_call_impl":
                filtered_stack.append(function)
        elif frame.startswith("<built-in method"):
            frame_words = frame.split(" ")
            torch_calls.append(frame_words[2])
            torch_calls.append(frame_words[4])
        elif frame.startswith("<built-in function"):
            frame_words = frame.split(" ")
            torch_calls.append(frame_words[2][:-1])
    filtered_stack.reverse()    
    event.call_stack = ".".join(filtered_stack)
    torch_calls.reverse()
    event.torch_stack = ".".join(torch_calls)

def profile_forward(model, coalesce_layers, batch_size=1, seq_length=None, percent_threshold=0.2):
    # Execute one forward pass
    if seq_length is None: seq_length = model.config.max_position_embeddings
    input_ids = torch.randint(low=0, high=32000, size=(batch_size,seq_length), dtype=torch.int64).to(model.device)
    attention_mask = torch.ones(batch_size,seq_length).to(model.device)
    model.eval()
    with torch.profiler.profile(activities=[ProfilerActivity.CPU,ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True, with_flops=True, with_modules=True, experimental_config=torch._C._profiler._ExperimentalConfig(verbose=True)) as prof:
        with torch.profiler.record_function("MODEL INFERENCE"):
            with torch.no_grad():
                outputs = model(input_ids=input_ids, attention_mask=attention_mask, use_cache=False, output_attentions=False, output_hidden_states=False)        

    # Analyze profiling events
    events = prof.events()
    coalesced_events = []
    first_layer0_event_index = 0
    layer0_events_count = 0
    layer_index = 0
    event_index_in_layer = 0
    for event in events:
        if event.cpu_parent is not None and event.cpu_parent.id == events[0].id:
            add_call_stacks(event)            
            key = event.call_stack
            start_index = key.find(coalesce_layers)
            if start_index >= 0:
                dot_index = key.find('.', start_index)
                current_layer_index = int(key[start_index+len(coalesce_layers)+1:dot_index])
    
                if first_layer0_event_index == 0:
                    first_layer0_event_index = len(coalesced_events)
                if layer0_events_count == 0 and current_layer_index == 1:
                    layer0_events_count = len(coalesced_events) - first_layer0_event_index
                
                if current_layer_index > layer_index:
                    layer_index = current_layer_index 
                    if event_index_in_layer != layer0_events_count:
                        print(f"ERROR at layer {layer_index}: number of events {event_index_in_layer} different of layer 0 events count: {layer0_events_count}")
                        break
                    event_index_in_layer = 0                        
                
                if layer_index == 0:
                    event.layers_count = 1
                    event.layers_cpu_time = event.cpu_time
                    event.layers_cuda_time = event.device_time
                    coalesced_events.append(event)
                else:
                    first_event = coalesced_events[first_layer0_event_index + event_index_in_layer]
                    first_event.layers_count += 1
                    first_event.layers_cpu_time += event.cpu_time
                    first_event.layers_cuda_time += event.device_time    
                event_index_in_layer += 1
            else:
                coalesced_events.append(event)

    # Display profiling results
    table =  "| Cuda time (µs) | Cuda time (%) | Calls | Stack | PyTorch | Function |\n" 
    table += "| -------------- | ------------- | ----- | ----- | ------- | -------- |\n" 
    for event in coalesced_events:
        if getattr(event, "layers_count", 0) > 0:  
            percent_cuda_time = event.layers_cuda_time/events[0].device_time*100
            if percent_cuda_time >= percent_threshold:
                table += f"| {int(event.layers_cuda_time)} | {percent_cuda_time:.2f} | {event.layers_count} | {event.call_stack} | {event.torch_stack} | {event.name} |\n"
        else:
            percent_cuda_time = event.device_time/events[0].device_time*100
            if percent_cuda_time >= percent_threshold:
                table += f"| {int(event.device_time)} | {percent_cuda_time:.2f} | 1 | {event.call_stack} | {event.torch_stack} | {event.name} |\n"
    display(Markdown(table.replace("__","\\_\\_")))

In [27]:
profile_forward(model, coalesce_layers = "Qwen3DecoderLayer", batch_size=1, seq_length=200, percent_threshold=.5)

| Cuda time (µs) | Cuda time (%) | Calls | Stack | PyTorch | Function |
| -------------- | ------------- | ----- | ----- | ------- | -------- |
| 400 | 0.74 | 28 | Qwen3ForCausalLM_0.wrapper.forward.Qwen3Model_0.wrapper.forward.\_\_call\_\_.Qwen3DecoderLayer_0.wrapped_func.forward.Qwen3RMSNorm_0.forward |  | aten::mul |
| 4161 | 7.66 | 28 | Qwen3ForCausalLM_0.wrapper.forward.Qwen3Model_0.wrapper.forward.\_\_call\_\_.Qwen3DecoderLayer_0.wrapped_func.forward.Qwen3Attention_0.wrapped_func.forward.Linear_0 | linear | aten::linear |
| 2325 | 4.28 | 28 | Qwen3ForCausalLM_0.wrapper.forward.Qwen3Model_0.wrapper.forward.\_\_call\_\_.Qwen3DecoderLayer_0.wrapped_func.forward.Qwen3Attention_0.wrapped_func.forward.Linear_1 | linear | aten::linear |
| 2309 | 4.25 | 28 | Qwen3ForCausalLM_0.wrapper.forward.Qwen3Model_0.wrapper.forward.\_\_call\_\_.Qwen3DecoderLayer_0.wrapped_func.forward.Qwen3Attention_0.wrapped_func.forward.Linear_2 | linear | aten::linear |
| 313 | 0.58 | 28 | Qwen3ForCausalLM_0.wrapper.forward.Qwen3Model_0.wrapper.forward.\_\_call\_\_.Qwen3DecoderLayer_0.wrapped_func.forward.Qwen3Attention_0.wrapped_func.forward.apply_rotary_pos_emb.rotate_half | type.cat | aten::cat |
| 1047 | 1.93 | 28 | Qwen3ForCausalLM_0.wrapper.forward.Qwen3Model_0.wrapper.forward.\_\_call\_\_.Qwen3DecoderLayer_0.wrapped_func.forward.Qwen3Attention_0.wrapped_func.forward.sdpa_attention_forward | scaled_dot_product_attention | aten::scaled_dot_product_attention |
| 4247 | 7.82 | 28 | Qwen3ForCausalLM_0.wrapper.forward.Qwen3Model_0.wrapper.forward.\_\_call\_\_.Qwen3DecoderLayer_0.wrapped_func.forward.Qwen3Attention_0.wrapped_func.forward.Linear_3 | linear | aten::linear |
| 396 | 0.73 | 28 | Qwen3ForCausalLM_0.wrapper.forward.Qwen3Model_0.wrapper.forward.\_\_call\_\_.Qwen3DecoderLayer_0.wrapped_func.forward.Qwen3RMSNorm_3.forward |  | aten::mul |
| 6136 | 11.30 | 28 | Qwen3ForCausalLM_0.wrapper.forward.Qwen3Model_0.wrapper.forward.\_\_call\_\_.Qwen3DecoderLayer_0.wrapped_func.forward.Qwen3MLP_0.forward.Linear_4 | linear | aten::linear |
| 6096 | 11.23 | 28 | Qwen3ForCausalLM_0.wrapper.forward.Qwen3Model_0.wrapper.forward.\_\_call\_\_.Qwen3DecoderLayer_0.wrapped_func.forward.Qwen3MLP_0.forward.Linear_5 | linear | aten::linear |
| 6592 | 12.14 | 28 | Qwen3ForCausalLM_0.wrapper.forward.Qwen3Model_0.wrapper.forward.\_\_call\_\_.Qwen3DecoderLayer_0.wrapped_func.forward.Qwen3MLP_0.forward.Linear_6 | linear | aten::linear |
| 14569 | 26.83 | 1 | Qwen3ForCausalLM_0.wrapper.forward.Linear_196 | linear | aten::linear |
