# Exploring Chat Templates with SmolLM2

This notebook demonstrates how to use chat templates with the `SmolLM2` model.
Key concepts covered:
- Loading and configuring a language model for chat
- Formatting messages using chat templates
- Understanding tokenization of chat messages

In [None]:
# Install the requirements in Google Colab
!pip install transformers datasets trl huggingface_hub

# Authenticate to Hugging Face
from huggingface_hub import login

login()
# for convenience you can create an environment variable containing your hub token as HF_TOKEN

In [12]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer # Core Hugging Face components for model and tokenizer
from trl import setup_chat_format # Helper for chat formatting
import torch # Deep learning framework

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

## Set up device and model

In [14]:
# First determine best available hardware
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load the base model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Configure the model and tokenizer for chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

## SmolLM2 Chat Template

Chat templates help structure interactions between users and AI models, ensuring consistent and contextually appropriate responses.

Let's explore how to use a chat template with the `SmolLM2` model. We'll define a simple conversation and apply the chat template.

In [None]:
# Define messages for SmolLM2
"""
Chat messages are structured as a list of dictionaries.
Each message has:
- role: either 'user' or 'assistant'
- content: the actual message text
"""

messages = [
    {
        "role": "user",
        "content": "Hello, how are you?"},
    {
        "role": "assistant",
        "content": "I'm doing well, thank you! How can I assist you today?",},
]

# Apply chat template without tokenization

The tokenizer represents the conversation as a string with special tokens to describe the role of the user and the assistant.

The chat template adds special tokens to structure the conversation:
- <|im_start|> and <|im_end|> mark message boundaries
- The role (user/assistant) is included before each message
This helps the model understand the conversation flow.

In [None]:
input_text = tokenizer.apply_chat_template(messages, tokenize=False)

print("Conversation with template:", input_text)

Conversation with template: <|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
I'm doing well, thank you! How can I assist you today?<|im_end|>



# Apply chat template with tokenization
When preparing for model input, we:
1. Apply the chat template
2. Tokenize the text
3. Add a generation prompt for the model's response

In [None]:
input_text = tokenizer.apply_chat_template(
    messages, tokenize=True,
    add_generation_prompt=True
)

# Decode the conversation
Note that the conversation is represented as above but with a further assistant message.

In [None]:

print("Conversation decoded:", tokenizer.decode(token_ids=input_text))

Conversation decoded: <|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
I'm doing well, thank you! How can I assist you today?<|im_end|>
<|im_start|>assistant



# Tokenize the conversation

Of course, the tokenizer also tokenizes the conversation and special token as ids that relate to the model's vocabulary.

The model processes text as token IDs - numbers that represent words/subwords.
This shows the actual input format the model receives.


In [None]:
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

print("Conversation tokenized:", input_text)

Conversation tokenized: [1, 4093, 198, 19556, 28, 638, 359, 346, 47, 2, 198, 1, 520, 9531, 198, 57, 5248, 2567, 876, 28, 9984, 346, 17, 1073, 416, 339, 4237, 346, 1834, 47, 2, 198, 1, 520, 9531, 198]


## Exercise: Processing Datasets for Supervised Fine-Tuning (SFT)


<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Exercise: Process a dataset for SFT</h2>
    <p>Take a dataset from the Hugging Face hub and process it for SFT. </p>
    <p><b>Difficulty Levels</b></p>
    <p>🐢 Convert the `HuggingFaceTB/smoltalk` dataset into chatml format.</p>
    <p>🐕 Convert the `openai/gsm8k` dataset into chatml format.</p>
</div>

### What We'll Learn
- Loading datasets from Hugging Face Hub
- Converting raw data into chat format
- Applying chat templates for model training
- Processing and validating the formatted data

### 🐢 Beginner Exercise: Processing SmolTalk Dataset

In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        """<iframe
  src="https://huggingface.co/datasets/HuggingFaceTB/smoltalk/embed/viewer/all/train?row=0"
  frameborder="0"
  width="100%"
  height="360px"
></iframe>
"""
    )
)

In [None]:
from datasets import load_dataset

# Load the dataset
ds = load_dataset("HuggingFaceTB/smoltalk", "everyday-conversations")

In [None]:
# Select the 'train' split
train_ds = ds["train"]

# Inspect the dataset structure
print("Dataset keys:", ds.column_names)  # Shows available keys for each split
print("First example in 'train' split:", train_ds[0])  # Inspect the first record in the 'train' split

Dataset keys: {'train': ['full_topic', 'messages'], 'test': ['full_topic', 'messages']}
First example in 'train' split: {'full_topic': 'Travel/Vacation destinations/Beach resorts', 'messages': [{'content': 'Hi there', 'role': 'user'}, {'content': 'Hello! How can I help you today?', 'role': 'assistant'}, {'content': "I'm looking for a beach resort for my next vacation. Can you recommend some popular ones?", 'role': 'user'}, {'content': "Some popular beach resorts include Maui in Hawaii, the Maldives, and the Bahamas. They're known for their beautiful beaches and crystal-clear waters.", 'role': 'assistant'}, {'content': 'That sounds great. Are there any resorts in the Caribbean that are good for families?', 'role': 'user'}, {'content': 'Yes, the Turks and Caicos Islands and Barbados are excellent choices for family-friendly resorts in the Caribbean. They offer a range of activities and amenities suitable for all ages.', 'role': 'assistant'}, {'content': "Okay, I'll look into those. Thank

In [None]:
# Define the function to process the dataset into ChatML format
def process_smoltalk_dataset(sample):
    """
    Converts the SmolTalk format into ChatML format, both tokenized and non-tokenized.

    The SmolTalk dataset has a 'messages' field containing conversation history.
    This function extracts the messages and formats them for the ChatML tokenizer.
    """

    # Extract the messages field from the dataset
    messages = sample["messages"]

    # Format the messages to fit the ChatML structure (lowercase role, consistent content)
    formatted_messages = [{"role": msg["role"].lower(), "content": msg["content"]} for msg in messages]

    # Apply chat template without tokenization
    input_text_no_tokenize = tokenizer.apply_chat_template(formatted_messages, tokenize=False)

    # Apply chat template with tokenization
    input_text_tokenized = tokenizer.apply_chat_template(formatted_messages, tokenize=True, add_generation_prompt=True)

    return {"formatted_text_no_tokenize": input_text_no_tokenize, "formatted_text_tokenized": input_text_tokenized}

In [None]:
# Apply the processing function to the dataset
processed_train_ds = train_ds.map(process_smoltalk_dataset)
print("\nDictionary structure now:")
print(processed_train_ds)

# Show the original and processed format
print("Original format:")
print(train_ds[0])  # Show the raw data in 'train' split


# Processed format with both ChatML (non-tokenized and tokenized) structures
print("\nProcessed format (non-tokenized ChatML structure):")
print(processed_train_ds[0]["formatted_text_no_tokenize"])

print("\nProcessed format (tokenized ChatML structure):")
print(processed_train_ds[0]["formatted_text_tokenized"])



Dictionary structure now:
Dataset({
    features: ['full_topic', 'messages', 'formatted_text_no_tokenize', 'formatted_text_tokenized'],
    num_rows: 2260
})
Original format:
{'full_topic': 'Travel/Vacation destinations/Beach resorts', 'messages': [{'content': 'Hi there', 'role': 'user'}, {'content': 'Hello! How can I help you today?', 'role': 'assistant'}, {'content': "I'm looking for a beach resort for my next vacation. Can you recommend some popular ones?", 'role': 'user'}, {'content': "Some popular beach resorts include Maui in Hawaii, the Maldives, and the Bahamas. They're known for their beautiful beaches and crystal-clear waters.", 'role': 'assistant'}, {'content': 'That sounds great. Are there any resorts in the Caribbean that are good for families?', 'role': 'user'}, {'content': 'Yes, the Turks and Caicos Islands and Barbados are excellent choices for family-friendly resorts in the Caribbean. They offer a range of activities and amenities suitable for all ages.', 'role': 'ass

In [None]:
# Decode the tokenized conversation to make it human-readable
decoded_text = tokenizer.decode(token_ids=processed_example_tokenized)
print("\nDecoded conversation (from tokenized format):")
print(decoded_text)


Decoded conversation (from tokenized format):
<|im_start|>user
Hi there<|im_end|>
<|im_start|>assistant
Hello! How can I help you today?<|im_end|>
<|im_start|>user
I'm looking for a beach resort for my next vacation. Can you recommend some popular ones?<|im_end|>
<|im_start|>assistant
Some popular beach resorts include Maui in Hawaii, the Maldives, and the Bahamas. They're known for their beautiful beaches and crystal-clear waters.<|im_end|>
<|im_start|>user
That sounds great. Are there any resorts in the Caribbean that are good for families?<|im_end|>
<|im_start|>assistant
Yes, the Turks and Caicos Islands and Barbados are excellent choices for family-friendly resorts in the Caribbean. They offer a range of activities and amenities suitable for all ages.<|im_end|>
<|im_start|>user
Okay, I'll look into those. Thanks for the recommendations!<|im_end|>
<|im_start|>assistant
You're welcome. I hope you find the perfect resort for your vacation.<|im_end|>
<|im_start|>assistant



### 🐕 Advanced Exercise: Processing GSM8K Dataset

The GSM8K dataset contains math word problems with step-by-step solutions.
This requires:
1. Formatting the problem as a user question
2. Structuring the solution as an assistant response
3. Preserving the step-by-step reasoning

In [None]:
display(
    HTML(
        """<iframe
  src="https://huggingface.co/datasets/openai/gsm8k/embed/viewer/main/train"
  frameborder="0"
  width="100%"
  height="360px"
></iframe>
"""
    )
)

In [None]:
from datasets import load_dataset

# Load the GSM8K dataset
ds2 = load_dataset("openai/gsm8k", "main")

# Select the 'train' split
train_ds2 = ds2["train"]

# Inspect the dataset structure
print("Dataset keys:", ds2.column_names)  # Shows available keys for each split
print("First example in 'train' split:", train_ds2[0])  # Inspect the first record in the 'train' split


README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Dataset keys: {'train': ['question', 'answer'], 'test': ['question', 'answer']}
First example in 'train' split: {'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}


In [None]:
# Define the function to process the dataset into ChatML format
def process_gsm8k_dataset(sample):
    """
    Converts the GSM8K format into ChatML format, both tokenized and non-tokenized.

    The GSM8K dataset has 'question' and 'answer' fields.
    This function formats these fields for ChatML tokenizer.
    """

    # Extract the question and answer
    question = sample["question"]
    answer = sample["answer"]

    # Structure the messages as a conversation
    messages = [
        {"role": "user", "content": question},
        {"role": "assistant", "content": answer}
    ]

    # Apply chat template without tokenization
    input_text_no_tokenize = tokenizer.apply_chat_template(messages, tokenize=False)

    # Apply chat template with tokenization
    input_text_tokenized = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)

    return {
        "formatted_text_no_tokenize": input_text_no_tokenize,
        "formatted_text_tokenized": input_text_tokenized
    }

In [None]:
# Apply the processing function to the dataset
processed_train_ds2 = train_ds2.map(process_gsm8k_dataset)
print("\nDictionary structure now:")
print(processed_train_ds2)

# Show the original and processed format
print("Original format:")
print(train_ds2[0])  # Show the raw data in 'train' split


# Show examples for both formats
print("\nExample without tokenization:")
print(processed_train_ds2[0]["formatted_text_no_tokenize"])

print("\nExample with tokenization:")
print(processed_train_ds2[0]["formatted_text_tokenized"])



Map:   0%|          | 0/7473 [00:00<?, ? examples/s]


Dictionary structure now:
Dataset({
    features: ['question', 'answer', 'formatted_text_no_tokenize', 'formatted_text_tokenized'],
    num_rows: 7473
})
Original format:
{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}

Example without tokenization:
<|im_start|>user
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<|im_end|>
<|im_start|>assistant
Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72<|im_end|>


Example with tokenization:
[1, 4093, 198, 62, 6927, 542, 3459, 23026, 288, 216, 36, 40, 282, 874, 2428, 281, 4124, 28, 284, 965, 1041, 3459

In [None]:
# Optionally, decode the tokenized conversation to make it human-readable
decoded_text = tokenizer.decode(token_ids=processed_train_ds2[0]["formatted_text_tokenized"])
print("\nDecoded conversation (from tokenized format):")
print(decoded_text)


Decoded conversation (from tokenized format):
<|im_start|>user
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<|im_end|>
<|im_start|>assistant
Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72<|im_end|>
<|im_start|>assistant



### Key Learning Points

1. **Dataset Structure**: Different datasets require different processing approaches
   - SmolTalk: Simple conversation parsing
   - GSM8K: Complex problem-solution formatting

2. **Chat Templates**: The template adds special tokens and structure
   - Marks message boundaries
   - Identifies speaker roles
   - Maintains conversation flow

3. **Data Validation**: Always check your processed output
   - Verify format matches model expectations
   - Ensure no information is lost
   - Test with different examples

Try modifying the processing functions to handle different dataset formats or add additional features like:
- Input validation
- Error handling
- Custom formatting options

# WIKIHOW-es Dataset

In [4]:
from IPython.display import HTML
display(
    HTML(
        """<iframe
  src="https://huggingface.co/datasets/daqc/wikihow_es/embed/viewer/main/train"
  frameborder="0"
  width="100%"
  height="360px"
></iframe>
"""
    )
)

In [None]:
!pip install transformers datasets trl huggingface_hub

In [7]:
from datasets import load_dataset

# Cargar el dataset
ds = load_dataset("daqc/wikihow_es")

# Seleccionar el split 'train'
train_ds = ds["train"]

# Inspeccionar la estructura del dataset
print("Dataset keys:", ds.column_names)  # Muestra las claves disponibles en cada split
print("Primer ejemplo en el split 'train':", train_ds[0])  # Inspecciona el primer registro en el split 'train'


README.md:   0%|          | 0.00/518 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/173M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/113160 [00:00<?, ? examples/s]

Dataset keys: {'train': ['title', 'section_name', 'summary', 'document', 'english_section_name', 'english_url', 'url']}
Primer ejemplo en el split 'train': {'title': '¿Cómo calcular El rendimiento anualizado de una cartera de inversiones?', 'section_name': 'Calcular tu rendimiento anualizado', 'summary': 'Calcula tu rendimiento anualizado. Calcula el rendimiento semestral. Calcula un equivalente anualizado.', 'document': 'Una vez que hayas calculado el rendimiento total (como se muestra arriba), ingresa el resultado en esta ecuación: rendimiento anualizado = (1+ rendimiento)1/N-1 El producto de esta ecuación será el número correspondiente al rendimiento de cada año durante todo el periodo de tiempo.  En el exponente (el número pequeño que está afuera del paréntesis), el “1” representa la unidad que estamos midiendo, que es un año. Si deseas ser más específico, podrías usar “365” para obtener el rendimiento diario. La “N” representa el número de periodos que medirás. Entonces, si mides 

In [10]:
# Definir la función para procesar el dataset en formato ChatML
def process_wikihow_es_dataset(sample):
    """
    Convierte el formato de `wikihow_es` al formato ChatML.

    Este dataset tiene campos como 'input' (pregunta) y 'output' (respuesta).
    La función formatea estos campos como una conversación ChatML.
    """

    # Extraer el input y el output
    question = sample["title"]  # Campo que contiene la pregunta o entrada del usuario
    answer = sample["summary"]   # Campo que contiene la respuesta o salida esperada

    # Estructurar los mensajes como una conversación
    messages = [
        {"role": "user", "content": question},
        {"role": "assistant", "content": answer}
    ]

    # Aplicar plantilla ChatML sin tokenización
    input_text_no_tokenize = tokenizer.apply_chat_template(messages, tokenize=False)

    # Aplicar plantilla ChatML con tokenización
    input_text_tokenized = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)

    return {
        "formatted_text_no_tokenize": input_text_no_tokenize,
        "formatted_text_tokenized": input_text_tokenized
    }

In [15]:
# Aplicar la función de procesamiento al dataset
processed_train_ds = train_ds.map(process_wikihow_es_dataset)
print("\nEstructura del diccionario después del procesamiento:")
print(processed_train_ds)

# Mostrar el formato original y procesado
print("Formato original:")
print(train_ds[0])  # Muestra el dato original en el split 'train'

# Mostrar ejemplos de ambos formatos
print("\nEjemplo sin tokenización:")
print(processed_train_ds[0]["formatted_text_no_tokenize"])

print("\nEjemplo con tokenización:")
print(processed_train_ds[0]["formatted_text_tokenized"])

# Opcionalmente, decodificar el texto tokenizado para hacerlo legible
decoded_text = tokenizer.decode(token_ids=processed_train_ds[0]["formatted_text_tokenized"])
print("\nConversación decodificada (del formato tokenizado):")
print(decoded_text)

Map:   0%|          | 0/113160 [00:00<?, ? examples/s]


Estructura del diccionario después del procesamiento:
Dataset({
    features: ['title', 'section_name', 'summary', 'document', 'english_section_name', 'english_url', 'url', 'formatted_text_no_tokenize', 'formatted_text_tokenized'],
    num_rows: 113160
})
Formato original:
{'title': '¿Cómo calcular El rendimiento anualizado de una cartera de inversiones?', 'section_name': 'Calcular tu rendimiento anualizado', 'summary': 'Calcula tu rendimiento anualizado. Calcula el rendimiento semestral. Calcula un equivalente anualizado.', 'document': 'Una vez que hayas calculado el rendimiento total (como se muestra arriba), ingresa el resultado en esta ecuación: rendimiento anualizado = (1+ rendimiento)1/N-1 El producto de esta ecuación será el número correspondiente al rendimiento de cada año durante todo el periodo de tiempo.  En el exponente (el número pequeño que está afuera del paréntesis), el “1” representa la unidad que estamos midiendo, que es un año. Si deseas ser más específico, podrías 

## Conclusion

This notebook demonstrated how to apply chat templates to different models, `SmolLM2`. By structuring interactions with chat templates, we can ensure that AI models provide consistent and contextually relevant responses.

In the exercise you tried out converting a dataset into chatml format. Luckily, TRL will do this for you, but it's useful to understand what's going on under the hood.