In [38]:
!pip3 install typing transformers datasets



In [36]:
from IPython.display import display, Markdown
from transformers import AutoTokenizer

# Base Models vs Instruct Models

A base model is trained on raw text data to predict the next token, while an instruct model is fine-tuned specifically to follow instructions and engage in conversations. For example, SmolLM2-135M is a base model, while SmolLM2-135M-Instruct is its instruction-tuned variant.

Instruction Tuned models are trained to follow a specific conversational structure, making them more suitable for chatbot applications. Moreover instruct models can handle complex interactions, including toll use, mutimodel inputs and function calling.

When using an instruct model, always verify you're using the correct chat template format. Using the wrong template can result in poor model performance or unexpected behaviour. The easiest way to ensure this is to check the model tokenizer configuration on the Hub. 

## Common Template Formats

In [19]:
messages = [
    {'role':"system", "content": "You are a helpful assistant."}, 
    {"role": "user", "content":"Hello!"}, 
    {"role": "assistant", "content": "Hi! How can I help you today?"},
    {"role": "user", "content": "What's the weather?"}
]

messages

[{'role': 'system', 'content': 'You are a helpful assistant.'},
 {'role': 'user', 'content': 'Hello!'},
 {'role': 'assistant', 'content': 'Hi! How can I help you today?'},
 {'role': 'user', 'content': "What's the weather?"}]

In [9]:

# This is the ChatML Template used in models like SmolLM2 and Qwen2

# <|im_start|>system
# You are a helpful assistant.<|im_end|>
# <|im_start|>user
# Hello!<|im_end|>
# <|im_start|>assistant
# Hi! How can I help you today?<|im_end|>
# <|im_start|>user
# What's the weather?<|im_start|>assistant

## So each and every instruct model follows different varitions. Transformers library helps us handle these variations automatically

In [24]:
# These will use different templates automatically
mistral_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
qwen_tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
smol_tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")

# Each will format according to its model's template
mistral_chat = mistral_tokenizer.apply_chat_template(messages, tokenize = False)
qwen_chat = qwen_tokenizer.apply_chat_template(messages, tokenize = False)
smol_chat = smol_tokenizer.apply_chat_template(messages, tokenize = False)

In [46]:
print(mistral_chat)

<s> [INST] You are a helpful assistant.

Hello! [/INST] Hi! How can I help you today?</s> [INST] What's the weather? [/INST]


In [47]:
print(qwen_chat)

<|system|>
You are a helpful assistant.</s>
<|user|>
Hello!</s>
<|assistant|>
Hi! How can I help you today?</s>
<|user|>
What's the weather?</s>



In [48]:
print(smol_chat)

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hi! How can I help you today?<|im_end|>
<|im_start|>user
What's the weather?<|im_end|>



## For multimodel conversations, chat templates can include image references or base64-encoded images

In [50]:
messages = [
    {
        "role": "system",
        "content": "You are a helpful vision assistant that can analyze images.",
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image", "image_url": "https://example.com/image.jpg"},
        ],
    },
]

When working with chat templates, follow these key practices:

* Consistent Formatting: Always use the same template format throughout your application
* Clear Role Definition: Clearly specify roles (system, user, assistant, tool) for each message
* Context Management: Be mindful of token limits when maintaining conversation history
* Error Handling: Include proper error handling for tool calls and multimodal inputs
* Validation: Validate message structure before sending to the model

# Converting a Dataset values into a Chat Templates format

## 1. Load the Dataset

In [54]:
from datasets import load_dataset

dataset = load_dataset("HuggingFaceTB/smoltalk", "smol-summarize")

dataset

data/smol-summarize/train-00000-of-00001(…):   0%|          | 0.00/119M [00:00<?, ?B/s]

data/smol-summarize/test-00000-of-00001.(…):   0%|          | 0.00/6.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/96356 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5072 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 96356
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 5072
    })
})

In [64]:
dataset["train"][0]["messages"]

[{'content': 'Extract and present the main key point of the input text in one very short sentence, including essential details like dates or locations if necessary.',
  'role': 'system'},
 {'content': "Hi Sarah,\n\nI hope you're doing well! I wanted to reach out because I've been struggling with a student in my class who is significantly behind in reading comprehension. I remember you mentioning some effective strategies during our last conversation, and I was wondering if you could share some resources or tips that might help me support this student better.\n\nAny advice would be greatly appreciated! Let me know if you have time to chat further about this.\n\nBest,\nEmily",
  'role': 'user'},
 {'content': 'Emily is seeking advice on strategies for a struggling reader in her class.',
  'role': 'assistant'}]

## 2. Create a processing function

In [65]:
def convert_to_chatml(message):
    return {
        "messages":[
            {"role":message["messages"][1]["role"], "content": message["messages"][1]["content"]},
            {"role":message["messages"][2]["role"], "content": message["messages"][2]["content"]}
        ]
    }


train = dataset["train"].map(convert_to_chatml)

Map:   0%|          | 0/96356 [00:00<?, ? examples/s]

In [72]:
train["messages"][0]

[{'content': "Hi Sarah,\n\nI hope you're doing well! I wanted to reach out because I've been struggling with a student in my class who is significantly behind in reading comprehension. I remember you mentioning some effective strategies during our last conversation, and I was wondering if you could share some resources or tips that might help me support this student better.\n\nAny advice would be greatly appreciated! Let me know if you have time to chat further about this.\n\nBest,\nEmily",
  'role': 'user'},
 {'content': 'Emily is seeking advice on strategies for a struggling reader in her class.',
  'role': 'assistant'}]

Above, we have extracted just the messages related role user and assistant 