## Chapter 4: Formatting Your Dataset

### Spoilers 

In this chapter, we will:

- Understand the importance of defining a proper chat template
- Discuss several formatting alternatives, including custom formatting functions and templates
- Configure the tokenizer and the model’s embedding layer
- Explore packed datasets and different data collators for loading data

### Setup

In [None]:
# If you're running on Colab
!pip install datasets bitsandbytes trl

In [None]:
# If you're running on runpod.io's Jupyter Template
#!pip install datasets bitsandbytes trl transformers peft huggingface-hub accelerate safetensors pandas matplotlib

### Imports

In [1]:
import torch
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig
from datasets import load_dataset, Dataset
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, DataCollatorForLanguageModeling, DataCollatorWithPadding, DataCollatorWithFlattening, BitsAndBytesConfig
from trl import setup_chat_format, DataCollatorForCompletionOnlyLM
from trl.extras.dataset_formatting import FORMAT_MAPPING, instructions_formatting_function, conversations_formatting_function
from trl.trainer import ConstantLengthDataset

### The Goal

We format the dataset to provide structure and cues to the LLM. We can easily steer its behavior (e.g., instruction-tuning) by carefully wrapping each component—the user’s prompt and the model’s completion—with appropriate tags and special tokens.

### Formatting in a Nutshell

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch4/base_prompt.png?raw=True)
<center>Figure 4.1 - Base model’s next token prediction</center>

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch4/fine_tuned_prompt.png?raw=True)
<center>Figure 4.2 - Fine-tuned model triggered by response template</center>

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch4/chat_prompt_new.png?raw=True)
<center>Figure 4.3 - Chat model using chat template</center>

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch4/chat_example_new.png?raw=True)
<center>Figure 4.4 - General structure of a chat template</center>

### The Road so Far

In [2]:
supported = torch.cuda.is_bf16_supported(including_emulation=False)
compute_dtype = (torch.bfloat16 if supported else torch.float32)

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=compute_dtype
)

model_q4 = AutoModelForCausalLM.from_pretrained("facebook/opt-350m",
                                                device_map='cuda:0',
                                                torch_dtype=compute_dtype,
                                                quantization_config=nf4_config)

model_q4 = prepare_model_for_kbit_training(model_q4)

config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
peft_model = get_peft_model(model_q4, config)

  return self.fget.__get__(instance, owner)()


### Applying Templates

****
**Summary of "Applying Templates"**

You have three options for formatting your dataset:
1. Your dataset is in one of the **two formats supported by the `STTrainer` class** (conversational or instruction):
   - Your **tokenizer must have a chat template** configured.
   - No need to define a formatting function or format the dataset before training.
   - **IMPORTANT**: **the instruction format is not properly supported anymore by recent versions of the `trl` package**
2. You want to use a **custom formatting function** (see "BYOFF, Bring Your Own Formatting Function"):
   - The custom function should be provided as the **`formatting_func` argument of the `SFTTrainer` class** (see Chapter 5).
   - Your formatting function **must handle batches of data**.
     - Test it by calling the dataset's `map()` method with `batched=True`.
    - No need to apply the function to the dataset before training.
    - If your tokenizer already **has a chat template**:
      - You may call its `apply_chat_template()` method in your function.
      - Stick to the template's general format (instruction and response templates).
      - If the template doesn’t include one, **you may append an `EOS` token to the end of the formatted output**.
   - If your tokenizer **does not have a chat template**:
     - You're free to define the general format, including instruction and response templates (see "Advanced—BYOT, Bring Your Own Template")
3. Your dataset is **already formatted** (see "BYOFD, Bring Your Own Formatted Data"):
   - The column containing the formatted data should be provided as the **`dataset_text_field` argument of the `SFTTrainer` class** (see Chapter 5).
   - Even though you can use your own formatting function to preprocess your dataset, it won't be used by the trainer class.
   - Ensure your **data is compatible with the tokenizer's template**.
****

In [3]:
tokenizer_phi = AutoTokenizer.from_pretrained("microsoft/phi-3-mini-4k-instruct")
print(tokenizer_phi.chat_template)

{% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'user' %}{{'<|user|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>
' + message['content'] + '<|end|>
'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}


In [4]:
messages = [
    {'role': 'system', 'content': 'You are a helpful AI assistant.'},
    {'role': 'user', 'content': 'What is the capital of Argentina?'},
    {'role': 'assistant', 'content': 'Buenos Aires.'}
]

formatted = tokenizer_phi.apply_chat_template(conversation=messages, 
                                          tokenize=False, 
                                          add_generation_prompt=False)
print(formatted)

<|system|>
You are a helpful AI assistant.<|end|>
<|user|>
What is the capital of Argentina?<|end|>
<|assistant|>
Buenos Aires.<|end|>
<|endoftext|>


In [5]:
inference_input = tokenizer_phi.apply_chat_template(conversation=messages[:-1], 
                                          tokenize=False, 
                                          add_generation_prompt=True)
print(inference_input)

<|system|>
You are a helpful AI assistant.<|end|>
<|user|>
What is the capital of Argentina?<|end|>
<|assistant|>



#### Supported Formats

##### Conversational

In [6]:
conversation_ds = Dataset.from_list([{'messages': messages}])
conversation_ds.features

{'messages': [{'content': Value(dtype='string', id=None),
   'role': Value(dtype='string', id=None)}]}

In [7]:
FORMAT_MAPPING['chatml'] == conversation_ds.features['messages']

True

In [8]:
formatting_func = conversations_formatting_function(tokenizer_phi, messages_field='messages')

print(formatting_func(conversation_ds[0]))

<|system|>
You are a helpful AI assistant.<|end|>
<|user|>
What is the capital of Argentina?<|end|>
<|assistant|>
Buenos Aires.<|end|>
<|endoftext|>


```python
# formatting function for conversational format
def format_dataset(examples):
    if isinstance(examples[messages_field][0], list):
        output_texts = []
        for i in range(len(examples[messages_field])):
            output_texts.append(tokenizer.apply_chat_template(examples[messages_field][i], tokenize=False))
        return output_texts
    else:
        return tokenizer.apply_chat_template(examples[messages_field], tokenize=False)
```

##### Instruction

**IMPORTANT UPDATE**: unfortunately, in more recent versions of the `trl` library, the "instruction" format is not properly supported anymore, thus leading to the chat template not being applied to the dataset. In order to avoid this issue, it is recommended to use the "conversational" format instead.

In [10]:
instructions = [{'prompt': 'What is the capital of Argentina?',
                 'completion': 'Buenos Aires.'}]

instruction_ds = Dataset.from_list(instructions)
instruction_ds.features

{'prompt': Value(dtype='string', id=None),
 'completion': Value(dtype='string', id=None)}

In [11]:
FORMAT_MAPPING['instruction'] == instruction_ds.features

True

In [12]:
formatting_func = instructions_formatting_function(tokenizer_phi)
formatting_func

<function trl.extras.dataset_formatting.instructions_formatting_function.<locals>.format_dataset(examples)>

```python
# formatting function for instruction format
def format_dataset(examples):
    if isinstance(examples["prompt"], list):
        output_texts = []
        for i in range(len(examples["prompt"])):
            converted_sample = [
                {"role": "user", "content": examples["prompt"][i]},
                {"role": "assistant", "content": examples["completion"][i]},
            ]
            output_texts.append(tokenizer.apply_chat_template(converted_sample, tokenize=False))
        return output_texts
    else:
        converted_sample = [
            {"role": "user", "content": examples["prompt"]},
            {"role": "assistant", "content": examples["completion"]},
        ]
        return tokenizer.apply_chat_template(converted_sample, tokenize=False)
```

In [14]:
batch_prompts_completions = {
    'prompt': ['What is the capital of Argentina?',
               'What is the capital of the United States?'],
    'completion': ['Buenos Aires.',
                    'Washington D.C.']
}

In [15]:
batch_messages = [
    [{'role': 'user', 'content': 'What is the capital of Argentina?'},
     {'role': 'assistant', 'content': 'Buenos Aires.'}],
    [{'role': 'user', 'content': 'What is the capital of the United States?'},
     {'role': 'assistant', 'content': 'Washington D.C.'}]
]

#### BYOFF (Bring Your Own Formatting Function)

In [16]:
def byo_formatting_func1(examples):
    messages = examples["messages"]
    output_texts = tokenizer_phi.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    return output_texts

In [17]:
ds_msg = Dataset.from_dict({'messages': batch_messages})
ds_msg.map(lambda v: tokenizer_phi(byo_formatting_func1(v)), batched=True)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Dataset({
    features: ['messages', 'input_ids', 'attention_mask'],
    num_rows: 2
})

In [18]:
def byo_formatting_func2(examples):
    response_template = '### Answer:'
    text = f"### Question: {examples['prompt']}\n{response_template} {examples['completion']}"
    text += tokenizer_phi.eos_token
    return text

In [19]:
ds_prompt = Dataset.from_dict(batch_prompts_completions)
print(byo_formatting_func2(ds_prompt[0]))

### Question: What is the capital of Argentina?
### Answer: Buenos Aires.<|endoftext|>


In [20]:
# this is going to raise an exception
ds_prompt.map(lambda v: tokenizer_phi(byo_formatting_func2(v)), batched=True)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

ArrowInvalid: Column 2 named input_ids expected length 2 but got length 44

In [21]:
def byo_formatting_func3(examples):
    output_texts = []
    response_template = '### Answer:'
    for i in range(len(examples['prompt'])):
        text = f"### Question: {examples['prompt'][i]}\n {response_template} {examples['completion'][i]}"
        text += tokenizer_phi.eos_token
        output_texts.append(text)
    return output_texts

In [22]:
ds_prompt.map(lambda v: tokenizer_phi(byo_formatting_func3(v)), batched=True)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'completion', 'input_ids', 'attention_mask'],
    num_rows: 2
})

#### BYOFD (Bring Your Own Formatted Data)

In [23]:
def byofd_formatting_func(examples):
    messages = examples["messages"]
    output_texts = tokenizer_phi.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    return {'text': output_texts}

In [24]:
formatted_ds = ds_msg.map(byofd_formatting_func, batched=True)
formatted_ds['text']

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

['<|user|>\nWhat is the capital of Argentina?<|end|>\n<|assistant|>\nBuenos Aires.<|end|>\n<|endoftext|>',
 '<|user|>\nWhat is the capital of the United States?<|end|>\n<|assistant|>\nWashington D.C.<|end|>\n<|endoftext|>']

#### Showdown

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch4/formatting_flow.png?raw=True)

<center>Figure 4.5 - Choosing the right configuration for your formatting needs</center>

### The Tokenizer

****
**Summary of "The Tokenizer"**
- The **tokenizer's vocabulary** is usually **shorter than the model's embedding layer**.
  - The difference in size consists of, quite literally, "empty slots" that you can use to **create new tokens without resizing** the embedding layer.
  - The **size of the embedding layer** is often a **multiple of a power of two** (32, 64, etc.) to optimize **memory allocation**.
- The `EOS` token should be **used solely to mark the end of the text** and nothing else.
  - Using the `EOS` token for padding may lead to _endless token generation_.
- The `PAD` token is often undefined, but you might still need it:
  - **DO NOT** assign the `EOS` token as the `PAD` token.
  - If the `UNK` token is defined, it is fine to assign it as the `PAD` token.
  - If the `UNK` token is undefined, create a new special token as the `PAD` token.
  - **WATCH OUT**: If the `PAD` token is left **undefined**, many libraries will **default to assigning it the `EOS` token** instead!
- For **generative** models, **padding** should be performed on the **left** side.
  - Padding on the _right_ side will train the model to generate _endless sequences of padding tokens_.
  - Many tutorials use `tokenizer.padding_side='right'` due to reported overflow issues with the `SFTTrainer` class.
    - This is fine **only if you're using packing or packing-like collators** (see the "Packed Dataset" section) instead of standard padding.
- If you **create new special tokens**, in theory, you should also **fine-tune the embedding layer** (since you're using those "empty slots").
  - In practice, your model _may_ still work if you **keep the embeddings frozen**.
  - Even though the new tokens' representation is _random_ (their embeddings aren't trained), the other trainable parts of the model may still learn to use them "as is."
****

In [25]:
tokenizer_phi = AutoTokenizer.from_pretrained("microsoft/phi-3-mini-4k-instruct")
config_phi = AutoConfig.from_pretrained("microsoft/phi-3-mini-4k-instruct", trust_remote_code=True)

In [26]:
tokenizer_phi("Let's tokenize this sentence!")

{'input_ids': [2803, 29915, 29879, 5993, 675, 445, 10541, 29991], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

#### Vocabulary

In [27]:
len(tokenizer_phi), config_phi.vocab_size

(32011, 32064)

In [28]:
sorted(tokenizer_phi.vocab.items(), key=lambda t: -t[1])[:11]

[('<|user|>', 32010),
 ('<|placeholder6|>', 32009),
 ('<|placeholder5|>', 32008),
 ('<|end|>', 32007),
 ('<|system|>', 32006),
 ('<|placeholder4|>', 32005),
 ('<|placeholder3|>', 32004),
 ('<|placeholder2|>', 32003),
 ('<|placeholder1|>', 32002),
 ('<|assistant|>', 32001),
 ('<|endoftext|>', 32000),
 ('给', 31999)]

In [29]:
tokenizer_phi.eos_token, tokenizer_phi.eos_token_id

('<|endoftext|>', 32000)

#### The Tokenizer 7

In [30]:
tokenizer_phi.all_special_tokens

['<s>', '<|endoftext|>', '<unk>']

In [31]:
tokenizer_phi.special_tokens_map

{'bos_token': '<s>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<unk>',
 'pad_token': '<|endoftext|>'}

In [33]:
tokenizer_phi.cls_token, tokenizer_phi.sep_token, tokenizer_phi.mask_token

(None, None, None)

In [34]:
tokenizer_phi.add_special_tokens({'cls_token': '<cls>', 'sep_token': '<sep>', 'mask_token': '<mask>'})
tokenizer_phi.special_tokens_map

{'bos_token': '<s>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<unk>',
 'sep_token': '<sep>',
 'pad_token': '<|endoftext|>',
 'cls_token': '<cls>',
 'mask_token': '<mask>'}

In [35]:
sorted(tokenizer_phi.vocab.items(), key=lambda t: -t[1])[:13]

[('<mask>', 32013),
 ('<sep>', 32012),
 ('<cls>', 32011),
 ('<|user|>', 32010),
 ('<|placeholder6|>', 32009),
 ('<|placeholder5|>', 32008),
 ('<|end|>', 32007),
 ('<|system|>', 32006),
 ('<|placeholder4|>', 32005),
 ('<|placeholder3|>', 32004),
 ('<|placeholder2|>', 32003),
 ('<|placeholder1|>', 32002),
 ('<|assistant|>', 32001),
 ('<|endoftext|>', 32000),
 ('给', 31999)]

#### The `EOS` Token

In [32]:
tokenizer_phi.pad_token = tokenizer_phi.unk_token
tokenizer_phi.pad_token_id = tokenizer_phi.unk_token_id

tokenizer_phi.special_tokens_map

{'bos_token': '<s>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<unk>',
 'sep_token': '<sep>',
 'pad_token': '<unk>',
 'cls_token': '<cls>',
 'mask_token': '<mask>'}

```python
# Updating model's configuration for the modified PAD token
if getattr(model, "config", None) is not None:
    model.config.pad_token_id = tokenizer_phi.pad_token_id
if (getattr(model, "generation_config", None) s not None):
    model.config.pad_token_id = tokenizer_phi.pad_token_id
```

#### The `PAD` Token

In [36]:
tokenizer_phi.pad_token, tokenizer_phi.padding_side

('<unk>', 'left')

### Data Collators

****
**Summary of "Data Collators"**
- You can specify the `data_collator` argument in the `SFTTrainer` class (see Chapter 5).
- `DataCollatorForLanguageModeling` is the **default** collator for the `SFTTrainer` class:
  - It automatically **replicates the token IDs as labels**.
  - It **doesn't shift the labels**, as this is **handled automatically by the model**.
  - It includes the full text (both prompt and completion) as labels, making it ideal for instruction-tuning.
- If you're further fine-tuning an instruction or chat model, you can use `DataCollatorForCompletionOnlyLM` to **train only on the model's answer (completion)**.
  - It also replicates the token IDs as labels but **masks the prompt tokens by replacing their IDs with `-100`**.
  - In a **single interaction** (one prompt and one completion), the **response template is enough** to locate the completion.
  - In **multiple interactions** (a sequence of prompts and completions), both the **instruction and response templates** are needed to correctly identify and mask the prompt tokens.
****

In [37]:
dataset = load_dataset("dvgodoy/yoda_sentences", split="train")
dataset = dataset.rename_column("sentence", "prompt")
dataset = dataset.rename_column("translation_extra", "completion")
dataset = dataset.remove_columns(["translation"])
len(dataset), dataset[0]

(720,
 {'prompt': 'The birch canoe slid on the smooth planks.',
  'completion': 'On the smooth planks, the birch canoe slid. Yes, hrrrm.'})

In [38]:
formatting_func = instructions_formatting_function(tokenizer_phi)
dataset = dataset.map(lambda row: {'text': formatting_func(row)}, batched=True, batch_size=32)
sequences = dataset['text']
print(sequences[:2])

['<|user|>\nThe birch canoe slid on the smooth planks.<|end|>\n<|assistant|>\nOn the smooth planks, the birch canoe slid. Yes, hrrrm.<|end|>\n<|endoftext|>', '<|user|>\nGlue the sheet to the dark blue background.<|end|>\n<|assistant|>\nGlue the sheet to the dark blue background, you must.<|end|>\n<|endoftext|>']


In [39]:
tokenized_dataset = dataset.map(lambda row: tokenizer_phi(row['text']))
tokenized_dataset = tokenized_dataset.select_columns(['input_ids'])

#### `DataCollatorWithPadding`

In [40]:
pad_collator = DataCollatorWithPadding(tokenizer_phi)
pad_dloader = DataLoader(tokenized_dataset, batch_size=2, collate_fn=pad_collator)
pad_batch = next(iter(pad_dloader))
pad_batch

{'input_ids': tensor([[32010,   450, 29773,   305,   508,  7297,  2243,   333,   373,   278,
         10597,   715,  1331, 29889, 32007, 32001,  1551,   278, 10597,   715,
          1331, 29892,   278, 29773,   305,   508,  7297,  2243,   333, 29889,
          3869, 29892,   298, 21478,  1758, 29889, 32007, 32000],
        [    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
         32010,  8467,   434,   278,  9869,   304,   278,  6501,  7254,  3239,
         29889, 32007, 32001,  8467,   434,   278,  9869,   304,   278,  6501,
          7254,  3239, 29892,   366,  1818, 29889, 32007, 32000]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

#### Dude, Where's My Label?

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch4/shift_labels.png?raw=True)

<center>Figure 4.6 - Inputs and their corresponding shifted labels</center>

#### `DataCollatorForLanguageModeling`

In [41]:
lm_collator = DataCollatorForLanguageModeling(tokenizer_phi, mlm=False)
lm_dloader = DataLoader(tokenized_dataset, batch_size=2, collate_fn=lm_collator)
lm_batch = next(iter(lm_dloader))
lm_batch

{'input_ids': tensor([[32010,   450, 29773,   305,   508,  7297,  2243,   333,   373,   278,
         10597,   715,  1331, 29889, 32007, 32001,  1551,   278, 10597,   715,
          1331, 29892,   278, 29773,   305,   508,  7297,  2243,   333, 29889,
          3869, 29892,   298, 21478,  1758, 29889, 32007, 32000],
        [    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
         32010,  8467,   434,   278,  9869,   304,   278,  6501,  7254,  3239,
         29889, 32007, 32001,  8467,   434,   278,  9869,   304,   278,  6501,
          7254,  3239, 29892,   366,  1818, 29889, 32007, 32000]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[32010,   450, 29773,   305,   508,  7297,  2243,   333,   373,   278,
   

#### `DataCollatorForCompletionOnlyLM`

In [44]:
response_template = '<|assistant|>' # token id 32001
completion_collator = DataCollatorForCompletionOnlyLM(response_template=response_template, 
                                                      tokenizer=tokenizer_phi)
completion_dloader = DataLoader(tokenized_dataset, batch_size=2, collate_fn=completion_collator)
completion_batch = next(iter(completion_dloader))
completion_batch

{'input_ids': tensor([[32010,   450, 29773,   305,   508,  7297,  2243,   333,   373,   278,
         10597,   715,  1331, 29889, 32007, 32001,  1551,   278, 10597,   715,
          1331, 29892,   278, 29773,   305,   508,  7297,  2243,   333, 29889,
          3869, 29892,   298, 21478,  1758, 29889, 32007, 32000],
        [    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
         32010,  8467,   434,   278,  9869,   304,   278,  6501,  7254,  3239,
         29889, 32007, 32001,  8467,   434,   278,  9869,   304,   278,  6501,
          7254,  3239, 29892,   366,  1818, 29889, 32007, 32000]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
   

In [47]:
labels = completion_batch['labels'][0]
valid_tokens = (labels >= 0)
tokenizer_phi.decode(labels[valid_tokens])

'On the smooth planks, the birch canoe slid. Yes, hrrrm.<|end|><|endoftext|>'

##### Multiple Interactions

In [48]:
dummy_chat = """<|user|>Hello
<|assistant|>How are you?
<|user|>I'm fine! You?
<|assistant|>I'm fine too!
<|endoftext|>"""

dummy_ds = Dataset.from_dict({'text': [dummy_chat]})
dummy_ds = dummy_ds.map(lambda row: tokenizer_phi(row['text'])).select_columns(['input_ids'])

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [49]:
completion_dloader = DataLoader(dummy_ds, batch_size=1, collate_fn=completion_collator)
completion_batch = next(iter(completion_dloader))
completion_batch

{'input_ids': tensor([[32010, 15043,    13, 32001,  1128,   526,   366, 29973,    13, 32010,
           306, 29915, 29885,  2691, 29991,   887, 29973,    13, 32001,   306,
         29915, 29885,  2691,  2086, 29991,    13, 32000]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1]]), 'labels': tensor([[ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,   306,
         29915, 29885,  2691,  2086, 29991,    13, 32000]])}

In [50]:
labels = completion_batch['labels']
tokenizer_phi.decode(labels[labels >= 0])

"I'm fine too!\n<|endoftext|>"

In [51]:
instruction_template = '<|user|>'
response_template = '<|assistant|>'
completion_collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template,
                                                      response_template=response_template, 
                                                      tokenizer=tokenizer_phi)
completion_dloader = DataLoader(dummy_ds, batch_size=1, collate_fn=completion_collator)
completion_batch = next(iter(completion_dloader))
completion_batch

{'input_ids': tensor([[32010, 15043,    13, 32001,  1128,   526,   366, 29973,    13, 32010,
           306, 29915, 29885,  2691, 29991,   887, 29973,    13, 32001,   306,
         29915, 29885,  2691,  2086, 29991,    13, 32000]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1]]), 'labels': tensor([[ -100,  -100,  -100,  -100,  1128,   526,   366, 29973,    13,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,   306,
         29915, 29885,  2691,  2086, 29991,    13, 32000]])}

In [52]:
labels = completion_batch['labels']
tokenizer_phi.decode(labels[labels >= 0])

"How are you?\n I'm fine too!\n<|endoftext|>"

#### Label Shifting

```python
if labels is not None:
    # move labels to correct device to enable model parallelism
    labels = labels.to(lm_logits.device)
    # we are doing next-token prediction; shift prediction scores and input ids by one
    shift_logits = lm_logits[:, :-1, :].contiguous()
    labels = labels[:, 1:].contiguous()
    loss_fct = CrossEntropyLoss()
    lm_loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), labels.view(-1))
```

### Packed Dataset

****
**Summary of "Packed Dataset"**
- Packing **concatenates** sequences and **splits** them into **equal-sized packs**:
  - **No padding tokens** are used.
  - Each pack's length must not exceed the **model’s maximum sequence length**.
- Packing is natively supported by the `SFTTrainer`:
  - Set its `packing` argument to `True`.
  - It creates an internal `ConstantLengthDataset` to handle the packing.
  - By default, you **cannot use packing and a collator simultaneously**.
- Some **collators can effectively pack** sequences:
  - In this case, the **`packing` argument must be set to `False`**, and the collator performs the packing.
  - `DataCollatorWithFlattening` is the packing equivalent of `DataCollatorForLanguageModeling`.
  - `DataCollatorForCompletionOnlyLM` includes a new argument (`padding_free`) that makes the completion-only collator function like packing.
  - Certain models (e.g. Llama, Phi, Mistral, Gemma, OLMo, and a few others) support these collators with Flash Attention 2:
    - These models use `position_ids` to **mark the boundaries** between the original sequences packed together.
****

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch4/packed_seq.png?raw=True)

<center>Figure 4.7 - Packed sequences</center>

In [53]:
sequences = dataset['text']
print(sequences[:2])

['<|user|>\nThe birch canoe slid on the smooth planks.<|end|>\n<|assistant|>\nOn the smooth planks, the birch canoe slid. Yes, hrrrm.<|end|>\n<|endoftext|>', '<|user|>\nGlue the sheet to the dark blue background.<|end|>\n<|assistant|>\nGlue the sheet to the dark blue background, you must.<|end|>\n<|endoftext|>']


In [54]:
iterator = ConstantLengthDataset(tokenizer_phi, dataset, 
                                 dataset_text_field='text', 
                                 seq_length=64, shuffle=False)

def data_generator(iterator):
    yield from iterator

packed_dataset = Dataset.from_generator(
    data_generator, 
    gen_kwargs={"iterator": iterator}
)
packed_dataset

Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 351
})

In [55]:
input_ids = packed_dataset['input_ids']
tokenizer_phi.decode(input_ids[0])

'<|user|> The birch canoe slid on the smooth planks.<|end|><|assistant|> On the smooth planks, the birch canoe slid. Yes, hrrrm.<|end|><|endoftext|><|endoftext|><|user|> Glue the sheet to the dark blue background.<|end|><|assistant|> Glue the sheet to the dark blue background, you must'

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch4/packing_flow.png?raw=True)

<center>Figure 4.8 - Choosing the right configuration for your data</center>

#### Collators for Packing

##### `DataCollatorWithFlattening`

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch4/collator_flat.png?raw=True)

<center>Figure 4.9 - Packing-like collator</center>

In [59]:
flat_collator = DataCollatorWithFlattening()
flat_dloader = DataLoader(tokenized_dataset, batch_size=2, collate_fn=flat_collator)
flat_batch = next(iter(flat_dloader))
flat_batch



{'input_ids': tensor([[32010,   450, 29773,   305,   508,  7297,  2243,   333,   373,   278,
          10597,   715,  1331, 29889, 32007, 32001,  1551,   278, 10597,   715,
           1331, 29892,   278, 29773,   305,   508,  7297,  2243,   333, 29889,
           3869, 29892,   298, 21478,  1758, 29889, 32007, 32000, 32010,  8467,
            434,   278,  9869,   304,   278,  6501,  7254,  3239, 29889, 32007,
          32001,  8467,   434,   278,  9869,   304,   278,  6501,  7254,  3239,
          29892,   366,  1818, 29889, 32007, 32000]]),
 'labels': tensor([[ -100,   450, 29773,   305,   508,  7297,  2243,   333,   373,   278,
          10597,   715,  1331, 29889, 32007, 32001,  1551,   278, 10597,   715,
           1331, 29892,   278, 29773,   305,   508,  7297,  2243,   333, 29889,
           3869, 29892,   298, 21478,  1758, 29889, 32007, 32000,  -100,  8467,
            434,   278,  9869,   304,   278,  6501,  7254,  3239, 29889, 32007,
          32001,  8467,   434,   278,  986

In [67]:
flat_batch['input_ids'].shape, flat_batch['position_ids'].max() + 1

(torch.Size([1, 66]), tensor(38))

##### `DataCollatorForCompletionOnlyLM`

In [60]:
response_template = '<|assistant|>'
completion_nopad_collator = DataCollatorForCompletionOnlyLM(response_template=response_template, 
                                                            tokenizer=tokenizer_phi,
                                                            padding_free=True)
completion_nopad_dloader = DataLoader(tokenized_dataset, batch_size=2, collate_fn=completion_nopad_collator)
completion_nopad_batch = next(iter(completion_nopad_dloader))
completion_nopad_batch

{'input_ids': tensor([[32010,   450, 29773,   305,   508,  7297,  2243,   333,   373,   278,
         10597,   715,  1331, 29889, 32007, 32001,  1551,   278, 10597,   715,
          1331, 29892,   278, 29773,   305,   508,  7297,  2243,   333, 29889,
          3869, 29892,   298, 21478,  1758, 29889, 32007, 32000, 32010,  8467,
           434,   278,  9869,   304,   278,  6501,  7254,  3239, 29889, 32007,
         32001,  8467,   434,   278,  9869,   304,   278,  6501,  7254,  3239,
         29892,   366,  1818, 29889, 32007, 32000]]), 'labels': tensor([[ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  1551,   278, 10597,   715,
          1331, 29892,   278, 29773,   305,   508,  7297,  2243,   333, 29889,
          3869, 29892,   298, 21478,  1758, 29889, 32007, 32000,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  8467,   434,   278,  9869,   304,   

### Advanced: BYOT (Bring Your Own Template)

****
**Summary of "Advanced: BYOT"**
- Every template must define a **response template** and, ideally, **end with an `EOS` token**.
- Double-check your tokenizer's `EOS`, `PAD`, and `UNK` tokens:
  - The `EOS` token must be distinct from both `PAD` and `UNK` tokens.
  - The `PAD` and `UNK` tokens can be the same.
- **Only resize** the embedding layer if absolutely necessary—i.e., if all "empty slots" have already been used:
  - When calling the model's `resize_token_embeddings()`, use the `pad_to_multiple_of` argument to ensure the size remains a **multiple of a power of two**.
- If you don’t want to create a Jinja template yourself, you can use a default template like ChatML.
  - The `trl` package provides the `setup_chat_format()` function, but it has some drawbacks:
    - It assigns the `EOS` token to the `PAD` token (you'll need to **fix it manually** afterward).
    - It resizes the model's embedding layer by default, even if only to make it shorter (though **you can avoid resizing** by selecting the appropriate `resize_to_multiple_of`).
- You can define and apply a **custom template using a formatting function** instead of creating a Jinja template for your tokenizer:
  - If you specify the `formatting_func` in the `SFTTrainer` class (see Chapter 5), your tokenizer doesn't need to have a chat template.
  - Choose your response template carefully:
    - Using **regular words** (e.g. "## Answer:") **may cause issues**, as some tokenizers are "context-dependent" and might split your response template into multiple tokens.
    - Creating an **additional special token for your response template** is safer, as it will be encoded as a **single token**.
****

#### Chat Template

In [270]:
model_opt = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer_opt = AutoTokenizer.from_pretrained("facebook/opt-350m")

print(tokenizer_opt.chat_template)

None


In [128]:
tokenizer_opt.special_tokens_map

{'bos_token': '</s>',
 'eos_token': '</s>',
 'unk_token': '</s>',
 'pad_token': '<pad>'}

**ChatML**
****
[ChatML](https://github.com/openai/openai-python/blob/release-v0.28.0/chatml.md), short for Chat Markup Language, was developed by OpenAI: 

_____
"_Traditionally, GPT models consumed unstructured text. ChatGPT models instead expect a structured format, called Chat Markup Language (ChatML for short). ChatML documents consist of a sequence of messages._"
_____

Each message should contain the role of the participant and their corresponding content, like the conversational format introduced earlier. This is ChatML's Jinja template:

```
{% for message in messages %}
  {{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}
{% endfor %}
```
****

In [129]:
len(tokenizer_opt)

50265

In [131]:
model_opt.config.vocab_size

50272

In [132]:
def get_multiple_of(vocab_size):
    return 2**(bin(vocab_size)[::-1].find('1'))

pad_to_multiple_of = get_multiple_of(model_opt.config.vocab_size)
pad_to_multiple_of

32

In [133]:
model_opt.resize_token_embeddings(len(tokenizer_opt), 
                                  pad_to_multiple_of=pad_to_multiple_of)

Embedding(50272, 512, padding_idx=1)

In [217]:
def modify_tokenizer(tokenizer, 
                     alternative_bos_token='<|im_start|>', 
                     alternative_unk_token='<unk>', 
                     special_tokens=None, 
                     tokens=None):
    eos_token, bos_token = tokenizer.eos_token, tokenizer.bos_token
    pad_token, unk_token = tokenizer.pad_token, tokenizer.unk_token

    # BOS token must be different than EOS token
    if bos_token == eos_token:
        bos_token = alternative_bos_token

    # UNK token must be different than EOS token
    if unk_token == eos_token:
        unk_token = alternative_unk_token

    # PAD token must be different than EOS token
    # but can be the same as UNK token
    if pad_token == eos_token:
        pad_token = unk_token
        
    assert bos_token != eos_token, "Please choose a different BOS token."
    assert unk_token != eos_token, "Please choose a different UNK token."

    # Creates dict for BOS, PAD, and UNK tokens
    # Keeps the EOS token as it was originally defined
    special_tokens_dict = {'bos_token': bos_token, 
                           'pad_token': pad_token, 
                           'unk_token': unk_token}
    
    # If there are additional special tokens, add them
    if special_tokens is not None:
        if isinstance(special_tokens, list):
            special_tokens_dict.update({'additional_special_tokens': special_tokens})
        
    tokenizer.add_special_tokens(special_tokens_dict)
    
    # If there are new regular (not special) tokens to add
    if tokens is not None:
        if isinstance(tokens, list):
            tokenizer.add_tokens(tokens)
        
    return tokenizer

In [216]:
def jinja_template(tokenizer):
    return ("{% for message in messages %}"
            f"{{{{'{tokenizer.bos_token}' + message['role'] + '\n' + message['content'] + '{tokenizer.eos_token}' + '\n'}}}}"
            "{% endfor %}"
            "{% if add_generation_prompt %}"
            f"{{{{ '{tokenizer.bos_token}assistant\n' }}}}"
            "{% endif %}")

def add_template(tokenizer, chat_template=None):
    # If not chat template was given, creates a ChatML template
    # using the BOS and EOS tokens
    if chat_template is None:
        chat_template = jinja_template(tokenizer)
        
    # Assigns chat template to tokenizer
    tokenizer.chat_template = chat_template
    
    return tokenizer

In [219]:
def get_multiple_of(vocab_size):
    return 2**(bin(vocab_size)[::-1].find('1'))

def modify_model(model, tokenizer):    
    # If new tokenizer length exceeds vocabulary size
    # resizes it while keeping it a multiple of the same value
    if len(tokenizer) > model.config.vocab_size:
        pad_to_multiple_of = get_multiple_of(model.vocab_size)
        model.resize_token_embeddings(len(tokenizer), 
                                      pad_to_multiple_of=pad_to_multiple_of)    

    # Updates token ids on model configurations
    if getattr(model, "config", None) is not None:
        model.config.pad_token_id = tokenizer.pad_token_id
        model.config.bos_token_id = tokenizer.bos_token_id
        model.config.eos_token_id = tokenizer.eos_token_id
    if getattr(model, "generation_config", None) is not None:
        model.generation_config.bos_token_id = tokenizer.bos_token_id
        model.generation_config.eos_token_id = tokenizer.eos_token_id
        model.generation_config.pad_token_id = tokenizer.pad_token_id
    
    return model

In [271]:
tokenizer_opt = modify_tokenizer(tokenizer_opt)
tokenizer_opt = add_template(tokenizer_opt)
model_opt = modify_model(model_opt, tokenizer_opt)

In [273]:
tokenizer_opt.special_tokens_map

{'bos_token': '<|im_start|>',
 'eos_token': '</s>',
 'unk_token': '<unk>',
 'pad_token': '<pad>'}

In [281]:
len(tokenizer_opt)

50266

In [282]:
tokenizer_opt.convert_ids_to_tokens(50265)

'<|im_start|>'

In [283]:
model_opt.get_input_embeddings()

Embedding(50272, 512, padding_idx=1)

In [221]:
print(tokenizer_opt.chat_template)

{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '</s>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}


In [274]:
messages = ds_msg['messages'][0]
print(tokenizer_opt.apply_chat_template(messages, tokenize=False))

<|im_start|>user
What is the capital of Argentina?</s>
<|im_start|>assistant
Buenos Aires.</s>



#### Custom Template

In [284]:
model_opt = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer_opt = AutoTokenizer.from_pretrained("facebook/opt-350m")

response_template = '##[YODA]##>'
tokenizer_opt = modify_tokenizer(tokenizer_opt, special_tokens=[response_template])
model_opt = modify_model(model_opt, tokenizer_opt)

In [285]:
def formatting_func_builder(response_template):
    def formatting_func(examples, add_generation_prompt=False):
        output_texts = []
        for i in range(len(examples['prompt'])):
            text = f"{examples['prompt'][i]}"
            try:
                text += f" {response_template} {examples['completion'][i]}{tokenizer_opt.eos_token}"
            except KeyError:
                if add_generation_prompt:
                    text += f" {response_template} "
            output_texts.append(text)
        return output_texts
    return formatting_func

yoda_formatting_func = formatting_func_builder(response_template)
yoda_formatting_func

<function __main__.gen_formatting_func.<locals>.formatting_func(examples, add_generation_prompt=False)>

In [289]:
formatted_seqs = yoda_formatting_func(dataset)
formatted_seqs[0]

'The birch canoe slid on the smooth planks. ##[YODA]##> On the smooth planks, the birch canoe slid. Yes, hrrrm.</s>'

In [291]:
tokenizer_opt(formatted_seqs[0])

{'input_ids': [2, 133, 23629, 611, 31728, 13763, 15, 5, 6921, 563, 2258, 4, 1437, 50266, 374, 5, 6921, 563, 2258, 6, 5, 23629, 611, 31728, 13763, 4, 3216, 6, 1368, 28015, 22900, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [293]:
tokenizer_opt.convert_ids_to_tokens(50266)

'##[YODA]##>'

In [294]:
yoda_formatting_func({'prompt': ['The Force is strong in you.', 
                                 'I am your father!']}, 
                     add_generation_prompt=True)

['The Force is strong in you. ##[YODA]##> ', 'I am your father! ##[YODA]##> ']

#### Special Tokens FTW

In [184]:
tokenizer_llama = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer_llama.pad_token = tokenizer_llama.unk_token
tokenizer_llama.pad_token_id = tokenizer_llama.unk_token_id

In [185]:
prompt = """### User: Hello\n\n### Assistant: Hi, how can I help you?"""
print(prompt)

### User: Hello

### Assistant: Hi, how can I help you?


In [186]:
tokens = tokenizer_llama.tokenize(prompt, add_special_tokens=False)
token_ids = tokenizer_llama.encode(prompt, add_special_tokens=False)
list(zip(tokens, token_ids))[6:11]

[('##', 2277), ('#', 29937), ('▁Ass', 4007), ('istant', 22137), (':', 29901)]

In [187]:
response_template = "### Assistant:"
tokens = tokenizer_llama.tokenize(response_template, add_special_tokens=False)
token_ids = tokenizer_llama.encode(response_template, add_special_tokens=False)
list(zip(tokens, token_ids))

[('▁###', 835), ('▁Ass', 4007), ('istant', 22137), (':', 29901)]

In [188]:
dummy_ds = Dataset.from_dict({'text': [prompt]})
dummy_tokenized = dummy_ds.map(lambda row: tokenizer_llama(row['text'])).select_columns(['input_ids'])

response_template = "### Assistant:"

bad_collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer_llama)
bad_dloader = DataLoader(dummy_tokenized, batch_size=1, collate_fn=bad_collator)
bad_batch = next(iter(bad_dloader))
bad_batch

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

{'input_ids': tensor([[    1,   835,  4911, 29901, 15043,    13,    13,  2277, 29937,  4007,
         22137, 29901,  6324, 29892,   920,   508,   306,  1371,   366, 29973]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100]])}

In [189]:
modified_response_template = "\n### Assistant:"
tokens = tokenizer_llama.tokenize(modified_response_template, add_special_tokens=False)
token_ids = tokenizer_llama.encode(modified_response_template, add_special_tokens=False)
list(zip(tokens, token_ids))

[('▁', 29871),
 ('<0x0A>', 13),
 ('##', 2277),
 ('#', 29937),
 ('▁Ass', 4007),
 ('istant', 22137),
 (':', 29901)]

In [190]:
fixed_token_ids = token_ids[2:]
fixed_collator = DataCollatorForCompletionOnlyLM(fixed_token_ids, tokenizer=tokenizer_llama)
fixed_dloader = DataLoader(dummy_tokenized, batch_size=1, collate_fn=fixed_collator)
fixed_batch = next(iter(fixed_dloader))
fixed_batch

{'input_ids': tensor([[    1,   835,  4911, 29901, 15043,    13,    13,  2277, 29937,  4007,
         22137, 29901,  6324, 29892,   920,   508,   306,  1371,   366, 29973]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  6324, 29892,   920,   508,   306,  1371,   366, 29973]])}

In [191]:
response_template = "### Assistant:"
tokenizer_llama.add_special_tokens({'additional_special_tokens': [response_template]})

1

In [192]:
dummy_tokenized = dummy_ds.map(lambda row: tokenizer_llama(row['text'])).select_columns(['input_ids'])

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [193]:
special_collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer_llama)
special_dloader = DataLoader(dummy_tokenized, batch_size=1, collate_fn=special_collator)
special_batch = next(iter(special_dloader))
special_batch

{'input_ids': tensor([[    1,   835,  4911, 29901, 15043,    13,    13, 32000, 29871,  6324,
         29892,   920,   508,   306,  1371,   366, 29973]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100, 29871,  6324,
         29892,   920,   508,   306,  1371,   366, 29973]])}

### Coming Up in "Fine-Tuning LLMs"

Chat templates are key to reining in the untamed LLM monsters and teaching them how to have proper conversations with us humans. Cleverly placing cues, or special tokens, along the conversation enables them to learn how to respond when triggered by the right commanding keyword. The training procedure, though, is not without its perils: activations, gradients, and the optimizer all demand huge portions of precious RAM in order to do their jobs. Appeasing these memory-hungry components will take both skill and effort. Configuring the training loop isn’t for the faint of heart. Don’t miss the next challenging chapter of "Fine-Tuning LLMs."