<a href="https://colab.research.google.com/github/shahabday/DSR-LLM-finetuning/blob/main/04_Instruction_Tuning_Pythia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Adapted from https://www.danliden.com/fine-tuning/4_olmo_1b_instruction_tune/4_olmo_instruction_tune.html

In [None]:
!pip install accelerate datasets peft trl bitsandbytes

Collecting accelerate
  Downloading accelerate-1.4.0-py3-none-any.whl.metadata (19 kB)
Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting peft
  Downloading peft-0.14.0-py3-none-any.whl.metadata (13 kB)
Collecting trl
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting huggingface-hub>=0.21.0 (from accelerate)
  Downloading huggingface_hub-0.29.1-py3-none-any.whl.metadata (13 kB)
Collecting safetensors>=0.4.3 (from accelerate)
  Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-19.0.1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting pandas (from datasets)
  Downloading pandas-2.

In [None]:
# Some Environment Setup
OUTPUT_DIR = "./results/pythia/" # the path to the output directory; where model checkpoints will be saved
LOG_DIR = "./logs/pythia/" # the path to the log directory; where logs will be saved
CACHE_DIR = "./cache/pythia/" # the path to the cache directory; where cache files will be saved

## Loading Base Model

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_ckpt = "EleutherAI/pythia-410m"

tokenizer = AutoTokenizer.from_pretrained(
    model_ckpt,
)
tokenizer.add_special_tokens({'pad_token': '<|pad|>'})

model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    device_map='cuda:0'
)

### Prompting Base Model

In [None]:
# Inference
def generate(prompt, max_new_tokens=100):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    gen_tokens = model.generate(input_ids,
                                max_new_tokens=max_new_tokens,
                                eos_token_id=tokenizer.eos_token_id,
                                repetition_penalty=1.1)
    return tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)[0]

In [None]:
print(generate("Here are step-by-step instructions to make a great cup of coffee with a Chemex coffee maker:\n1."))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Here are step-by-step instructions to make a great cup of coffee with a Chemex coffee maker:
1. Add the water and coffee grounds to the coffee pot.
2. Fill the coffee pot with hot water, then add the ground coffee beans.
3. Pour in the milk and stir until it is fully mixed.
4. Add the sugar and stir again.
5. Add the espresso powder and stir again.
6. Add the cream and stir again.
7. Add the vanilla extract and stir again.
8. Add the coffee liqueur and stir again


In [None]:
print(generate("Here are step-by-step instructions to make a Margarita drink:\n1."))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Here are step-by-step instructions to make a Margarita drink:
1. Fill a glass with ice and add 1/2 cup of water.
2. Add the juice from one lime, one grapefruit, one orange, one kiwi, one pineapple, one mango, one banana, one apple, one cantaloupe, one papaya, one pineapple, one watermelon, one pineapple, one grapefruit, one orange, one kiwi, one pineapple, one mango, one banana, one apple, one cantaloupe,


What happens if, instead, we ask a question or give an instruction? As the model has not been instruction tuned, these will not work.

In [None]:
# Question
print(generate("How do I make coffee with a Chemex coffee maker?"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


How do I make coffee with a Chemex coffee maker?

I have been using the Chemex for about 2 years now and it has become my go to machine. It is very easy to use, you just need to fill up your cup with water and then add in the coffee grounds. The only thing that I would like to say is that if you are not familiar with the Chemex, you should definitely check out their website. They have a great selection of machines and they even have a free trial version.

What is the best way


In [None]:
# Instruction
print(generate("Tell me how to make coffee with a Chemex coffee maker."))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Tell me how to make coffee with a Chemex coffee maker.

I have been using the Chemex for about 2 years now and I love it! It is very easy to use, you just need to fill up your cup with water and then add in some ground coffee. The only thing that I would change is that I would like to be able to add more ground coffee than what is listed on the box. I am not sure if this is possible but I will try my best to find out.

I have been using the Chemex for


In [None]:
print(generate("Tell me how to make a Margarita drink."))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Tell me how to make a Margarita drink.

I'm not sure what I was thinking, but it didn't work out so well for me. I had no idea that the only way to get a margarita is to have a glass of water and then pour in some ice. It's like drinking a soda without any sugar.

So I decided to try something different. I took a sip of my water and poured in some ice cubes. Then I added a little bit more water and let it sit for a few seconds


These did not work because the model has not been instruction tuned.

## Dataset

We'll be using the [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset.

In [None]:
from datasets import load_dataset
from pathlib import Path

slimorca = load_dataset('Open-Orca/SlimOrca-Dedup',
                           cache_dir=str(Path(CACHE_DIR) / "data"), split='train')
slimorca = slimorca.shuffle(seed=42).select(range(11000))

README.md:   0%|          | 0.00/2.94k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

(‚Ä¶)-00000-of-00002-6d275f30fa8e143f.parquet:   0%|          | 0.00/163M [00:00<?, ?B/s]

(‚Ä¶)-00001-of-00002-20da825e60baa022.parquet:   0%|          | 0.00/145M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/363491 [00:00<?, ? examples/s]

Here's one record from the dataset:

In [None]:
import json
print(json.dumps(slimorca[0], indent=4))

{
    "conversations": [
        {
            "from": "system",
            "value": "You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-by-step and justify your answer."
        },
        {
            "from": "human",
            "value": "Give the step-by-step reasoning process and then the final answer. A pen and pencil have a total cost of $6. If the pen costs twice as much as the pencil, what is the cost of the pen?"
        },
        {
            "from": "gpt",
            "value": "Step 1: Let's assign variables to the unknowns. Let's say the cost of the pencil is x dollars and the cost of the pen is y dollars.\n\nStep 2: We know the total cost of the pen and pencil is $6. So, we can write the equation:\nx + y = 6\n\nStep 3: We also know that the pen costs twice as much as the pencil. So, we can write the equation:\ny = 2x\n\nStep 4: Now, we have a system 

### Format the Data

We format the data using ChatML template:
- It uses `<|im_start|>` and `<|im_end|>` special tokens
- the template did not, by default, add an `<|endoftext|>` token at the end of the chat, so we needed to do this manually. Without training on data including the `<|endoftext|>` token, at inference time, the model just keeps generating until it hits the token limit instead of stopping naturally after addressing the instruction.

#### Examine the chat template

In [None]:
print(tokenizer.chat_template)

None


There is no chat template defined for this tokenizer, so we'll use the default, which is the [ChatML](https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md) format. In order to use the template, we first need to adjust the slimorca records to match the following format, with `role` and `content` instead of `from` and `value` keys, and `system`/`assistant`/`user` roles instead of `system`/`gpt`/`human`. The chat is still structured as a list of dictionaries. Here's an example of a chat in the expected format:

In [None]:
chat = [
    {"role": "system", "content": "You are a helpful assistant and an expert at making coffee."},
    {"role": "user", "content": "How do I make coffee with a Chemex coffee maker?"},
    {"role": "assistant", "content": "To make coffee with a Chemex:\n1. Boil water to about 200¬∞F (93¬∞C).\n2. Place the Chemex filter in the top and rinse it with hot water to remove paper taste and warm the vessel. Discard the rinse water.\n3. Add coffee grounds to the filter. Use a medium-coarse grind, about 1 gram of coffee per 16 grams of water.\n4. Pour just enough hot water to saturate the grounds. Wait 30 seconds for the coffee to 'bloom'.\n5. Slowly pour the remaining water over the grounds in a circular motion. Aim for a total brew time of 3.5 to 4.5 minutes.\n6. Once brewing is complete, remove the filter and enjoy."}
]

Now we can apply the chat template and obtain a string-formatted chat that we can tokenize and train on. Note the lack of a token indicating the end of the string! We will need to add the `tokenizer.eos_token` to the end of the string manually. This tokenizer did not define a `bos_token`, so we will proceed without one.

In [None]:
tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

In [None]:
print(tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False))

<|im_start|>system
You are a helpful assistant and an expert at making coffee.<|im_end|>
<|im_start|>user
How do I make coffee with a Chemex coffee maker?<|im_end|>
<|im_start|>assistant
To make coffee with a Chemex:
1. Boil water to about 200¬∞F (93¬∞C).
2. Place the Chemex filter in the top and rinse it with hot water to remove paper taste and warm the vessel. Discard the rinse water.
3. Add coffee grounds to the filter. Use a medium-coarse grind, about 1 gram of coffee per 16 grams of water.
4. Pour just enough hot water to saturate the grounds. Wait 30 seconds for the coffee to 'bloom'.
5. Slowly pour the remaining water over the grounds in a circular motion. Aim for a total brew time of 3.5 to 4.5 minutes.
6. Once brewing is complete, remove the filter and enjoy.<|im_end|>



#### Apply the template to the whole dataset

Now we need to apply the template to the whole slimorca dataset. We will first convert the slimorca entries into the expected format, and then use `tokenizer.apply_chat_template` to apply the template.

We will also add the `<|im_start|>` and `<|im_end|>` special tokens to the tokenizer. The `<|endoftext|>` and `<|padding|>` tokens are already in the tokenizer's vocabulary, so we do not need to add them manually.

We add the `tokenizer.eos_token` to the end of the string here. Without doing so, the model does not learn when to stop generating.

In [None]:
import torch

# Add the instruction tokens to the tokenizer
special_tokens = ["<|im_start|>", "<|im_end|>"]
# Adding special tokens to the tokenizer
tokenizer.add_special_tokens({"additional_special_tokens": special_tokens})

# Do not need to resize the model's input token embeddings matrix
# it is already larger than the vocabulary/large enough to accommodate
# the added tokens
# model.resize_token_embeddings(len(tokenizer))

system_msg = "You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-by-step and justify your answer."

def format_slimorca(ex):
    role_mapping = {"gpt": "assistant", "system": "system", "human": "user"}
    chat = [
        {"role": role_mapping[message["from"]], "content": message["value"] if message["from"] != "system" else system_msg}
        for message in ex["conversations"]
    ]
    formatted_chat = tokenizer.apply_chat_template(
        chat,
        tokenize=False,              # Apply formatting but do not tokenize
        add_generation_prompt=False,
    ) + tokenizer.eos_token          # add the end of sequence token

    # Tokenize using the standard tokenizer method
    tokenized_output = tokenizer(
        formatted_chat,
        add_special_tokens=False,  # apply_chat_template already added special tokens
        #padding="max_length",     # pad to the specified length
        max_length=512,            # max length at which to truncate or to which to pad
        truncation=True,           # truncate to the specified length
    )

    return tokenized_output


# Map to the dataset
slimorca_tokenized = slimorca.map(format_slimorca, num_proc=16).remove_columns(
    "conversations"
)

Map (num_proc=16):   0%|          | 0/11000 [00:00<?, ? examples/s]

In [None]:
slimorca_tokenized

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 11000
})

Now let's inspect a single example and make sure it corresponds to the format we expect.

In [None]:
# Inspect one example
print(tokenizer.decode(slimorca_tokenized[11]['input_ids']))

<|im_start|>system
You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-by-step and justify your answer.<|im_end|>
<|im_start|>user
How do you say "For more info please contact us." in Spanish?<|im_end|>
<|im_start|>assistant
To translate the phrase "For more info please contact us." into Spanish, I will follow these steps:

1. Identify the main ideas in the English phrase: "more info", "please contact", and "us".
2. Translate each main idea into Spanish: "m√°s informaci√≥n" (more info), "por favor, p√≥ngase en contacto" (please contact), and "con nosotros" (with us).
3. Combine the translations to form the Spanish phrase: "Para m√°s informaci√≥n, por favor, p√≥ngase en contacto con nosotros."

So, the translated phrase in Spanish is: "Para m√°s informaci√≥n, por favor, p√≥ngase en contacto con nosotros."<|im_end|>
<|endoftext|>


Note the padding tokens at the end. The whole example was shorter than 512 tokens, so it was padded to reach 512 tokens.

#### Split the dataset into training and validation

Here we also limit to a training subset of 10,000 examples. This is based on the [LIMIT](https://www.databricks.com/blog/limit-less-more-instruction-tuning) paper, which found that a small number of high-quality examples is sufficient for instruction-tuning. Under ideal circumstances, we would choose more *domain-specific* examples with a variety of different formats. Given that we are not tailoring this fine-tuning job for a specific domain, we will just choose 10,000 random examples from the SlimOrca dataset. We could almost certainly get by with fewer examples, especially if those examples were selected for quality and tailored to the specific tasks we want the model to succeed at.

In [None]:
from datasets import DatasetDict
from transformers import set_seed

set_seed(123)

slimorca_tokenized_split = slimorca_tokenized.train_test_split(
    train_size=10000, test_size=1000
)

Now we will configure a *collator*. The collator is responsible for taking inputs, generating labels, and assembling the inputs into batches.

Since we already padded/truncated the inputs to the same lengths, we don't need anything special here. The `DataCollatorForLanguageModeling` collator will add labels to each entry. Importantly, the labels are the same as the inputs. The trainer handles shifting the labels; we do not need to implement any custom logic to align the labels.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

## Fine-tune the model

Now that the data are ready, we can train the model using the Hugging Face `Trainer`.

### Hyperparameters and Training Arguments
At a high level: this is a fairly naive fine-tuning job. We aren't trying to excel at a specific benchmark or task. Our main goal is simply equipping the model with the ability to respond to instructions and questions in an appropriate format. We attain this goal fairly easily with a variety of different hyperparameter configurations. That said, especially when training on a smaller subset of the data for multiple epochs, the results were fairly sensitive to learning rate. The default of `0.00005` was too high and resulted in overfitting.
- We set `auto_find_batch_size` to `True`. The trainer will try multiple batch sizes, starting from the specified `per_device_train_batch_size`, and reduce the batch size if it encounters an OOM error.
- We use gradient accumulation to simulate a larger batch size. Gradients are accumulated over multiple mini-batches of data (because we cannot use a very large batch size). The weights are only updated after the specified number of gradient accumulation steps.

In [None]:
from transformers import TrainingArguments, Trainer

# Define the training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=5,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=8,
    per_device_eval_batch_size=4,
    auto_find_batch_size=True,
    warmup_steps=2,
    weight_decay=0.01,
    logging_dir=LOG_DIR,
    logging_steps=5,  # Log every 5 steps
    evaluation_strategy="steps",
    eval_steps=20,
    lr_scheduler_type="linear",
    gradient_checkpointing=False,
    save_steps=100,
    learning_rate=1e-3,
    optim="paged_adamw_8bit",
    report_to='none'
)



In [None]:
model

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 1024)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
          (act): GELUActivation()
        )
      )
    )
    (final_layer_norm): LayerNorm((1024,), eps=1e-05, 

### LoRA

In [None]:
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    bias="none",
    lora_dropout=0.05,  # Conventional
    target_modules=["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"],
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=slimorca_tokenized_split["train"],
    eval_dataset=slimorca_tokenized_split["test"],
    args=training_args,
    peft_config=lora_config,
    data_collator=data_collator,
)



Converting train dataset to ChatML:   0%|          | 0/10000 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/10000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/10000 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/1000 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
model

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 1024)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (query_key_value): lora.Linear(
            (base_layer): Linear(in_features=1024, out_features=3072, bias=True)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=1024, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=3072, bias=False

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model)

trainable params: 6291456 || all params: 411625472 || trainable%: 1.5284418550268921


In [None]:
trainer.train()

Step,Training Loss,Validation Loss
20,2.0004,1.888377
40,1.8295,1.78721
60,1.7411,1.727435
80,1.7189,1.692388
100,1.7696,1.66816
120,1.71,1.653097
140,1.6669,1.649694
160,1.6266,1.645966
180,1.5967,1.632689
200,1.6448,1.625056


TrainOutput(global_step=780, training_loss=1.4832097065754426, metrics={'train_runtime': 2180.8463, 'train_samples_per_second': 22.927, 'train_steps_per_second': 0.358, 'total_flos': 5.39774043659305e+16, 'train_loss': 1.4832097065754426})

In [None]:
OUTPUT_DIR

'./results/pythia/'

In [None]:
model_ckpt = OUTPUT_DIR + "/stop"

trainer.save_model(model_ckpt)

## Reload the Fine-Tuned Model

In [None]:
!gdown 1DbfK8yus5T4zYQxcAxFivwmn2pNdrnjF
!unzip pythia_instruct.zip -d./results/pythia/stop

Downloading...
From: https://drive.google.com/uc?id=1DbfK8yus5T4zYQxcAxFivwmn2pNdrnjF
To: /workspace/pythia_instruct.zip
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 24.0M/24.0M [00:00<00:00, 82.1MB/s]
Archive:  pythia_instruct.zip
  inflating: ./results/pythia/stop/results/pythia/stop/README.md  
  inflating: ./results/pythia/stop/results/pythia/stop/adapter_config.json  
  inflating: ./results/pythia/stop/results/pythia/stop/adapter_model.safetensors  
  inflating: ./results/pythia/stop/results/pythia/stop/special_tokens_map.json  
  inflating: ./results/pythia/stop/results/pythia/stop/tokenizer.json  
  inflating: ./results/pythia/stop/results/pythia/stop/tokenizer_config.json  
  inflating: ./results/pythia/stop/results/pythia/stop/training_args.bin  


In [None]:
model_ckpt = OUTPUT_DIR + "/stop"

tokenizer = AutoTokenizer.from_pretrained(
    model_ckpt
)
tokenizer.add_special_tokens({'pad_token': '<|pad|>'})
special_tokens = ["<|im_start|>", "<|im_end|>"]
tokenizer.add_special_tokens({"additional_special_tokens": special_tokens})
tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

fine_tuned_model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    device_map="auto",
    trust_remote_code=True,
)

In [None]:
system_msg

'You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-by-step and justify your answer.'

In [None]:
def generate(model, tokenizer, prompt, max_new_tokens=100, chat=True):
    if chat:
        messages = [
            {
                "role": "system",
                "content": system_msg
            },
            {"role": "user", "content": prompt},
        ]
        formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    else:
        formatted_prompt = prompt

    input_ids = tokenizer(formatted_prompt, return_tensors="pt").input_ids.to(model.device)
    model.eval()
    gen_tokens = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.1,
    )
    return tokenizer.batch_decode(gen_tokens, skip_special_tokens=False)[0]

In [None]:
print(generate(fine_tuned_model, tokenizer, "Tell me how to make coffee with a Chemex coffee maker.", max_new_tokens=500))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


<|im_start|>system
You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-by-step and justify your answer.<|im_end|>
<|im_start|>user
Tell me how to make coffee with a Chemex coffee maker.<|im_end|>
<|im_start|>assistant
To create a delicious cup of coffee, it's essential to follow these simple steps:

1. Preparing the grounds: Before brewing, it's crucial to thoroughly clean your Chemex coffee maker using hot water or a soft cloth. This ensures a smooth process and prevents any dirt from entering the machine.

2. Setting up the temperature: Once the grounds have been cleaned and dried, set the temperature of the grounds in the Chemex coffee maker to between 200 and 250 degrees (80-110¬∞F). This allows for optimal extraction of coffee beans, ensuring a rich flavor and a pleasant aroma.

3. Brewing the coffee: Pouring the grounds into the Chemex coffee maker's brewing cham

In [None]:
print(generate(fine_tuned_model, tokenizer, "Tell me which ingredients I should use to make a Margarita drink.", max_new_tokens=500))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


<|im_start|>system
You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-by-step and justify your answer.<|im_end|>
<|im_start|>user
Tell me which ingredients I should use to make a Margarita drink.<|im_end|>
<|im_start|>assistant
To make a Margarita drink, follow these steps:

1. Preparing the margaritas: Start with a medium-sized margarita (about 4 ounces) of any type of beer or cachaucas. You can also make margaritas with other drinks like soda, salsa, or even water.

2. Mixing the margaritas: Pour 1/4 cup (21.3 mm) of each margarita into a glass. This will serve two purposes: it will be easier for you to mix the margaritas while having a margarita, and it will keep the margaritas from getting too cold.

3. Garnishing the margarita: Place the margarita in a large margarita glass, and fill it with ice. Garnish with a small amount of lime juice if desired.

4. Serving t