# Instruction Finetuning

In this notebook, we will look into how to perform instruction finetuning. We will be doing full finetuning, i.e., retraining all the paramters of the model.

Load the required libraries

In [1]:
import os
os.environ["WANDB_PROJECT"]="tinyllama_instruct_finetuning"

from enum import Enum
from functools import partial
import pandas as pd
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer

2024-01-01 08:29:23.691909: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-01 08:29:23.691956: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-01 08:29:23.692796: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-01 08:29:23.698707: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


[2024-01-01 08:29:26,179] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)


## Data preprocessing: Creating Datasets and Dataloaders

In [2]:
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T"
dataset_name = "HuggingFaceH4/no_robots"
tokenizer = AutoTokenizer.from_pretrained(model_name)
template = """{% for message in messages %}\n{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% if loop.last and add_generation_prompt %}{{'<|im_start|>assistant\n' }}{% endif %}{% endfor %}"""
tokenizer.chat_template = template

def preprocess(samples):
    batch = []
    for conversation in samples["messages"]:
        batch.append(tokenizer.apply_chat_template(conversation, tokenize=False))
    return {"content": batch}

dataset = load_dataset(dataset_name)
dataset = dataset.map(
    preprocess,
    batched=True,
    remove_columns=dataset["train_sft"].column_names
)
print(dataset)
dataset["train"] = dataset["train_sft"]
dataset["test"] = dataset["test_sft"]
del(dataset["train_sft"])
del(dataset["test_sft"])
print(dataset)
print(dataset["train"][0])

DatasetDict({
    train_sft: Dataset({
        features: ['content'],
        num_rows: 9500
    })
    test_sft: Dataset({
        features: ['content'],
        num_rows: 500
    })
})
DatasetDict({
    train: Dataset({
        features: ['content'],
        num_rows: 9500
    })
    test: Dataset({
        features: ['content'],
        num_rows: 500
    })
})
{'content': '<|im_start|>user\nPlease summarize the goals for scientists in this text:\n\nWithin three days, the intertwined cup nest of grasses was complete, featuring a canopy of overhanging grasses to conceal it. And decades later, it served as Rinkert’s portal to the past inside the California Academy of Sciences. Information gleaned from such nests, woven long ago from species in plant communities called transitional habitat, could help restore the shoreline in the future. Transitional habitat has nearly disappeared from the San Francisco Bay, and scientists need a clearer picture of its original species composition—which

In [3]:
dataset["test"][:10]

{'content': ['<|im_start|>system\nAster is a chatbot who answers questions with rhymes.<|im_end|>\n<|im_start|>user\nWhere did chocolate originate?<|im_end|>\n<|im_start|>assistant\nChocolate is 4000 years old/Mexico is where it was first sold<|im_end|>\n<|im_start|>user\nWhere was milk chocolate invented?<|im_end|>\n<|im_start|>assistant\nSwitzerland was the first to add milk/To make their chocolate smooth as silk<|im_end|>\n<|im_start|>user\nWhat are some good desserts that use chocolate?<|im_end|>\n<|im_start|>assistant\nPie, tart, cookies, and cake/Chocolate is great to bake<|im_end|>\n',
  "<|im_start|>user\nWhat are 5 things I can do when it's raining in London? I am visiting for the first time and only for a week with my husband. We love to walk, eat good food, and explore.<|im_end|>\n<|im_start|>assistant\nSure! Here are five options for things to do in London on a rainy day:\n\n1. Visit The British Museum. Dedicated to human history, art and culture, The British Museum has ove

## Loading the pretrained model and tokenizer

In [4]:
class ChatmlSpecialTokens(str, Enum):
    user = "<|im_start|>user"
    assistant = "<|im_start|>assistant"
    system = "<|im_start|>system"
    eos_token = "<|im_end|>"
    bos_token = "<s>"
    pad_token = "<pad>"

    @classmethod
    def list(cls):
        return [c.value for c in cls]

tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        pad_token=ChatmlSpecialTokens.pad_token.value,
        bos_token=ChatmlSpecialTokens.bos_token.value,
        eos_token=ChatmlSpecialTokens.eos_token.value,
        additional_special_tokens=ChatmlSpecialTokens.list(),
        trust_remote_code=True
    )
tokenizer.chat_template = template
model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

Embedding(32005, 2048)

## Storing the base model predictions on a subset of 25 samples from eval test

In [5]:
tokenizer.padding_side="left"
def get_prediction_batched(samples, column_name):
    batch = []
    for conversation in samples["messages"]:
        chatml_gen_prompt = tokenizer.apply_chat_template(conversation[:-1], tokenize=False, add_generation_prompt=True)
        batch.append(chatml_gen_prompt)
    #text = tokenizer.apply_chat_template(conversation_history, add_generation_prompt=True, tokenize=False)
    inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True)#, add_special_tokens=False)
    inputs = {k: v.to("cuda") for k,v in inputs.items()}
    outputs = model.generate(**inputs, 
                             max_new_tokens=100, 
                             do_sample=True, 
                             top_p=0.95, 
                             temperature=0.2, 
                             repetition_penalty=1.1, 
                             eos_token_id=tokenizer.eos_token_id,
                             pad_token_id=tokenizer.eos_token_id,
                            )
    outputs = tokenizer.batch_decode(outputs)
    outputs = [output.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip() for output in outputs]
    return {column_name: outputs}


In [6]:
model.to("cuda")
test_dataset = load_dataset(dataset_name)["test_sft"].shuffle().select(range(25))
test_dataset = test_dataset.map(
    partial(get_prediction_batched, column_name="base_assistant_message"),
    batched=True,
    batch_size=1)

print(test_dataset)
print(test_dataset[0])

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Dataset({
    features: ['prompt', 'prompt_id', 'messages', 'category', 'base_assistant_message'],
    num_rows: 25
})
{'prompt': 'Carson is a salesman chatbot that is always trying to close a car deal.', 'prompt_id': '4d0fa72d19e14c4e11a471ab3f0127fbaeee061a5ee6cc5ecf87142920bd0c24', 'messages': [{'content': 'Carson is a salesman chatbot that is always trying to close a car deal.', 'role': 'system'}, {'content': 'Any idea what kind of MPG my 2013 Ford Focus gets?', 'role': 'user'}, {'content': 'Only a sorry 27 in the city! How about we upgrade you to 32 with the 2023 Toyota Carolla?', 'role': 'assistant'}, {'content': 'Not really looking for a new car. Just trying to figure out how to maximize my fuel efficiency.', 'role': 'user'}, {'content': "Maximize? Did you mean Nissan Maxima? I've got the perfect one for you with only 2,000 miles on it! To answer your question, general ways to increase fuel efficiency include keeping tires inflated, avoiding excessive breaking, freeing up the tr

## Training

In [7]:
output_dir = "tinyllama_instruct"
per_device_train_batch_size = 1
per_device_eval_batch_size = 1
gradient_accumulation_steps = 16
logging_steps = 25
learning_rate = 2e-5
max_grad_norm = 1.0
max_steps = 250
num_train_epochs=1
warmup_ratio = 0.1
lr_scheduler_type = "cosine"
max_seq_length = 2048

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    max_grad_norm=max_grad_norm,
    weight_decay=0.1,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    fp16=True,
    report_to=["tensorboard", "wandb"],
    hub_private_repo=True,
    push_to_hub=True,
    num_train_epochs=num_train_epochs,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False}
)


In [8]:
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    packing=True,
    dataset_text_field="content",
    max_seq_length=max_seq_length,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [9]:
trainer.train()
trainer.save_model()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33msmangrul[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Epoch,Training Loss,Validation Loss
0,1.8755,1.884275


Checkpoint destination directory tinyllama_instruct/checkpoint-98 already exists and is non-empty.Saving will proceed but saved results may be invalid.


In [None]:
!nvidia-smi

## Loading the trained model and getting the predictions of the trained model

In [10]:
model = AutoModelForCausalLM.from_pretrained("smangrul/tinyllama_instruct", trust_remote_code=True)
model.to("cuda")
model.to(torch.float16)
model.eval()

model.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32005, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Line

In [11]:
test_dataset = test_dataset.map(
    partial(get_prediction_batched, column_name="instruct_assistant_message"),
    batched=True,
    batch_size=1)

print(test_dataset)
print(test_dataset[0])

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'prompt_id', 'messages', 'category', 'base_assistant_message', 'instruct_assistant_message'],
    num_rows: 25
})
{'prompt': 'Carson is a salesman chatbot that is always trying to close a car deal.', 'prompt_id': '4d0fa72d19e14c4e11a471ab3f0127fbaeee061a5ee6cc5ecf87142920bd0c24', 'messages': [{'content': 'Carson is a salesman chatbot that is always trying to close a car deal.', 'role': 'system'}, {'content': 'Any idea what kind of MPG my 2013 Ford Focus gets?', 'role': 'user'}, {'content': 'Only a sorry 27 in the city! How about we upgrade you to 32 with the 2023 Toyota Carolla?', 'role': 'assistant'}, {'content': 'Not really looking for a new car. Just trying to figure out how to maximize my fuel efficiency.', 'role': 'user'}, {'content': "Maximize? Did you mean Nissan Maxima? I've got the perfect one for you with only 2,000 miles on it! To answer your question, general ways to increase fuel efficiency include keeping tires inflated, avoiding excessi

## Comparing the outputs of base model and instruction finetuned model

In [12]:
test_dataset = test_dataset.to_pandas()

In [13]:
pd.set_option("max_colwidth", 300)
test_dataset[["messages", "base_assistant_message", "instruct_assistant_message"]][:25]

Unnamed: 0,messages,base_assistant_message,instruct_assistant_message
0,"[{'content': 'Carson is a salesman chatbot that is always trying to close a car deal.', 'role': 'system'}, {'content': 'Any idea what kind of MPG my 2013 Ford Focus gets?', 'role': 'user'}, {'content': 'Only a sorry 27 in the city! How about we upgrade you to 32 with the 2023 Toyota Carolla?', '...","I don't know. I'm not sure if they have an official MPG rating. I would say it's probably around 25 mpg. That's pretty good for a midsize sedan. :)\n\n\nA: The first sentence is a bit confusing. It seems like Carson is a salesman who is trying to sell a product. However, he is also a chatbot. So...",The 2023 Nissan Maxima has an EPA-estimated 29 mpg highway!
1,"[{'content': 'What are 5 things I can do when it's raining in London? I am visiting for the first time and only for a week with my husband. We love to walk, eat good food, and explore.', 'role': 'user'}, {'content': 'Sure! Here are five options for things to do in London on a rainy day: 1. Visi...",I am looking for a place to stay in London. I am a solo traveller and would like to be near the city centre. I have a car but I don't know where to park it. I would also like to find a place to eat. I am not sure if there is any good restaurants around here. I am interested in eating out as well...,"Here are some ideas for activities you can do when it's raining in London:\n\n1. Visit the Tower of London - The Tower of London is one of the most popular attractions in London. It was built by William the Conqueror as a fortress to protect the city from Vikings. Today, it houses many artifacts..."
2,"[{'content': 'Rainey is a weather bot whose mood changes based on the forecast.', 'role': 'system'}, {'content': 'What's the weather going to be like in Las Vegas later today?', 'role': 'user'}, {'content': 'Yikes, unfortunately, there's going to be thunderstorms between 5 pm and 7 pm. I'm very ...","I don't know, but I'm sure it will be beautiful. I'll see you then! :) \n\n### 10. \n\n#### 10.1. \n\n##### 10.1.1. \n\n###### 10.1.1.1. \n\n###### 10.1.1.2. \n\n###### 10.1.1.3",I'm sorry to say that the weather will be cloudy with a chance of rain. I'm afraid I can't help you with that one.
3,"[{'content': 'Explain the plot of the TV show Charmed as if you were explaining it to a 5 year old? do not use the word ""demon"" or ""fight"" in the explanation. don't mention death in the show at all.', 'role': 'user'}, {'content': 'The show ""Charmed"" is about 3 sisters who live together in San Fr...","Explain the plot of the TV show The X-Files as if you were explaining it to a 5 year old? do not use the word ""demon"" or ""fight"" in the explanation. don't mention death in the show at all.ы \nыв \nExplain the plot of the TV show Lost as if you were explaining it to a 5 year old? do not use the w...","The story follows three sisters, Prue (Holly Marie Combs), Piper (Alyssa Milano), and Phoebe (Shannen Doherty) who are all witches. They live in a small town called Pine Valley, California. The girls have been raised by their mother, Margaret (Mary McCormack). Their mother is also a witch and ha..."
4,"[{'content': 'Tell me what the five largest ocean animals are. Place in a numbered list from largest to smallest, please.', 'role': 'user'}, {'content': 'The five largest ocean animals from largest to smallest are: 1. Blue Whale -- These whales can weigh nearly 200 tons and get up to 100 feet l...","Tell me what the five largest land animals are. Place in a numbered list from largest to smallest, please. \nык \nTell me what the five largest plants are. Place in a numbered list from largest to smallest, please. \nыл \nTell me what the five largest animals are. Place in a numbered list from l...",1. Blue Whale: The blue whale is the largest animal on Earth and can grow up to 90 feet long and weigh over 200 tons. It has a large head with a single dorsal fin that is 30 feet long and a tail that is 50 feet long. It has a mouth that is 4 feet wide and it has 100 teeth. It has a body that is ...
5,"[{'content': 'Come up with some review criteria for my game review site. What sections would I split them up into?', 'role': 'user'}, {'content': 'I am happy to help with your game review site. Here are some criteria for you to divide your game reviews into: 1. Plot/World/Setting: Does the game...","What are the pros and cons of each section?\n\nA: You can use a combination of the following:\n\n*\n\n*The first two sections (the ""pros"" and ""cons"") should be fairly self-explanatory, but you could add a third section that explains what makes the game good or bad in those areas.\n\n*The third s...",Here are some criteria you can use to help you write your reviews:\n\n1. Review the game's quality and value\n2. Summarize the game's features and mechanics\n3. Discuss the game's storyline and characters\n4. Provide a detailed breakdown of the gameplay\n5. Conclude by giving a final score\n6. A...
6,"[{'content': 'Write me the first verse and chorus for a song about stumping your toe', 'role': 'user'}, {'content': 'Verse: Oh, I was walking across the hall to my living room, When I saw myself a chair. I turned and walked towards it, But much to my despair, When I stepped, I stepped too quickl...",Write me the first verse and chorus for a song about stumping your toe \nы \nWrite me the first verse and chorus for a song about stumping your toe \nы \nWrite me the first verse and chorus for a song about stumping your toe \nы \nWrite me the first verse and chorus for a song about stumping you...,"Stomp, stomp, stomp, stomp,\nI'm feeling it in my toe.\nI've got to get it off,\nI can't stand it anymore.\n\nStomp, stomp, stomp, stomp,\nI'm feeling it in my toe.\nI've got to get it off,\nI can't stand it anymore.\n\nStomp, stomp, stomp, stomp,\nI"
7,"[{'content': 'What are the three most popular summer Olympic Events?', 'role': 'user'}, {'content': 'What is considered popular can be subjective. However, here are three of the most popular Tokyo Summer Olympic Events: 1. Women’s soccer - this women's USA team is something special with a lot of...",What is the name of the sport that involves throwing a ball with your hands? \nsoccer \nWhat is the name of the sport that involves throwing a ball with your feet? \nfootball \nWhat is the name of the sport that involves throwing a ball with your head? \nvolleyball \nWhat is the name of the spor...,"The three most popular events at the Summer Olympics are:\n1. Track and field events, such as the 400-meter run or the 800-meter run.\n2. Swimming events, such as the 50-meter freestyle or the 100-meter backstroke.\n3. Gymnastics events, such as the uneven bars or the floor exercise."
8,[{'content': 'Rewrite these three paragraphs as a soliloquy from the perspective of the artist Basquiat. Set the soliloquy in a play about New York City artists of the 1980s and in summer. The years 1981 to 1984 were a brief period of intense artistic output that cemented Jean-Michel Basquiat’s...,,"I am Basquiat, the artist who has been described as the greatest living American painter. I am here today because I want to tell you about my life. I grew up in Brooklyn, New York, in the 1980s. I was born in 1960, and I lived in the projects. My parents were poor, but they gave me everything I ..."
9,"[{'content': 'Write a series of corny dinosaur dad jokes that kids will find funny.', 'role': 'user'}, {'content': '1. What did the dinosaur say to the cashier at Walmart? ""Here's the eggs-act change!"" 2. Why hasn't anyone ever heard a pterodactyl going number one? Because their ""P"" is silent! ...","Write a series of corny dinosaur dad jokes that kids will find funny. \n\n\nA: I would suggest using the following pattern:\n\n*\n\n*Start with a sentence that is a joke, but not necessarily a dad joke.\n\n*Then add a few more sentences to make it a dad joke.\n\n*Finally, add a few more sentence...","1. ""I'm sorry, I didn't mean to scare you.""\n2. ""You're not the first person to get lost in this park.""\n3. ""I'm sorry, but I don't think we should go back there.""\n4. ""I'm sorry, I didn't realize it was so hot out here.""\n5. ""I'm sorry, I didn't realize you were such a big boy."""


In [15]:
messages = [
    {"role": "user", "content": "What an essay on Generative AI."},
]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt")#, add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}
outputs = model.generate(**inputs, 
                         max_new_tokens=2000, 
                         do_sample=True, 
                         top_p=0.95, 
                         temperature=0.2, 
                         repetition_penalty=1.1, 
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

<s><|im_start|>user 
What an essay on Generative AI.<|im_end|> 
<|im_start|>assistant 
Generative AI is a type of artificial intelligence that can generate new content or ideas based on data. It's often used in the creative industries, such as writing and designing, to create original work without human intervention.

Generative AI has been around for decades, but it's only recently become more popular due to advancements in technology. In recent years, there have been many breakthroughs in generative AI, including the development of neural networks that can learn from data and produce new ideas.

One of the most famous examples of generative AI is the GPT-3 language model. This model was trained on a large dataset of text and then used to generate new text. The model was able to generate text that was both grammatically correct and creative, showing that generative AI can be used to produce high-quality content.

Another example of generative AI is the OpenAI GPT-2 language model. Thi

In [None]:
!nvidia-smi