# Finetuning LLM

We will learn how to finetune a small scale LLM: OPT-350m

We will fine-tune OPT-350m to generate coherent stories, acknowledging that its limited capabilities may result in stories comparable to a first grader's level. However, this approach should still yield improved outcomes compared to using the model without fine-tuning.

First, connect to a T4 GPU instance

Then we need to install and load the necessary packages.


In [None]:
! pip install accelerate bitsandbytes peft datasets transformers

`accelerate`, `bitsandbytes` are both used for reducing memory requirements to speed up the training process

`peft` stands for parameter efficient fine tuning. This is where LoRA is housed.

`datasets` allows you to load data sets from HuggingFace, and `transformers` is a wrapper for transformer based models on HF.

In [None]:
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
import transformers
import torch
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m",
    load_in_8bit=True,
    device_map='auto',
    torch_dtype=torch.float16,
)

Tokenizers are required for LLMs. Complete the `tokenizer` variable by using the `AutoTokenizer` class which inherits from Tokenizer. Make sure you use the appropriate tokenizer.

(You should read up on how to use Tokenizers https://github.com/huggingface/tokenizers/blob/main/README.md)

In [None]:
tokenizer = None # Placeholder

## Tokenizers

Tokenizers convert words into subwords and assigns them an ID. We will learn to play with tokenizers here.

Using the loaded tokenizer, find the token ids for the string "Northwestern Wildcats".

(Make sure you have the correct tokenizer, or the results for the rest of the assignment will not be correct).

In [None]:
token_ids = None # placeholder
print(token_ids)

None


An encoded message is shown below as a sequence of token IDs. Please decode the message with the tokenizer.

In [None]:
message = [2, 11073, 16507, 589, 36, 487, 791, 43, 16, 10, 940, 557, 2737, 11, 9771, 6712, 6,
 3882, 6, 315, 532, 4, 5441, 28477, 11, 504, 4708, 7, 1807, 5, 3575, 8535, 23463, 6,
 24, 16, 5, 7763, 5966, 3215, 2737, 11, 3882, 4, 20, 2737, 34, 63, 1049, 2894, 552, 5,
 20597, 9, 1777, 2293, 11, 5, 1568, 20887, 443, 4, 1437]

decoded_string = None # placeholder
print(decoded_string)

None


## LoRA

The transformer model and its tokenizer has been defined. Now we need to attach a LoRA adapter if we hope to train the model at all.

LoRA has some parameters for you to tune. Please fill out the appropriate `task_type`.

Please also fill out `r` and `lora_alpha`. These are tunable hyperparameters and you can come back and edit these two as you see fit.

Please read https://huggingface.co/docs/peft/main/en/developer_guides/lora for a guide on these parameters


In [None]:
config = LoraConfig(
    r=0, # placeholder
    lora_alpha=0, # placeholder
    target_modules= ["q_proj", "v_proj"],
    lora_dropout= 0.05,
    bias="none",
    task_type= None # placeholder
)

lora_model = get_peft_model(model, config)

The LoRA model has been set. To see if it has actually reduced the number of trainable parameters, apply the following function on your lora model.




In [None]:
def print_trainable_parameters(model):
    """
    Input: torch model

    Return: None. Print message instead

    Prints the number of trainable parameters in the model.
    Report the percentage of trainable parameters / all parameters
    """

    # Hint: keep two counters initialized at 0
    # iterate through all parameters and keep track of which
    # parameters require gradients
    # Report
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        pass # placeholder

    print(
        f"trainable params: {None}" # placeholder
    )

print_trainable_parameters(lora_model)

## TinyStories

The model has been set with the LoRA adapter. Now we are ready to collect our dataset. We will be using a subset of TinyStories which is a collection of ~2-5 sentence stories.


In [None]:
data = load_dataset("roneneldan/TinyStories", split='train[0:5000]')
data['text'][0]

The data has been tokenized for you in the cell below.

In [None]:
def tokenize(data):
    return tokenizer(data['text'])
tokenized_data = data.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])
tokenized_data

Our dataset has 5000 rows, and it contains the columns `input_ids` and `attention_mask`.

Please describe what the `input_id` and `attention_mask` are.

_Your response here_

In order to speed up training, we concatenate all 5000 rows of stories into one long block of text. Then we will chunk the block of text into chunks of size 128. Feel free to experiment with this number.

In [None]:
def group_texts(examples, block_size=128):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} # input ids and attention masks, concat these lists
    total_length = len(concatenated_examples[list(examples.keys())[0]]) # get total length of input ids, should be equal to mask length
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size # delete remainder given block size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

processed_datasets = tokenized_data.map(group_texts,
                                        batched=True,
                                        batch_size=1000,
                                        num_proc=4,)

Use your tokenizer to decode the input ids for chunk 1.

In [None]:
input_ids = processed_datasets[1]["input_ids"]
text = None # placeholder
print(text)

Before we train, we look at the model output when we prompt it with a story with "Alice and Bob". Run the cell below to see what the default OPT-350m will give when prompted with Alice and Bob.

Decode the model generated tokens and print the story.

In [None]:
model_inputs = tokenizer('Alice and Bob', return_tensors='pt').to('cuda')
greedy_output = model.generate(**model_inputs, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)[0]
story = None # placeholder
print(story)

## Training Loop

Now we begin our training loop. We will use the HuggingFace trainer API since it has built-in efficiencies. Please fill in the `per_device_train_batch_size`, `gradient_accumulation_steps`, `learning_rate`, and `num_train_epochs`.

- `per_device_train_batch_size`: Assuming one device (one GPU), this determines the batch size you use.
- `gradient_accumulation_steps`: This determines the number of forward passes to take, and accumulate losses, before taking a backward pass to update model parameters.

These two parameters effectively determine how much data goes into estimating your gradient. More data leads to more accurate gradient estimations, but becomes memory intensive. Modify these two parameters in tandem for efficiency.

Make sure you train for enough epochs. Even with the built-in efficiencies, training takes a while. Be sure to budget your time for this portion.

In [None]:
trainer = transformers.Trainer(
    model=lora_model,
    train_dataset=processed_datasets,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=0, #placeholder,
        gradient_accumulation_steps=0, #placeholder,
        learning_rate=0, #placeholder,
        fp16=True,
        logging_steps=1,
        output_dir='outputs',
        num_train_epochs=0 # placeholder
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=None) # Fill out the None to be either True or False. Which one is it?
)
lora_model.config.use_cache = False
trainer.train()

Now that the model has trained, write the following code to visualize the output for the story prompt "Alice and Bob"

In [None]:
model_inputs = tokenizer('Alice and Bob', return_tensors='pt').to('cuda')
output = # placeholder
tuned_story = print(output)

## Modify the Generator

How can we make this better?

model.generate takes the input, passes it through the LLM, and selects tokens to be decoded in a probabilitic manner. It can be controlled by the following:

- `beam search = k`: This means that instead of looking at probabilities of the next single token, the model will consider probabilities over the next `k` tokens.

- `do_sample`: Tells the model whether to sample for the next tokens, or pick the next best token.

- `top-k = k`: Over the probability distribution of the next possible tokens, we filter out only the tokens with the top `k` highest probabilities. The probability is redistributed over these `k` tokens and we can sample from this.

- `top-p = p`: Over the probability distribution of the next possible tokens, we keep the set of tokens with highest probabilities, such that they all sum to `p`. Then we sample over these tokens.

- `temperature = T`: It makes the distribution over the the next tokens sharper. That is, higher temperatures make the distribution more uniform, while lower temperatures increase the differences in probabilities between tokens. This is essentially a way pronounce probability differences in a distribution.

- `no_repeat_ngram_size=n`: Stops the model from repeating any sequence of n tokens.

Think about how each of these parameters affect how we sample the next tokens. Modify your text generation by including these parameters.

In [None]:
output = lora_model.generate(**model_inputs,
                             max_new_tokens=200, # modify
                             top_k=0, # modify
                             top_p=0.0, # modify
                             temperature=0.0, # modify
                             num_beams=0, # modify
                             no_repeat_ngram_size = 0, # modify
                             do_sample=True,
                             pad_token_id=tokenizer.eos_token_id)[0]
tuned_story = tokenizer.decode(output)
print(tuned_story)