<a href="https://colab.research.google.com/github/stat-junda/Stat-359-Modern-Deep-Learning/blob/main/%E2%80%9CAssignment_2_question1_ipynb%E2%80%9D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning LLM

We will learn how to finetune a small scale LLM: OPT-350m

We will fine-tune OPT-350m to generate coherent stories, acknowledging that its limited capabilities may result in stories comparable to a first grader's level. However, this approach should still yield improved outcomes compared to using the model without fine-tuning.

First, connect to a T4 GPU instance

Then we need to install and load the necessary packages.


In [1]:
! pip install accelerate bitsandbytes peft datasets transformers

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.8.2-py3-none-any.whl (183 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [3

`accelerate`, `bitsandbytes` are both used for reducing memory requirements to speed up the training process

`peft` stands for parameter efficient fine tuning. This is where LoRA is housed.

`datasets` allows you to load data sets from HuggingFace, and `transformers` is a wrapper for transformer based models on HF.

In [2]:
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
import transformers
import torch
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m",
    load_in_8bit=True,
    device_map='auto',
    torch_dtype=torch.float16,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/663M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Tokenizers are required for LLMs. Complete the `tokenizer` variable by using the `AutoTokenizer` class which inherits from Tokenizer. Make sure you use the appropriate tokenizer.

(You should read up on how to use Tokenizers https://github.com/huggingface/tokenizers/blob/main/README.md)

In [3]:
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") # placeholder

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

## Tokenizers

Tokenizers convert words into subwords and assigns them an ID. We will learn to play with tokenizers here.

Using the loaded tokenizer, find the token ids for the string "Northwestern Wildcats".

(Make sure you have the correct tokenizer, or the results for the rest of the assignment will not be correct).

In [4]:
token_ids = tokenizer.encode("Northwestern Wildcats", add_special_tokens=True) # placeholder
print(token_ids)

[2, 11073, 16507, 10828]


An encoded message is shown below as a sequence of token IDs. Please decode the message with the tokenizer.

In [5]:
message = [2, 11073, 16507, 589, 36, 487, 791, 43, 16, 10, 940, 557, 2737, 11, 9771, 6712, 6,
 3882, 6, 315, 532, 4, 5441, 28477, 11, 504, 4708, 7, 1807, 5, 3575, 8535, 23463, 6,
 24, 16, 5, 7763, 5966, 3215, 2737, 11, 3882, 4, 20, 2737, 34, 63, 1049, 2894, 552, 5,
 20597, 9, 1777, 2293, 11, 5, 1568, 20887, 443, 4, 1437]

decoded_string = tokenizer.decode(message) # placeholder
print(decoded_string)

</s>Northwestern University (NU) is a private research university in Evanston, Illinois, United States. Established in 1851 to serve the historic Northwest Territory, it is the oldest chartered university in Illinois. The university has its main campus along the shores of Lake Michigan in the Chicago metropolitan area. 


## LoRA

The transformer model and its tokenizer has been defined. Now we need to attach a LoRA adapter if we hope to train the model at all.

LoRA has some parameters for you to tune. Please fill out the appropriate `task_type`.

Please also fill out `r` and `lora_alpha`. These are tunable hyperparameters and you can come back and edit these two as you see fit.

Please read https://huggingface.co/docs/peft/main/en/developer_guides/lora for a guide on these parameters


In [10]:
config = LoraConfig(
    r=12, # placeholder
    lora_alpha=4, # placeholder
    target_modules= ["q_proj", "v_proj"],
    lora_dropout= 0.05,
    bias="none",
    task_type= "CAUSAL_LM" # placeholder ？？？
)

lora_model = get_peft_model(model, config)

The LoRA model has been set. To see if it has actually reduced the number of trainable parameters, apply the following function on your lora model.




In [11]:
def print_trainable_parameters(model):
    """
    Input: torch model

    Return: None. Print message instead

    Prints the number of trainable parameters in the model.
    Report the percentage of trainable parameters / all parameters
    """

    # keep two counters initialized at 0
    trainable_params = 0
    all_param = 0

    # iterate through all parameters and keep track of which parameters require gradients
    for name, param in model.named_parameters():
        all_param += param.numel()  # the number of all parameters
        if param.requires_grad:
            trainable_params += param.numel()  # the number of trainable parameters

    # Report
    percentage_of_trainable_params = (trainable_params / all_param) * 100

    print(
        f"Trainable params: {trainable_params} ({percentage_of_trainable_params:.2f}% of total)"
    )

print_trainable_parameters(lora_model)


Trainable params: 1179648 (0.35% of total)


## TinyStories

The model has been set with the LoRA adapter. Now we are ready to collect our dataset. We will be using a subset of TinyStories which is a collection of ~2-5 sentence stories.


In [12]:
data = load_dataset("roneneldan/TinyStories", split='train[0:5000]')
data['text'][0]

Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/248M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/246M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/248M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/9.99M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

'One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.'

The data has been tokenized for you in the cell below.

In [13]:
def tokenize(data):
    return tokenizer(data['text'])
tokenized_data = data.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])
tokenized_data

Map (num_proc=4):   0%|          | 0/5000 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 5000
})

Our dataset has 5000 rows, and it contains the columns `input_ids` and `attention_mask`.

Please describe what the `input_id` and `attention_mask` are.

### input_ids
- **Definition**: `input_ids` are the numerical representations of the text after it has been processed by the tokenizer, where each number corresponds to a specific word or subword (token) in the dictionary. These IDs allow the model to know exactly which words or symbols it is dealing with.
- **Purpose**: They serve as input to the model for training or inference, enabling the model to understand and process the raw text data.

### attention_mask
- **Definition**: `attention_mask` is an array of the same length as `input_ids`, used to indicate to the model which parts of the data are actual content and which parts are padding. In `attention_mask`, a `1` usually marks actual data parts, and `0` marks the padding parts.
- **Purpose**: Since shorter texts may be padded with additional content to match the length of the longest text in a batch for processing, the `attention_mask` allows the model to differentiate between genuine text content and padding added for batching purposes. This ensures that the model's attention mechanism focuses only on the real text during processing, ignoring the padding, thus maintaining accuracy and performance.

Processing a dataset like `roneneldan/TinyStories` with these two types of information efficiently trains or applies language models, ensuring the model can correctly understand and handle each text input and its structure. `input_ids` provide the precise representation of the text content required by the model, while `attention_mask` ensures that the model can appropriately ignore any padding added for batching, thereby enhancing model performance and accuracy.

In order to speed up training, we concatenate all 5000 rows of stories into one long block of text. Then we will chunk the block of text into chunks of size 128. Feel free to experiment with this number.

In [14]:
def group_texts(examples, block_size=128):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} # input ids and attention masks, concat these lists
    total_length = len(concatenated_examples[list(examples.keys())[0]]) # get total length of input ids, should be equal to mask length
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size # delete remainder given block size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

processed_datasets = tokenized_data.map(group_texts,
                                        batched=True,
                                        batch_size=1000,
                                        num_proc=4,)

Map (num_proc=4):   0%|          | 0/5000 [00:00<?, ? examples/s]

Use your tokenizer to decode the input ids for chunk 1.

In [15]:
input_ids = processed_datasets[1]["input_ids"]
text = tokenizer.decode(input_ids, skip_special_tokens=True) # placeholder
print(text)

 sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.

One day, Beep was driving in the park when he saw a big tree. The tree had many leaves that were falling. Beep liked how the leaves fall and wanted to play with them. Beep


Before we train, we look at the model output when we prompt it with a story with "Alice and Bob". Run the cell below to see what the default OPT-350m will give when prompted with Alice and Bob.

Decode the model generated tokens and print the story.

In [16]:
model_inputs = tokenizer('Alice and Bob', return_tensors='pt').to('cuda')
greedy_output = model.generate(**model_inputs, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)[0]
story = tokenizer.decode(greedy_output, skip_special_tokens=True) # placeholder
print(story)

Alice and Bobbie are the best.
Alice is the best.


## Training Loop

Now we begin our training loop. We will use the HuggingFace trainer API since it has built-in efficiencies. Please fill in the `per_device_train_batch_size`, `gradient_accumulation_steps`, `learning_rate`, and `num_train_epochs`.

- `per_device_train_batch_size`: Assuming one device (one GPU), this determines the batch size you use.
- `gradient_accumulation_steps`: This determines the number of forward passes to take, and accumulate losses, before taking a backward pass to update model parameters.

These two parameters effectively determine how much data goes into estimating your gradient. More data leads to more accurate gradient estimations, but becomes memory intensive. Modify these two parameters in tandem for efficiency.

Make sure you train for enough epochs. Even with the built-in efficiencies, training takes a while. Be sure to budget your time for this portion.

In [17]:
trainer = transformers.Trainer(
    model=lora_model,
    train_dataset=processed_datasets,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=16, #placeholder,
        gradient_accumulation_steps=4, #placeholder,  5e-5/3e-4？
        learning_rate=2e-5, #placeholder,
        fp16=True,
        logging_steps=1,
        output_dir='outputs',
        num_train_epochs=3 # placeholder
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False) # Fill out the None to be either True or False. Which one is it?
)
lora_model.config.use_cache = False
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.5989
2,2.5677
3,2.6172
4,2.5651
5,2.5607
6,2.6026
7,2.5767
8,2.6043
9,2.5832
10,2.5755


TrainOutput(global_step=378, training_loss=2.457624185022223, metrics={'train_runtime': 1129.3747, 'train_samples_per_second': 21.429, 'train_steps_per_second': 0.335, 'total_flos': 5652061932748800.0, 'train_loss': 2.457624185022223, 'epoch': 2.99})

Now that the model has trained, write the following code to visualize the output for the story prompt "Alice and Bob"

In [31]:
model_inputs = tokenizer('Alice and Bob', return_tensors='pt').to('cuda')
output = model.generate(**model_inputs, max_length=100, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)# placeholder
tuned_story = tokenizer.decode(output[0], skip_special_tokens=True)
print(tuned_story)


Alice and Bob were the only two people who could see the light.
Alice was the only one who could see the light.
Bob was the only one who could see the light.
Alice was the only one who could see the light.
Alice was the only one who could see the light.
Bob was the only one who could see the light.
Alice was the only one who could see the light.
Bob was the only one who could see the light.



## Modify the Generator

How can we make this better?

model.generate takes the input, passes it through the LLM, and selects tokens to be decoded in a probabilitic manner. It can be controlled by the following:

- `beam search = k`: This means that instead of looking at probabilities of the next single token, the model will consider probabilities over the next `k` tokens.

- `do_sample`: Tells the model whether to sample for the next tokens, or pick the next best token.

- `top-k = k`: Over the probability distribution of the next possible tokens, we filter out only the tokens with the top `k` highest probabilities. The probability is redistributed over these `k` tokens and we can sample from this.

- `top-p = p`: Over the probability distribution of the next possible tokens, we keep the set of tokens with highest probabilities, such that they all sum to `p`. Then we sample over these tokens.

- `temperature = T`: It makes the distribution over the the next tokens sharper. That is, higher temperatures make the distribution more uniform, while lower temperatures increase the differences in probabilities between tokens. This is essentially a way pronounce probability differences in a distribution.

- `no_repeat_ngram_size=n`: Stops the model from repeating any sequence of n tokens.

Think about how each of these parameters affect how we sample the next tokens. Modify your text generation by including these parameters.

In [32]:
output = lora_model.generate(**model_inputs,
                             max_new_tokens=200, # modify
                             top_k=50, # modify
                             top_p=0.95, # modify
                             temperature=0.9, # modify
                             num_beams=5, # modify
                             no_repeat_ngram_size = 2, # modify
                             do_sample=True,
                             pad_token_id=tokenizer.eos_token_id)[0]
tuned_story = tokenizer.decode(output)
print(tuned_story)

</s>Alice and Bob,

Hi, Alice.
I'm Bob and I'm glad you're here.  I'd like to ask you a few questions about your day. What did you do? What was your favorite part of the day? And what was the worst thing that happened to you?  Let me know what you think!  And remember, you can always count on me to answer your questions.</s>
