<a href="https://colab.research.google.com/github/sancayenne06/ACMAI/blob/main/ACM_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM finetuning

You should make a copy of this notebook and work on it there.

In this assignement we're going to use hugging face libraries to turn an LLM into a limited instruct model (all this means is that rather than make the LLM fill in the next char in a sequence it'll write based off of an instruction). This is important because we're going to be using trl to tune r1 and knowing how to use peft will make it much more efficent. The goal of this is not to test your understanding or to test if you can use chatgpt but rather to make sure that you can read hugging face's documentation and understand what you need to do to make it work.

In [None]:
# the libraries you will need
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig, TaskType

In [None]:
# load smolLM from hugging face (only ~100M params)
# note you will want to set your colab runtime to use the gpu
MODEL_NAME = "HuggingFaceTB/SmolLM2-135M"
device = "cuda" if torch.cuda.is_available() else "cpu" # set runtime to T4 GPU
print(f"Using device: {device}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)

Using device: cuda


In [None]:
# ok lets try prompting it with an instruction, note that it doesn't obey the instruction
# and rather just repeats the format (likely because its never seen this before)
test_prompt = f"### Prompt: Compose a poem, about blue skys \n ### Response:"
inputs = tokenizer.encode(test_prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=50, temperature=0.2, top_k=50, do_sample=True)
print(tokenizer.decode(outputs[0]))

In [None]:
# lets load a dataset with instruction output pairs from hugging face
dataset = load_dataset("checkai/instruction-poems", split="train[:30%]")

In [None]:
# lets check out an example from the dataset
print("Instruction:\n", dataset["INSTRUCTION"][0])
print("Response:\n", dataset["RESPONSE"][0])

In [None]:
# make a peft config, good params are:
# r = 8
# lora_alpha = 32
# lora_dropout = 0.05
# make sure you know what those mean


# now we need a function that formats our data into instruction and response
def formatting_func(example):
    text = f"### Prompt: {example['INSTRUCTION']}\n ### Response: {example['RESPONSE']}"
    return text

# create our SFTTrainer
# since we don't have powerful gpus I would reccomend restricting
# max_seq_length to something like 100

# train our model


In [None]:
# now when we give it the following prompt we expect to see some better outputs
# ideally poetry (even if its bad), something original not in the query
# mine output:
"""
### Prompt: Write a poem about the sea
 ### Response:
I love the sea. It's so beautiful. It's so peaceful. It's so calm. It's so peaceful.
 ### Prompt: Write a poem about the sea
 """

instruction = "Write a poem about the sea"
prompt = f"### Prompt: {instruction}\n ### Response: "

input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(input_ids, max_length=50, temperature=0.2, top_k=50, do_sample=True)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


### Prompt: Write a poem about the sea
 ### Response: 
I love the sea. It's so beautiful. It's so peaceful. It's so calm. It's so peaceful.
 ### Prompt: Write a poem about the sea


# Report
Write a short report about what you did and why (we mostly want to check you didn't just use chatgpt for the entire thing).