First we check the GPU version available in the environment and install specific dependencies that are compatible with the detected GPU to prevent version conflicts.

In [None]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

Next we need to prepare to load a range of quantized language models, including a new 15 trillion token LLama-3 model, optimized for memory efficiency with 4-bit quantization.


In [None]:
!pip install triton
import triton

Collecting triton
  Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)
Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.4/209.4 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: triton
Successfully installed triton-3.0.0


In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! Llama 3 is up to 8k
dtype = None
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit",
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", # Llama-3 70b also works (just change the model name)
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.9.post4: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]



---



Next, we integrate LoRA adapters into our model, which allows us to efficiently update just a fraction of the model's parameters, enhancing training speed and reducing computational load.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.9.post4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep

We define a system prompt that formats tasks into instructions, inputs, and responses, and apply it to a dataset to prepare our inputs and outputs for the model, with an EOS token to signal completion.


In [None]:
# You are a teacher grading a quiz. You will be given the expected answers and the answers from a student. Your task is to grade the student out of 10 marks. You will output the score out of 10 marks for each question. Grade the question with higher score if the student's answer overlaps with the expected answer. Ignore differences in punctuation and phrasing between the student's answer and the expected answer. The student's answer is CORRECT if it contains more information than the expected answer, but it should at least cover what's in the expected answer. The order of the items in each answer is also not a problem. Grade the question with lower marks if the student's answer is not factual or doesn't overlap with the expected answer.

In [None]:
from datasets import load_dataset
import pandas as pd
dataset = pd.read_csv("/content/mohler_dataset_edited.csv")
dataset['text'] = 0
dataset.to_csv('modified_mohler_dataset.csv', index=False)
dataset.drop(['id', 'score_me', 'score_other'], inplace=True, axis=1)
dataset['score_avg'] = dataset['score_avg'] * 2
dataset

Unnamed: 0,question,desired_answer,student_answer,score_avg,text
0,What is the role of a prototype program in pro...,To simulate the behaviour of portions of the d...,High risk problems are address in the prototyp...,7.00,0
1,What is the role of a prototype program in pro...,To simulate the behaviour of portions of the d...,To simulate portions of the desired final prod...,10.00,0
2,What is the role of a prototype program in pro...,To simulate the behaviour of portions of the d...,A prototype program simulates the behaviors of...,8.00,0
3,What is the role of a prototype program in pro...,To simulate the behaviour of portions of the d...,Defined in the Specification phase a prototype...,10.00,0
4,What is the role of a prototype program in pro...,To simulate the behaviour of portions of the d...,It is used to let the users have a first idea ...,6.00,0
...,...,...,...,...,...
2268,How many steps does it take to search a node i...,The height of the tree.,log n,9.50,0
2269,How many steps does it take to search a node i...,The height of the tree.,( n(n-1) ) / 2,3.00,0
2270,How many steps does it take to search a node i...,The height of the tree.,2n-1,4.75,0
2271,How many steps does it take to search a node i...,The height of the tree.,"it takes at most h steps, where h is the heigh...",10.00,0


In [None]:
from datasets import Dataset

dataset = Dataset.from_pandas(dataset)

In [None]:
# this is basically the system prompt
alpaca_prompt = """You are a teacher grading a quiz. You will be given the question, expected answers and the answer from a student. Your task is to assign some score to the student out of 10 marks. You will output the score out of 10 marks for each question. Grade the question with higher score if the student's answer overlaps with the expected answer. Ignore differences in punctuation and phrasing between the student's answer and the expected answer. The student's answer is CORRECT if it contains more information than the expected answer, but it should at least cover what's in the expected answer. The order of the items in each answer is also not a problem. Grade the question with lower marks if the student's answer is not factual or doesn't overlap with the expected answer.

### Question:
{}

### Expected Answer:
{}

### Student Answer:
{}

### Score out of 10:
{}"""

EOS_TOKEN = tokenizer.eos_token # do not forget this part!
def formatting_prompts_func(examples):
       # Access data as lists within the batch
       questions = examples["question"]
       expected_answers = examples["desired_answer"]
       student_answers = examples["student_answer"]
       outputs = examples["score_avg"]

       texts = []
       # Iterate through the batch
       for question, expected_answer, student_answer, output in zip(questions, expected_answers, student_answers, outputs):
           text = alpaca_prompt.format(question, expected_answer, student_answer, output) + EOS_TOKEN
           texts.append(text)
       return { "text" : texts }


dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/2273 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
- We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
- At this stage, we're configuring our model's training setup, where we define things like batch size and learning rate, to teach our model effectively with the data we have prepared.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # increase this to make the model learn "better"
        num_train_epochs=1,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/2273 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
10.109 GB of memory reserved.


In [None]:
# We're now kicking off the actual training of our model, which will spit out some statistics showing us how well it learns
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,273 | Num Epochs = 4
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 1,136
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,0.6899
2,0.6805
3,0.8696
4,0.6892
5,0.8523
6,0.5476
7,0.4357
8,0.701
9,1.2587
10,0.401


In [None]:

used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

5491.0325 seconds used for training.
91.52 minutes used for training.
Peak reserved memory = 11.568 GB.
Peak reserved memory for training = 1.459 GB.
Peak reserved memory % of max memory = 78.438 %.
Peak reserved memory for training % of max memory = 9.893 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "How many steps does it take to search a node in a binary search tree?", # Question
        "The height of the tree.", # Expected Answer
        "The height of the tree.", # Student Answer
        "", # Score out of 10
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)

["<|begin_of_text|>You are a teacher grading a quiz. You will be given the question, expected answers and the answer from a student. Your task is to assign some score to the student out of 10 marks. You will output the score out of 10 marks for each question. Grade the question with higher score if the student's answer overlaps with the expected answer. Ignore differences in punctuation and phrasing between the student's answer and the expected answer. The student's answer is CORRECT if it contains more information than the expected answer, but it should at least cover what's in the expected answer. The order of the items in each answer is also not a problem. Grade the question with lower marks if the student's answer is not factual or doesn't overlap with the expected answer.\n\n### Question:\nHow many steps does it take to search a node in a binary search tree?\n\n### Expected Answer:\nThe height of the tree.\n\n### Student Answer:\nThe height of the tree.\n\n### Score out of 10:\n10

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
question = "What is the role of a prototype program in problem solving?"
key = "To simulate the behaviour of portions of the desired software product."
student = "To find problem and errors in a program before it is finalized"
score = ""

In [None]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        question,
        key,
        student,
        score,
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>You are a teacher grading a quiz. You will be given the question, expected answers and the answer from a student. Your task is to assign some score to the student out of 10 marks. You will output the score out of 10 marks for each question. Grade the question with higher score if the student's answer overlaps with the expected answer. Ignore differences in punctuation and phrasing between the student's answer and the expected answer. The student's answer is CORRECT if it contains more information than the expected answer, but it should at least cover what's in the expected answer. The order of the items in each answer is also not a problem. Grade the question with lower marks if the student's answer is not factual or doesn't overlap with the expected answer.

### Question:
What is the role of a prototype program in problem solving?

### Expected Answer:
To simulate the behaviour of portions of the desired software product.

### Student Answer:
To find problem and error

Finetuning with SciEntsBank Corpus

In [None]:
from datasets import load_dataset
import pandas as pd
dataset1 = pd.read_csv("/content/modified_SciEntsBank_dataset.csv")


In [None]:
from datasets import Dataset

dataset1 = Dataset.from_pandas(dataset1)

In [None]:
#@title Show current memory stats
gpu_stats1 = torch.cuda.get_device_properties(0)
start_gpu_memory1 = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory1 = round(gpu_stats1.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats1.name}. Max memory = {max_memory1} GB.")
print(f"{start_gpu_memory1} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
11.631 GB of memory reserved.


In [None]:


EOS_TOKEN = tokenizer.eos_token # do not forget this part!
def formatting_prompts_func1(examples):
       # Access data as lists within the batch
       questions = examples["question"]
       expected_answers = examples["reference_answer"]
       student_answers = examples["student_answer"]
       outputs = examples["score"]

       texts = []
       # Iterate through the batch
       for question, expected_answer, student_answer, output in zip(questions, expected_answers, student_answers, outputs):
           text = alpaca_prompt.format(question, expected_answer, student_answer, output) + EOS_TOKEN
           texts.append(text)
       return { "text" : texts }


dataset1 = dataset1.map(formatting_prompts_func1, batched = True)

Map:   0%|          | 0/4969 [00:00<?, ? examples/s]

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # increase this to make the model learn "better"
        num_train_epochs=1,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/2273 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
# We're now kicking off the actual training of our model, which will spit out some statistics showing us how well it learns
trainer_stats1 = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,273 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 600
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,0.6903
2,0.6808
3,0.8697
4,0.6895
5,0.8525
6,0.5478
7,0.4357
8,0.7013
9,1.259
10,0.4011


In [None]:

used_memory1 = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora1 = round(used_memory1 - start_gpu_memory1, 3)
used_percentage1 = round(used_memory1/max_memory1*100, 3)
lora_percentage1 = round(used_memory_for_lora1/max_memory1*100, 3)
print(f"{trainer_stats1.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats1.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory1} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora1} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage1} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage1} %.")

2886.6429 seconds used for training.
48.11 minutes used for training.
Peak reserved memory = 13.219 GB.
Peak reserved memory for training = 1.588 GB.
Peak reserved memory % of max memory = 89.632 %.
Peak reserved memory for training % of max memory = 10.768 %.


In [None]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "You used several methods to separate and identify the substances in mock rocks. How did you separate the salt from the water?",
        "The water was evaporated, leaving the salt.",
        "By letting it sit in a dish for a day.",
        "",
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>You are a teacher grading a quiz. You will be given the question, expected answers and the answer from a student. Your task is to assign some score to the student out of 10 marks. You will output the score out of 10 marks for each question. Grade the question with higher score if the student's answer overlaps with the expected answer. Ignore differences in punctuation and phrasing between the student's answer and the expected answer. The student's answer is CORRECT if it contains more information than the expected answer, but it should at least cover what's in the expected answer. The order of the items in each answer is also not a problem. Grade the question with lower marks if the student's answer is not factual or doesn't overlap with the expected answer.

### Question:
You used several methods to separate and identify the substances in mock rocks. How did you separate the salt from the water?

### Expected Answer:
The water was evaporated, leaving the salt.

### St

In [None]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "You used several methods to separate and identify the substances in mock rocks. How did you separate the salt from the water?",
        "The water was evaporated, leaving the salt.",
        "Let the water evaporate and the salt is left behind.",
        "",
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>You are a teacher grading a quiz. You will be given the question, expected answers and the answer from a student. Your task is to assign some score to the student out of 10 marks. You will output the score out of 10 marks for each question. Grade the question with higher score if the student's answer overlaps with the expected answer. Ignore differences in punctuation and phrasing between the student's answer and the expected answer. The student's answer is CORRECT if it contains more information than the expected answer, but it should at least cover what's in the expected answer. The order of the items in each answer is also not a problem. Grade the question with lower marks if the student's answer is not factual or doesn't overlap with the expected answer.

### Question:
You used several methods to separate and identify the substances in mock rocks. How did you separate the salt from the water?

### Expected Answer:
The water was evaporated, leaving the salt.

### St

In [None]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "How can an array be addressed in pointer/offset notation?",
        "By initializing a pointer to point to the first element of the array, and then incrementing this pointer with the index of the array element.",
        "multi-dimensional array",
        "",
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>You are a teacher grading a quiz. You will be given the question, expected answers and the answer from a student. Your task is to assign some score to the student out of 10 marks. You will output the score out of 10 marks for each question. Grade the question with higher score if the student's answer overlaps with the expected answer. Ignore differences in punctuation and phrasing between the student's answer and the expected answer. The student's answer is CORRECT if it contains more information than the expected answer, but it should at least cover what's in the expected answer. The order of the items in each answer is also not a problem. Grade the question with lower marks if the student's answer is not factual or doesn't overlap with the expected answer.

### Question:
How can an array be addressed in pointer/offset notation?

### Expected Answer:
By initializing a pointer to point to the first element of the array, and then incrementing this pointer with the index

In [None]:
# model.save_pretrained("lora_model") # Local saving
model.push_to_hub("rohand8/lora_model_review2", token = "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") # Online saving

README.md:   0%|          | 0.00/574 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/rohand8/lora_model_review2


In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model_review2", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model)

# alpaca_prompt = You MUST run cells from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is the role of a prototype program in problem solving?", # Question
        "To simulate the behaviour of portions of the desired software product.", # Expected Answer
        "To address major issues in the creation of the program. There is no way to account for all possible bugs in the program, but it is possible to prove the program is tangible.", # Student Answer
        "", # Score out of 10
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)


<|begin_of_text|>You are a teacher grading a quiz. You will be given the question, expected answers and the answer from a student. Your task is to assign some score to the student out of 10 marks. You will output the score out of 10 marks for each question. Grade the question with higher score if the student's answer overlaps with the expected answer. Ignore differences in punctuation and phrasing between the student's answer and the expected answer. The student's answer is CORRECT if it contains more information than the expected answer, but it should at least cover what's in the expected answer. The order of the items in each answer is also not a problem. Grade the question with lower marks if the student's answer is not factual or doesn't overlap with the expected answer.

### Question:
What is the role of a prototype program in problem solving?

### Expected Answer:
To simulate the behaviour of portions of the desired software product.

### Student Answer:
To address major issues i