<h3>Import libraries</h3>

In [1]:
from transformers import (
    T5Tokenizer, 
    T5ForConditionalGeneration, 
    Trainer, 
    TrainingArguments, 
    AutoTokenizer, 
    AutoModelForSeq2SeqLM, 
    RobertaForMultipleChoice,
    RobertaForQuestionAnswering,
    RobertaTokenizer
)
from datasets import load_dataset, concatenate_datasets, Dataset
import torch
import random
import sympy as sp
import sqlite3
import pandas as pd
import sqlite3
import re
import numpy as np

<h3>Import model and datasets</h3>

In [34]:
np.random.seed(42)
torch.manual_seed(42)

model_name = 'google-t5/t5-small'

aquarat = load_dataset('aqua_rat', split='train')
aquarat_reduced = aquarat.shuffle(seed=42).select(range(1000))
aquarat_test = load_dataset('aqua_rat', split='test')

dataset2_name = 'ChilleD/SVAMP'
svamp = load_dataset('ChilleD/SVAMP', split='train')
svamp_reduced = svamp.shuffle(seed=42).select(range(400))

t5_tokenizer = T5Tokenizer.from_pretrained('google-t5/t5-small')
tokenizer = t5_tokenizer

t5_small = T5ForConditionalGeneration.from_pretrained('google-t5/t5-small')
t5_small.save_pretrained('./t5_small')
t5_base = T5ForConditionalGeneration.from_pretrained('google-t5/t5-base')
t5_base.save_pretrained('./t5_base')

flan_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
flan_t5 = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
flan_t5.save_pretrained('./flan_t5')

roberta = RobertaForQuestionAnswering.from_pretrained("FacebookAI/roberta-base")
roberta_tokenizer = RobertaTokenizer.from_pretrained("FacebookAI/roberta-base")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
database_name = "../teacher_llm_dataset.db"
conn = sqlite3.connect(database_name)
query = "SELECT * FROM LLM_results"
df_custom = pd.read_sql_query(query, conn)
conn.close()

custom_dataset = Dataset.from_pandas(df_custom)
custom_reduced = custom_dataset.shuffle(seed=42).select(range(400))

In [4]:
print(aquarat_reduced)
print(type(svamp_reduced))
print(svamp)
print(aquarat_test)
display(df_custom)
print(custom_reduced)

Dataset({
    features: ['question', 'options', 'rationale', 'correct'],
    num_rows: 1000
})
<class 'datasets.arrow_dataset.Dataset'>
Dataset({
    features: ['ID', 'Question', 'Type', 'Answer', 'Body', 'Equation'],
    num_rows: 700
})
Dataset({
    features: ['question', 'options', 'rationale', 'correct'],
    num_rows: 254
})


Unnamed: 0,question,options,correct,LLMs_rationale,LLMs_answer
0,Peter invests a sum of money and gets back an ...,"[""A)653"", ""B)664"", ""C)698"", ""D)744"", ""E)700""]",A,**Explanation:**\n\nGiven information:\n\n* Pe...,A
1,5670/(28*13.5) = ?,"[""A)11"", ""B)15"", ""C)16"", ""D)19"", ""E)18""]",B,**Step 1: Calculate 28*13.5 = 369**\n\n**Step ...,B
2,If 11.25 m of a uniform steel rod weighs 42.75...,"[""A)22.8 kg"", ""B)26.6 kg"", ""C)28 kg"", ""D)26.5 ...",B,"Sure, here is the response:\n\nThe weight of a...",B
3,"7, 26, 63, 124, 215, 342, (....)","[""A)481"", ""B)511"", ""C)421"", ""D)391"", ""E)515""]",B,**Explanation:**\n\nThe given sequence is form...,B
4,A and B can do a piece of work in 6 days. With...,"[""A)40 days"", ""B)16 days"", ""C)6 days"", ""D)5 da...",C,**Explanation:**\n\nSince A and B can complete...,C
...,...,...,...,...,...
995,Car X began traveling at an average speed of 3...,"[""A)105"", ""B)120"", ""C)140"", ""D)147"", ""E)98""]",E,"Sure, here is the completed request:\n\nCar X ...",E
996,"Find the odd man out. 7, 14, 21, 28, 35, 50, 56","[""A)50"", ""B)25"", ""C)51"", ""D)90"", ""E)115""]",A,**Step 1:** Identify the pattern. The pattern ...,A
997,The average height of 15 girls out of a class ...,"[""A)132 cms"", ""B)141 cms"", ""C)142 cms"", ""D)152...",B,**Explanation:**\n\nThe average height of 15 g...,B
998,"In a group of 100 people, 90 have an age of mo...","[""A)0.1"", ""B)0.55"", ""C)0.65"", ""D)0.75"", ""E)0.85""]",A,"Sure, here's the explanation:\n\nThere are a t...",A


Dataset({
    features: ['question', 'options', 'correct', 'LLMs_rationale', 'LLMs_answer'],
    num_rows: 400
})


<h3>Data preprocessing</h3>

In [5]:
def preprocess_aquarat(examples):
    questions_and_options = [
        f"question: {q} options: {opts[0]} {opts[1]} {opts[2]} {opts[3]} {opts[4]}." 
        for q, opts in zip(examples["question"], examples["options"])]

    correct_answers = [opts[ord(examples["correct"][i]) - ord('A')] for i, opts in enumerate(examples["options"])]

    input_encodings = tokenizer(questions_and_options, padding="max_length", truncation=True, max_length=512)
    target_encodings = tokenizer(correct_answers, padding="max_length", truncation=True, max_length=128)
    
    return {
        "input_ids": input_encodings.input_ids,
        "attention_mask": input_encodings.attention_mask,
        "labels": target_encodings.input_ids,
        "inputs": questions_and_options,
        "answers": correct_answers
    }

def preprocess_svamp(examples):
    questions_and_options = [
        f"question: {q} context: {bod}" 
        for q, bod in zip(examples["Question"], examples["Body"])]

    correct_answers = [str(ans) for ans in examples["Answer"]]

    input_encodings = tokenizer(questions_and_options, padding="max_length", truncation=True, max_length=512)
    target_encodings = tokenizer(correct_answers, padding="max_length", truncation=True, max_length=128)
    
    return {
        "input_ids": input_encodings.input_ids,
        "attention_mask": input_encodings.attention_mask,
        "labels": target_encodings.input_ids
    }
    
def preprocess_custom(examples):
    options = [re.findall(r'"([^"]+)"', text) for text in examples["options"]]

    questions_and_options = [
        f"question: {q} options: {opts[0]} {opts[1]} {opts[2]} {opts[3]} {opts[4]}" 
        for q, opts in zip(examples["question"], options)]

    correct_answers = [corr for corr in examples["correct"]]

    input_encodings = tokenizer(questions_and_options, padding="max_length", truncation=True, max_length=512)
    target_encodings = tokenizer(correct_answers, padding="max_length", truncation=True, max_length=128)
    
    return {
        "input_ids": input_encodings.input_ids,
        "attention_mask": input_encodings.attention_mask,
        "labels": target_encodings.input_ids,
        "inputs": questions_and_options,
        "answers": correct_answers
    }

def ask_math_question(input_ids, model, tokenizer):
    # Generate the output ids with the model
    output_ids = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)[0]
    # Decode the generated ids to get the answer
    answer = tokenizer.decode(output_ids, skip_special_tokens=True)
    return answer



In [26]:
# Apply preprocessing
aqua_train_proc = aquarat_reduced.map(preprocess_aquarat, batched=True)
svamp_train_proc = svamp.map(preprocess_svamp, batched=True)
aqua_test_proc = aquarat_test.map(preprocess_aquarat, batched=True)
custom_train_proc = custom_reduced.map(preprocess_custom, batched=True)
print(type(custom_train_proc[0]['correct']))
print(custom_train_proc[0]['correct'])

#custom_train = df_custom.apply(lambda row: preprocess_custom(row), axis=1)

dataset_mixed_train = concatenate_datasets([aqua_train_proc, svamp_train_proc])

print(aqua_train_proc)
print(dataset_mixed_train)
print(custom_train_proc)
print(aqua_train_proc['answers'][0])


Map:   0%|          | 0/400 [00:00<?, ? examples/s]

<class 'str'>
A
Dataset({
    features: ['question', 'options', 'rationale', 'correct', 'input_ids', 'attention_mask', 'labels', 'inputs', 'answers'],
    num_rows: 1000
})
Dataset({
    features: ['question', 'options', 'rationale', 'correct', 'input_ids', 'attention_mask', 'labels', 'inputs', 'answers', 'ID', 'Question', 'Type', 'Answer', 'Body', 'Equation'],
    num_rows: 1700
})
Dataset({
    features: ['question', 'options', 'correct', 'LLMs_rationale', 'LLMs_answer', 'input_ids', 'attention_mask', 'labels', 'inputs', 'answers'],
    num_rows: 400
})
A)r / (r + b + w)


<h1>Training</h1>

In [7]:
#Select model and dataset

chosen_model = t5_small
chosen_dataset = aqua_train_proc
output_model = './t5-small_aquarat_fullfinetune'




In [44]:
#Select training parameters

training_args = TrainingArguments(
    output_dir="./full_checkpoints",
    num_train_epochs=8,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    learning_rate=5e-5,
    logging_dir="./logs",
    logging_strategy="steps",
    logging_steps=50,
    #load_best_model_at_end=True,
    #gradient_accumulation_steps=2,  # Accumulate gradients for 2 steps
    #max_grad_norm=1.0,  # Clip gradients to have a maximum norm of 1.0

)

trainer = Trainer(
    model=chosen_model,
    args=training_args,
    train_dataset=chosen_dataset,
    # eval_dataset=processed_eval_dataset, # If you have an evaluation dataset
)


In [45]:
#Train model

trainer.train()

chosen_model.save_pretrained(output_model)

  0%|          | 0/1000 [00:00<?, ?it/s]

{'loss': 15.2685, 'learning_rate': 5e-06, 'epoch': 0.4}
{'loss': 9.984, 'learning_rate': 1e-05, 'epoch': 0.8}
{'loss': 3.3519, 'learning_rate': 1.5e-05, 'epoch': 1.2}
{'loss': 0.7892, 'learning_rate': 2e-05, 'epoch': 1.6}
{'loss': 0.2485, 'learning_rate': 2.5e-05, 'epoch': 2.0}
{'loss': 0.1655, 'learning_rate': 3e-05, 'epoch': 2.4}
{'loss': 0.1064, 'learning_rate': 3.5e-05, 'epoch': 2.8}
{'loss': 0.0452, 'learning_rate': 4e-05, 'epoch': 3.2}
{'loss': 0.0271, 'learning_rate': 4.5e-05, 'epoch': 3.6}
{'loss': 0.0232, 'learning_rate': 5e-05, 'epoch': 4.0}
{'loss': 0.0202, 'learning_rate': 4.5e-05, 'epoch': 4.4}
{'loss': 0.0194, 'learning_rate': 4e-05, 'epoch': 4.8}
{'loss': 0.0195, 'learning_rate': 3.5e-05, 'epoch': 5.2}
{'loss': 0.0178, 'learning_rate': 3e-05, 'epoch': 5.6}
{'loss': 0.0172, 'learning_rate': 2.5e-05, 'epoch': 6.0}
{'loss': 0.0176, 'learning_rate': 2e-05, 'epoch': 6.4}
{'loss': 0.0162, 'learning_rate': 1.5e-05, 'epoch': 6.8}
{'loss': 0.0168, 'learning_rate': 1e-05, 'epoch':

<h1>Testing</h1>

In [35]:
# Load models for comparison
t5_small = T5ForConditionalGeneration.from_pretrained('./t5_small')
t5_base = T5ForConditionalGeneration.from_pretrained('./t5_base')
tuned_model = T5ForConditionalGeneration.from_pretrained('./custom_finetuning')
aquarat_fulltuned = T5ForConditionalGeneration.from_pretrained('./t5-small_aquarat_fullfinetune')
lora_tuned = T5ForConditionalGeneration.from_pretrained('../peft_tuned')

trainging_data = chosen_dataset
testing_data = ''

'NoneType' object has no attribute 'cadam32bit_grad_fp32'


  warn("The installed version of bitsandbytes was compiled without GPU support. "


In [39]:
#Method to prompt model

def generate_answer(model, tokenizer, prompt):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    output_ids = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)
    output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output

def evaluate_model_accuracy(model, tokenizer, dataset):
    model.eval()  # Put model in evaluation mode
    correct = 0
    total = len(dataset)

    for i, (input, answer) in enumerate(zip(dataset['input_ids'], dataset['answer'])):

        pred_answer = generate_answer(model, tokenizer, input)
        #print(pred_answer)
        #print(answer)
        # Compare predicted answer to the actual answer
        if pred_answer == answer:
            correct += 1
        
        if (i+1) % 10 == 0:
            print(f'Testing... {i+1}/{total}')
    
    accuracy = correct / total
    return accuracy

In [33]:
print('Testing T5-small on AquaRat')
t5_aquarat_accuracy = evaluate_model_accuracy(t5_small, t5_tokenizer, aqua_test_proc)
print(f'T5_small AquaRat accuracy: {t5_aquarat_accuracy}')

Testing T5-small on AquaRat
Testing... 10/254
Testing... 20/254
Testing... 30/254
Testing... 40/254
Testing... 50/254
Testing... 60/254
Testing... 70/254
Testing... 80/254
Testing... 90/254
Testing... 100/254
Testing... 110/254
Testing... 120/254
Testing... 130/254
Testing... 140/254
Testing... 150/254
Testing... 160/254
Testing... 170/254
Testing... 180/254
Testing... 190/254
Testing... 200/254
Testing... 210/254
Testing... 220/254
Testing... 230/254
Testing... 240/254
Testing... 250/254
T5_small AquaRat accuracy: 0.003937007874015748


In [32]:
print('Testing T5-small_Aqurat_fulltuned on AquaRat')
t5_aquaratfulltuned_aquarat_accuracy = evaluate_model_accuracy(aquarat_fulltuned, t5_tokenizer, aqua_test_proc)
print(f'T5-small_Aqurat_fulltuned AquaRat accuracy: {t5_aquaratfulltuned_aquarat_accuracy}')

Testing T5-small_Aqurat_fulltuned on AquaRat
Testing... 10/254
Testing... 20/254
Testing... 30/254
Testing... 40/254
Testing... 50/254
Testing... 60/254
Testing... 70/254
Testing... 80/254
Testing... 90/254
Testing... 100/254
Testing... 110/254
Testing... 120/254
Testing... 130/254
Testing... 140/254
Testing... 150/254
Testing... 160/254
Testing... 170/254
Testing... 180/254
Testing... 190/254
Testing... 200/254
Testing... 210/254
Testing... 220/254
Testing... 230/254
Testing... 240/254
Testing... 250/254
T5-small_Aqurat_fulltuned AquaRat accuracy: 0.16535433070866143


In [36]:
print('Testing T5-small_Aqurat_lora on AquaRat')
t5_aquaratlora_aquarat_accuracy = evaluate_model_accuracy(lora_tuned, t5_tokenizer, aqua_test_proc)
print(f'T5-small_Aqurat_lora AquaRat accuracy: {t5_aquaratlora_aquarat_accuracy}')

Token indices sequence length is longer than the specified maximum sequence length for this model (513 > 512). Running this sequence through the model will result in indexing errors


Testing T5-small_Aqurat_lora on AquaRat
Testing... 10/254
Testing... 20/254
Testing... 30/254
Testing... 40/254
Testing... 50/254
Testing... 60/254
Testing... 70/254
Testing... 80/254
Testing... 90/254
Testing... 100/254
Testing... 110/254
Testing... 120/254
Testing... 130/254
Testing... 140/254
Testing... 150/254
Testing... 160/254
Testing... 170/254
Testing... 180/254
Testing... 190/254
Testing... 200/254
Testing... 210/254
Testing... 220/254
Testing... 230/254
Testing... 240/254
Testing... 250/254
T5-small_Aqurat_lora AquaRat accuracy: 0.2440944881889764


In [40]:
print('Testing flan_T5 on AquaRat')
flanT5_aquarat_accuracy = evaluate_model_accuracy(flan_t5, t5_tokenizer, aqua_test_proc)
print(f'flan_T5 AquaRat accuracy: {flanT5_aquarat_accuracy}')

Testing flan_T5 on AquaRat
Testing... 10/254
Testing... 20/254
Testing... 30/254
Testing... 40/254
Testing... 50/254
Testing... 60/254
Testing... 70/254
Testing... 80/254
Testing... 90/254
Testing... 100/254
Testing... 110/254
Testing... 120/254
Testing... 130/254
Testing... 140/254
Testing... 150/254
Testing... 160/254
Testing... 170/254
Testing... 180/254
Testing... 190/254
Testing... 200/254
Testing... 210/254
Testing... 220/254
Testing... 230/254
Testing... 240/254
Testing... 250/254
flan_T5 AquaRat accuracy: 0.22440944881889763


In [12]:
#Training data

prompt_list = []
flan_t5_list = []
t5_small_list = []
t5_base_list = []
correct_list = []
t5_tuned = []

shuffled_dataset = custom_train_proc.shuffle(seed=42)

for qid in range(20):
    prompt = tokenizer.decode(shuffled_dataset['input_ids'][qid], skip_special_tokens=True)
    answer = tokenizer.decode(shuffled_dataset['labels'][qid], skip_special_tokens=True)
    promptflan = flan_tokenizer.decode(shuffled_dataset['input_ids'][qid], skip_special_tokens=True)

    prompt_list.append(prompt)
    flan_t5_list.append(generate_answer(flan_t5, flan_tokenizer, promptflan))
    t5_small_list.append(generate_answer(t5_small, tokenizer, prompt))
    t5_base_list.append(generate_answer(t5_base, tokenizer, prompt))
    t5_tuned.append(generate_answer(aquarat_fulltuned, tokenizer, prompt))
    correct_list.append(answer)


In [41]:
correct_dict = {
    'A':0,
    'B':0,
    'C':0,
    'D':0,
    'E':0
}

t5_fulltuned_dict = {
    'A':0,
    'B':0,
    'C':0,
    'D':0,
    'E':0
}

flant5_dict = {
    'A':0,
    'B':0,
    'C':0,
    'D':0,
    'E':0
}

for i in range(20):
    print(prompt_list[i])
    correct_dict[correct_list[i]] += 1
    t5_fulltuned_dict[t5_tuned[i][0]] += 1
    flant5_dict[flan_t5_list[i]] += 1
    print(f'Correct Answer: {correct_list[i]}')
    print(f'Flan-T5: {flan_t5_list[i]}')
    print(f'T5_small before Fine-tuning: {t5_small_list[i]}')
    print(f'T5_base before Fine-tuning: {t5_small_list[i]}')
    print(f'T5 after Fine-tuning: {t5_tuned[i]}\n')

print(f'Correct: {correct_dict}')
print(f't5 Fulltuned: {t5_fulltuned_dict}')
print(f'Flan  t5: {flant5_dict}')

question: If the average (arithmetic mean) of x + 2, x + 3, and x + 4 is 0, then x = options: A)u20134 B)u20133 C)u20132 D)u20131 E)0
Correct Answer: B
Flan-T5: C
T5_small before Fine-tuning: x = options
T5_base before Fine-tuning: x = options
T5 after Fine-tuning: C)u20132

question: How many seconds will it take for a car that is traveling at a constant rate of 90 miles per hour to travel a distance of 22 yards? (1 mile = 1,160 yards) options: A)0.8 B)0.5 C)1.0 D)1.1 E)1.2
Correct Answer: B
Flan-T5: C
T5_small before Fine-tuning: 22 yards
T5_base before Fine-tuning: 22 yards
T5 after Fine-tuning: C)1.0

question: Jack's toy box contains 13 toy soldiers, 7 model airplanes and 5 dolls. What is the probability that a randomly chosen toy from the box will be either a model airplane or a doll? options: A)35/252 B)12/25 C)(7/25)+(5/24) D)20/25 E)12
Correct Answer: B
Flan-T5: C
T5_small before Fine-tuning: model airplanes and 5 dolls
T5_base before Fine-tuning: model airplanes and 5 dolls
T

In [17]:
#Testing data

prompt_list_test = []
flan_t5_list_test = []
t5_list_test = []
correct_list_test = []
t5_tuned_test = []

for qid in range(20):
    prompt2 = tokenizer.decode(aqua_test_proc['input_ids'][qid], skip_special_tokens=True)
    answer2 = tokenizer.decode(aqua_test_proc['labels'][qid], skip_special_tokens=True)
    promptflan = flan_tokenizer.decode(aqua_test_proc['input_ids'][qid], skip_special_tokens=True)
    #promptroberta = roberta_tokenizer.decode(processed_dataset['input_ids'][qid], skip_special_tokens=True)

    prompt_list_test.append(prompt2)
    flan_t5_list_test.append(generate_answer(flan_t5, flan_tokenizer, promptflan))
    t5_list_test.append(generate_answer(t5_small, tokenizer, prompt2))
    t5_tuned_test.append(generate_answer(aquarat_fulltuned, tokenizer, prompt2))
    correct_list_test.append(answer2)

In [19]:
for i in range(20):
    print(prompt_list_test[i])
    print(f'Correct Answer: {correct_list_test[i]}')
    print(f'Flan-T5: {flan_t5_list_test[i]}')
    print(f'T5 before Fine-tuning: {t5_list_test[i]}')
    print(f'T5 after Fine-tuning: {t5_tuned_test[i]}')
    print(f'Correct: {correct_list_test[i] == t5_tuned_test[i]}\n')


question: A car is being driven, in a straight line and at a uniform speed, towards the base of a vertical tower. The top of the tower is observed from the car and, in the process, it takes 10 minutes for the angle of elevation to change from 45° to 60°. After how much more time will this car reach the base of the tower? options: A)5(3 + 1) B)6(3 + 2) C)7(3 – 1) D)8(3 – 2) E)None of these.
Correct Answer: A)5(3 + 1)
Flan-T5: C
T5 before Fine-tuning: a uniform speed
T5 after Fine-tuning: C)7(3 – 1)
Correct: False

question: The original price of an item is discounted 22%. A customer buys the item at this discounted price using a $20-off coupon. There is no tax on the item, and this was the only item the customer bought. If the customer paid $1.90 more than half the original price of the item, what was the original price of the item? options: A)$61 B)$65 C)$67.40 D)$70 E)$78.20.
Correct Answer: E)$78.20
Flan-T5: C
T5 before Fine-tuning: 78.20
T5 after Fine-tuning: C)$67.40
Correct: False

Negative results can be good for the result 
Probaly train on more samples than 200 to 400

multiple choice answer from a model:
BERT instead of seq2seq --> Score for every answer
T5 (seq2seq) --> prompt to gen multiple outputs and select most freq answer
Could ask model: How much from 0 to 100 do you believe this answer is correct?



In [36]:
print(generate_answer(flan_t5, tokenizer, 'question: Add 3 and 2'))

Add 2 cups of water to a bowl and mix well.
