# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
%%capture

!pip install transformers datasets peft evaluate torch

In [2]:
%%capture

pip install -U scikit-learn scipy matplotlib

In [3]:
import sklearn

In [None]:
import torch

from peft import LoraConfig, get_peft_model

from datasets import load_dataset

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding

from sklearn.metrics import f1_score, accuracy_score

#### using an uncased model as they always perform better in most cases

In [None]:
model_name = 'allenai/scibert_scivocab_uncased'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  
#num_labels = 2 meaning we wake it a yes or no kinda scenario

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
dataset = load_dataset("deepmind/math_dataset", "calculus__differentiate")

In [None]:
config = LoraConfig(
    r=16,   #let us use 16 rank
    lora_alpha=32,   #scaling factor
    target_modules=["query", "value"],          #specific layers to apply lora to
    lora_dropout=0.05,     
    bias="none",     #bias type for LoRA
    task_type="SEQ_CLS"   #sequence classification task, simply denoting correct or incorrect for a question
)

In [None]:
model = get_peft_model(model, config)   #specifically applying LoRA to our SCIBERT model

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [None]:
def preprocess_function(examples: dict) -> dict:
    """
    Extracting questions from the input examples and then using tokenizer to process them
    
    Arg: examples(dict) = dictionary containing batvch of examples
    
    Returns: labels(dict) = dictionary containing tokenized values to explain inputs_id and attention mask
    
    """
    questions = examples["question"]
    inputs = tokenizer(questions, padding="max_length", truncation=True, max_length=512)
    inputs["labels"] = [1 if answer == "correct" else 0 for answer in examples["answer"]]
    return inputs

In [None]:
preprocess_function??

In [None]:
dataset['train'][0], dataset['test'][0]

In [None]:
preprocessed_datasets = dataset.map(preprocess_function, batched=True)

In [None]:
data_collator =  DataCollatorWithPadding(tokenizer=tokenizer, padding=True)   #ensuring consistent, dynamic batch padding when training

### Defining training arguments now

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average='weighted')
    accuracy = accuracy_score(labels, preds)
    return {'f1': f1, 'accuracy': accuracy}

In [None]:
training_args = TrainingArguments(
    output_dir="calculus_lora_scibert",    #specifying where to save our model
    num_train_epochs=1,    #number of training epochs done
    per_device_train_batch_size=16,     #gpu processes 16 samples during training
    per_device_eval_batch_size=16,       #process 16 samples during evaluation
    gradient_accumulation_steps=2,     #number of forward passes before backpropagation by the per_device_train_batch_size
    learning_rate=1e-4,       #step size taken during learning
    fp16=True,        # memory efficiency, helps speed entire process
    logging_steps=50,   #how often metrics are logged
    save_steps=100,  #model chcekpoint set to 100
    evaluation_strategy="epoch"    #explains how often we evaluate our model
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=preprocessed_datasets["train"],
    eval_dataset=preprocessed_datasets["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

In [None]:
# Saving our fine-tuned model
save_directory = "calculus_lora_scibert_saved"

In [None]:
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

print(f"Model saved to: {save_directory}")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [None]:
#i have to type here every hour so it doesnt time out