<a href="https://colab.research.google.com/github/vjnadkarni/Udacity-GenerativeAI/blob/main/Udacity_Project_1_final_PEFT_LoRA_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Udacity Project #1: Apply Lightweight Fine-Tuning to a Foundation Model**
## **Multi-class classification of rich trove of news articles into 4 categories**

### By: Vijay Nadkarni
### Email: vjnadkarni@gmail.com
### Course: Generative AI

This project implements sequence classification using parameter-efficient fine-tuning (PEFT) by way of low rank adaptation (LoRA). The base model used is "FacebookAI/roberta-base", commonly used for sequence classification and the dataset is 'ag_news' which is a rich trove of 1M+ new articles classified into 4 categories.

In this project, the PEFT model is trained (fine-tuned) on the ag_news news articles, and the resulting LoRA adapter is merged with the base roberta-base base model. News headlines are obtained and fed to the model, which then classfies it into one of the 4 categories above.

TODO: In this cell, describe your choices for each of the following

* **PEFT technique:** This project performs parameter-efficient fine-tuning (PEFT) using low rank adaptation (LoRA). The PEFT technique is sequence classification, where new headlines are classified into one of 4 categories: World, Sports, Business and Sci/Tech. In this project, for PEFT the weights of the selected model are frozen by way of the 'peft' library (i.e. no further training) and only the weights of the adapter are trained.
* **Model:** The specific model chosen is 'FacebookAI/roberta-base' which is very commonly used for sequence classification and text classification and has a good blend of performance and size. The AutoModelForSequenceClassification() model from Hugging Face is used since this is a text classification (sequence classification) implementation.
* **Dataset:** The dataset used for fine-tuning is "ag_news". This dataset is a collection of more than 1 million news headlines that have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. These are classified into 4 categories: 'World', 'Sports', 'Business', 'Sci/Tech'. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity.
* **Evaluation approach:** The evaluation is done using the Hugging Face 'evaluate' library. Three scenarios are evaluated for accuracy of results:
  - Non fine-tuned model (original roberta-base)
  - PEFT fine-tuned model (roberta-base w/frozen weights merged with PEFT adapter)
  - Fully fine-tuned model (roberta-base retrained with ag_news dataset)
* **Fine-tuning dataset:** The 'ag_news' dataset is used. This dataset maps well to the AutoModelForSequenceClassification() model and is well suited to the ag_news dataset. It contains over 1 million new headlines classified into four categories: 'World', 'Sports', 'Business' and "Sci/Tech'.

## **Attribution**
I referenced a number of Hugging Face, Google DeepLearning, Medium, GitHub and YouTube tutorials and videos that covered PEFT and LoRA. In some cases, notably from Hugging Face, I took the liberty of including 1-10 line snippets of code in my project (but not the entire program). The URLs of the more important websites and videos are the following:
- https://huggingface.co/docs/peft/en/developer_guides/quantization
- https://colab.research.google.com/drive/1iERDk94Jp0UErsPf7vXyPKeiM4ZJUQ-a?usp=sharing
- https://www.youtube.com/watch?v=Us5ZFp16PaU
- https://www.youtube.com/live/g68qlo9Izf0
- https://www.youtube.com/watch?v=3fsn19OI_C8
- https://www.youtube.com/watch?v=iYr1xZn26R8
- https://www.youtube.com/watch?v=XpoKB3usmKc
https://medium.com/@achillesmoraites/lightweight-roberta-sequence-classification-fine-tuning-with-lora-using-the-hugging-face-peft-8dd9edf99d19
- https://www.youtube.com/watch?v=YJNbgusTSF0
- https://github.com/adidror005/youtube-videos/blob/main/LLAMA_3_Fine_Tuning_for_Sequence_Classification_Actual_Video.ipynb
- https://github.com/achimoraites/machine-learning-playground/blob/main/NLP/Text%20classification/Lightweight_RoBERTa_PEFT_LORA_FineTuning.ipynb
- https://medium.com/@achillesmoraites/lightweight-roberta-sequence-classification-fine-tuning-with-lora-using-the-hugging-face-peft-8dd9edf99d19
- https://github.com/huggingface/peft/blob/main/examples/sequence_classification/LoRA.ipynb


# **DEPENDENCIES**

## **Mount Google Drive**

## **Summary**
PEFT technique: LORA
Model: FacebookAI/roberta-base
Evaluation approach: Accuracy
Fine-tuning dataset: ag_news
Intro
Fine-tuning large language models (LLMs) like RoBERTa can produce remarkable results when adapting them to specific tasks. Unfortunately, it can also be slow and computationally expensive. In a previous article, we explored Fine-tuning RoBERTa for Topic Classification with Hugging Face Transformers and Datasets Library.

Here, we will explore how to make that fine-tuning process more efficient using LORA (Low-Rank Adaptation) by leveraging the 🤗PEFT (Parameter-Efficient Fine-Tuning) library.

## **Install dependencies**

## **Mount Google Drive**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Retrieve Hugging Face User Access Token from Google Colab Secrets**

In [2]:
import os
from google.colab import userdata  # Import Colab Secrets userdata module

# Retrieve Hugging Face access token
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
hf_token = os.environ["HF_TOKEN"]

## **Log into Hugging Face Hub (non-blocking)**

In [3]:
from huggingface_hub import login, logout
login(hf_token, add_to_git_credential=True) # non-blocking login

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## **(For Reviewer) Log into Hugging Face Hub manually with reviewer's HF Access Token**
### No dependence on student's token (uncomment the lines below)

In [4]:
# from huggingface_hub import notebook_login
# notebook_login()

## **Install and/or upgrade Hugging Face libraries as needed**

In [5]:
!pip install -U transformers -q
!pip install -U datasets -q
!pip install -U peft -q
!pip install -U loralib -q
!pip install -U trl -q
!pip install -U bitsandbytes -q
!pip install -U accelerate -q
!pip install -U evaluate -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 16.1.0 

## **Check for availability of GPU and list its type**

In [6]:
!nvidia-smi -L

GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-c48eaee9-67fd-ae7d-2a4a-62b458033b4c)


# **TRAINING**

## **Dataset Preprocessing**

In [7]:
import torch
from transformers import AutoModelForSequenceClassification, RobertaModel, RobertaTokenizer, TrainingArguments, BitsAndBytesConfig, Trainer, DataCollatorWithPadding
import bitsandbytes as bnb
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import random
from tqdm import tqdm

peft_model_name = 'roberta-base-peft'
modified_base = 'roberta-base-modified'
base_model = 'roberta-base'

## **Load the dataset and instantiate tokenizer**

In [8]:
dataset = load_dataset('ag_news')
tokenizer = RobertaTokenizer.from_pretrained(base_model)

Downloading readme:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

## **Define function to preprocess dataset**

In [9]:
def preprocess(examples):
    tokenized = tokenizer(examples['text'], truncation=True, padding=True)
    return tokenized

## **Preprocess the dataset**

In [10]:
tokenized_dataset = dataset.map(preprocess, batched=True,  remove_columns=["text"])
train_dataset=tokenized_dataset['train']
eval_dataset=tokenized_dataset['test'].shard(num_shards=2, index=0)
test_dataset=tokenized_dataset['test'].shard(num_shards=2, index=1)

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [11]:
# Examine structure of 'dataset'
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [12]:
# Examine structure of 'tokenized_dataset'
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 7600
    })
})

In [13]:
# Examine structure of 'train_dataset'
train_dataset

Dataset({
    features: ['label', 'input_ids', 'attention_mask'],
    num_rows: 120000
})

In [14]:
# Examine structure of 'eval_dataset'
eval_dataset

Dataset({
    features: ['label', 'input_ids', 'attention_mask'],
    num_rows: 3800
})

In [15]:
# Examine structure of 'test_dataset'
test_dataset

Dataset({
    features: ['label', 'input_ids', 'attention_mask'],
    num_rows: 3800
})

## **Obtain number of classes in dataset and their names**

In [16]:
# Extract the number of classess and their names
num_labels = dataset['train'].features['label'].num_classes
class_names = dataset["train"].features["label"].names
print(f"number of labels: {num_labels}")
print(f"the labels: {class_names}")

number of labels: 4
the labels: ['World', 'Sports', 'Business', 'Sci/Tech']


## **Create the id2label mapping which will be needed when categorizing results**

In [17]:
# Create an id2label mapping
id2label = {i: label for i, label in enumerate(class_names)}

In [18]:
# Print the mappings for reference
for i, label in enumerate(class_names):
    print(f"id: {i}, label: {label}")

id: 0, label: World
id: 1, label: Sports
id: 2, label: Business
id: 3, label: Sci/Tech


## **Instantiate data collator**

In [19]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

## **Training**
Train two models, one using LORA and the other with full fine-tuning.

Note the LORA training times and the number of trained parameters!

In [20]:
# use the same Training args for all models
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='steps',
    learning_rate=2e-5,
    num_train_epochs=1,
    per_device_train_batch_size=16,
)



In [21]:
def get_trainer(model):
      return  Trainer(
          model=model,
          args=training_args,
          train_dataset=train_dataset,
          eval_dataset=eval_dataset,
          data_collator=data_collator,
      )

## **Full Fine-Tuning Training**

In [22]:
full_finetuning_trainer = get_trainer(
    AutoModelForSequenceClassification.from_pretrained(
        base_model,
        id2label=id2label),
)

full_finetuning_trainer.train()

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss
500,0.4013,0.300813
1000,0.2642,0.249738
1500,0.2375,0.251835
2000,0.2229,0.245339
2500,0.2288,0.232494
3000,0.2417,0.210147
3500,0.2113,0.204921
4000,0.2034,0.205636
4500,0.1946,0.22046
5000,0.1906,0.198275


TrainOutput(global_step=7500, training_loss=0.22287412719726563, metrics={'train_runtime': 1976.2895, 'train_samples_per_second': 60.72, 'train_steps_per_second': 3.795, 'total_flos': 2.0289992490004224e+16, 'train_loss': 0.22287412719726563, 'epoch': 1.0})

## **PEFT Training**

## **Train PEFT adapter**

### **Note:** I did *not* include BitsAndBytes in this model.After experimenting with BitsAndBytesConfig to try and improve the model, I discovered that the minimum loss for PEFT asymptoted at around 0.31 with BitsAndBytes compared to a minimum loss of around 0.24 without BitsAndBytes. Furthermore, the accuracy was only around 90% with BitsAndBytes compared to an accuracy of 93% without BitsAndBytes. Since my model was training perfectly well without BitsAndBytes, I decided to *not* use it, which resulted in a superior model.

### I kept the code for future reference if needed.

### **FYI:** The line "%%script false --no-raise-error" makes Jupyter Notebook ignore the entire cell.

#### *The cell below is ignored by Jupyter Notebook*



In [23]:
%%script false --no-raise-error
# The line above causes Jupyter Notebook to skip this entire cell
# Quantization config for LORA using BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, # enable 4-bit quantization
    bnb_4bit_quant_type = 'nf4', # information theoretically optimal dtype for normally distributed weights
    bnb_4bit_use_double_quant = True, # quantize quantized weights //insert xzibit meme
    bnb_4bit_compute_dtype = torch.bfloat16 # optimized fp format for ML
)

## **Instantiate the PEFT model**

In [24]:
model = AutoModelForSequenceClassification.from_pretrained(
    base_model,
    # quantization_config=quantization_config,
    id2label=id2label
    )

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### *The cell below is ignored by Jupyter Notebook*

In [25]:
%%script false --no-raise-error
# The line above causes Jupyter Notebook to skip this entire cell
# Prepare model for kbit training
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)

#### *The cell below is ignored by Jupyter Notebook*

In [26]:
%%script false --no-raise-error
# The line above causes Jupyter Notebook to skip this entire cell
# Dequantize the classifier layer
for param in model.classifier.parameters():
    param.data = param.data.float()
    param.requires_grad = True

## **Configure LoRA**

In [27]:
peft_config = LoraConfig(
    task_type="SEQ_CLS", # sequence classification task
    inference_mode=False,
    r=8, # dimension of low-rank matrices
    lora_alpha=16, # scaling factor for LoRA activations
    lora_dropout=0.1,
    bias="none", # no bias weights
    )

## **Merge the PEFT adapter with the base model and print the trainable parameters of the merged model**

In [28]:
peft_model = get_peft_model(model, peft_config)

print('PEFT Model')
peft_model.print_trainable_parameters()

PEFT Model
trainable params: 888,580 || all params: 125,537,288 || trainable%: 0.7078


## **Get the PEFT trainer and train the merged PEFT model**

In [29]:
peft_lora_finetuning_trainer = get_trainer(peft_model)
peft_lora_finetuning_trainer.train()

Step,Training Loss,Validation Loss
500,0.9322,0.341805
1000,0.3103,0.320526
1500,0.305,0.314143
2000,0.2872,0.305717
2500,0.2965,0.300594
3000,0.3077,0.291972
3500,0.2744,0.29505
4000,0.2867,0.290692
4500,0.2877,0.287533
5000,0.2739,0.287174




TrainOutput(global_step=7500, training_loss=0.3284603535970052, metrics={'train_runtime': 1555.4345, 'train_samples_per_second': 77.149, 'train_steps_per_second': 4.822, 'total_flos': 2.0500492798385664e+16, 'train_loss': 0.3284603535970052, 'epoch': 1.0})

## **Save PEFT Model**

In [30]:
tokenizer.save_pretrained(modified_base)
peft_model.save_pretrained(peft_model_name)



## **Load the saved PEFT model**

In [31]:
from peft import AutoPeftModelForSequenceClassification
from transformers import AutoTokenizer

# Load the saved PEFT model
inference_model = AutoPeftModelForSequenceClassification.from_pretrained(peft_model_name, id2label=id2label)
tokenizer = AutoTokenizer.from_pretrained(modified_base)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **Define function to categorize blocks of text**



In [32]:
def categorize(text):
  inputs = tokenizer(text, truncation=True, padding=True, return_tensors="pt")
  output = inference_model(**inputs)

  category = output.logits.argmax(dim=-1).item() # 'category' is a number
  label = id2label[category] # 'label' is a string identifier
  # print(f'\n Category: {category}, Label: {label}, Text: {text}') # 'text' is a text block
  return category, label

## **Verify the PEFT model for correct categorization of new articles from test dataset**

In [33]:
def evaluate_predictions(num_evals):
  prediction = "Null"
  correct_count = 0
  for i in range(num_evals):
    random_index = random.randint(0, len(dataset['test']))
    test_dataset_label = dataset['test']['label'][random_index]
    test_dataset_id = id2label[test_dataset_label]
    test_dataset_text = dataset['test']['text'][random_index]
    category, label = categorize(test_dataset_text) # categorize text from dataset
    if test_dataset_label == category:
      correct_count += 1
      correct_prediction = 'Y'
    else:
      correct_prediction = 'N'
    print(f"Random index in test dataset: {random_index}, Ground truth: {test_dataset_id}, Prediction: {label}\nNews item: {test_dataset_text}\n")

  # Percentage correct
  print(f"Number of samples evaluated: {num_evals}, Number correctly categorized: {correct_count}")
  print(f"Percentage correct: {correct_count / num_evals * 100}%")

evaluate_predictions(100)

Random index in test dataset: 5238, Ground truth: Sci/Tech, Prediction: Sci/Tech
News item: Photos Plus Music Equals an Expensive iPod (washingtonpost.com) washingtonpost.com - First Apple put some color on the iPod, when it offered the iPod mini in a palette of pastel hues, and now it has put some color inside it, in the form of the new iPod Photo.

Random index in test dataset: 912, Ground truth: Sci/Tech, Prediction: Sports
News item: Bryant Makes First Appearance at Trial (AP) AP - NBA star Kobe Bryant arrived at his sexual assault trial Monday as attorneys in the case who spent the weekend poring over questionnaires prepared to question potential jurors individually.

Random index in test dataset: 204, Ground truth: Sports, Prediction: Sports
News item: Owners Seek Best Ballpark Deal for Expos (AP) AP - Trying to get the best possible ballpark deal for the Montreal Expos, major league baseball instructed its lawyers to press ahead with negotiations involving four of the areas bidd

## **Load Models from Repository and Compare All 3 Models**
- Non fine-tuned model (base model)
- PEFT/LoRA model
- Fully fine-tuned model


In [34]:
from torch.utils.data import DataLoader
import evaluate
from tqdm import tqdm

metric = evaluate.load('accuracy')

def evaluate_model(inference_model, dataset):

    eval_dataloader = DataLoader(dataset.rename_column("label", "labels"), batch_size=8, collate_fn=data_collator)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    inference_model.to(device)
    inference_model.eval()
    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch.to(device)
        with torch.no_grad():
            outputs = inference_model(**batch)
        predictions = outputs.logits.argmax(dim=-1)
        predictions, references = predictions, batch["labels"]
        metric.add_batch(
            predictions=predictions,
            references=references,
        )

    eval_metric = metric.compute()
    print(eval_metric)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [35]:
# Evaluate the non fine-tuned model
evaluate_model(AutoModelForSequenceClassification.from_pretrained(base_model, id2label=id2label), test_dataset)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 475/475 [00:13<00:00, 36.46it/s]

{'accuracy': 0.25026315789473685}





In [36]:
# Evaluate the PEFT fine-tuned model
evaluate_model(inference_model, test_dataset)

100%|██████████| 475/475 [00:13<00:00, 34.88it/s]

{'accuracy': 0.9155263157894736}





In [37]:
# Evaluate the Fully fine-tuned model
evaluate_model(full_finetuning_trainer.model, test_dataset)

100%|██████████| 475/475 [00:13<00:00, 35.95it/s]


{'accuracy': 0.9452631578947368}


# **Conclusion**

### In this project, the PEFT/LoRA model that was implemented gave successful results in classifying news headlines into one of four categories: 'World', 'Sports', 'Business', 'Sci/Tech'. A set of 100 randomly selected news headlines in these categories were selected from the test dataset and were presented to the PEFT model to categorize. The accuracy of responses from the PEFT model was excellent (approx 93%), and confirmed that the fine-tuning had worked.

### The PEFT model was compared against the non-tuned base model as well as against a fully fine-tuned model (non-PEFT/LoRA). The accuracy of the non-tuned base model was a poor 25%, while the accuracy of the PEFT/LoRA model was significantly better at 92%. The fully fine-tuned model had a marginally better accuracy of 94% but took longer to train and had decidedly larger storage and memory requirements as well, which detracted from its overall efficiency.

### Learnings from the project were several:
- Understanding of models that are amenable for PEFT & LoRA improvements
- Learning how to use Hugging Face Hub, navigate its vast collection of models, tokenizers, and datasets, use its API, upload/download one's own models to/from it and understand the settings of various configuration parameters
- Insights into the selection of suitable models and datasets with which to perform the PEFT training, notably for sequence classification
- Comparing the percentage accuracy of non-finetuned, PEFT fine-tuned and fully fine-tuned models. Found that the accuracy of the PEFT model was almost the same as that of the fully fine-tuned model, but the untrained base model's accuracy was terrible.
- Learning how to merge the base model with PEFT fine-tuned adapters to create a super-model that provided superior results to those of the base model, while retaining all the weights and properties of the original base model.
- Learning how to use BitsAndBytes for quantization, to reduce the memory and storage requirements.
- Discovering that BitsAndBytes actually made the model slightly *worse*, relative to not using it, probably a result of its 4-bit vs 8-bit quantization. When using BitsAndBytes, the minimum loss for PEFT asymptoted at 0.31 vs 0.24 with no BitsAndBytes. Furthermore, the accuracy was only 90% with BitsAndBytes vs 93% without BitsAndBytes. Since my model was training perfectly well without BitsAndBytes, I decided to omit it but kept the code with for future reference (with a skip command for Jupyter Notebook).
