## 1 · Environment & Configuration  
This section prepares the execution environment and defines all hyper-parameters for LoRA-based 4-bit fine-tuning of a BLIP VQA model.

* **Hardware binding** – Forces the process to see **only GPU 0** and disables Weights & Biases logging for a completely self-contained run.  
* **Core libraries** – Loads PyTorch, Transformers, PEFT, Pandas, TQDM, and PIL— the minimal stack needed for vision-language fine-tuning.  
* **Dataset & model paths** – Points to a mini VQA dataset (`fullInput.csv` + image folder) and the *Salesforce/blip-vqa-base* checkpoint.  
* **Quantisation & adaptation** – Enables **4-bit QLoRA** (`USE_4BIT=True`) and sets LoRA hyper-parameters (`LORA_R`, `LORA_ALPHA`, …).  
* **Training regime** – Three epochs, moderate batch sizes, and a 5 × 10⁻⁵ learning-rate— conservative choices for stable convergence on a small GPU.  
* **Output structure** – `OUTPUT_DIR` is the experiment root; LoRA adapters are stored separately in `LORA_DIR` for clean deployment.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" #Limit visible GPUs to GPU-0 only avoiding accidental multi-GPU allocation on Kaggle.
os.environ["WANDB_DISABLED"] = "true"

import sys
import traceback
import pandas as pd
import torch
import re
from bert_score import score
import evaluate
from PIL import Image
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split

from transformers import ( #Hugging Face Transformers objects for BLIP VQA
    BlipProcessor,
    BlipForQuestionAnswering,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig,
)
# PEFT utilities for parameter-efficient fine-tuning (LoRA, QLoRA, etc.)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

DATASET_CSV = '../VR-mini-Proj-2/fullInput.csv'
IMAGE_BASE_DIR = '../images/small'
MODEL_NAME = "Salesforce/blip-vqa-base"
USE_4BIT = True
BATCH_SIZE = 8
EVAL_BATCH_SIZE = 16
NUM_EPOCHS = 3
LR = 5e-5
LORA_R = 16 # Rank of the LoRA update matrices
LORA_ALPHA = 32 # Scaling factor (α) for LoRA
LORA_DROPOUT = 0.05 # Dropout applied to LoRA layers
MAX_LEN = 128
OUTPUT_DIR = "./blip_vqa_lora_q_v8"
LORA_DIR = os.path.join(OUTPUT_DIR, "lora_adapters")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

print("CUDA available:", torch.cuda.is_available(),"Device count:", torch.cuda.device_count())

CUDA available: False Device count: 0


### 2 · Custom `VQADataset`

This dataset class converts the VQA CSV into a PyTorch-friendly `Dataset` so the Hugging Face `Trainer` can stream data efficiently.

**What it does, step by step**

1. **Index sanitisation** – Iterates through every row and keeps only those whose `filename` actually exists on disk. This prevents runtime crashes caused by missing images.
2. **Image–text pairing** – For each remaining row, the corresponding image is loaded with Pillow (`RGB` mode) and paired with its *question* and *answer* strings from the CSV.
3. **Tokenisation** – A single `BlipProcessor` handles both the vision preprocessing and the text tokenisation. The ground-truth answer is passed via `text_target`, enabling teacher forcing during training.
4. **Padding and truncation** – All sequences are padded or truncated to `MAX_LEN`. Fixed lengths are crucial for mixed-precision training and for reducing GPU memory fragmentation.
5. **Return format** – The method returns a dictionary whose keys match the BLIP model’s input signature (`input_ids`, `pixel_values`, `labels`, etc.). Each tensor is squeezed to remove the batch dimension so PyTorch’s default collate function can stack them later.


In [None]:
class VQADataset(torch.utils.data.Dataset):
    """
    Thin wrapper around a Pandas DataFrame for Vision-Language Q&A.
    Performs on-disk image existence checks and tokenises each sample via
    a shared BLIP processor.
    """
    def __init__(self, df, processor, img_dir, max_len):
        self.df = df.reset_index(drop=True)
        self.proc = processor
        self.img_dir = img_dir
        self.max_len = max_len
        valid = []
        for i, r in tqdm(self.df.iterrows(), total=len(self.df), desc="Checking images"): #Filter out rows whose image files are missing
            if os.path.exists(os.path.join(img_dir, str(r['filename']))):
                valid.append(i)
        if not valid:
            raise RuntimeError("No valid images found.")
        self.df = self.df.loc[valid].reset_index(drop=True) #Keep only rows with existing images

    def __len__(self): 
        """
        Total number of valid (image, question, answer) triples
        """
        return len(self.df)

    def __getitem__(self, idx):
        """
        Load an image, pair it with its question & answer, and return the
        processor-encoded tensors expected by BLIP.
        """
        row = self.df.iloc[idx]
        img = Image.open(os.path.join(self.img_dir, str(row['filename']))).convert("RGB") #Load image as RGB
        enc = self.proc( #Tokenise + visual pre-processing in one call
            images=img,
            text=str(row['question']),
            text_target=str(row['answer']), ## ground-truth answer
            padding="max_length", #padding to max len
            truncation=True, #trunc. to max len
            max_length=self.max_len,
            return_tensors="pt", # return val are torch tensor
        )
        return {k: v.squeeze(0) for k, v in enc.items()} #Remove batch dimension because Trainer will stack later

### 3 · Data, Quantisation, and LoRA

* **Load & split** – Read the CSV, verify a `filename` column, then make an 80 / 20 train-val split with `random_state=42`.
* **Processor** – Grab `BlipProcessor` with `use_fast=True` for quicker tokenisation.
* **4-bit config** – If `USE_4BIT`, load weights in NF4 4-bit; compute in BF16 when supported (else FP16) and enable double-quant for extra compression.
* **Model** – Load `BlipForQuestionAnswering` on `cuda:0`, then call `prepare_model_for_kbit_training`.
* **LoRA** – Inject rank-16 LoRA adapters (α = 32, 5 % dropout) into attention projection layers; only these adapters will be trainable.


In [None]:
df = pd.read_csv(DATASET_CSV)
if 'filename' not in df.columns:
    sys.exit("ERROR: 'filename' column missing in CSV")
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42) #80:20 train test split 

processor = BlipProcessor.from_pretrained(MODEL_NAME, use_fast=True) # Instantiate BLIP processor (handles both image transforms and tokenisation)
model_kwargs = {}
if USE_4BIT:
    quant_cfg = BitsAndBytesConfig( # Configure 4-bit weight loading via bitsandbytes
        load_in_4bit=True, 
        bnb_4bit_quant_type="nf4",  # Normal-Float 4 quant scheme
        bnb_4bit_compute_dtype=(
            torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
        ),
        bnb_4bit_use_double_quant=True, # Additional compression pass
    )
    model_kwargs["quantization_config"] = quant_cfg
    model_kwargs["torch_dtype"] = (torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16)
    print("4-bit quantization configured (dtype:", model_kwargs["torch_dtype"], ")")

model = BlipForQuestionAnswering.from_pretrained(
    MODEL_NAME,
    device_map="auto",      # pins everything to cuda:0
    **model_kwargs # Injects quantisation config
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False) #Insert low-bit-friendly tweaks (casts, hooks, etc.)

lora_cfg = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=["q_proj", "k_proj", "v_proj", "query", "key", "value"],
    lora_dropout=LORA_DROPOUT,
    bias="none",
)
model = get_peft_model(model, lora_cfg) #Inject trainable LoRA adapters
model.print_trainable_parameters()

4-bit quantization configured (dtype: torch.float16 )
trainable params: 3,538,944 || all params: 388,211,516 || trainable%: 0.9116


### 4 · Dataset objects and training hyper-parameters

* **Datasets** – Wrap the train and validation splits in `VQADataset`, which handles image checks and tokenisation.
* **Warm-up / logging** – Warm-up steps are 10 % of total optimizer steps (but at least 100). Logging happens every ~10 % of a training epoch.
* **`TrainingArguments`** highlights  
  * 3 epochs, batch sizes 8 (train) / 16 (eval).  
  * Gradient checkpointing on (non-reentrant) to save memory.  
  * Mixed precision (`fp16=True`) when a CUDA GPU is present.  
  * Checkpoints and evaluations occur at the end of every epoch; we skip “best-model” selection to reduce disk I/O.  
  * Metrics are logged to **TensorBoard** only.


In [None]:
train_ds = VQADataset(train_df, processor, IMAGE_BASE_DIR, MAX_LEN)
val_ds   = VQADataset(val_df,   processor, IMAGE_BASE_DIR, MAX_LEN)

warmup = max(100, int(0.1 * len(train_ds) * NUM_EPOCHS / BATCH_SIZE))
logs   = max(10,  int(len(train_ds) / BATCH_SIZE / 10))
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    warmup_steps=warmup,
    learning_rate=LR,
    weight_decay=0.01,
    logging_strategy="steps",
    logging_steps=logs,
    eval_strategy="epoch",
    save_strategy="epoch",
    # drop load_best_model_at_end & metric_for_best_model
    eval_accumulation_steps=1,
    prediction_loss_only=True,
    report_to="tensorboard",
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=torch.cuda.is_available(),
)



Checking images: 100%|██████████| 39708/39708 [00:14<00:00, 2651.60it/s]
Checking images: 100%|██████████| 9928/9928 [00:02<00:00, 3561.05it/s]


### 5 · Custom `Trainer` and launch

* **CustomTrainer** – Subclasses `Trainer` only to expose a global handle (`trainer = self`) and to make `compute_loss` return the model-defined loss without extra keys.
* **Trainer instantiation** – Feeds in the LoRA-augmented model, `TrainingArguments`, and our train/val datasets.
* **Training loop** – Starts fine-tuning, then saves the LoRA adapter weights and processor to `LORA_DIR`. Any exception prints a traceback and aborts with a non-zero exit code.


In [None]:
class CustomTrainer(Trainer):
    """
    Minimal subclass of Hugging Face Trainer.
    Adds a global handle for interactive debugging and
    returns the model's intrinsic loss without extra bookkeeping.
    """
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        global trainer
        trainer = self

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        """
        Delegate forward pass to the model and extract `.loss`.
        This matches Trainer's expected signature while avoiding
        additional metric computation overhead.
        """
        out = model(**inputs)
        loss = out.loss if hasattr(out, "loss") else out[0] # fallback for tuple
        return (loss, out) if return_outputs else loss

trainer = CustomTrainer(model=model,args=training_args,train_dataset=train_ds,eval_dataset=val_ds)

print("Starting 4-bit LoRA fine-tuning…")
try:
    trainer.train()                                     # main training loop
    trainer.model.save_pretrained(LORA_DIR)            # save LoRA adapters
    processor.save_pretrained(LORA_DIR)                # save the processor
    print("Saved LoRA adapters to", LORA_DIR)
except Exception as e:
    print("Training failed:", e)
    traceback.print_exc()
    sys.exit(1)


CUDA available: True Device count: 1
4-bit quantization ready: torch.bfloat16
trainable params: 3,538,944 || all params: 388,211,516 || trainable%: 0.9116


Checking images:   0%|          | 0/39708 [00:00<?, ?it/s]

Checking images:   0%|          | 0/9928 [00:00<?, ?it/s]

No label_names provided for model class `PeftModel`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting 4-bit LoRA fine-tuning…


Epoch,Training Loss,Validation Loss
1,8.4816,No log
2,8.4742,No log
3,8.4651,No log


Saved LoRA adapters to ./blip_vqa_lora_q_final/lora_adapters


## Post-Training Evaluation Setup

This block initializes the post-training evaluation phase. After fine-tuning the vision-language model with LoRA and 4-bit quantization, we now assess its performance using the validation dataset.

Key elements:

- **`SAVED_LORA_ADAPTER_DIR_EVAL`**: Path to the saved LoRA adapter directory. This allows the model to load only the low-rank adaptation layers for evaluation.
- **`USE_4BIT_FOR_BASE_MODEL_EVAL`**: Ensures the base model is loaded in 4-bit precision, consistent with how it was used during training.
- **`eval_df_for_final_eval`**: A copy of the original validation dataframe to preserve data integrity during evaluation.
- **`eval_processor`**: The processor (tokenizer + image processor) used for preparing inputs. The code attempts to load it from the LoRA adapter directory to ensure compatibility.


In [None]:
print("\n--- Starting Post-Training Evaluation ---")
SAVED_LORA_ADAPTER_DIR_EVAL = LORA_DIR 
USE_4BIT_FOR_BASE_MODEL_EVAL = USE_4BIT 
eval_df_for_final_eval = val_df.copy()


--- Starting Post-Training Evaluation ---


In [27]:
try:
    eval_processor = BlipProcessor.from_pretrained(SAVED_LORA_ADAPTER_DIR_EVAL, use_fast=True)
    print("Evaluation processor loaded from adapter directory.")
except Exception as e_proc:
    print(f"Could not load eval processor from adapter dir: {e_proc}. Using training processor.")
    eval_processor = processor # Fallback to the one used in training

Evaluation processor loaded from adapter directory.


### Load Base Model for Final Evaluation

This block loads the base vision-language model (`BlipForQuestionAnswering`) for final evaluation.

- If enabled, the model is loaded in 4-bit precision using `BitsAndBytesConfig` for efficient memory usage.
- Key quantization settings include NF4 format and `float16` compute type.
- The model is loaded with `device_map="auto"` to utilize available hardware efficiently.
- This setup ensures consistency with the training configuration and supports adapter-based fine-tuning.

In [None]:
print("Loading base model for final evaluation...")
model_kwargs_eval = {}
if USE_4BIT_FOR_BASE_MODEL_EVAL:
    # Use the same compute_dtype determined during training setup
    compute_dtype_eval = model_kwargs.get("torch_dtype", torch.float16)
    #quantization configuration for 4-bit inference
    quant_cfg_eval = BitsAndBytesConfig(
        load_in_4bit=True,                      # Enable 4-bit quantization
        bnb_4bit_quant_type="nf4",              # Use NF4 (normal float 4-bit) quantization scheme
        bnb_4bit_compute_dtype=compute_dtype_eval,  # Set compute dtype to match training
        bnb_4bit_use_double_quant=True,  
    )
    model_kwargs_eval["quantization_config"] = quant_cfg_eval #Inject quantization config and compute dtype into model loading kwargs
    model_kwargs_eval["torch_dtype"] = compute_dtype_eval
    print(f"Base model for final eval will be loaded with 4-bit. Compute dtype: {compute_dtype_eval}")

base_model_eval = BlipForQuestionAnswering.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    **model_kwargs_eval
)

Loading base model for final evaluation...
Base model for final eval will be loaded with 4-bit. Compute dtype: torch.bfloat16


In [None]:
# Confirm that the base model is correctly loaded in 4-bit mode before proceeding with evaluation
if USE_4BIT_FOR_BASE_MODEL_EVAL and hasattr(base_model_eval, 'is_loaded_in_4bit') and base_model_eval.is_loaded_in_4bit:
    print("Base model for final evaluation successfully loaded in 4-bit.")

Base model for final evaluation successfully loaded in 4-bit.


In [None]:
print(f"Loading best LoRA weights from {SAVED_LORA_ADAPTER_DIR_EVAL} for final evaluation...")
eval_model = PeftModel.from_pretrained(base_model_eval, SAVED_LORA_ADAPTER_DIR_EVAL) # Load the trained LoRA adapter into the base model for evaluation
eval_model.eval() # Set to evaluation mode
print("Best LoRA adapters loaded successfully for final evaluation.")

Loading best LoRA weights from ./blip_vqa_lora_q_final/lora_adapters for final evaluation...
Best LoRA adapters loaded successfully for final evaluation.


In [31]:
eval_model.eval()

PeftModel(
  (base_model): LoraModel(
    (model): BlipForQuestionAnswering(
      (vision_model): BlipVisionModel(
        (embeddings): BlipVisionEmbeddings(
          (patch_embedding): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
        )
        (encoder): BlipEncoder(
          (layers): ModuleList(
            (0-11): 12 x BlipEncoderLayer(
              (self_attn): BlipAttention(
                (dropout): Dropout(p=0.0, inplace=False)
                (qkv): Linear4bit(in_features=768, out_features=2304, bias=True)
                (projection): Linear4bit(in_features=768, out_features=768, bias=True)
              )
              (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (mlp): BlipMLP(
                (activation_fn): GELUActivation()
                (fc1): Linear4bit(in_features=768, out_features=3072, bias=True)
                (fc2): Linear4bit(in_features=3072, out_features=768, bias=True)
              )
              (l

In [None]:
# Evaluation dataset and dataloader for final evaluation 
print("Creating dataset for final evaluation...")
final_eval_dataset = VQADataset(eval_df_for_final_eval, eval_processor, IMAGE_BASE_DIR, MAX_LEN)

Creating dataset for final evaluation...


Checking images:   0%|          | 0/9928 [00:00<?, ?it/s]

In [41]:
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
import math
predictions_ft = [] #stores model predictions
ground_truths_normalized_ft = [] #store normalized ground-truth answers
original_indices_ft = [] #track of original sample indices
num_batches_eval = math.ceil(len(eval_df_for_final_eval) / EVAL_BATCH_SIZE) #total number of batches

### Code for Fine-Tuned Model Evaluation

This section of code evaluates a fine-tuned model on an image-question dataset. It processes the data in batches, loads the necessary images, and generates predictions. The predictions are then compared to the ground truth answers, and results are stored for later evaluation.

In [None]:
with torch.no_grad(): # Disable gradient computation  during evaluation
    for i in tqdm(range(0, len(eval_df_for_final_eval), EVAL_BATCH_SIZE), total=num_batches_eval, desc="Evaluating Fine-tuned Model"):
        batch_df = eval_df_for_final_eval[i:i+EVAL_BATCH_SIZE] #BATCH
        batch_images_pil = []
        batch_questions = []
        current_batch_ground_truths = []
        current_batch_original_indices = []

        for idx_in_batch, (original_df_idx, row) in enumerate(batch_df.iterrows()):
            question = str(row['question'])
            true_answer = str(row['answer']).lower().strip()
            image_filename = str(row['filename'])
            img_path = os.path.join(IMAGE_BASE_DIR, image_filename) #question, answer, filename loaded for the batch

            try:
                raw_image = Image.open(img_path).convert('RGB') #read the image in RGB and adding to batched list
                batch_images_pil.append(raw_image)
                batch_questions.append(question)
                current_batch_ground_truths.append(true_answer)
                current_batch_original_indices.append(original_df_idx)
            except FileNotFoundError:
                print(f"Warning (Eval): Image not found at {img_path}. Skipping.")
            except Exception as e:
                print(f"Warning (Eval): Error loading image {img_path}: {e}. Skipping.")

        if not batch_images_pil: #Skip the batch if there are no valid images loaded
            print(f"Warning (Eval): No valid images for batch starting at {i}. Skipping.")
            continue
        
        #processing batch
        inputs = eval_processor(images=batch_images_pil, text=batch_questions, return_tensors="pt", padding=True, truncation=True).to(DEVICE)
        outputs = eval_model.generate(**inputs, max_new_tokens=10)
        batch_preds_decoded = eval_processor.batch_decode(outputs, skip_special_tokens=True)

        for pred_idx, decoded_pred in enumerate(batch_preds_decoded):
            predicted_answer = decoded_pred.strip().lower()
            predicted_answer = re.sub(r'[^\w\s]', '', predicted_answer)

            true_answer_normalized = current_batch_ground_truths[pred_idx]
            true_answer_normalized = re.sub(r'[^\w\s]', '', true_answer_normalized)

            predictions_ft.append(predicted_answer)
            ground_truths_normalized_ft.append(true_answer_normalized)
            original_indices_ft.append(current_batch_original_indices[pred_idx])

Evaluating Fine-tuned Model:   0%|          | 0/621 [00:00<?, ?it/s]

In [None]:
results_ft_df = pd.DataFrame({
    'original_index': original_indices_ft,
    'predicted_answer_ft': predictions_ft,
    'ground_truth_normalized': ground_truths_normalized_ft
})

In [52]:
results_ft_df.head()

Unnamed: 0,original_index,predicted_answer_ft,ground_truth_normalized
0,7055,hard,hard
1,49141,gold,gold
2,40947,three,three
3,12004,yes,yes
4,7905,two,two


In [53]:
eval_df = eval_df_for_final_eval

In [None]:
df_with_ft_results = eval_df.merge(results_ft_df, left_index=True, right_on='original_index', how='right')
results_ft_filename = 'vqa_results_lora_finetuned.csv'
df_with_ft_results.to_csv(results_ft_filename, index=False)
print(f"Fine-tuned results saved to {results_ft_filename}")

Fine-tuned results saved to vqa_results_lora_finetuned.csv


In [55]:
df_with_ft_results.keys()

Index(['id', 'question', 'answer', 'filename', 'original_index',
       'predicted_answer_ft', 'ground_truth_normalized'],
      dtype='object')

### Metrics

In [None]:
if not predictions_ft:
    print("Error: No valid fine-tuned predictions available to calculate metrics.")
else:
    correct_predictions_ft = sum(p == gt for p, gt in zip(predictions_ft, ground_truths_normalized_ft))
    total_valid_ft = len(predictions_ft)
    accuracy_ft = correct_predictions_ft / total_valid_ft if total_valid_ft > 0 else 0
    pred= df_with_ft_results['predicted_answer_ft']
    gt = df_with_ft_results['ground_truth_normalized']
    print(f"Fine-tuned Model Accuracy (Exact Match): {accuracy_ft:.4f}")
    try:
        bertscore = evaluate.load("bertscore")
        
        # P, R, F1_bs  = score(pred.tolist(), gt.tolist(), lang='en', verbose=False,rescale_with_baseline=True)
        # avg_f1_bertscore = sum(bertscore_results['f1']) / len(bertscore_results['f1']) if bertscore_results['f1'] else 0
        # print(f"BERTScore - P: {P.mean().item():.4f}, R: {R.mean().item():.4f}, F1: {F1_bs.mean().item():.4f}")
        
        
        P1, R1, F1_bs1  = score(pred.tolist(), gt.tolist(), lang='en', verbose=False)
        print(f"BERTScore1 - P1: {P1.mean().item():.4f}, R: {R1.mean().item():.4f}, F1: {F1_bs1.mean().item():.4f}")
    except Exception as e:
        print(f"Could not compute BERTScore for fine-tuned model: {e}")

print("LoRA Fine-tuning and Evaluation Script Finished.")

Fine-tuned Model Accuracy (Exact Match): 0.5701


Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


▶ BERTScore → P: 0.8724, R: 0.8459, F1: 0.8572
▶ BERTScore1 → P1: 0.9785, R: 0.9740, F1: 0.9759
LoRA Fine-tuning and Evaluation Script Finished.
