## Step 1: Mounting Google Drive and Importing Dependencies

In [1]:
# Mount Google Drive
from google.colab import drive, files
drive.mount('/content/drive')

# Navigate to the repo folder
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer

# List repo contents
!ls

Mounted at /content/drive
/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer
data	    LICENSE  notebooks	      qa_pairs	 results  wandb
deployment  models   project_plan.md  README.md  scripts


In [None]:
!pip install datasets --quiet

In [None]:
!pip install bert-score  --quiet

In [None]:
!pip install --upgrade openai

In [78]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from peft import PeftModel
from huggingface_hub import login
import torch
from datasets import load_from_disk, load_dataset
import json
import os
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from google.colab import userdata
from google import generativeai as genai
import time
import openai
from openai import OpenAI
from getpass import getpass
import random
import numpy as np
from bert_score import score as bertscore

## Step 2: Merging the Saved Adapter Weights and Base Model

In [4]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
# Load adapter + base model
adapter_model_path = "./models/finetuned-mistral"
base_model_name = "mistralai/Mistral-7B-Instruct-v0.3"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype=torch.float16, device_map="auto")

This line of code takes the base model (model) and enhances it with task-specific knowledge from the adapter weights stored at adapter_model_path. It effectively loads a fine-tuned version of the LLM, ready for use.

In [7]:
# Apply LoRA adapter
model = PeftModel.from_pretrained(model, adapter_model_path)

In [8]:
# Merge LoRA weights into base model
model = model.merge_and_unload()

In [9]:
model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): MistralRMSNorm((4096,), eps=1e-0

In [10]:
# Save merged model
merged_path = "./models/merged-finetuned-mistral"
model.save_pretrained(merged_path, safe_serialization=True)  # saves as safetensors
tokenizer.save_pretrained(merged_path)

print(f"Merged model saved at: {merged_path}")

Merged model saved at: ./models/merged-finetuned-mistral


## Step 3: Loading the Validation Set for Evaluation

In [5]:
eval_path = "./data/eval.jsonl"
eval_pairs = []

with open(eval_path, "r") as f:
    for line in f:
        eval_pairs.append(json.loads(line.strip()))

print(f"Loaded {len(eval_pairs)} QA pairs for evaluation.")

Loaded 30 QA pairs for evaluation.


In [6]:
eval_pairs

[{'question': 'What is the primary innovation introduced by the LoRI method for parameter-efficient fine-tuning?',
  'answer': 'LoRI introduces a novel approach that freezes the projection matrices A as random projections and sparsifies the matrices B using task-specific masks, thereby significantly reducing trainable parameters while minimizing cross-task interference.'},
 {'question': 'How does LoRI reduce the number of trainable parameters compared to traditional LoRA?',
  'answer': 'LoRI reduces the number of trainable parameters by keeping matrix A fixed as a random projection and sparsifying matrix B using task-specific masks, eliminating the need to train both matrices and reducing redundancy.'},
 {'question': 'Why is sparsity in matrix B important in LoRI?',
  'answer': 'Sparsity in matrix B enables LoRI to retain only the most critical elements necessary for adaptation, reducing parameter count and mitigating cross-task interference during adapter merging and continual learnin

## Step 4: Loading Saved Model

In [7]:
# Load merged model and tokenizer
model_path = "./models/merged-finetuned-mistral"

In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [15]:
tokenizer.pad_token = tokenizer.eos_token

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")

In [10]:
model.eval()
print("Model and tokenizer successfully loaded.")

Model and tokenizer successfully loaded.


## Step 5: Generating Model Predictions

In this section, we generate predictions from our **fine-tuned Mistral model** using a *closed-book* approach—i.e., without feeding any external context into the model at inference time.

We define a `generate_answer()` function that:
- Encodes the input prompt (`"Question: ... \nAnswer:"`) using the tokenizer.
- Applies greedy decoding (`do_sample=False`) with a `max_new_tokens` limit.
- Truncates the input to 512 tokens to avoid overflow.
- Returns only the portion of the output after `"Answer:"`.

For each QA pair in our `eval_pairs` list, we:
- Format the question as an instruction-style prompt.
- Generate a prediction using our fine-tuned model.
- Save the original question, the reference answer, and the predicted answer into a results list.

Finally, we write the `results` to a file named `eval_predictions.json`, which will be used in subsequent evaluation steps (e.g., BLEU scoring, qualitative analysis, LLM-as-a-judge).

>  The last line:
> ```python
> with open("eval_predictions.json", "w") as f:
>     json.dump(results, f, indent=2)
> ```
> ...saves the full evaluation results to disk as a human-readable `.json` file. This allows us to persist the predictions for further metric computation and analysis—even if the Colab session resets.

In [16]:
def generate_answer(question: str) -> str:
    inputs = tokenizer(
        question,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=256,
            do_sample=False  # greedy decoding
        )

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded.split("Answer:")[-1].strip()

In [17]:
results = []

for item in eval_pairs:
    question = item["question"]
    reference = item["answer"]
    prediction = generate_answer(f"Question: {question}\nAnswer:")
    results.append({
        "question": question,
        "reference": reference,
        "prediction": prediction
    })

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

In [18]:
with open("eval_predictions.json", "w") as f:
    json.dump(results, f, indent=2)

In [76]:
output_path = "./data/evaluation/eval_predictions_closed_book.json"

os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, "w") as f:
    json.dump(results, f, indent=2)

print(f"Predictions saved to {output_path}")

Predictions saved to ./data/evaluation/eval_predictions_closed_book.json


## Step 6: BLEU Score Evaluation

In this section, we evaluate our fine-tuned model using the **BLEU (Bilingual Evaluation Understudy)** score, a standard metric for evaluating the quality of generated text by comparing it to a reference answer.

### What is BLEU?
BLEU measures *n-gram overlap* between the model's prediction and the reference answer:
- **BLEU-1**: unigram overlap (word-level similarity)
- **BLEU-2**: bigram overlap (2-word chunks)
- **BLEU-3**: trigram overlap
- **BLEU-4**: 4-gram overlap (more stringent)

### Components of the Code:
- `weights=(1, 0, 0, 0)`: Measures unigram overlap only (BLEU-1).
- `smoothing_function=method1`: Prevents the BLEU score from dropping to 0 when there are no exact n-gram matches. This is useful for short or paraphrased responses.
- We iterate over our evaluation dataset and compute BLEU-1 through BLEU-4 for each response.

### Limitations:
BLEU is a **surface-level** metric:
- It penalizes paraphrasing.
- It doesn't understand meaning—only *form*.
- It is useful for rough comparison, but **not sufficient alone** to assess model quality.

Hence, we will also perform **qualitative evaluation** using *LLM-as-a-Judge* in the next step.

### Results:
Our average scores were:
- BLEU-1: *e.g., 0.22*
- BLEU-2: *e.g., 0.11*
- BLEU-3: *e.g., 0.07*
- BLEU-4: *e.g., 0.05*

These low scores are expected, since:
1. The evaluation was *closed-book* (no document context).
2. The questions were from **papers published in 2025**, after the model's training cutoff.
3. The model had not seen any of these papers during fine-tuning.

**Conclusion**: BLEU gives us a sense of lexical similarity. In high-difficulty settings like this one, it must be supplemented with qualitative evaluation.

In [20]:
# Load model predictions
with open("eval_predictions.json", "r") as f:
    eval_results = json.load(f)

In [21]:
smooth = SmoothingFunction().method1

In [23]:
bleu_scores = {
    "BLEU-1": [],
    "BLEU-2": [],
    "BLEU-3": [],
    "BLEU-4": []
}

In [25]:
for item in eval_results:
    reference = item["reference"].split()
    prediction = item["prediction"].split()

    bleu_scores["BLEU-1"].append(sentence_bleu([reference], prediction, weights=(1, 0, 0, 0), smoothing_function=smooth))
    bleu_scores["BLEU-2"].append(sentence_bleu([reference], prediction, weights=(0.5, 0.5, 0, 0), smoothing_function=smooth))
    bleu_scores["BLEU-3"].append(sentence_bleu([reference], prediction, weights=(0.33, 0.33, 0.33, 0), smoothing_function=smooth))
    bleu_scores["BLEU-4"].append(sentence_bleu([reference], prediction, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smooth))

# Compute average BLEU scores
avg_bleu_scores = {metric: round(sum(scores)/len(scores), 4) for metric, scores in bleu_scores.items()}
print("Average BLEU Scores:", avg_bleu_scores)

Average BLEU Scores: {'BLEU-1': 0.2244, 'BLEU-2': 0.1126, 'BLEU-3': 0.0787, 'BLEU-4': 0.0514}


## Step 7: Using GPT-4o as LLM-as-a-Judge (OpenAI Evaluation)

In this section, we use **GPT-4o**—a state-of-the-art model from OpenAI—as a neutral third-party judge to evaluate the quality of our model’s predictions against ground truth answers. This is part of the **LLM-as-a-Judge** evaluation methodology, which is growing in popularity as a way to assess open-ended outputs where metrics like BLEU or ROUGE may fall short.

**What this section does:**

- Loads model predictions from `eval_predictions.json`
- Uses a GPT-4o prompt that provides:
  - The question
  - The model's generated answer
  - The reference (ground-truth) answer
- Asks GPT-4o to score the generated answer on a **scale from 1 to 5**, considering relevance, correctness, completeness, and style
- Stores all outputs in `gpt4o_judged_results.json` for analysis

**Key Functions:**

- `ask_gpt_judge()` → Sends a prompt to GPT-4o via the OpenAI API and returns a numeric score
- `judged_results` → A list of evaluation records including the question, reference, model prediction, and GPT-4o's score
- `np.mean()` → Used at the end to compute the **average evaluation score** across all QA pairs

**Why use GPT-4o?**

Because LLMs are best judged by **other LLMs** capable of contextual understanding. GPT-4o has been shown to be highly consistent and reliable in comparative evaluations.

This evaluation complements our BLEU score by offering a **semantic and qualitative assessment**, helping us better understand the strengths and weaknesses of our fine-tuned model.

---

In [47]:
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key:")

Enter your OpenAI API key:··········


In [48]:
openai.api_key = os.environ["OPENAI_API_KEY"]

In [56]:
# Load the API key from environment variable
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [58]:
def ask_gpt_judge(question, reference, prediction):
    prompt = f"""
You are an expert model evaluator. Given a question, a reference answer, and a model-generated answer, judge how good the model’s answer is on a scale of 1 to 5. Use the following rubric:

1 – Completely irrelevant or hallucinated.
2 – Partially related but mostly inaccurate.
3 – Mostly accurate but missing key details.
4 – Accurate and mostly complete.
5 – Nearly identical in meaning to the reference.

Be strict but fair. Output ONLY the number.

Question: {question}
Reference Answer: {reference}
Model Prediction: {prediction}

Score:"""

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print("Error during evaluation:\n")
        print(e)
        return None

In [59]:
with open("eval_predictions.json") as f:
    eval_results = json.load(f)

In [60]:
judged_results = []

for i, item in enumerate(results):
    print(f"Evaluating {i+1}/{len(results)}")
    score = ask_gpt_judge(item["question"], item["reference"], item["prediction"])
    if score:
        judged_results.append({
            "question": item["question"],
            "reference": item["reference"],
            "prediction": item["prediction"],
            "gpt4o_score": score
        })
    time.sleep(1.2)

Evaluating 1/30
Evaluating 2/30
Evaluating 3/30
Evaluating 4/30
Evaluating 5/30
Evaluating 6/30
Evaluating 7/30
Evaluating 8/30
Evaluating 9/30
Evaluating 10/30
Evaluating 11/30
Evaluating 12/30
Evaluating 13/30
Evaluating 14/30
Evaluating 15/30
Evaluating 16/30
Evaluating 17/30
Evaluating 18/30
Evaluating 19/30
Evaluating 20/30
Evaluating 21/30
Evaluating 22/30
Evaluating 23/30
Evaluating 24/30
Evaluating 25/30
Evaluating 26/30
Evaluating 27/30
Evaluating 28/30
Evaluating 29/30
Evaluating 30/30


In [61]:
with open("gpt4o_judgments.json", "w") as f:
    json.dump(judged_results, f, indent=2)

In [65]:
for sample in judged_results:
    print(" Question:", sample["question"])
    print(" Reference Answer:", sample["reference"])
    print(" Model Prediction:", sample["prediction"])
    print(" GPT-4o Evaluation:", sample["gpt4o_score"])
    print("-" * 80)

 Question: What is the primary innovation introduced by the LoRI method for parameter-efficient fine-tuning?
 Reference Answer: LoRI introduces a novel approach that freezes the projection matrices A as random projections and sparsifies the matrices B using task-specific masks, thereby significantly reducing trainable parameters while minimizing cross-task interference.
 Model Prediction: LoRI introduces a novel method for parameter-efficient fine-tuning by learning a low-rank matrix that approximates the full weight matrix, enabling efficient adaptation without storing the entire weight matrix.
 GPT-4o Evaluation: 2
--------------------------------------------------------------------------------
 Question: How does LoRI reduce the number of trainable parameters compared to traditional LoRA?
 Reference Answer: LoRI reduces the number of trainable parameters by keeping matrix A fixed as a random projection and sparsifying matrix B using task-specific masks, eliminating the need to train

In [68]:
for res in judged_results[:3]:
    print(res["gpt4o_score"])

2
2
2


In [69]:
scores = [int(res["gpt4o_score"]) for res in judged_results if res["gpt4o_score"].isdigit()]
average_score = np.mean(scores)
print(f"Average GPT-4o Evaluation Score: {average_score:.2f} out of 5")

Average GPT-4o Evaluation Score: 2.27 out of 5


In [75]:
output_path = "./data/evaluation/eval_gpt4o_judgments_closed_book.json"

os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, "w") as f:
    json.dump(judged_results, f, indent=2)

print(f"Judged results saved to {output_path}")

Judged results saved to ./data/evaluation/eval_gpt4o_judgments_closed_book.json


## Step 8: Evaluating with BERTScore (Semantic Similarity Metric)

In this section, we evaluate the semantic similarity between the model’s predictions and the ground truth answers using **BERTScore**, a metric that leverages contextual embeddings from large pretrained models (like BERT) to assess the *meaning* of the outputs.

Unlike BLEU, which only considers surface-level n-gram overlap, BERTScore measures how semantically close the answers are—even when the phrasing differs.

### Interpretation:
- **BERTScore F1** reflects the degree of **semantic overlap** between model output and human-labeled answer.
- A score closer to **1.0** indicates stronger alignment of meaning.
- This metric is especially useful in open-ended QA or summarization settings where **exact matching isn't expected**.

> Current Result:  
> `Average BERTScore (F1): 0.3464`  
> *This is modest, and expected for a closed-book setup on unseen 2025 research content. It will likely improve once context is added or RAG is enabled.*

In [79]:
# Replace `results` with `judged_results` if needed
predictions = [item["prediction"] for item in results]
references = [item["reference"] for item in results]

In [None]:
P, R, F1 = bertscore(predictions, references, lang="en", rescale_with_baseline=True)

In [87]:
print(f"Average Precision: {P.mean().item():.4f}")

Average Precision: 0.3634


In [86]:
print(f"Average Recall: {R.mean().item():.4f}")

Average Recall: 0.3280


In [83]:
print(f"Average BERTScore (F1): {F1.mean().item():.4f}")

Average BERTScore (F1): 0.3464
