# Open-Book Evaluation Summary (Instruction-Tuned Mistral-7B)

This notebook presents the **open-book evaluation** of a Mistral-7B model fine-tuned via LoRA on instruction-following QA pairs from scientific papers. Unlike closed-book evaluation, the model was given context passages (abstract, introduction, and conclusion) for each question to simulate information-grounded reasoning.

### Evaluation Metrics

| Metric             | Closed-Book Score | Open-Book Score | Δ Improvement |
|--------------------|------------------:|----------------:|--------------:|
| **BLEU-1**         | 0.2244            | **0.4010**       | +0.1766       |
| **BLEU-4**         | 0.0514            | **0.1625**       | +0.1111       |
| **BERTScore (F1)** | 0.3464            | **0.5295**       | +0.1831       |
| **LLM-as-Judge**   | 2.27 / 5          | **3.97 / 5**     | +1.70         |


### Key Takeaways

- **Substantial quality gains** were observed across all metrics when the model was supplied with contextual excerpts.
- **Hallucination rate dropped sharply**, as confirmed by GPT-4o evaluations.
- The model retained its **instruction-following capability** while leveraging external context to improve factual grounding.
- This validates the **effectiveness of context-augmented prompting** even without full RAG infrastructure.

> Next Step: Implement **Retrieval-Augmented Generation (RAG)** using FAISS to dynamically select relevant context from a larger paper corpus during inference, followed by final evaluation.

---

## Step 1: Mounting Google Drive and Importing Dependencies

In [None]:
# Mount Google Drive
from google.colab import drive, files
drive.mount('/content/drive')

# Navigate to the repo folder
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer

# List repo contents
!ls

Mounted at /content/drive
/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer
data		       gpt4o_judgments.json  notebooks	      README.md  wandb
deployment	       LICENSE		     project_plan.md  results
eval_predictions.json  models		     qa_pairs	      scripts


In [None]:
!pip install datasets bert-score openai --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/183.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from peft import PeftModel
from huggingface_hub import login
import torch
from datasets import load_from_disk, load_dataset
import json
import os
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from google.colab import userdata
import time
import openai
from openai import OpenAI
from getpass import getpass
import random
import numpy as np
from bert_score import score as bertscore

## Step 2: Loading the Validation Set for Evaluation

In [None]:
eval_path = "./data/eval_with_context.jsonl"
eval_pairs = []

with open(eval_path, "r") as f:
    for line in f:
        eval_pairs.append(json.loads(line.strip()))

print(f"Loaded {len(eval_pairs)} QA pairs for evaluation.")

Loaded 30 QA pairs for evaluation.


In [None]:
eval_pairs[0]

{'question': 'What is the primary innovation introduced by the LoRI method for parameter-efficient fine-tuning?',
 'answer': 'LoRI introduces a novel approach that freezes the projection matrices A as random projections and sparsifies the matrices B using task-specific masks, thereby significantly reducing trainable parameters while minimizing cross-task interference.',
 'context': 'Abstract:\n\nLow-Rank Adaptation (LoRA) has emerged as a popular parameter- efficient fine-tuning (PEFT) method for Large Language Models (LLMs), yet it still incurs notable overhead and suffers from parameter interference in multi-task scenarios. We propose LoRA with Reduced Interference (LoRI), a simple yet effective approach that freezes the projection matrices A as random projections and sparsifies the matrices B using task-specific masks. This design substantially reduces the number of trainable parameters while maintaining strong task performance. Moreover, LoRI minimizes cross-task interference in ad

## Step 3: Loading Saved Model

In [None]:
# Load merged model and tokenizer
model_path = "./models/merged-finetuned-mistral"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [None]:
tokenizer.pad_token = tokenizer.eos_token

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
model.eval()
print("Model and tokenizer successfully loaded.")

Model and tokenizer successfully loaded.


## Step 4: Generating Model Predictions

In this section, we evaluate the performance of our fine-tuned Mistral-7B model on a set of *context-rich QA pairs*. Each sample contains:
- A **context** (drawn from the abstract, introduction, and conclusion of a scientific paper),
- A **question** (designed to probe the model's semantic reasoning), and
- A **reference answer** (manually crafted and grounded in the given context).

We pass the model the following formatted prompt:

``Context: {context}``

``Question: {question}
Answer:``


Using greedy decoding (`do_sample=False`) ensures deterministic outputs for reproducibility. The predictions are stripped and stored alongside their corresponding question and reference answer.

The result is saved to a file named `eval_predictions_with_context.json`, which will later be used for downstream evaluation using both **BLEU score** and **LLM-as-judge** metrics.

This step gives us the model's raw, unaided reasoning ability in an open-book setting—where it has access to all necessary factual context, but must *still* learn how to synthesize, reason, and generate semantically aligned answers.

In [None]:
def generate_answer_with_context(context: str, question: str) -> str:
    prompt = f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=2048
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=512,
            do_sample=False  # Greedy decoding
        )

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded.split("Answer:")[-1].strip()

In [None]:
# Run predictions
results = []

for item in eval_pairs:
    question = item["question"]
    reference = item["answer"]
    context = item["context"]

    prediction = generate_answer_with_context(context, question)

    results.append({
        "question": question,
        "reference": reference,
        "context": context,
        "prediction": prediction
    })

In [None]:
# Save for evaluation
with open("eval_openbook_predictions.json", "w") as f:
    json.dump(results, f, indent=2)

In [None]:
output_path = "./data/evaluation/eval_predictions_open_book.json"

os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, "w") as f:
    json.dump(results, f, indent=2)

print(f"Predictions saved to {output_path}")

Predictions saved to ./data/evaluation/eval_predictions_open_book.json


## Step 5: BLEU Score Evaluation

In this section, we evaluate our fine-tuned model using the **BLEU (Bilingual Evaluation Understudy)** score, a standard metric for evaluating the quality of generated text by comparing it to a reference answer.

### What is BLEU?
BLEU measures *n-gram overlap* between the model's prediction and the reference answer:
- **BLEU-1**: unigram overlap (word-level similarity)
- **BLEU-2**: bigram overlap (2-word chunks)
- **BLEU-3**: trigram overlap
- **BLEU-4**: 4-gram overlap (more stringent)

### Components of the Code:
- `weights=(1, 0, 0, 0)`: Measures unigram overlap only (BLEU-1).
- `smoothing_function=method1`: Prevents the BLEU score from dropping to 0 when there are no exact n-gram matches. This is useful for short or paraphrased responses.
- We iterate over our evaluation dataset and compute BLEU-1 through BLEU-4 for each response.

### Limitations:
BLEU is a **surface-level** metric:
- It penalizes paraphrasing.
- It doesn't understand meaning—only *form*.
- It is useful for rough comparison, but **not sufficient alone** to assess model quality.

Hence, we will also perform **qualitative evaluation** using *LLM-as-a-Judge* in the next step.

### Results:
Our average scores were:
- BLEU-1: *e.g., 0.22*
- BLEU-2: *e.g., 0.11*
- BLEU-3: *e.g., 0.07*
- BLEU-4: *e.g., 0.05*

These low scores are expected, since:
1. The evaluation was *closed-book* (no document context).
2. The questions were from **papers published in 2025**, after the model's training cutoff.
3. The model had not seen any of these papers during fine-tuning.

**Conclusion**: BLEU gives us a sense of lexical similarity. In high-difficulty settings like this one, it must be supplemented with qualitative evaluation.

> Current Result:  
> `Average BLEU Scores: {'BLEU-1': 0.401, 'BLEU-2': 0.2655, 'BLEU-3': 0.2033, 'BLEU-4': 0.1625}`  
> *This is a significant improvement over our BLEU Score from the closed notebook evaluation:
> `Average BLEU Scores: {'BLEU-1': 0.2244, 'BLEU-2': 0.1126, 'BLEU-3': 0.0787, 'BLEU-4': 0.0514}`. This shows that our model is producing not just word-matching, but contextually fluent, semantically faithful answers when given the context.*

In [None]:
# Load predictions with context
with open("eval_openbook_predictions.json", "r") as f:
    eval_results = json.load(f)

In [None]:
# Initialize smoothing function and score containers
smooth = SmoothingFunction().method1
bleu_scores = {f"BLEU-{n}": [] for n in range(1, 5)}

In [None]:
# Iterate over predictions and compute BLEU-1 to BLEU-4
for item in eval_results:
    reference = item["reference"].split()
    prediction = item["prediction"].split()

    bleu_scores["BLEU-1"].append(
        sentence_bleu([reference], prediction, weights=(1, 0, 0, 0), smoothing_function=smooth)
    )
    bleu_scores["BLEU-2"].append(
        sentence_bleu([reference], prediction, weights=(0.5, 0.5, 0, 0), smoothing_function=smooth)
    )
    bleu_scores["BLEU-3"].append(
        sentence_bleu([reference], prediction, weights=(1/3, 1/3, 1/3, 0), smoothing_function=smooth)
    )
    bleu_scores["BLEU-4"].append(
        sentence_bleu([reference], prediction, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smooth)
    )

# Compute and display average scores
avg_bleu_scores = {metric: round(sum(scores)/len(scores), 4) for metric, scores in bleu_scores.items()}
print("Average BLEU Scores:", avg_bleu_scores)

Average BLEU Scores: {'BLEU-1': 0.401, 'BLEU-2': 0.2655, 'BLEU-3': 0.2033, 'BLEU-4': 0.1625}


## Step 6: Using GPT-4o as LLM-as-a-Judge (OpenAI Evaluation)

In this section, we use **GPT-4o**—a state-of-the-art model from OpenAI—as a neutral third-party judge to evaluate the quality of our model’s predictions against ground truth answers. This is part of the **LLM-as-a-Judge** evaluation methodology, which is growing in popularity as a way to assess open-ended outputs where metrics like BLEU or ROUGE may fall short.

**What this section does:**

- Loads model predictions from `eval_openbook_predictions.json`
- Uses a GPT-4o prompt that provides:
  - The question
  - The model's generated answer
  - The reference (ground-truth) answer
- Asks GPT-4o to score the generated answer on a **scale from 1 to 5**, considering relevance, correctness, completeness, and style
- Stores all outputs in `gpt4o_judgments_openbook.json` for analysis

**Key Functions:**

- `ask_gpt_judge()` → Sends a prompt to GPT-4o via the OpenAI API and returns a numeric score
- `judged_results` → A list of evaluation records including the question, reference, model prediction, and GPT-4o's score
- `np.mean()` → Used at the end to compute the **average evaluation score** across all QA pairs

**Why use GPT-4o?**

Because LLMs are best judged by **other LLMs** capable of contextual understanding. GPT-4o has been shown to be highly consistent and reliable in comparative evaluations.

This evaluation complements our BLEU score by offering a **semantic and qualitative assessment**, helping us better understand the strengths and weaknesses of our fine-tuned model.

---

In [None]:
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key:")

Enter your OpenAI API key:··········


In [None]:
openai.api_key = os.environ["OPENAI_API_KEY"]

In [None]:
# Load the API key from environment variable
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [None]:
def ask_gpt_judge(question, reference, prediction):
    prompt = f"""
You are an expert model evaluator. Given a question, a reference answer, and a model-generated answer that was generated with access to a relevant excerpt from a scientific paper, judge how good the model’s answer is on a scale of 1 to 5. Use the following rubric:

1 – Completely irrelevant or hallucinated.
2 – Partially related but mostly inaccurate.
3 – Mostly accurate but missing key details.
4 – Accurate and mostly complete.
5 – Nearly identical in meaning to the reference.

Be strict but fair. Output ONLY the number.

Question: {question}
Reference Answer: {reference}
Model Prediction: {prediction}

Score:"""

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print("Error during evaluation:\n")
        print(e)
        return None

In [None]:
with open("eval_openbook_predictions.json") as f:
    eval_results = json.load(f)

In [None]:
judged_results = []

for i, item in enumerate(eval_results):
    print(f"Evaluating {i+1}/{len(eval_results)}")
    score = ask_gpt_judge(item["question"], item["reference"], item["prediction"])
    if score:
        judged_results.append({
            "question": item["question"],
            "reference": item["reference"],
            "prediction": item["prediction"],
            "gpt4o_score": score
        })
    time.sleep(1.2)

Evaluating 1/30
Evaluating 2/30
Evaluating 3/30
Evaluating 4/30
Evaluating 5/30
Evaluating 6/30
Evaluating 7/30
Evaluating 8/30
Evaluating 9/30
Evaluating 10/30
Evaluating 11/30
Evaluating 12/30
Evaluating 13/30
Evaluating 14/30
Evaluating 15/30
Evaluating 16/30
Evaluating 17/30
Evaluating 18/30
Evaluating 19/30
Evaluating 20/30
Evaluating 21/30
Evaluating 22/30
Evaluating 23/30
Evaluating 24/30
Evaluating 25/30
Evaluating 26/30
Evaluating 27/30
Evaluating 28/30
Evaluating 29/30
Evaluating 30/30


In [None]:
with open("gpt4o_judgments_openbook.json", "w") as f:
    json.dump(judged_results, f, indent=2)

In [None]:
for sample in judged_results:
    print(" Question:", sample["question"])
    print(" Reference Answer:", sample["reference"])
    print(" Model Prediction:", sample["prediction"])
    print(" GPT-4o Evaluation:", sample["gpt4o_score"])
    print("-" * 80)

 Question: What is the primary innovation introduced by the LoRI method for parameter-efficient fine-tuning?
 Reference Answer: LoRI introduces a novel approach that freezes the projection matrices A as random projections and sparsifies the matrices B using task-specific masks, thereby significantly reducing trainable parameters while minimizing cross-task interference.
 Model Prediction: LoRI reduces the number of trainable parameters by freezing the projection matrices A as random projections and sparsifying the matrices B using task-specific masks, while maintaining strong task performance.
 GPT-4o Evaluation: 5
--------------------------------------------------------------------------------
 Question: How does LoRI reduce the number of trainable parameters compared to traditional LoRA?
 Reference Answer: LoRI reduces the number of trainable parameters by keeping matrix A fixed as a random projection and sparsifying matrix B using task-specific masks, eliminating the need to train b

In [None]:
# Calculating the average score
scores = [int(res["gpt4o_score"]) for res in judged_results if res["gpt4o_score"].isdigit()]
average_score = np.mean(scores)
print(f"Average GPT-4o Evaluation Score: {average_score:.2f} out of 5")

Average GPT-4o Evaluation Score: 3.97 out of 5


In [None]:
# Saving the results

output_path = "./data/evaluation/eval_gpt4o_judgments_open_book.json"

os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, "w") as f:
    json.dump(judged_results, f, indent=2)

print(f"Judged results saved to {output_path}")

Judged results saved to ./data/evaluation/eval_gpt4o_judgments_open_book.json


## Step 7: Evaluating with BERTScore (Semantic Similarity Metric)

In this section, we evaluate the semantic similarity between the model’s predictions and the ground truth answers using **BERTScore**, a metric that leverages contextual embeddings from large pretrained models (like BERT) to assess the *meaning* of the outputs.

Unlike BLEU, which only considers surface-level n-gram overlap, BERTScore measures how semantically close the answers are—even when the phrasing differs.

### Interpretation:
- **BERTScore F1** reflects the degree of **semantic overlap** between model output and human-labeled answer.
- A score closer to **1.0** indicates stronger alignment of meaning.
- This metric is especially useful in open-ended QA or summarization settings where **exact matching isn't expected**.

> Current Result:  
> `Average BERTScore (F1): 0.5295`  
> *This is a +0.1831 improvement from the BERTScore for the closed book case (0.3464). This vindicates our earlier hypothesis that adding context to the prompt can significantly improve the quality of the model's answers. We expect to see further improvement when RAG is implemented.*

In [None]:
# Replace `results` with `judged_results` if needed
predictions = [item["prediction"] for item in results]
references = [item["reference"] for item in results]

In [None]:
P, R, F1 = bertscore(predictions, references, lang="en", rescale_with_baseline=True)

In [None]:
print(f"Average Precision: {P.mean().item():.4f}")

Average Precision: 0.5135


In [None]:
print(f"Average Recall: {R.mean().item():.4f}")

Average Recall: 0.5448


In [None]:
print(f"Average BERTScore (F1): {F1.mean().item():.4f}")

Average BERTScore (F1): 0.5295


## Step 8: Fixing Metadata

In [1]:
pip install nbformat --quiet

In [6]:
from google.colab import drive, files
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
import os
# List the notebook directory to confirm the file exists
os.listdir("/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer/notebooks")

['.keep',
 '00_colab_setup.ipynb',
 '01_arxiv_scraper.ipynb',
 '02_pdf_downloader.ipynb',
 '04_prepare_finetuning_corpus.ipynb',
 '05_tokenization.ipynb',
 '03_qa_curation.ipynb',
 '06_finetuning.ipynb',
 '07_eval_qa_curation.ipynb',
 '08_evaluation_closed_book.ipynb',
 '09_evaluation_open_book.ipynb']

In [None]:
import nbformat

notebook_path = "/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer/notebooks/09_evaluation_open_book.ipynb"

with open(notebook_path, "r") as f:
    nb = nbformat.read(f, as_version=4)

if "widgets" in nb.metadata:
    del nb.metadata["widgets"]

with open(notebook_path, "w") as f:
    nbformat.write(nb, f)

print("Notebook fixed and saved successfully!")

Notebook fixed and saved successfully!
