# RAG Summarization - Complete Pipeline

##  Project Workflow
1. ✅ Set up retrieval (BM25 & FAISS dense)
2. ✅ Implement RAG pipeline
3. ✅ Ablation studies (retriever type, top-k)
4. ✅ Evaluate QA & summarisation, compute readability metrics
5. ✅ Log and compare ablation results

##  Evaluation Metrics
- ROUGE-1, ROUGE-2, ROUGE-L
- F1 BERTScore
- Avg Flesch-Kincaid Grade
- Individual FK grades (in CSV)

In [1]:
 # Define your branch name as a string
BRANCH = "yunxiu-branch"

# Clone your GitHub repo and switch to your branch
!git clone -b $BRANCH https://github.com/xiuxiuface/dsa4213-project.git

# Move into the project folder
%cd dsa4213-project

Cloning into 'dsa4213-project'...
remote: Enumerating objects: 248, done.[K
remote: Counting objects: 100% (160/160), done.[K
remote: Compressing objects: 100% (132/132), done.[K
remote: Total 248 (delta 85), reused 56 (delta 27), pack-reused 88 (from 2)[K
Receiving objects: 100% (248/248), 84.72 MiB | 29.64 MiB/s, done.
Resolving deltas: 100% (98/98), done.
/kaggle/working/dsa4213-project


## Installation (Run Once)

In [2]:
!pip install -q rank-bm25 sentence-transformers faiss-cpu transformers torch rouge-score bert-score textstat pandas numpy tqdm openpyxl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m66.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Imports

In [3]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import json
import os
from typing import List, Dict, Tuple

# Retrieval
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss

# Generation
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Evaluation
from rouge_score import rouge_scorer
from bert_score import score as bert_score
import textstat


2025-11-13 17:17:53.872077: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763054274.057973      48 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763054274.109476      48 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

## 1️⃣ Data Loading

In [4]:
# Load dataset
df = pd.read_excel("rag_dataset.xlsx", engine='openpyxl')
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head()

Dataset shape: (46610, 5)
Columns: ['question_id', 'question', 'answer', 'passage', 'passage_id']


Unnamed: 0,question_id,question,answer,passage,passage_id
0,0,Is hidradenitis suppurativa a systemic disease...,Control subjects were not validated for absenc...,Hidradenitis suppurativa (HS) is a chronic inf...,0
1,0,Is hidradenitis suppurativa a systemic disease...,Control subjects were not validated for absenc...,"In this retrospective case-control study, we c...",1
2,0,Is hidradenitis suppurativa a systemic disease...,Control subjects were not validated for absenc...,A total of 2292 patients at Massachusetts Gene...,2
3,1,Is admission hyperglycemia associated with fai...,"In patients with STEMI who undergo FT, admissi...",Hyperglycemia on admission is associated with ...,3
4,1,Is admission hyperglycemia associated with fai...,"In patients with STEMI who undergo FT, admissi...",This is a retrospective study of 304 STEMI pat...,4


In [5]:
# Prepare corpus (unique passages)
unique_passages = df.drop_duplicates(subset=["passage_id"])
unique_passages = unique_passages[unique_passages['passage'].notna()]
corpus = unique_passages["passage"].tolist()
passage_ids = unique_passages["passage_id"].tolist()

print(f"Total unique passages: {len(corpus)}")

Total unique passages: 46609


## 2️⃣ Retriever Classes

In [6]:
# BM25 Retriever
class BM25Retriever:
    def __init__(self, corpus: List[str]):
        print("Initializing BM25...")
        tokenized_corpus = [doc.lower().split() for doc in corpus]
        self.bm25 = BM25Okapi(tokenized_corpus)
        self.corpus = corpus
        print("✓ BM25 ready")
    
    def retrieve(self, query: str, k: int = 3) -> List[str]:
        tokenized_query = query.lower().split()
        scores = self.bm25.get_scores(tokenized_query)
        top_k_indices = np.argsort(scores)[::-1][:k]
        return [self.corpus[i] for i in top_k_indices]

In [7]:
# FAISS Dense Retriever
class FAISSRetriever:
    def __init__(self, corpus: List[str], model_name: str = 'all-MiniLM-L6-v2'):
        print(f"Initializing FAISS with {model_name}...")
        self.embed_model = SentenceTransformer(model_name)
        self.corpus = corpus
        
        # Create embeddings
        print("Creating passage embeddings...")
        self.passage_embeddings = self.embed_model.encode(
            corpus, 
            show_progress_bar=True,
            convert_to_numpy=True
        )
        
        # Build FAISS index
        dimension = self.passage_embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)  # Inner Product
        
        # Normalize for cosine similarity
        faiss.normalize_L2(self.passage_embeddings)
        self.index.add(self.passage_embeddings)
        print(f"✓ FAISS index ready with {self.index.ntotal} vectors")
    
    def retrieve(self, query: str, k: int = 3) -> List[str]:
        query_vec = self.embed_model.encode([query], convert_to_numpy=True)
        faiss.normalize_L2(query_vec)
        scores, indices = self.index.search(query_vec, k)
        return [self.corpus[i] for i in indices[0]]

## 3️⃣ Initialize Retrievers

In [8]:
# Initialize BM25
bm25_retriever = BM25Retriever(corpus)

Initializing BM25...
✓ BM25 ready


In [9]:
# Initialize FAISS
faiss_retriever = FAISSRetriever(corpus)

Initializing FAISS with all-MiniLM-L6-v2...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Creating passage embeddings...


Batches:   0%|          | 0/1457 [00:00<?, ?it/s]

✓ FAISS index ready with 46609 vectors


## 4️⃣ Summarization Model

In [20]:
summarisation_prompts = {
    "plain": "You are a medical research assistant.\nSummarise the biomedical text below concisely:\n\n",
    "cite_source": "You are a medical research assistant.\nSummarise the biomedical text below concisely while citing possible biomedical sources:\n\n"
}
class SummarizationModel:
    def __init__(self, model_name: str = "google/flan-t5-small"):
        print(f"Loading {model_name}...")
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(self.device)
        print(f"✓ Model loaded on {self.device}")
    
    def generate(self, question: str, context: List[str], max_length: int = 150, prompt_type: str = "plain") -> str:
        context_str = " ".join(context)
        prompt = summarisation_prompts[prompt_type] + context_str
        
        inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(self.device)
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_length=max_length,
                num_beams=4,
                early_stopping=True
            )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Initialize
summarizer = SummarizationModel()

Loading google/flan-t5-small...
✓ Model loaded on cuda


## 5️⃣ Quick Test

In [21]:
# Test with one example
test_row = df.drop_duplicates(subset=["question_id"]).iloc[0]

print("Question:", test_row["question"])
print("\nTrue answer:", test_row["answer"])

# BM25
print("\n" + "="*60)
print("BM25 Retrieval (k=3):")
bm25_ctx = bm25_retriever.retrieve(test_row["question"], k=3)
bm25_summary = summarizer.generate(test_row["question"], bm25_ctx)
print("Summary:", bm25_summary)
print("FK Grade:", textstat.flesch_kincaid_grade(bm25_summary))

# FAISS
print("\n" + "="*60)
print("FAISS Retrieval (k=3):")
faiss_ctx = faiss_retriever.retrieve(test_row["question"], k=3)
faiss_summary = summarizer.generate(test_row["question"], faiss_ctx)
print("Summary:", faiss_summary)
print("FK Grade:", textstat.flesch_kincaid_grade(faiss_summary))

Question: Is hidradenitis suppurativa a systemic disease with substantial comorbidity burden?

True answer: Control subjects were not validated for absence of HS and comorbidity validation was not performed for either group.

BM25 Retrieval (k=3):
Summary: The prevalence and comorbidities of Hidradenitis suppurativa in diabetic patients.
FK Grade: 16.63

FAISS Retrieval (k=3):
Summary: Prevalence and comorbidity of HS in a large patient care database.
FK Grade: 10.154545454545456


## 6️⃣ Evaluation Functions

In [22]:
class RAGEvaluator:
    def __init__(self):
        self.rouge_scorer = rouge_scorer.RougeScorer(
            ['rouge1', 'rouge2', 'rougeL'], 
            use_stemmer=True
        )
    
    def evaluate_batch(self, predictions: List[str], references: List[str]) -> Tuple[Dict, List]:
        # ROUGE scores
        rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
        
        for pred, ref in zip(predictions, references):
            scores = self.rouge_scorer.score(ref, pred)
            rouge_scores['rouge1'].append(scores['rouge1'].fmeasure)
            rouge_scores['rouge2'].append(scores['rouge2'].fmeasure)
            rouge_scores['rougeL'].append(scores['rougeL'].fmeasure)
        
        # BERTScore
        P, R, F1 = bert_score(predictions, references, lang="en", verbose=False)
        
        # Flesch-Kincaid
        fk_grades = [textstat.flesch_kincaid_grade(p) for p in predictions]
        
        metrics = {
            "ROUGE-1": np.mean(rouge_scores['rouge1']),
            "ROUGE-2": np.mean(rouge_scores['rouge2']),
            "ROUGE-L": np.mean(rouge_scores['rougeL']),
            "BERTScore_F1": F1.mean().item(),
            "Avg_FK_Grade": np.mean(fk_grades)
        }
        
        return metrics, fk_grades

evaluator = RAGEvaluator()
print("✓ Evaluator ready")

✓ Evaluator ready


## 7️⃣ Ablation Study (We decided to use only FAISS cuz it outperformed BM25)

In [23]:
# Sample test set for ablation
TEST_SIZE = 500  # Adjust based on your computational resources
test_df = df.drop_duplicates(subset=["question_id"]).sample(n=TEST_SIZE, random_state=42)

print(f"Test set size: {len(test_df)}")

Test set size: 500


In [25]:
# Ablation configurations
configs = [
    {"name": "FAISS_k1_plain", "retriever": faiss_retriever, "k": 1, "prompt_type": "plain"},
    {"name": "FAISS_k1_cite", "retriever": faiss_retriever, "k": 1, "prompt_type": "cite_source"},
    {"name": "FAISS_k3_plain", "retriever": faiss_retriever, "k": 3, "prompt_type": "plain"},
    {"name": "FAISS_k3_cite", "retriever": faiss_retriever, "k": 3, "prompt_type": "cite_source"},
    {"name": "FAISS_k5_plain", "retriever": faiss_retriever, "k": 5, "prompt_type": "plain"},
    {"name": "FAISS_k5_cite", "retriever": faiss_retriever, "k": 5, "prompt_type": "cite_source"},
]

ablation_results = []

for config in configs:
    print(f"\n{'='*60}")
    print(f"Running: {config['name']}")
    print(f"{'='*60}")
    
    predictions = []
    references = []
    
    for _, row in tqdm(test_df.iterrows(), total=len(test_df), desc=config['name']):
        ctx = config['retriever'].retrieve(row['question'], k=config['k'])
        pred = summarizer.generate(row['question'], ctx, prompt_type=config['prompt_type'])
        predictions.append(pred)
        references.append(row['answer'])
    
    # Evaluate
    metrics, _ = evaluator.evaluate_batch(predictions, references)
    
    result = {
        "config_name": config['name'],
        "retriever_type": config['name'].split('_')[0],
        "k": config['k'],
        "prompt_type": config.get('prompt_type', 'plain'),  # Add this line
        **metrics
    }
    ablation_results.append(result)
    
    print(f"\nResults:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")

# Save results
os.makedirs("results", exist_ok=True)
# Group results by prompt type and save separately
ablation_df = pd.DataFrame(ablation_results)
ablation_df.to_csv("results/ablation_study_all.csv", index=False)

# Save separate files for each prompt type
# Select best config for EACH prompt type and run full evaluation
for prompt_type in ['plain', 'cite_source']:
    prompt_results = ablation_df[ablation_df['prompt_type'] == prompt_type]
    best_config = prompt_results.loc[prompt_results['BERTScore_F1'].idxmax()]
    
    print(f"\n{'='*80}")
    print(f"Best configuration for {prompt_type}:")
    print(best_config)
    print(f"{'='*80}")
    
    # Use the best retriever and k for this prompt type
    BEST_RETRIEVER = faiss_retriever if "FAISS" in best_config['config_name'] else bm25_retriever
    BEST_K = int(best_config['k'])
    
    # Full evaluation on entire dataset
    print(f"\nRunning full evaluation for {prompt_type}...")
    print(f"Using: {best_config['config_name']}")
    
    unique_df = df.drop_duplicates(subset=["question_id"]).reset_index(drop=True)
    
    predictions = []
    references = []
    
    for _, row in tqdm(unique_df.iterrows(), total=len(unique_df), desc=f"Full eval - {prompt_type}"):
        ctx = BEST_RETRIEVER.retrieve(row['question'], k=BEST_K)
        pred = summarizer.generate(row['question'], ctx, prompt_type=prompt_type)
        predictions.append(pred)
        references.append(row['answer'])
    
    # Add predictions to dataframe
    unique_df[f'rag_summary_{prompt_type}'] = predictions
    
    # Evaluate
    final_metrics, fk_grades = evaluator.evaluate_batch(predictions, references)
    unique_df[f'FK_grade_{prompt_type}'] = fk_grades
    
    # Save
    unique_df.to_csv(f"results/rag_full_outputs_{prompt_type}.csv", index=False)
    
    with open(f"results/rag_full_metrics_{prompt_type}.json", "w") as f:
        json.dump(final_metrics, f, indent=2)
    
    print(f"\n{'='*80}")
    print(f"FINAL METRICS for {prompt_type}")
    print(f"{'='*80}")
    for metric, value in final_metrics.items():
        print(f"{metric}: {value:.4f}")
    
    print(f"\n✓ Saved to results/")


Running: FAISS_k1_plain


FAISS_k1_plain: 100%|██████████| 500/500 [04:33<00:00,  1.83it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.2714
  ROUGE-2: 0.0980
  ROUGE-L: 0.2112
  BERTScore_F1: 0.8661
  Avg_FK_Grade: 15.5416

Running: FAISS_k1_cite


FAISS_k1_cite: 100%|██████████| 500/500 [04:36<00:00,  1.81it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.2748
  ROUGE-2: 0.0997
  ROUGE-L: 0.2112
  BERTScore_F1: 0.8666
  Avg_FK_Grade: 15.7351

Running: FAISS_k3_plain


FAISS_k3_plain: 100%|██████████| 500/500 [04:24<00:00,  1.89it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.2473
  ROUGE-2: 0.0850
  ROUGE-L: 0.1953
  BERTScore_F1: 0.8637
  Avg_FK_Grade: 15.6739

Running: FAISS_k3_cite


FAISS_k3_cite: 100%|██████████| 500/500 [04:28<00:00,  1.86it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.2476
  ROUGE-2: 0.0836
  ROUGE-L: 0.1951
  BERTScore_F1: 0.8637
  Avg_FK_Grade: 15.9728

Running: FAISS_k5_plain


FAISS_k5_plain: 100%|██████████| 500/500 [04:13<00:00,  1.97it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.2305
  ROUGE-2: 0.0773
  ROUGE-L: 0.1817
  BERTScore_F1: 0.8592
  Avg_FK_Grade: 15.4355

Running: FAISS_k5_cite


FAISS_k5_cite: 100%|██████████| 500/500 [04:09<00:00,  2.00it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.2211
  ROUGE-2: 0.0746
  ROUGE-L: 0.1728
  BERTScore_F1: 0.8573
  Avg_FK_Grade: 15.4645

Best configuration for plain:
config_name       FAISS_k1_plain
retriever_type             FAISS
k                              1
prompt_type                plain
ROUGE-1                 0.271444
ROUGE-2                  0.09801
ROUGE-L                 0.211209
BERTScore_F1            0.866089
Avg_FK_Grade           15.541614
Name: 0, dtype: object

Running full evaluation for plain...
Using: FAISS_k1_plain


Full eval - plain: 100%|██████████| 13738/13738 [2:03:07<00:00,  1.86it/s]  
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



FINAL METRICS for plain
ROUGE-1: 0.2686
ROUGE-2: 0.0962
ROUGE-L: 0.2081
BERTScore_F1: 0.8657
Avg_FK_Grade: 15.5026

✓ Saved to results/

Best configuration for cite_source:
config_name       FAISS_k1_cite
retriever_type            FAISS
k                             1
prompt_type         cite_source
ROUGE-1                0.274784
ROUGE-2                0.099689
ROUGE-L                0.211238
BERTScore_F1           0.866593
Avg_FK_Grade          15.735061
Name: 1, dtype: object

Running full evaluation for cite_source...
Using: FAISS_k1_cite


Full eval - cite_source: 100%|██████████| 13738/13738 [2:04:02<00:00,  1.85it/s]  
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



FINAL METRICS for cite_source
ROUGE-1: 0.2668
ROUGE-2: 0.0959
ROUGE-L: 0.2073
BERTScore_F1: 0.8651
Avg_FK_Grade: 15.4753

✓ Saved to results/


## 8️⃣ Full Evaluation


In [26]:
# Select best config from ablation
best_config = ablation_df.loc[ablation_df['BERTScore_F1'].idxmax()]
print("Best configuration:")
print(best_config)

# Use the best retriever and k
BEST_RETRIEVER = faiss_retriever if "FAISS" in best_config['config_name'] else bm25_retriever
BEST_K = int(best_config['k'])

Best configuration:
config_name       FAISS_k1_cite
retriever_type            FAISS
k                             1
prompt_type         cite_source
ROUGE-1                0.274784
ROUGE-2                0.099689
ROUGE-L                0.211238
BERTScore_F1           0.866593
Avg_FK_Grade          15.735061
Name: 1, dtype: object


In [29]:
# Full evaluation on entire dataset
print("Running full evaluation...")
print(f"Using: {best_config['config_name']}")

unique_df = df.drop_duplicates(subset=["question_id"]).reset_index(drop=True)

predictions = []
references = []

for _, row in tqdm(unique_df.iterrows(), total=len(unique_df), desc="Full evaluation"):
    ctx = BEST_RETRIEVER.retrieve(row['question'], k=BEST_K)
    pred = summarizer.generate(row['question'], ctx)
    predictions.append(pred)
    references.append(row['answer'])

# Add predictions to dataframe
unique_df['rag_summary'] = predictions

# Evaluate
final_metrics, fk_grades = evaluator.evaluate_batch(predictions, references)
unique_df['FK_grade'] = fk_grades

# Save
unique_df.to_csv("results/rag_full_outputs.csv", index=False)

with open("results/rag_full_metrics.json", "w") as f:
    json.dump(final_metrics, f, indent=2)

print("\n" + "="*80)
print("FINAL METRICS")
print("="*80)
for metric, value in final_metrics.items():
    print(f"{metric}: {value:.4f}")

print(f"\n✓ Saved to results/")

Running full evaluation...
Using: FAISS_k1_cite


Full evaluation: 100%|██████████| 13738/13738 [2:03:18<00:00,  1.86it/s]  
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



FINAL METRICS
ROUGE-1: 0.2686
ROUGE-2: 0.0962
ROUGE-L: 0.2081
BERTScore_F1: 0.8657
Avg_FK_Grade: 15.5026

✓ Saved to results/


## 9️⃣ Extract Required Passages

In [30]:
# Extract specific passages
required_passages = [16771, 12220, 29568]

subset_df = unique_df[unique_df['passage_id'].isin(required_passages)].copy()
subset_df.to_csv("results/rag_required_passages.csv", index=False)

print("Required passages extracted:")
subset_df[['passage_id', 'question', 'rag_summary', 'FK_grade']]

Required passages extracted:


Unnamed: 0,passage_id,question,rag_summary,FK_grade
3601,12220,Is parathyroid hormone associated with biomark...,The association of 25-hydroxyvitamin D and tes...,23.761667
4953,16771,Does iGF-2 mediate intestinal mucosal hyperpla...,IGF2 and IGF1R mediate the Rb-IKO intestinal p...,9.655
8722,29568,Is insulin-like growth factor binding protein-...,Plasma IGFBP2 levels are associated with clini...,12.69


## Summary

### Generated Files:
1. `results/ablation_study_all.csv` - 
2. `results/rag_full_outputs_cite_source.csv` -
3. `results/rag_full_outputs_plain.csv` -
4. `results/rag_full_metrics_plain.json` -
5. `results/rag_full_metrics_cite_source.json` - 
6. `results/rag_required_passages.csv` - Specific passages [16771, 12220, 29568]

### Metrics in JSON:
- ROUGE-1
- ROUGE-2
- ROUGE-L
- BERTScore_F1
- Avg_FK_Grade ← **Average across all samples**

### Individual FK Grades:
- Stored in CSV files (column: `FK_grade`)