# RAG Summarization - Complete Pipeline

##  Project Workflow
1. ✅ Set up retrieval (BM25 & FAISS dense)
2. ✅ Implement RAG pipeline
3. ✅ Ablation studies (retriever type, top-k)
4. ✅ Evaluate QA & summarisation, compute readability metrics
5. ✅ Log and compare ablation results

##  Evaluation Metrics
- ROUGE-1, ROUGE-2, ROUGE-L
- F1 BERTScore
- Avg Flesch-Kincaid Grade
- Individual FK grades (in CSV)

In [1]:
 # Define your branch name as a string
BRANCH = "yunxiu-branch"

# Clone your GitHub repo and switch to your branch
!git clone -b $BRANCH https://github.com/xiuxiuface/dsa4213-project.git

# Move into the project folder
%cd dsa4213-project

Cloning into 'dsa4213-project'...
remote: Enumerating objects: 178, done.[K
remote: Counting objects: 100% (90/90), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 178 (delta 52), reused 41 (delta 26), pack-reused 88 (from 2)[K
Receiving objects: 100% (178/178), 78.87 MiB | 39.49 MiB/s, done.
Resolving deltas: 100% (65/65), done.
/kaggle/working/dsa4213-project


## Installation (Run Once)

In [2]:
!pip install -q rank-bm25 sentence-transformers faiss-cpu transformers torch rouge-score bert-score textstat pandas numpy tqdm openpyxl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m81.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m61.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Imports

In [6]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import json
import os
from typing import List, Dict, Tuple

# Retrieval
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss

# Generation
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Evaluation
from rouge_score import rouge_scorer
from bert_score import score as bert_score
import textstat


## 1️⃣ Data Loading

In [9]:
# Load dataset
df = pd.read_excel("rag_dataset.xlsx", engine='openpyxl')
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head()

Dataset shape: (46610, 5)
Columns: ['question_id', 'question', 'answer', 'passage', 'passage_id']


Unnamed: 0,question_id,question,answer,passage,passage_id
0,0,Is hidradenitis suppurativa a systemic disease...,Control subjects were not validated for absenc...,Hidradenitis suppurativa (HS) is a chronic inf...,0
1,0,Is hidradenitis suppurativa a systemic disease...,Control subjects were not validated for absenc...,"In this retrospective case-control study, we c...",1
2,0,Is hidradenitis suppurativa a systemic disease...,Control subjects were not validated for absenc...,A total of 2292 patients at Massachusetts Gene...,2
3,1,Is admission hyperglycemia associated with fai...,"In patients with STEMI who undergo FT, admissi...",Hyperglycemia on admission is associated with ...,3
4,1,Is admission hyperglycemia associated with fai...,"In patients with STEMI who undergo FT, admissi...",This is a retrospective study of 304 STEMI pat...,4


In [10]:
# Prepare corpus (unique passages)
unique_passages = df.drop_duplicates(subset=["passage_id"])
unique_passages = unique_passages[unique_passages['passage'].notna()]
corpus = unique_passages["passage"].tolist()
passage_ids = unique_passages["passage_id"].tolist()

print(f"Total unique passages: {len(corpus)}")

Total unique passages: 46609


## 2️⃣ Retriever Classes

In [13]:
# BM25 Retriever
class BM25Retriever:
    def __init__(self, corpus: List[str]):
        print("Initializing BM25...")
        tokenized_corpus = [doc.lower().split() for doc in corpus]
        self.bm25 = BM25Okapi(tokenized_corpus)
        self.corpus = corpus
        print("✓ BM25 ready")
    
    def retrieve(self, query: str, k: int = 3) -> List[str]:
        tokenized_query = query.lower().split()
        scores = self.bm25.get_scores(tokenized_query)
        top_k_indices = np.argsort(scores)[::-1][:k]
        return [self.corpus[i] for i in top_k_indices]

In [15]:
# FAISS Dense Retriever
class FAISSRetriever:
    def __init__(self, corpus: List[str], model_name: str = 'all-MiniLM-L6-v2'):
        print(f"Initializing FAISS with {model_name}...")
        self.embed_model = SentenceTransformer(model_name)
        self.corpus = corpus
        
        # Create embeddings
        print("Creating passage embeddings...")
        self.passage_embeddings = self.embed_model.encode(
            corpus, 
            show_progress_bar=True,
            convert_to_numpy=True
        )
        
        # Build FAISS index
        dimension = self.passage_embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)  # Inner Product
        
        # Normalize for cosine similarity
        faiss.normalize_L2(self.passage_embeddings)
        self.index.add(self.passage_embeddings)
        print(f"✓ FAISS index ready with {self.index.ntotal} vectors")
    
    def retrieve(self, query: str, k: int = 3) -> List[str]:
        query_vec = self.embed_model.encode([query], convert_to_numpy=True)
        faiss.normalize_L2(query_vec)
        scores, indices = self.index.search(query_vec, k)
        return [self.corpus[i] for i in indices[0]]

## 3️⃣ Initialize Retrievers

In [16]:
# Initialize BM25
bm25_retriever = BM25Retriever(corpus)

Initializing BM25...
✓ BM25 ready


In [17]:
# Initialize FAISS (this may take a few minutes)
faiss_retriever = FAISSRetriever(corpus)

Initializing FAISS with all-MiniLM-L6-v2...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Creating passage embeddings...


Batches:   0%|          | 0/1457 [00:00<?, ?it/s]

✓ FAISS index ready with 46609 vectors


## 4️⃣ Summarization Model

In [18]:
class SummarizationModel:
    def __init__(self, model_name: str = "google/flan-t5-base"):
        print(f"Loading {model_name}...")
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(self.device)
        print(f"✓ Model loaded on {self.device}")
    
    def generate(self, question: str, context: List[str], max_length: int = 150) -> str:
        context_str = " ".join(context)
        prompt = f"Question: {question}\n\nContext: {context_str}\n\nSummarize the answer based on the context:"
        
        inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(self.device)
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_length=max_length,
                num_beams=4,
                early_stopping=True
            )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Initialize
summarizer = SummarizationModel()

Loading google/flan-t5-base...


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

✓ Model loaded on cuda


## 5️⃣ Quick Test (小样本测试)

In [19]:
# Test with one example
test_row = df.drop_duplicates(subset=["question_id"]).iloc[0]

print("Question:", test_row["question"])
print("\nTrue answer:", test_row["answer"])

# BM25
print("\n" + "="*60)
print("BM25 Retrieval (k=3):")
bm25_ctx = bm25_retriever.retrieve(test_row["question"], k=3)
bm25_summary = summarizer.generate(test_row["question"], bm25_ctx)
print("Summary:", bm25_summary)
print("FK Grade:", textstat.flesch_kincaid_grade(bm25_summary))

# FAISS
print("\n" + "="*60)
print("FAISS Retrieval (k=3):")
faiss_ctx = faiss_retriever.retrieve(test_row["question"], k=3)
faiss_summary = summarizer.generate(test_row["question"], faiss_ctx)
print("Summary:", faiss_summary)
print("FK Grade:", textstat.flesch_kincaid_grade(faiss_summary))

Question: Is hidradenitis suppurativa a systemic disease with substantial comorbidity burden?

True answer: Control subjects were not validated for absence of HS and comorbidity validation was not performed for either group.

BM25 Retrieval (k=3):
Summary: HS is a chronic inflammatory disease with substantial comorbidities
FK Grade: 14.142222222222227

FAISS Retrieval (k=3):
Summary: HS is a chronic inflammatory disease involving intertriginous skin.
FK Grade: 16.764444444444447


## 6️⃣ Evaluation Functions

In [20]:
class RAGEvaluator:
    def __init__(self):
        self.rouge_scorer = rouge_scorer.RougeScorer(
            ['rouge1', 'rouge2', 'rougeL'], 
            use_stemmer=True
        )
    
    def evaluate_batch(self, predictions: List[str], references: List[str]) -> Tuple[Dict, List]:
        # ROUGE scores
        rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
        
        for pred, ref in zip(predictions, references):
            scores = self.rouge_scorer.score(ref, pred)
            rouge_scores['rouge1'].append(scores['rouge1'].fmeasure)
            rouge_scores['rouge2'].append(scores['rouge2'].fmeasure)
            rouge_scores['rougeL'].append(scores['rougeL'].fmeasure)
        
        # BERTScore
        P, R, F1 = bert_score(predictions, references, lang="en", verbose=False)
        
        # Flesch-Kincaid
        fk_grades = [textstat.flesch_kincaid_grade(p) for p in predictions]
        
        metrics = {
            "ROUGE-1": np.mean(rouge_scores['rouge1']),
            "ROUGE-2": np.mean(rouge_scores['rouge2']),
            "ROUGE-L": np.mean(rouge_scores['rougeL']),
            "BERTScore_F1": F1.mean().item(),
            "Avg_FK_Grade": np.mean(fk_grades)
        }
        
        return metrics, fk_grades

evaluator = RAGEvaluator()
print("✓ Evaluator ready")

✓ Evaluator ready


## 7️⃣ Ablation Study (消融研究)

In [21]:
# Sample test set for ablation
TEST_SIZE = 500  # Adjust based on your computational resources
test_df = df.drop_duplicates(subset=["question_id"]).sample(n=TEST_SIZE, random_state=42)

print(f"Test set size: {len(test_df)}")

Test set size: 500


In [22]:
# Ablation configurations
configs = [
    {"name": "BM25_k3", "retriever": bm25_retriever, "k": 3},
    {"name": "BM25_k5", "retriever": bm25_retriever, "k": 5},
    {"name": "BM25_k10", "retriever": bm25_retriever, "k": 10},
    {"name": "FAISS_k3", "retriever": faiss_retriever, "k": 3},
    {"name": "FAISS_k5", "retriever": faiss_retriever, "k": 5},
    {"name": "FAISS_k10", "retriever": faiss_retriever, "k": 10},
]

ablation_results = []

for config in configs:
    print(f"\n{'='*60}")
    print(f"Running: {config['name']}")
    print(f"{'='*60}")
    
    predictions = []
    references = []
    
    for _, row in tqdm(test_df.iterrows(), total=len(test_df), desc=config['name']):
        ctx = config['retriever'].retrieve(row['question'], k=config['k'])
        pred = summarizer.generate(row['question'], ctx)
        predictions.append(pred)
        references.append(row['answer'])
    
    # Evaluate
    metrics, _ = evaluator.evaluate_batch(predictions, references)
    
    result = {
        "config_name": config['name'],
        "retriever_type": config['name'].split('_')[0],
        "k": config['k'],
        **metrics
    }
    ablation_results.append(result)
    
    print(f"\nResults:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")

# Save results
os.makedirs("results", exist_ok=True)
ablation_df = pd.DataFrame(ablation_results)
ablation_df.to_csv("results/ablation_study.csv", index=False)

print("\n" + "="*80)
print("ABLATION STUDY COMPLETE")
print("="*80)
ablation_df


Running: BM25_k3


BM25_k3: 100%|██████████| 500/500 [08:35<00:00,  1.03s/it]


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.2954
  ROUGE-2: 0.1239
  ROUGE-L: 0.2329
  BERTScore_F1: 0.8725
  Avg_FK_Grade: 16.6095

Running: BM25_k5


BM25_k5: 100%|██████████| 500/500 [08:24<00:00,  1.01s/it]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.2851
  ROUGE-2: 0.1209
  ROUGE-L: 0.2262
  BERTScore_F1: 0.8702
  Avg_FK_Grade: 16.3056

Running: BM25_k10


BM25_k10: 100%|██████████| 500/500 [08:03<00:00,  1.03it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.2784
  ROUGE-2: 0.1197
  ROUGE-L: 0.2232
  BERTScore_F1: 0.8689
  Avg_FK_Grade: 16.0520

Running: FAISS_k3


FAISS_k3: 100%|██████████| 500/500 [06:11<00:00,  1.34it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.3052
  ROUGE-2: 0.1316
  ROUGE-L: 0.2423
  BERTScore_F1: 0.8758
  Avg_FK_Grade: 16.1163

Running: FAISS_k5


FAISS_k5: 100%|██████████| 500/500 [06:03<00:00,  1.38it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.2856
  ROUGE-2: 0.1216
  ROUGE-L: 0.2251
  BERTScore_F1: 0.8722
  Avg_FK_Grade: 16.1249

Running: FAISS_k10


FAISS_k10: 100%|██████████| 500/500 [05:57<00:00,  1.40it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Results:
  ROUGE-1: 0.2810
  ROUGE-2: 0.1224
  ROUGE-L: 0.2242
  BERTScore_F1: 0.8711
  Avg_FK_Grade: 15.9115

ABLATION STUDY COMPLETE


Unnamed: 0,config_name,retriever_type,k,ROUGE-1,ROUGE-2,ROUGE-L,BERTScore_F1,Avg_FK_Grade
0,BM25_k3,BM25,3,0.295422,0.123913,0.232937,0.872539,16.609516
1,BM25_k5,BM25,5,0.285055,0.120936,0.226201,0.870154,16.305616
2,BM25_k10,BM25,10,0.278429,0.119725,0.223163,0.86888,16.052027
3,FAISS_k3,FAISS,3,0.305214,0.131634,0.242267,0.87578,16.11625
4,FAISS_k5,FAISS,5,0.285635,0.121625,0.225136,0.872206,16.124874
5,FAISS_k10,FAISS,10,0.281001,0.122428,0.224152,0.87108,15.911467


## 8️⃣ Full Evaluation (完整评估)

使用 ablation study 中表现最好的配置

In [23]:
# Select best config from ablation
best_config = ablation_df.loc[ablation_df['BERTScore_F1'].idxmax()]
print("Best configuration:")
print(best_config)

# Use the best retriever and k
BEST_RETRIEVER = faiss_retriever if "FAISS" in best_config['config_name'] else bm25_retriever
BEST_K = int(best_config['k'])

Best configuration:
config_name       FAISS_k3
retriever_type       FAISS
k                        3
ROUGE-1           0.305214
ROUGE-2           0.131634
ROUGE-L           0.242267
BERTScore_F1       0.87578
Avg_FK_Grade      16.11625
Name: 3, dtype: object


In [24]:
# Full evaluation on entire dataset
print("Running full evaluation...")
print(f"Using: {best_config['config_name']}")

unique_df = df.drop_duplicates(subset=["question_id"]).reset_index(drop=True)

predictions = []
references = []

for _, row in tqdm(unique_df.iterrows(), total=len(unique_df), desc="Full evaluation"):
    ctx = BEST_RETRIEVER.retrieve(row['question'], k=BEST_K)
    pred = summarizer.generate(row['question'], ctx)
    predictions.append(pred)
    references.append(row['answer'])

# Add predictions to dataframe
unique_df['rag_summary'] = predictions

# Evaluate
final_metrics, fk_grades = evaluator.evaluate_batch(predictions, references)
unique_df['FK_grade'] = fk_grades

# Save
unique_df.to_csv("results/rag_full_outputs.csv", index=False)

with open("results/rag_full_metrics.json", "w") as f:
    json.dump(final_metrics, f, indent=2)

print("\n" + "="*80)
print("FINAL METRICS")
print("="*80)
for metric, value in final_metrics.items():
    print(f"{metric}: {value:.4f}")

print(f"\n✓ Saved to results/")

Running full evaluation...
Using: FAISS_k3


Full evaluation: 100%|██████████| 13738/13738 [2:50:23<00:00,  1.34it/s]  
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



FINAL METRICS
ROUGE-1: 0.3031
ROUGE-2: 0.1291
ROUGE-L: 0.2413
BERTScore_F1: 0.8760
Avg_FK_Grade: 16.1867

✓ Saved to results/


## 9️⃣ Extract Required Passages (提取特定段落)

In [25]:
# Extract specific passages
required_passages = [16771, 12220, 29568]

subset_df = unique_df[unique_df['passage_id'].isin(required_passages)].copy()
subset_df.to_csv("results/rag_required_passages.csv", index=False)

print("Required passages extracted:")
subset_df[['passage_id', 'question', 'rag_summary', 'FK_grade']]

Required passages extracted:


Unnamed: 0,passage_id,question,rag_summary,FK_grade
3601,12220,Is parathyroid hormone associated with biomark...,PTH and 25-hydroxyvitamin D correlate with met...,15.64
4953,16771,Does iGF-2 mediate intestinal mucosal hyperpla...,IGF2 mediates intestinal mucosal hyperplasia i...,25.942222
8722,29568,Is insulin-like growth factor binding protein-...,IGFBP2 is elevated in blood of lung cancer pat...,11.5


## Summary

### Generated Files:
1. `results/ablation_study.csv` - Comparison of different retrievers and k values
2. `results/rag_full_outputs.csv` - All predictions with individual FK grades
3. `results/rag_full_metrics.json` - Overall metrics (ROUGE, BERTScore, Avg FK)
4. `results/rag_required_passages.csv` - Specific passages [16771, 12220, 29568]

### Metrics in JSON:
- ROUGE-1
- ROUGE-2
- ROUGE-L
- BERTScore_F1
- Avg_FK_Grade ← **Average across all samples**

### Individual FK Grades:
- Stored in CSV files (column: `FK_grade`)