# Week 8 Assignment: Multimodal Summarization and Reward Modeling

## Homework Introduction

Effective summarization is critical in research because it distills large, complex documents into concise overviews that highlight key insights. Researchers often rely on summaries to quickly understand a paper’s contributions without reading every detail. However, automatically evaluating the quality of generated summaries is challenging. Traditional metrics like ROUGE and BERTScore rely on lexical overlap and can miss nuances like semantic correctness or coherence.

Reward modeling offers a way to address this gap. In reinforcement learning from human feedback (RLHF), we train a reward model on examples of outputs labeled by humans. The reward model learns to predict which summary a person would prefer, serving as a proxy for human judgment. By training such a model on preference data, we can score new summaries according to human-aligned preferences, rather than just surface similarity.

## Learning Objectives

* Generate abstractive summaries of academic documents using LLaMA 3 (7B).
* Collect two candidate summaries per paper and have annotators select the better summary.
* Prepare the dataset of summary pairs and preference labels for reward model training.
* Train a reward model (e.g., DeBERTa-v3) on the collected preference data.
* Evaluate summaries using ROUGE and BERTScore, and compare these metrics to the reward model’s scores.

## Project Design

* **Data Collection:** Select 10 academic papers (including both text and figures) from arXiv or recent NLP conference proceedings.
* **Summary Generation:** For each paper, use the LLaMA 3 (7B) model to generate *two* different summaries. Vary the prompting strategy or sampling parameters to produce diverse outputs.
* **Human Annotation:** Have one or two human annotators compare each pair of summaries for a paper and choose the better one (e.g. more informative, coherent, factually consistent, etc.). Record which summary is preferred.
* **Data Formatting:** Create a dataset (e.g. in JSONL format) of summary pairs and preference labels. Each entry should include the two summary texts and which one was chosen (for example, fields `chosen` and `rejected` as required by reward modeling tools).
* **Reward Model Training:** Fine-tune a reward model (such as DeBERTa-v3) on this preference data. Use the chosen/rejected summary pairs so the model learns to assign higher scores to the preferred summaries.
* **Evaluation:** Generate summaries (or summary pairs) for 10 new papers and score them using the trained reward model. Also compute ROUGE and BERTScore for these summaries.
* **Comparison:** Analyze how the reward model’s scores align with ROUGE and BERTScore. Discuss examples where the reward model and the automatic metrics agree or disagree on which summary is better.

## Starter Code

* **Prompt Examples:** Prewritten prompt templates for LLaMA 3 summarization. For example: `"Summarize the following research paper excerpt:\n\n[insert paper text here]"`.


* **Dataset Format:** Example code showing how to store summary pairs and labels. For instance, a JSONL file where each record has `"chosen"` and `"rejected"` summary fields (matching the RewardTrainer input format).


In [None]:
import json

data = []
for pair in summary_pairs:
    data.append({
        "chosen": pair["preferred"],
        "rejected": pair["other"]
    })

with open("reward_data.jsonl", "w") as f:
    for item in data:
        f.write(json.dumps(item) + "\n")

* **Reward Training Loop:** Sample code (using Hugging Face `transformers` and `trl`) to fine-tune a reward model on the preference dataset. This should load the model (e.g. DeBERTa-v3) and train it on the chosen/rejected pairs.

In [None]:
from trl import RewardTrainer
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
model = AutoModelForSequenceClassification.from_pretrained("microsoft/deberta-v3-base", num_labels=1)

dataset = load_dataset("json", data_files="reward_data.jsonl", split="train")

def preprocess(example):
    return tokenizer(example["chosen"], example["rejected"], truncation=True, padding="max_length")

dataset = dataset.map(preprocess, batched=True)

training_args = TrainingArguments(
    output_dir="reward_model",
    per_device_train_batch_size=8,
    num_train_epochs=3,
    evaluation_strategy="no",
    save_strategy="epoch",
    logging_steps=10,
    fp16=True
)

trainer = RewardTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

trainer.train()

* **Evaluation Script:** Example code to compute ROUGE and BERTScore (using the `evaluate` library) and to run the reward model scoring on a batch of summaries. The script can output metric scores and compare reward model rankings.

In [None]:
from evaluate import load

rouge = load("rouge")
bertscore = load("bertscore")

results_rouge = rouge.compute(predictions=generated_summaries, references=reference_summaries)
results_bertscore = bertscore.compute(predictions=generated_summaries, references=reference_summaries, lang="en")

print("ROUGE:", results_rouge)
print("BERTScore:", results_bertscore)


## Environment Setup

* Install required Python libraries: `transformers`, `datasets`, `evaluate`, `trl` (Hugging Face TRL), and `accelerate`.
* (Optional) Install `peft` if you want to use parameter-efficient fine-tuning for the reward model.
* Ensure you have GPU access for model training (e.g., use Google Colab Pro, AWS, or a local GPU).
* Download or load the LLaMA 3 (7B) model checkpoint and a DeBERTa-v3 checkpoint (for example, via Hugging Face Hub).

## Deliverables

* A JSONL file containing 20 summary pairs with preference labels (the dataset of chosen/rejected summaries).
* The fine-tuned reward model weights (saved model file).
* An evaluation notebook (or script) that computes ROUGE and BERTScore on the summaries and compares them to the reward model’s scores/rankings.

## Exploration Tips

* Experiment with alternative models if resources allow. For example, try Mixtral-8x7B (a Mixture-of-Experts LLM) or the DeepSeek-VL vision-language model for summarization. Compare their outputs.
* Incorporate structured content into the prompts: e.g. include figure captions or table content when generating summaries to make the task truly multimodal.
* Compare summaries on qualitative criteria (factual consistency, conciseness, readability, etc.) and see how these aspects correlate with the numeric scores from ROUGE/BERTScore and from the reward model.

**Sources:** Summarization is often used to reduce long inputs and highlight key points. Evaluating summary quality is a known open challenge due to subjective references and aspects like coherence that metrics may miss. Reward modeling (from RLHF) involves training a model on human preference data so it can align generation with human judgments.

