# Llama 3.1 8B — QLoRA fine-tune on Abstract → Summary pairs
- Input: `data/abstract_pairs.parquet` (paper_id, title, input_text, summary, source)
- Model: `meta-llama/Meta-Llama-3.1-8B` in 4-bit (QLoRA)
- Goal: teach the model to summarize abstracts (100–200 words)
- We will:
  1. Load and format data
  2. Fine-tune with LoRA (PEFT + TRL)
  3. Save adapter
  4. Reload base vs fine-tuned and compare with ROUGE-L & BERTScore

## Ensure GPU is Connected

In [15]:
import torch
print("CUDA Available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name() if torch.cuda.is_available() else None)

CUDA Available: True
GPU: NVIDIA GeForce RTX 3090


## 1) Imports & config

In [16]:
import os, random, pandas as pd, numpy as np, torch
from pathlib import Path
from datasets import Dataset
from sklearn.model_selection import train_test_split

HF_MODEL_ID = "meta-llama/Meta-Llama-3.1-8B"
OUT_DIR = Path("outputs/llama31_8b_qlora_abstracts")
OUT_DIR.mkdir(parents=True, exist_ok=True)

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

print("CUDA available:", torch.cuda.is_available())
!nvidia-smi

# If you haven't logged into Hugging Face on this machine, run once:
# from huggingface_hub import login
# login()  # paste your HF token (must have Llama 3.1 access)

CUDA available: True
Fri Dec  5 13:18:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0  On |                  N/A |
|  0%   32C    P8             39W /  390W |    8383MiB /  24576MiB |      9%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-------------------------

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## 2) Load dataset (abstract pairs) and build prompts
Expects: `data/abstract_pairs.parquet` with columns:
- `paper_id`
- `title`
- `input_text` (abstract)
- `summary` (target summary)
- `source`  (s2/gpt/etc.)

In [17]:
df = pd.read_parquet("data/abstract_pairs.parquet")
print("Columns:", df.columns.tolist())
print("Total rows:", len(df))

# Keep up to 6800 examples
if len(df) > 6800:
    df = df.sample(6800, random_state=SEED).reset_index(drop=True)
print("Using rows:", len(df))

df[["title", "input_text", "summary"]].head(6)

Columns: ['paper_id', 'title', 'input_text', 'summary', 'source']
Total rows: 6800
Using rows: 6800


Unnamed: 0,title,input_text,summary
0,Observations on LLMs for Telecom Domain: Capab...,The landscape for building conversational inte...,Recent advancements in artificial intelligence...
1,"Connect the dots: Dataset Condensation, Differ...",Our work focuses on understanding the underpin...,This research explores how to improve a proces...
2,Grounding Language about Belief in a Bayesian ...,Despite the fact that beliefs are mental state...,"Humans often discuss each other's beliefs, eve..."
3,Open-world Semi-supervised Novel Class Discovery,Traditional semi-supervised learning tasks ass...,"In many real-world situations, we encounter da..."
4,Understanding Survey Paper Taxonomy about Larg...,As new research on Large Language Models (LLMs...,As research on Large Language Models (LLMs) gr...
5,Restore Anything Pipeline: Segment Anything Me...,Recent image restoration methods have produced...,Recent advancements in image restoration have ...


### Build supervised prompts
We create a single text field that includes:
Instruction + Title + Abstract + `Output:` + gold summary

The model is trained with next-token prediction on this sequence (SFT style).

In [18]:
def build_prompt(row):
    return (
        "### Instruction:\n"
        "Summarize the following abstract into a clear, faithful 100–200 word summary for a general audience.\n\n"
        f"### Title:\n{row['title']}\n\n"
        f"### Abstract:\n{row['input_text']}\n\n"
        "### Output:\n"
        f"{row['summary']}"
    )


df["text"] = df.apply(build_prompt, axis=1)

# Split into train/validation
train_df, val_df = train_test_split(df[["text", "title", "input_text", "summary"]],
                                    test_size=0.1, random_state=SEED)

train_ds = Dataset.from_pandas(train_df.reset_index(drop=True)[["text"]])
val_ds   = Dataset.from_pandas(val_df.reset_index(drop=True)[["text"]])

len(train_ds), len(val_ds)

print(train_df.head(6))

                                                   text  \
763   ### Instruction:\nSummarize the following abst...   
373   ### Instruction:\nSummarize the following abst...   
4836  ### Instruction:\nSummarize the following abst...   
2629  ### Instruction:\nSummarize the following abst...   
308   ### Instruction:\nSummarize the following abst...   
4100  ### Instruction:\nSummarize the following abst...   

                                                  title  \
763   Optimizing fairness tradeoffs in machine learn...   
373   Intuition emerges in Maximum Caliber models at...   
4836  Learning Computational Efficient Bots with Cos...   
2629  A Nonlinear Hash-based Optimization Method for...   
308   Boosting Theory-of-Mind Performance in Large L...   
4100  Data Formulator: AI-powered Concept-driven Vis...   

                                             input_text  \
763   Improving the fairness of machine learning mod...   
373   Whether large predictive models merely parrot ..

## 3) Tokenizer & model (4-bit with BitsAndBytes for QLoRA)

In [19]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

SEQ_LEN = 4096  # safe starting point for RTX 3090 + 4-bit

tok = AutoTokenizer.from_pretrained(HF_MODEL_ID, use_fast=True)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,  # good on Ampere (3090)
)

model = AutoModelForCausalLM.from_pretrained(
    HF_MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    dtype=torch.bfloat16,
)
model.config.use_cache = False  # for gradient checkpointing

print("Model loaded.")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model loaded.


## 4) LoRA config + SFTTrainer setup
We use TRL's SFTTrainer to:
- apply LoRA (PEFT)
- handle packing and training loop
- log loss / eval metrics during training

In [10]:
import transformers
import trl
trl.__version__

'0.25.1'

In [20]:
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, TaskType
from transformers import TrainingArguments  # still needed because SFTConfig inherits from it

# LoRA config (same as before)
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

# All trainer hyperparameters now go into SFTConfig
sft_config = SFTConfig(
    output_dir=str(OUT_DIR),
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,      # effective batch size = 8
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    num_train_epochs=2,                 # start with 1 for smoke test, then 2–3
    bf16=True,                          # or set to False/use fp16 if bf16 fails
    logging_steps=25,                   # print train loss every 25 steps

    # eval/save strategy names in recent transformers
    eval_strategy="steps",              # <— NOT evaluation_strategy
    eval_steps=250,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,

    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    report_to="none",                   # change to "wandb" if you use W&B

    # SFT-specific bits
    dataset_text_field="text",          # column name in train_ds / val_ds
    max_length=SEQ_LEN,                 # <— use max_length, not max_seq_length
    packing=False,                       # pack multiple examples per sequence
)

trainer = SFTTrainer(
    model=model,
    peft_config=lora_config,
    args=sft_config,        # pass the SFTConfig here
    train_dataset=train_ds,
    eval_dataset=val_ds,
    processing_class=tok,   # tokenizer
)

print("Trainer initialized.")

Adding EOS to train dataset:   0%|          | 0/6120 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/6120 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/6120 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/680 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/680 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/680 [00:00<?, ? examples/s]

Trainer initialized.


## 5) Train and monitor progress
Watch `loss` in the logs. If you enabled W&B, you'll get a live dashboard.

In [21]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 128001}.


Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
250,1.4798,1.480633,1.491614,933542.0,0.638644
500,1.4648,1.46304,1.47108,1864268.0,0.642976
750,1.4671,1.44751,1.449367,2799763.0,0.647009
1000,1.3393,1.449603,1.378876,3734780.0,0.648484
1250,1.298,1.445136,1.349375,4662373.0,0.649818
1500,1.3436,1.442196,1.353186,5597056.0,0.65069




TrainOutput(global_step=1530, training_loss=1.4128860367669, metrics={'train_runtime': 8933.9005, 'train_samples_per_second': 1.37, 'train_steps_per_second': 0.171, 'total_flos': 2.5855568781312e+17, 'train_loss': 1.4128860367669, 'entropy': 1.324153670668602, 'num_tokens': 5710000.0, 'mean_token_accuracy': 0.6751571699976922, 'epoch': 2.0})

## 6) Save the LoRA adapter + tokenizer

In [22]:
trainer.model.save_pretrained(OUT_DIR)
tok.save_pretrained(OUT_DIR)
print("Adapter + tokenizer saved to:", OUT_DIR)

Adapter + tokenizer saved to: outputs/llama31_8b_qlora_abstracts


## 7) Reload base vs fine-tuned and compare on a held-out subset
We will:
- Take ~200 examples from `val_df`
- Prompt both base and fine-tuned models to summarize the abstract
- Compute ROUGE-L and BERTScore-F1

In [23]:
from transformers import pipeline
from peft import PeftModel
import evaluate
import bert_score
from tqdm.auto import tqdm

# Reload base model (same 4-bit config)
base_model = AutoModelForCausalLM.from_pretrained(
    HF_MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
base_model.config.use_cache = True

# Reload fine-tuned: base + LoRA adapter
ft_model = AutoModelForCausalLM.from_pretrained(
    HF_MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
ft_model = PeftModel.from_pretrained(ft_model, OUT_DIR)
ft_model.config.use_cache = True

base_pipe = pipeline("text-generation", model=base_model, tokenizer=tok, device_map="auto")
ft_pipe   = pipeline("text-generation", model=ft_model,  tokenizer=tok, device_map="auto")

# Small validation subset for speed
VAL_N = min(200, len(val_df))
eval_slice = val_df.sample(VAL_N, random_state=SEED).reset_index(drop=True)
print("Eval examples:", len(eval_slice))

def make_eval_prompt(abstract_text: str, title: str | None = None,
                     min_w: int = 100, max_w: int = 200) -> str:
    title_str = f"Title: {title}\n\n" if title else ""
    return (
        f"Summarize the following abstract into a clear, faithful {min_w}–{max_w} word summary for a general audience.\n\n"
        f"{title_str}"
        f"Abstract:\n{abstract_text}\n\nSummary:"
    )


def generate_summary(pipe, prompt: str, max_new: int = 300) -> str:
    out = pipe(
        prompt,
        max_new_tokens=max_new,
        do_sample=False,
        temperature=0.0,
        eos_token_id=tok.eos_token_id,
    )[0]["generated_text"]
    # If model echoes the prompt, strip everything before "Summary:"
    if "Summary:" in out:
        out = out.split("Summary:", 1)[-1]
    return out.strip()


refs = []
base_preds = []
ft_preds = []

for _, row in tqdm(eval_slice.iterrows(), total=len(eval_slice), desc="Evaluating"):
    prompt = make_eval_prompt(row["input_text"], row["title"])
    refs.append(row["summary"])
    base_preds.append(generate_summary(base_pipe, prompt))
    ft_preds.append(generate_summary(ft_pipe, prompt))

# ROUGE-L
rouge = evaluate.load("rouge")
rg_base = rouge.compute(predictions=base_preds, references=refs, rouge_types=["rougeL"])
rg_ft   = rouge.compute(predictions=ft_preds,   references=refs, rouge_types=["rougeL"])

# BERTScore (F1)
P_b, R_b, F_b = bert_score.score(base_preds, refs, lang="en", rescale_with_baseline=True)
P_f, R_f, F_f = bert_score.score(ft_preds,   refs, lang="en", rescale_with_baseline=True)

print("\n=== ROUGE-L ===")
print("Base      :", rg_base["rougeL"])
print("Fine-tuned:", rg_ft["rougeL"])

print("\n=== BERTScore-F1 (mean) ===")
print("Base      :", float(F_b.mean()))
print("Fine-tuned:", float(F_f.mean()))

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0
Device set to use cuda:0


Eval examples: 200


Evaluating:   0%|          | 0/200 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Downloading builder script: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



=== ROUGE-L ===
Base      : 0.04510833292555738
Fine-tuned: 0.3925438207974889

=== BERTScore-F1 (mean) ===
Base      : -3.84653377532959
Fine-tuned: 0.5303022861480713


## 8) Helper: load fine-tuned pipeline later and run a test query

In [24]:
def load_finetuned_pipeline(adapter_dir: str | Path = OUT_DIR):
    base = AutoModelForCausalLM.from_pretrained(
        HF_MODEL_ID,
        quantization_config=bnb_config,
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
    ft = PeftModel.from_pretrained(base, adapter_dir)
    ft.config.use_cache = True
    return pipeline("text-generation", model=ft, tokenizer=tok, device_map="auto")


ft_pipe = load_finetuned_pipeline()

test_abstract = """Transformers use self-attention to capture long-range dependencies in sequences.
However, the quadratic complexity of standard attention limits practicality for very long inputs.
We propose a sparse attention mechanism that preserves performance while reducing computational cost,
and demonstrate improvements on language modeling and long-document summarization benchmarks."""

test_title = "Efficient Sparse Attention for Long-Context Transformers"

test_prompt = make_eval_prompt(test_abstract, test_title)
print("\n--- Fine-tuned model summary ---\n")
print(ft_pipe(test_prompt, max_new_tokens=280, do_sample=False, temperature=0.0)[0]["generated_text"])

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0



--- Fine-tuned model summary ---

Summarize the following abstract into a clear, faithful 100–200 word summary for a general audience.

Title: Efficient Sparse Attention for Long-Context Transformers

Abstract:
Transformers use self-attention to capture long-range dependencies in sequences.
However, the quadratic complexity of standard attention limits practicality for very long inputs.
We propose a sparse attention mechanism that preserves performance while reducing computational cost,
and demonstrate improvements on language modeling and long-document summarization benchmarks.

Summary: Researchers have developed a new method called sparse attention to improve how computers understand and process long texts, like articles or books. Traditional methods, known as self-attention, are effective but require a lot of computing power, making them impractical for very long documents. The new sparse attention method keeps the quality of the text analysis high while using less computing resou

## 9) (Optional) Merge LoRA into full weights (for export)
This step is optional and more VRAM-heavy. Usually you can just keep LoRA separately.

In [None]:
# from peft import AutoPeftModelForCausalLM
# merged = AutoPeftModelForCausalLM.from_pretrained(
#     OUT_DIR,
#     device_map="auto",
#     torch_dtype=torch.bfloat16,
# ).merge_and_unload()
# merged_dir = OUT_DIR / "merged_fp16"
# merged.save_pretrained(merged_dir)
# tok.save_pretrained(merged_dir)
# print("Merged full model saved to:", merged_dir)