# Dialogue Summarization with FLAN-T5 Small  
**Research Portfolio, Rifky Setiawan, Universitas Gadjah Mada (UGM)**

This notebook documents an end to end prototype for automatic dialogue summarization. The pipeline applies consistent instruction prompting and deterministic tokenization, then fine tunes FLAN-T5 small under a RAM safe configuration. Evaluation emphasizes ROUGE-1, ROUGE-2, and ROUGE-L, paired with qualitative inspection to surface omissions and attribution errors that reference overlap may not reveal.

The objective is to demonstrate practical ability to design and implement a summarization pipeline that is clear, reproducible, and compute friendly. The artifact is intended for academic study and research training, not for automated policy or legal decisions. This notebook is submitted as part of a research internship application.


## Abstract

This notebook investigates dialogue summarization on a constrained compute budget using FLAN-T5 small. The goal is a dependable baseline that remains reproducible while handling conversational turns with varied length and structure. The approach uses a fixed instruction prompt, bounded sequence lengths, and gradient checkpointing to stabilize training on modest GPUs. Evaluation relies on ROUGE with a RAM safe batching scheme and includes qualitative checks to verify coverage of actions, outcomes, and commitments.

The workflow is organized for clarity from raw dataset to evaluated summaries. It covers environment setup, DialogSum loading, instruction style input construction, tokenization with fixed limits, a short fine tuning schedule with conservative hyperparameters, ROUGE computation on capped subsets, and qualitative inference on curated dialogues. The design favors simple components and consistent preprocessing so that results can be recreated on Google Colab or a comparable environment.

This work is intended for research and teaching. It highlights patterns for qualitative review and iterative improvement. It is not a decision system and a human remains in the loop.


## 1. Problem Statement and Motivation

Dialogue data contains interruptions, speaker switches, and pragmatic cues that can challenge sequence-to-sequence models. Two recurring difficulties are compression accuracy under limited context windows and faithful attribution of actions to the correct speaker. A baseline is needed that is easy to audit, stable to rerun, and informative for subsequent improvements.

This notebook investigates a compact encoder-decoder formulation:
a fixed instruction prompt anchors the task across samples,
deterministic tokenization with bounded lengths keeps memory predictable,
and evaluation emphasizes ROUGE with qualitative inspection to surface systematic errors.

This approach:
- stabilizes behavior through consistent prompting and preprocessing
- maintains a modest compute footprint suitable for end-to-end reproduction
- supports careful assessment through paired quantitative and qualitative views

The intended use is research and teaching. Summaries assist analysis and model iteration. Outputs are not policy or legal determinations and a human remains in the loop.


## 2. Pipeline Overview

This pipeline is organized for clarity and reproducibility from dataset loading to evaluated predictions. It reflects the DialogSum corpus and the FLAN-T5 small backbone used in this notebook.

1. **Setup and Imports**: Transformers, Datasets, Evaluate, Accelerate, and PyTorch with version logging and CUDA check.
2. **Dataset Loading and Inspection**: load DialogSum splits, inspect schema, preview dialogue and reference.
3. **Preprocessing and Tokenization**: add a fixed instruction prompt, bound source and target lengths, mask label padding with −100, and store sequence length for grouping.
4. **Model Architecture (FLAN-T5 small)**: encoder-decoder initialized from the Hub with gradient checkpointing and cache disabled for training.
5. **Training**: single epoch starter schedule with small effective batch size, conservative learning rate, and RAM aware collator.
6. **Evaluation**: ROUGE-1, ROUGE-2, and ROUGE-L on capped subsets with greedy decoding, plus qualitative summaries on curated dialogues.
7. **Responsible Use and Limitations**: scope, dataset coverage, decoding trade offs, and the requirement for a human in the loop.


## 3. Environment Setup and Dependencies

This cell prepares a modern and compatible runtime for sequence-to-sequence training with Hugging Face libraries. It installs recent versions of Transformers, Datasets, Evaluate, Accelerate, and SentencePiece, along with ROUGE dependencies. After installation it prints package versions and CUDA availability to document the computing context and to help future rereading of results. Two environment flags are set to reduce non-critical RAM use by disabling tokenizer parallelism and dataset multiprocessing.

Checklist:
- Confirm printed versions match the intended experiment plan
- Verify `torch.cuda.is_available()` for GPU runs on Google Colab or a comparable environment


In [None]:
# Keep versions modern and compatible with Trainer flags.
!pip -q install -U "transformers>=4.45.0" "datasets>=2.20.0" "evaluate>=0.4.2" "accelerate>=0.34.0" "sentencepiece>=0.1.99" rouge_score packaging

import os, torch, transformers, datasets, evaluate
print("transformers:", transformers.__version__)
print("datasets    :", datasets.__version__)
print("evaluate    :", evaluate.__version__)
print("CUDA available:", torch.cuda.is_available())
# Safer defaults for RAM
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["HF_DATASETS_DISABLE_PARALLEL"] = "1"   # avoid multiprocessing

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.8/506.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.[0m[31m
[0mtransformers: 4.57.1
datasets    : 4.3.0
evaluate    

## 4. Setup and Imports

This section loads core libraries, defines a global random seed for reproducibility, and selects a compact model that is suitable for end-to-end experimentation. The configuration favors clarity and stability so that future readers can rerun the notebook without hidden state. The chosen backbone is `google/flan-t5-small`, which balances quality, speed, and memory use for dialogue summarization.

Scope covered:
- Standard warnings control and garbage collection utilities
- NumPy, pandas, PyTorch, and Hugging Face Datasets and Evaluate
- Transformers components for tokenization, modeling, data collation, and training
- A single `RANDOM_STATE` applied across Python, NumPy, and PyTorch
- Device selection to run on GPU when available


In [None]:
# Imports & Global Config

import warnings, random, gc
warnings.filterwarnings("ignore")

import gc, numpy as np
import pandas as pd
import torch

from datasets import load_dataset, DatasetDict
import evaluate

from tqdm.auto import tqdm
import inspect, transformers, evaluate, torch

from transformers import (
    T5TokenizerFast,
    T5ForConditionalGeneration,
    DataCollatorForSeq2Seq,
    Trainer,
    TrainingArguments,
)

import inspect, transformers, evaluate, torch

# Reproducibility
RANDOM_STATE = 42
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)

# Model choice — start small; upgrade later when stable
MODEL_NAME = "google/flan-t5-small"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

## 5. Model and Tokenizer Initialization

Load the tokenizer and model from the Hub and place the model on the selected device. Enable gradient checkpointing to trade computation for memory, which is effective on small GPUs. Disable the autoregressive cache during training so the forward graph can be recomputed cleanly in backward passes. These settings keep memory growth controlled and make the training step more stable under the Trainer defaults.

Operational notes:
- Use the matching `T5TokenizerFast` to ensure special tokens align with the model head
- Keep initialization early in the notebook so later cells assume a ready model


In [None]:
# Load Model & Tokenizer

tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
# Enable gradient checkpointing to reduce memory; disable cache for training
if hasattr(model, "gradient_checkpointing_enable"):
    model.gradient_checkpointing_enable()
if hasattr(model.config, "use_cache"):
    model.config.use_cache = False
model.to(device)

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

## 6. Dataset: DialogSum

Load the DialogSum dataset and expose the standard `train`, `validation`, and `test` splits. This cell prints split sizes and column names for quick inspection and shows a truncated example dialogue with its reference summary. Keeping a short preview in the notebook helps verify that downstream preprocessing aligns with the dataset schema.


In [None]:
# Dataset: DialogSum

ds_all = load_dataset("knkarthick/dialogsum")
ds = {
    "train": ds_all["train"],
    "validation": ds_all["validation"],
    "test": ds_all["test"],
}
for split, d in ds.items():
    print(split, len(d), d.column_names)
print("Sample:", ds["train"][0]["dialogue"][:120].replace("\n"," "), "| summary:", ds["train"][0]["summary"][:120])

README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

train 12460 ['id', 'dialogue', 'summary', 'topic']
validation 500 ['id', 'dialogue', 'summary', 'topic']
test 1500 ['id', 'dialogue', 'summary', 'topic']
Sample: #Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today? #Person2#: I found it would be a good idea to get  | summary: Mr. Smith's getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkins'll give some information 


## 7. Preprocessing and Tokenization

Build instruction-style inputs by prefixing each dialogue with a concise task string. Tokenize sources and targets with fixed maximum lengths so that memory use remains predictable. Replace padding in labels with `-100` to mask those positions in the loss. Add a lightweight `length` column for grouping by length during collation, which can improve packing without multiprocessing.

Key choices and rationale:
- `MAX_SOURCE_LEN` and `MAX_TARGET_LEN` define a bounded memory budget
- The prompt `summarize the dialogue:` encourages consistent behavior across samples
- Padding deferred to the collator keeps the stored dataset compact and fast to map


In [None]:
# Preprocessing & Tokenization

MAX_SOURCE_LEN = 256
MAX_TARGET_LEN = 48

def build_io(example):
    src = f"summarize the dialogue:\n{example['dialogue']}"
    tgt = example["summary"]
    return {"src": src, "tgt": tgt}

ds_io = DatasetDict({
    "train": ds["train"].map(build_io, remove_columns=ds["train"].column_names),
    "validation": ds["validation"].map(build_io, remove_columns=ds["validation"].column_names),
    "test": ds["test"].map(build_io, remove_columns=ds["test"].column_names),
})

def tok_fn(batch):
    model_inputs = tokenizer(
        batch["src"],
        max_length=MAX_SOURCE_LEN,
        truncation=True,
        padding=False,
    )
    labels = tokenizer(
        text_target=batch["tgt"],
        max_length=MAX_TARGET_LEN,
        truncation=True,
        padding=False,
    )
    labels_ids = []
    for seq in labels["input_ids"]:
        seq = [tid if tid != tokenizer.pad_token_id else -100 for tid in seq]
        labels_ids.append(seq)
    model_inputs["labels"] = labels_ids
    return model_inputs


train_tok = ds_io["train"].map(tok_fn, batched=True, remove_columns=["src","tgt"])
val_tok   = ds_io["validation"].map(tok_fn, batched=True, remove_columns=["src","tgt"])
test_tok  = ds_io["test"].map(tok_fn, batched=True, remove_columns=["src","tgt"])

# Add length column for group_by_length
train_tok = train_tok.map(lambda b: {"length": [len(x) for x in b["input_ids"]]}, batched=True)
val_tok   = val_tok.map(lambda b: {"length": [len(x) for x in b["input_ids"]]}, batched=True)
test_tok  = test_tok.map(lambda b: {"length": [len(x) for x in b["input_ids"]]}, batched=True)

len(train_tok), len(val_tok), len(test_tok)

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

(12460, 500, 1500)

## 8. Training Subset and Effective Batch Size

Define effective batch size as per device batch multiplied by gradient accumulation steps. Start with a fraction of the training data to validate the pipeline end-to-end before scaling to the full split. Report the approximate steps per epoch so learning rate schedules and early stopping choices remain interpretable.

Guidance for iteration:
- Increase `FRAC` after verifying stable loss curves and sane ROUGE
- Keep effective batch size constant when comparing runs to avoid confounds


In [None]:
# Effective batch size = per_device_train_batch_size * gradient_accumulation_steps

EFFECTIVE_BS = 8
FRAC = 0.20   # start small; raise to 0.5 or 1.0 once stable
N = int(len(train_tok) * FRAC)
train_small = train_tok.shuffle(seed=RANDOM_STATE).select(range(N))
steps_per_epoch = (N + EFFECTIVE_BS - 1) // EFFECTIVE_BS
print(f"Using {N} training examples (~{FRAC*100:.1f}%). Steps/epoch ≈ {steps_per_epoch}.")

Using 2492 training examples (~20.0%). Steps/epoch ≈ 312.


## 9. Training Setup and Trainer Construction

Create a collator that pads to a multiple of eight and removes the auxiliary `length` field before the forward pass. Configure `TrainingArguments` with conservative defaults that run on CPU or GPU, and avoid generation during evaluation to keep the memory footprint modest. Build the Trainer with the prepared datasets, tokenizer, and collator. ROUGE will be computed in a separate cell to maintain a RAM safe path.

Practical settings:
- Single epoch for the first pass with a small learning rate and weight decay
- `remove_unused_columns=False` to preserve fields expected by the collator
- `report_to=[]` to keep logs minimal in a portfolio context


In [None]:
# Training Setup

# Base collator (pad to multiples of 8; safe on CPU too)
_base_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, pad_to_multiple_of=8)

# Drop 'length' before forward so Trainer doesn't pass it to the model
def collator_with_pop(features):
    for f in features:
        f.pop("length", None)
    return _base_collator(features)

# IMPORTANT: do NOT enable predict_with_generate inside TrainingArguments.
# Keep config minimal to match the reference and avoid version gotchas.
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/TextSummarizationCheckpoints",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,   # effective batch = 8
    num_train_epochs=1,              # start with 1; increase later if stable
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=200,
    logging_steps=100,
    save_total_limit=1,
    dataloader_num_workers=0,        # avoid multiprocessing RAM overhead
    dataloader_pin_memory=False,
    fp16=torch.cuda.is_available(),
    no_cuda=not torch.cuda.is_available(),
    report_to=[],
    remove_unused_columns=False      # match the reference's permissive setting
)

# Choose the (smaller) training split if you created one earlier
train_ds = train_small if "train_small" in globals() else train_tok
eval_ds  = val_tok if "val_tok" in globals() else (test_tok if "test_tok" in globals() else None)

# Build Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,          # safe: loss-only eval (no generation)
    data_collator=collator_with_pop,
    tokenizer=tokenizer,
    compute_metrics=None           # compute ROUGE separately (RAM-safe)
)

print("Trainer ready. Train size:", len(train_ds), "| Eval size:", len(eval_ds) if eval_ds is not None else 0)

Trainer ready. Train size: 2492 | Eval size: 500


## 10. Train and Save

Run supervised training and persist fine tuned weights along with training metrics. Record the number of training samples and save the Trainer state so the run can be resumed or inspected later. Optionally perform a loss only evaluation to monitor overfitting without invoking text generation during training. Keeping these artifacts in a stable directory enables later inference and lightweight ROUGE scoring without retraining.


In [None]:
# Train & Save

train_result = trainer.train()

# Save model & basic metrics
trainer.save_model(training_args.output_dir)
metrics = dict(train_result.metrics)
metrics["train_samples"] = len(train_ds)
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

# Loss-only evaluation (no generation; very light on RAM)
if eval_ds is not None:
    eval_results = trainer.evaluate()
    eval_results["eval_samples"] = len(eval_ds)
    print("Eval (loss-only):", eval_results)
    trainer.log_metrics("eval", eval_results)
    trainer.save_metrics("eval", eval_results)

Step,Training Loss
100,2.0479
200,1.6554
300,1.5993


***** train metrics *****
  epoch                    =        1.0
  total_flos               =   193974GF
  train_loss               =     1.7609
  train_runtime            = 1:16:11.32
  train_samples            =       2492
  train_samples_per_second =      0.545
  train_steps_per_second   =      0.068


Eval (loss-only): {'eval_loss': 1.385675072669983, 'eval_runtime': 194.5651, 'eval_samples_per_second': 2.57, 'eval_steps_per_second': 1.285, 'epoch': 1.0, 'eval_samples': 500}
***** eval metrics *****
  epoch                   =        1.0
  eval_loss               =     1.3857
  eval_runtime            = 0:03:14.56
  eval_samples            =        500
  eval_samples_per_second =       2.57
  eval_steps_per_second   =      1.285


## 11. ROUGE Evaluation with Lightweight Generation

Compute ROUGE on small batches with a fixed cap on the number of validation and test examples to remain memory efficient. Generation uses greedy decoding by default to limit compute and RAM use. The helper converts various return formats into plain floats and rounds scores for concise reporting. This chunked path provides a stable readout of ROUGE-1, ROUGE-2, and ROUGE-L without exceeding typical academic hardware limits.

Practical tips:
- Lower `K` if memory pressure appears on large dialogs
- Keep `num_beams=1` unless you explicitly study decoding strategies


In [None]:
# RAM‑Safe Evaluation

rouge = evaluate.load("rouge")

def decode_batch(input_ids, attn_mask, max_new_tokens=48, num_beams=1):
    # Keep generation lightweight (no beams >1 unless needed)
    gen_ids = model.generate(
        input_ids=input_ids,
        attention_mask=attn_mask,
        num_beams=num_beams,
        max_new_tokens=max_new_tokens,
        length_penalty=1.0,
        no_repeat_ngram_size=3,
        # early_stopping removed (ignored in recent Transformers)
    )
    return tokenizer.batch_decode(gen_ids, skip_special_tokens=True)

def _to_float(x):
    """Robustly extract a float from various ROUGE return formats."""
    # Newer `evaluate` often returns plain floats.
    if isinstance(x, (float, int, np.floating)):
        return float(x)
    # Older format: x.mid.fmeasure
    try:
        return float(x.mid.fmeasure)
    except Exception:
        pass
    # Dict-like fallback: {"fmeasure": ...}
    try:
        return float(x["fmeasure"])
    except Exception:
        return float(x)

def chunked_eval(dataset, K=128, batch_size=4):
    """
    Evaluate on at most K examples, batching to keep memory low.
    - Set K lower (e.g., 64) if RAM is tight.
    - num_beams=1 inside decode to reduce memory.
    """
    model.eval()
    K = min(K, len(dataset))
    subset = dataset.select(range(K))
    preds, refs = [], []

    for i in tqdm(range(0, K, batch_size), total=(K + batch_size - 1)//batch_size):
        batch = subset[i:i+batch_size]
        inputs = tokenizer.pad(
            {"input_ids": batch["input_ids"], "attention_mask": batch["attention_mask"]},
            padding=True, return_tensors="pt"
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        with torch.inference_mode():
            out = decode_batch(inputs["input_ids"], inputs["attention_mask"], max_new_tokens=48, num_beams=1)
        preds.extend(p.strip() for p in out)

        # Convert -100 back to pad for decoding references
        labels = []
        for seq in batch["labels"]:
            labels.append([tid if tid != -100 else tokenizer.pad_token_id for tid in seq])
        refs.extend(tokenizer.batch_decode(labels, skip_special_tokens=True))

        # Cleanup
        del inputs
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    # Force aggregator to return floats
    scores = rouge.compute(predictions=preds, references=refs, use_stemmer=True, use_aggregator=True)
    return {
        "rouge1": round(_to_float(scores["rouge1"]), 4),
        "rouge2": round(_to_float(scores["rouge2"]), 4),
        "rougeL": round(_to_float(scores["rougeL"]), 4),
        "n_eval": K,
    }

# Run eval (lower K if memory is still tight)
val_scores  = chunked_eval(val_tok,  K=128, batch_size=4)
test_scores = chunked_eval(test_tok, K=128, batch_size=4)
val_scores, test_scores

  0%|          | 0/32 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

({'rouge1': 0.3564, 'rouge2': 0.1337, 'rougeL': 0.2991, 'n_eval': 128},
 {'rouge1': 0.333, 'rouge2': 0.1111, 'rougeL': 0.2716, 'n_eval': 128})

## 12. Inference Helper

Provide a clean function for single example inference. The dialogue is compacted to remove extra whitespace, prefixed with the same instruction used in training, tokenized at the same maximum source length, and summarized by the fine-tuned model. Using identical preprocessing between training and inference avoids distribution shift introduced by mismatched tokenization.


In [None]:
# Inference Helper

def clean_dialogue(text: str) -> str:
    return " ".join(str(text).split())

def summarize_dialogue(text, max_new_tokens=64, num_beams=4):
    cleaned = clean_dialogue(text)
    prompt = f"summarize the dialogue:\n{cleaned}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=MAX_SOURCE_LEN).to(device)
    gen_ids = model.generate(
        **inputs,
        num_beams=num_beams,
        max_new_tokens=max_new_tokens,
        length_penalty=1.0,
        no_repeat_ngram_size=3,
        early_stopping=True,
    )
    return tokenizer.decode(gen_ids[0], skip_special_tokens=True).strip()


## 13. Sample Selection

Select a validation dialogue for a quick sanity check. Keeping a direct reference to a single example simplifies qualitative inspection and helps confirm that the end-to-end path from raw text to generated summary is functioning as intended.


In [None]:
# Sample Inference

sample_text = ds["validation"][0]["dialogue"]
print("INPUT (truncated):", clean_dialogue(sample_text)[:300], "...")
print("\nSUMMARY:", summarize_dialogue(sample_text))

INPUT (truncated): #Person1#: Hello, how are you doing today? #Person2#: I ' Ve been having trouble breathing lately. #Person1#: Have you had any type of cold lately? #Person2#: No, I haven ' t had a cold. I just have a heavy feeling in my chest when I try to breathe. #Person1#: Do you have any allergies that you know ...

SUMMARY: #Person2# has a heavy feeling in his chest when he try to breathe. He will send him to a pulmonary specialist who can run tests on him for asthma.


## 14. Qualitative Summaries on Custom Dialogues

Run qualitative inference on a few custom multi-turn dialogues and print compact summaries. Inputs are normalized with the same cleaning function and passed through the summarization helper with a conservative decoding setup. This qualitative view complements ROUGE by exposing strengths and failure modes that may not be captured by reference overlap alone.

Suggestions for review:
- Compare generated summaries against the main action items in each dialogue
- Inspect recurring omissions or hallucinations to guide future training changes


In [None]:
# Extra sample dialogues

custom_dialogues = {
    "return_item": """
Customer: I want to return these headphones. The left side stopped working after a week.
Agent: Do you have the receipt?
Customer: Yes, bought them last Tuesday.
Agent: You’re within our 30-day return window. Would you like a refund or exchange?
Customer: Exchange, please.
Agent: I’ll process the exchange and we’ll test the new pair before you leave.""",

    "tech_support_internet": """
User: My internet drops every evening around 8 p.m.
Support: Let’s check your modem logs. Are the downstream levels stable?
User: I see repeated T3 timeouts.
Support: I’ll schedule a line check and provide a replacement modem. Meanwhile, try bridging your router to isolate the issue.
User: Okay, please book the technician for tomorrow morning.""",

    "university_project": """
Lead: We need a draft by Sunday. Can each of you claim a section?
Sam: I’ll write the related work.
Nina: I’ll handle the experiment setup.
Leo: I can do results and figures.
Lead: Great. Submit PRs by Saturday evening so we can review Sunday morning."""
}

def preview_summary(name, text, max_chars=280):
    print(f"\n== {name} ==")
    print("INPUT (truncated):", clean_dialogue(text)[:max_chars], "...")
    print("SUMMARY:", summarize_dialogue(text))

# Run summaries for all custom samples (num_beams kept at function default)
for name, text in custom_dialogues.items():
    preview_summary(name, text)


== return_item ==
INPUT (truncated): Customer: I want to return these headphones. The left side stopped working after a week. Agent: Do you have the receipt? Customer: Yes, bought them last Tuesday. Agent: You’re within our 30-day return window. Would you like a refund or exchange? Customer: Exchange, please. Agent: ...
SUMMARY: Customer wants to return the headphones. Agent will process the exchange and test the new pair before leaving.

== tech_support_internet ==
INPUT (truncated): User: My internet drops every evening around 8 p.m. Support: Let’s check your modem logs. Are the downstream levels stable? User: I see repeated T3 timeouts. Support: I’ll schedule a line check and provide a replacement modem. Meanwhile, try bridging your router to isolate the is ...
SUMMARY: Support will schedule a line check and provide a replacement modem.

== university_project ==
INPUT (truncated): Lead: We need a draft by Sunday. Can each of you claim a section? Sam: I’ll write the related work. Ni

## 15. Results

Validation performance is reported with loss and ROUGE under a RAM-safe evaluation setup. After one epoch the training loss settles at 1.7609 and the loss-only evaluation on 500 validation examples is 1.3857. ROUGE on capped subsets with greedy decoding yields ROUGE-1 = 0.3564, ROUGE-2 = 0.1337, and ROUGE-L = 0.2991 on 128 validation items, with test scores of ROUGE-1 = 0.3330, ROUGE-2 = 0.1111, and ROUGE-L = 0.2716 on 128 test items. Qualitative summaries preserve core actions and outcomes in short dialogs, while longer or denser turns show occasional omissions and role assignment drift. Learning logs indicate a steady decrease in training loss and a validation loss consistent with moderate overlap on this dataset and budget.

Key readouts to cite:
- Train loss 1.7609 at epoch 1 and eval loss 1.3857 on 500 validation samples
- Validation ROUGE: R1 0.3564, R2 0.1337, RL 0.2991 on 128 items
- Test ROUGE: R1 0.3330, R2 0.1111, RL 0.2716 on 128 items
- Greedy decoding and capped K = 128 per split to keep evaluation memory predictable


## 16. Insights

Several practical lessons emerged during experimentation. A compact backbone such as FLAN-T5-small can deliver competitive ROUGE under a modest compute budget when preprocessing is stable and prompts are consistent. Instruction prefixing improves reliability across heterogeneous dialogues by anchoring the model to a single task description. Capping evaluation size and generating with simple decoding keeps the run auditable and easy to rerun end-to-end. Qualitative inspection is essential, since reference overlap alone does not expose omissions or subtle hallucinations.

Concise takeaways:
- Stable input construction with a fixed instruction prompt and bounded lengths improves robustness
- ROUGE should be paired with targeted qualitative checks to reveal omissions and phrasing artifacts
- Small models benefit from gradient checkpointing and conservative schedules on limited GPUs


## 17. Responsible Use and Limitations

This notebook is a research and teaching artifact. It is not intended for policy or legal decisions. Summaries support analysis and should be reviewed by a human.

Scope and risks:
- Dataset coverage is narrow and may not reflect new domains, speakers, or dialogue styles
- ROUGE measures lexical overlap and can overestimate quality when paraphrases are acceptable
- Greedy decoding favors stability but can under-explore phrasing diversity
- Confidence estimation is not calibrated and should not be interpreted as reliability guarantees

Operational guidance:
- Keep a human in the loop for any action that affects people or decisions
- Reevaluate performance when moving to longer dialogues, new domains, or different languages
- Pair ROUGE with qualitative spot checks before drawing conclusions


## 18. Conclusion

A lightweight summarization pipeline built on FLAN-T5-small provides a clear and reproducible baseline for dialogue summarization. Fixed preprocessing, consistent prompts, and a RAM-safe evaluation path make results easy to audit and to rerun on Google Colab or comparable environments. ROUGE combined with qualitative inspection yields a balanced view of content fidelity and stylistic adequacy under constrained resources.

The baseline is suitable for research and teaching. It establishes a practical foundation for experiments that explore decoding strategies, modest scaling, or data curation while retaining a compute friendly profile.


## 19. Next Steps

Future work can build on this baseline while keeping the footprint small. Decoding studies that compare greedy, top-p, and low-beam settings may improve factual coverage with limited cost. Light data curation such as trimming boilerplate or anonymizing named entities may reduce distractors. Training extensions worth testing include a few extra epochs with early stopping, simple instruction variations, or selective upsampling of longer dialogues to encourage richer compression.

Priorities to explore:
- Evaluate ROUGE under alternative decoding with identical preprocessing and report variance across seeds
- Add small instruction variants and measure sensitivity of ROUGE and qualitative faithfulness
- Increase training fraction and epochs cautiously and monitor overfitting through loss trends and spot checks
- Export the model with `torch.compile` or ONNX and record latency for end-to-end inference on target hardware


## Author / Contact

**Rifky Setiawan**  
Undergraduate Student, Department of Computer Science  
Universitas Gadjah Mada (UGM), Indonesia  
Email: rifkysetiawan@mail.ugm.ac.id
