<a href="https://colab.research.google.com/github/vlassner/dsml_4220_project/blob/main/dsml4220_prj.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning Project
By Victoria Lassner
DSML 4220

**Goal**: Fine tune a model for abstractive Summarization.

**Model:** T5-Base with its Tokenizer

Websites: https://huggingface.co/docs/transformers/tasks/summarization

**Future Models to Compare:**

https://wandb.ai/mostafaibrahim17/ml-articles/reports/Fine-Tuning-LLaMa-2-for-Text-Summarization--Vmlldzo2NjA1OTAy

https://wandb.ai/mostafaibrahim17/ml-articles/reports/Crafting-Superior-Summaries-The-ChatGPT-Fine-Tuning-Guide--Vmlldzo1Njc5NDI1

**Definitions:**

Abstractive summarization = oncise summary of a text by understanding its meaning and creating new sentences, rather than simply extracting phrases from the original text.

*****
**Dataset:**
CNN/DailyMail: https://paperswithcode.com/dataset/cnn-daily-mail-1
BillSum


In [1]:
# disables weights and biases
import os
os.environ["WANDB_DISABLED"] = "true"

In [2]:
# downloads packages for model, dataset and tokenzier
# --Quiet limits output of messages
!pip install transformers datasets evaluate sentencepiece rouge_score --quiet

In [3]:
# Download packages
from datasets import load_dataset, concatenate_datasets
from transformers import T5ForConditionalGeneration, TrainingArguments, Trainer, T5Tokenizer
import torch
from torch.utils.data import DataLoader
import torch

In [4]:
# Load CNN/Daily Mail Dataset from dataset package
# Limit samples to 7000 total.

train_sample_limit = 5000
val_sample_limit = 2000

dataset = load_dataset("cnn_dailymail", "3.0.0")
limited_train_data = dataset["train"].select(range(train_sample_limit))
limited_val_data = dataset["validation"].select(range(val_sample_limit))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
# preprocess data for model
tokenizer = T5Tokenizer.from_pretrained("t5-base")

# limit length of input articles and output summary
max_input_length = 512
max_target_length = 250

chunk_size = 1000

def preprocess(examples):
    inputs = ["summarize: " + doc for doc in examples["article"]]
    targets = examples["highlights"]

    model_inputs = tokenizer(
        inputs,
        max_length=max_input_length,
        truncation=True,
        padding="max_length"
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=max_target_length,
            truncation=True,
            padding="max_length"
        )

    # Replace pad token with -100 to ignore in loss
    labels["input_ids"] = [
      [(label if label != tokenizer.pad_token_id else -100) for label in label_seq]
      for label_seq in labels["input_ids"]
    ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


def process_in_chunks(dataset, chunk_size, preprocess_fn):
    total_len = len(dataset)
    processed_chunks = []

    for i in range(0, total_len, chunk_size):
        chunk = dataset.select(range(i, min(i + chunk_size, total_len)))
        processed_chunk = chunk.map(
            preprocess_fn,
            batched=True,
            remove_columns=["article", "highlights", "id"]
        )
        processed_chunks.append(processed_chunk)

    return concatenate_datasets(processed_chunks)

# Process the training and validation data into chunks
train_dataset = process_in_chunks(limited_train_data, chunk_size, preprocess)
val_dataset = process_in_chunks(limited_val_data, chunk_size, preprocess)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [6]:
import evaluate
import numpy as np

rouge = evaluate.load("rouge")

def compute_metrics(eval_preds):
    preds, labels = eval_preds

    # If preds are logits, convert to token IDs
    if isinstance(preds, tuple):
        preds = preds[0]

    if preds.ndim == 3:  # logits
        preds = np.argmax(preds, axis=-1)

    # Optional safety: clip token IDs to vocab size
    preds = np.clip(preds, 0, tokenizer.vocab_size - 1)

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    return {k: round(v * 100, 2) for k, v in result.items()}

In [7]:
# Load model T5-base
model = T5ForConditionalGeneration.from_pretrained("t5-base")

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./t5-cnn-model",
    eval_steps=500,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    predict_with_generate=True,
    generation_max_length=128,
    logging_steps=100,
    save_steps=1000,
    num_train_epochs=3,
    fp16=True
)


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [8]:
# adds padding so shorter sequences match the longest one
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [9]:
# train model using hugging face's trainer class
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

  trainer = Seq2SeqTrainer(
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
100,1.734
200,1.6294
300,1.6312
400,1.6622
500,1.5909
600,1.5983
700,1.6182
800,1.6122
900,1.6116
1000,1.6086


TrainOutput(global_step=3750, training_loss=1.497725351969401, metrics={'train_runtime': 1893.5016, 'train_samples_per_second': 7.922, 'train_steps_per_second': 1.98, 'total_flos': 9134368358400000.0, 'train_loss': 1.497725351969401, 'epoch': 3.0})

In [10]:
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 1.878161907196045, 'eval_rouge1': 37.46, 'eval_rouge2': 16.15, 'eval_rougeL': 26.9, 'eval_rougeLsum': 26.91, 'eval_runtime': 1013.376, 'eval_samples_per_second': 1.974, 'eval_steps_per_second': 0.493, 'epoch': 3.0}


In [14]:
#saves current state of model and tokenzier
model.save_pretrained("/content/t5_cnn_model_base_v4")
tokenizer.save_pretrained("/content/t5_cnn_model_base_v4")

('/content/t5_cnn_model_base_v4/tokenizer_config.json',
 '/content/t5_cnn_model_base_v4/special_tokens_map.json',
 '/content/t5_cnn_model_base_v4/spiece.model',
 '/content/t5_cnn_model_base_v4/added_tokens.json')

In [15]:
from huggingface_hub import HfApi, HfFolder
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from huggingface_hub import login

login()

# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("/content/t5_cnn_model_base_v4")
tokenizer = T5Tokenizer.from_pretrained("/content/t5_cnn_model_base_v4")

# Save to HuggingFace
model.push_to_hub("vlassner01/t5_cnn_model_base_v4")
tokenizer.push_to_hub("vlassner01/t5_cnn_model_base_v4")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/vlassner01/t5_cnn_model_base_v4/commit/d3dba902db2fdfb0705646bf717955fb9bf41845', commit_message='Upload tokenizer', commit_description='', oid='d3dba902db2fdfb0705646bf717955fb9bf41845', pr_url=None, repo_url=RepoUrl('https://huggingface.co/vlassner01/t5_cnn_model_base_v4', endpoint='https://huggingface.co', repo_type='model', repo_id='vlassner01/t5_cnn_model_base_v4'), pr_revision=None, pr_num=None)

In [16]:
# Step 1: Install and update Hugging Face Hub
!pip install --upgrade huggingface-hub

# Step 2: Authenticate to Hugging Face (ensure you enter your API token)
from huggingface_hub import login
login()  # Enter your Hugging Face API token when prompted

# Step 3: Prepare the tokenizer files (Ensure all tokenizer files are in this folder)
!mkdir -p /content/hf_tokenizer_upload
!cp -r /content/t5_cnn_model_base_v3/* /content/hf_tokenizer_upload/

# Step 4: Upload tokenizer files to Hugging Face using hf_hub
from huggingface_hub import upload_file

# Define your Hugging Face repo name
repo_name = "vlassner01/t5_cnn_model_base_v4"

# Path to the local folder containing your tokenizer files
folder_path = '/content/hf_tokenizer_upload'

# Upload individual files to Hugging Face
upload_file(
    path_or_fileobj=f"{folder_path}/spiece.model",  # Replace with actual file path
    path_in_repo="spiece.model",  # Path in the Hugging Face repo
    repo_id=repo_name,
    commit_message="Upload spiece.model"
)

upload_file(
    path_or_fileobj=f"{folder_path}/tokenizer_config.json",  # Replace with actual file path
    path_in_repo="tokenizer_config.json",  # Path in the Hugging Face repo
    repo_id=repo_name,
    commit_message="Upload tokenizer_config.json"
)

upload_file(
    path_or_fileobj=f"{folder_path}/special_tokens_map.json",  # Replace with actual file path
    path_in_repo="special_tokens_map.json",  # Path in the Hugging Face repo
    repo_id=repo_name,
    commit_message="Upload special_tokens_map.json"
)

upload_file(
    path_or_fileobj=f"{folder_path}/tokenizer.json",  # Replace with actual file path
    path_in_repo="tokenizer.json",  # Path in the Hugging Face repo
    repo_id=repo_name,
    commit_message="Upload tokenizer.json"
)

# Step 5: Verify the files
# Once the upload finishes, check your model page at:
# https://huggingface.co/vlassner01/t5_cnn_model_base_v3




VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


ValueError: Provided path: '/content/hf_tokenizer_upload/tokenizer.json' is not a file on the local file system