# Deep Learning Project
By Victoria Lassner
DSML 4220

**Goal**: Fine tune a model for abstractive Summarization.

**Model:** T5-Base with its Tokenizer

Websites: https://huggingface.co/docs/transformers/tasks/summarization

**Future Models to Compare:**

https://wandb.ai/mostafaibrahim17/ml-articles/reports/Fine-Tuning-LLaMa-2-for-Text-Summarization--Vmlldzo2NjA1OTAy

https://wandb.ai/mostafaibrahim17/ml-articles/reports/Crafting-Superior-Summaries-The-ChatGPT-Fine-Tuning-Guide--Vmlldzo1Njc5NDI1

**Definitions:**

Abstractive summarization = oncise summary of a text by understanding its meaning and creating new sentences, rather than simply extracting phrases from the original text.

*****
**Dataset:**
CNN/DailyMail: https://paperswithcode.com/dataset/cnn-daily-mail-1
BillSum


In [1]:
# disables weights and biases
import os
os.environ["WANDB_DISABLED"] = "true"

In [2]:
# downloads packages for model, dataset and tokenzier
# --Quiet limits output of messages
!pip install transformers datasets sentencepiece --quiet
!pip install -q huggingface_hub transformers datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.12.0 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.8.4.1 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cudnn-cu12==9.1.0.70; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cudnn-cu12 9.3.0.75 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cufft-cu12==11.2.1.3; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cufft-cu12 11.3.3.83 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cu

In [3]:
# Download packages
from datasets import load_dataset, concatenate_datasets
from transformers import T5ForConditionalGeneration, TrainingArguments, Trainer, T5Tokenizer
import torch
from torch.utils.data import DataLoader
import torch

2025-05-03 15:48:57.877781: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746287338.032346      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746287338.078950      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
# Load CNN/Daily Mail Dataset from dataset package
# Limit samples to 4000 total.

train_sample_limit = 3000
val_sample_limit = 1000

dataset = load_dataset("cnn_dailymail", "3.0.0")
limited_train_data = dataset["train"].select(range(train_sample_limit))
limited_val_data = dataset["validation"].select(range(val_sample_limit))


README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [5]:
# preprocess data for model
tokenizer = T5Tokenizer.from_pretrained("t5-base")

# limit length of input articles and output summary
max_input_length = 512
max_target_length = 150

chunk_size = 1000

def preprocess(examples):
    inputs = ["summarize: " + doc for doc in examples["article"]]
    targets = examples["highlights"]

    model_inputs = tokenizer(
        inputs,
        max_length=max_input_length,
        truncation=True,
        padding="max_length"
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=max_target_length,
            truncation=True,
            padding="max_length"
        )

    model_inputs["labels"] =labels["input_ids"]
    return model_inputs


def process_in_chunks(dataset, chunk_size, preprocess_fn):
    total_len = len(dataset)
    processed_chunks = []

    for i in range(0, total_len, chunk_size):
        chunk = dataset.select(range(i, min(i + chunk_size, total_len)))
        processed_chunk = chunk.map(
            preprocess_fn,
            batched=True,
            remove_columns=["article", "highlights", "id"]
        )
        processed_chunks.append(processed_chunk)

    return concatenate_datasets(processed_chunks)

# Process the training and validation data into chunks
train_dataset = process_in_chunks(limited_train_data, chunk_size, preprocess)
val_dataset = process_in_chunks(limited_val_data, chunk_size, preprocess)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [6]:
# Load model T5-base
model = T5ForConditionalGeneration.from_pretrained("t5-base")

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./t5-cnn-checkpoints",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=5,
    save_steps=1000,
    logging_dir='./logs',
    logging_steps=50,
    save_total_limit=2,
    fp16=torch.cuda.is_available(),
)


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [7]:
# adds padding so shorter sequences match the longest one
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [8]:
# train model using hugging face's trainer class
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

trainer.train()

trainer.evaluate()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
50,1.9047
100,0.7385
150,0.6862
200,0.6651
250,0.6566
300,0.6469
350,0.6356
400,0.6264
450,0.6391




{'eval_loss': 0.6198398470878601,
 'eval_runtime': 48.3505,
 'eval_samples_per_second': 20.682,
 'eval_steps_per_second': 2.585,
 'epoch': 4.949333333333334}

In [9]:
#saves current state of model and tokenzier
model.save_pretrained("/content/t5_cnn_model_base_v2")
tokenizer.save_pretrained("/content/t5_cnn_model_base_v2")

('/content/t5_cnn_model_base_v2/tokenizer_config.json',
 '/content/t5_cnn_model_base_v2/special_tokens_map.json',
 '/content/t5_cnn_model_base_v2/spiece.model',
 '/content/t5_cnn_model_base_v2/added_tokens.json')

In [11]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from huggingface_hub import notebook_login

notebook_login()

# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("/content/t5_cnn_model_base_v2")
tokenizer = T5Tokenizer.from_pretrained("/content/t5_cnn_model_base_v2")

# Save to HuggingFace
model.push_to_hub("vlassner01/t5_cnn_model_base_v2")
tokenizer.push_to_hub("vlassner01/t5_cnn_model_base_v2")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/vlassner01/t5_cnn_model_base_v2/commit/448396b176e65c1b8855c9e5a6882ba0f21b2343', commit_message='Upload tokenizer', commit_description='', oid='448396b176e65c1b8855c9e5a6882ba0f21b2343', pr_url=None, repo_url=RepoUrl('https://huggingface.co/vlassner01/t5_cnn_model_base_v2', endpoint='https://huggingface.co', repo_type='model', repo_id='vlassner01/t5_cnn_model_base_v2'), pr_revision=None, pr_num=None)

In [13]:
# # Manually upload the file to HuggingFace
# # File: speice.model wouldn't upload

# from huggingface_hub import Repository
# from transformers import T5Tokenizer

# # Load and save tokenizer
# tokenizer = T5Tokenizer.from_pretrained("/content/t5_cnn_model_base_v2")
# tokenizer.save_pretrained("/content/t5_cnn_model_base_v2")

# # Initialize Hugging Face repo
# repo = Repository(
#     local_dir="/content/t5_cnn_model_base_v2",
#     clone_from="vlassner01/t5_cnn_model_base_v2"
# )

# # Track and push all files, including spiece.model
# repo.git_add(auto_lfs_track=True)
# repo.git_commit("Uploading tokenizer with spiece.model")
# repo.git_push()
