# Fine-tuning a Model for Summarization Task

In this task, you will load, preprocess, and fine-tune a T5 model on a dataset of news articles for a summarization task. Follow the steps below carefully.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `UBC-NLP/AraT5-base` if you face any problem you can use `google-t5/t5-small` but the first one is the correct one for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/news_articles_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [89]:
from datasets import load_dataset

ds = load_dataset("CUTD/news_articles_df")

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K  

In [2]:
import pandas as pd

df = pd.read_csv("hf://datasets/CUTD/news_articles_df/news_articles_df.csv")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [13]:
df.head()

Unnamed: 0.1,Unnamed: 0,summarizer,text
0,0,\nأشرف رئيس الجمهورية الباجي قايد السبسي اليوم...,اشرف رئيس الجمهوريه الباجي قايد السبسي اليوم ب...
1,1,"\nتحصل كتاب ""المصحف وقراءاته"" الذي ألفه باحثون...",تحصل كتاب المصحف وقراءاته الفه باحثون تونسيون ...
2,2,تونس حاضرة من جهة أخرى ستكون تونس حاضرة في قائ...,احتضن جناح تونس القريه الدوليه للافلام بمدينه ...
3,3,واستأجرت صاحبة المشروع المحامية والكاتبة سيران...,شهدت برلين الجمعه افتتاح مسجد فريد نوعه الاقل ...
4,4,\nنعت وزارة الشّؤون الثّقافيّة المنشد الصّوفي ...,نعت وزاره المنشد عز بن محمود انتقل جوار يوم تن...


In [4]:
!pip install evaluate rouge_score

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=4d3e32031729754dedbc46d4a94023f334429eb1494b448123c2148a4daa11a7
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.3 rouge_score-0.1.2


In [72]:
!pip install --force-reinstall pyarrow

Collecting pyarrow
  Using cached pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting numpy>=1.16.6 (from pyarrow)
  Downloading numpy-2.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.9/60.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
Downloading numpy-2.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m55.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy, pyarrow
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 17.0.0
    Uninstalling pyarrow-17.0.0:
      Successfully unin

In [6]:
import evaluate
metric = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

## Step 2: Load the Pretrained Tokenizer

Initialize a tokenizer from the gevin model checkpoint.

In [7]:
from transformers import AutoTokenizer , DataCollatorForSeq2Seq , AutoModelForSeq2SeqLM ,Seq2SeqTrainingArguments , Seq2SeqTrainer

In [108]:
checkpoint = "UBC-NLP/AraT5-base"
tokonezer = AutoTokenizer.from_pretrained(checkpoint)



In [9]:
checkpoint.split("/")[0]

'UBC-NLP'

## Step 3: Preprocess the Dataset

Define a preprocessing function that adds a prefix ("summarize:") to each input if needed and tokenizes the text for the model. The labels will be the tokenized summaries.

In [90]:
ds


DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'summarizer', 'text'],
        num_rows: 8378
    })
})

In [93]:
ds

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'summarizer', 'text'],
        num_rows: 8378
    })
})

In [94]:
ds_train = ds["train"]

In [109]:
prefix = "summarize:"
max_inp = 128
max_tar = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokonezer(inputs, max_length=max_inp, truncation=True,)

    labels = tokonezer(text_target=examples["summarizer"], max_length=max_tar, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [96]:
ds_proc =preprocess_function(ds_train)

In [97]:
ds_proc

{'input_ids': [[35880, 4633, 16410, 33, 16, 15566, 147, 9007, 13, 80175, 66111, 96848, 101, 26912, 41937, 18242, 6, 329, 89993, 585, 90, 56236, 4424, 11274, 96278, 67533, 10628, 1315, 41937, 17402, 13, 35397, 65, 69436, 257, 73662, 359, 22472, 44, 25460, 601, 7503, 206, 7872, 31985, 20, 1567, 37671, 16, 14, 2362, 37578, 61283, 56764, 102, 3926, 573, 787, 2523, 10848, 13, 20, 2133, 3663, 599, 47643, 2133, 69436, 487, 257, 941, 116, 14245, 13, 2133, 6164, 2133, 47665, 2231, 468, 2133, 3663, 69436, 1265, 1967, 2133, 41174, 20, 2133, 3663, 2133, 296, 3663, 8011, 9720, 9393, 2133, 69436, 2380, 56492, 116, 2133, 1], [35880, 4633, 16410, 33, 58732, 1227, 24409, 14, 28367, 2664, 371, 13, 36402, 38895, 65, 80762, 35153, 13, 33544, 5385, 2752, 2592, 1969, 515, 675, 12700, 11259, 614, 56255, 107, 1272, 291, 24595, 1082, 4424, 515, 86993, 13, 56187, 3, 48, 24409, 14, 28367, 2664, 750, 71664, 62, 1058, 11502, 13, 3183, 27580, 15631, 57205, 13712, 6001, 58726, 13, 70652, 257, 9742, 10891, 41619, 948

## Step 4: Define the Data Collator

Use a data collator designed for sequence-to-sequence models, which dynamically pads inputs and labels.

In [11]:
seqtoseq = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

pytorch_model.bin:   0%|          | 0.00/1.13G [00:00<?, ?B/s]

In [12]:
data_collator = DataCollatorForSeq2Seq(tokonezer, model=seqtoseq)

## Step 5: Load the Pretrained Model

Load the model for sequence-to-sequence tasks (summarization).

In [2]:
#Done

## Step 6: Define Training Arguments

Set up the training configuration with parameters like learning rate, batch size, and number of epochs.

In [13]:
batch_size = 8
model_name = "UBC-NLP"
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
)



In [None]:
args = Seq2SeqTrainingArguments()

## Step 7: Initialize the Trainer

Use the `Seq2SeqTrainer` class to train the model.

In [104]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokonezer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokonezer.pad_token_id)
    decoded_labels = tokonezer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    # Note that other metrics may not have a `use_aggregator` parameter
    # and thus will return a list, computing a metric for each sentence.
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True, use_aggregator=True)
    # Extract a few results
    result = {key: value * 100 for key, value in result.items()}

    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokonezer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [98]:
ds_train = ds_proc[:5000]
ds_val = ds_proc[5000:]

In [110]:
trainer = Seq2SeqTrainer(
    seqtoseq,
    args,
    train_dataset=ds_train,
    eval_dataset=ds_val,
    data_collator=data_collator,
    tokenizer=tokonezer,
    compute_metrics=compute_metrics
)

## Step 8: Fine-tune the Model

In [112]:
trainer.train()

AttributeError: 'tokenizers.Encoding' object has no attribute 'keys'

Train the model using the specified arguments and dataset.

## Step 9: Inference

Once the model is trained, perform inference on a sample text to generate a summary. Use the tokenizer to process the text, and then feed it into the model to get the generated summary.