# Text Summarization from Past to Present Using the T5 LLM

![unsplash-textsummarization.jpg](attachment:30eeae7d-4afb-4339-8973-5333b1aab808.jpg)

> Part of this article describes the history of text summarization from the past to the present and briefly introduces some of the recent architectures (seq-to-seq, transformers, etc.). If you don't want to read these, you can access the direct model setup from the table of contents below.

<a id = "table-of-content"></a>
# **Table of Content**

- [Introduction](#intro)
- [A Brief History of Text Summarization](#history)
- [Text Summarization Methods](#approaches)
- [Extractive Method vs Abstractive Method](#ea)
- [Seq2Seq, Attention, Transformers, and T5](#seq)
- [Text Summarization with Google T5 Language Model](#t5)
    - [Import Libraries](#import)
    - [Preprocessing the Data](#prep)
    - [Fine-Tuning the Model](#fine)

<a id = "intro"></a>
# **Introduction**
*Natural language processing (NLP) is a field that combines computer science, linguistics and artificial intelligence. One of the important applications in this field is text summarization.*

**Text summarization is the process of expressing the main idea of a text or document in a shorter form while preserving it. This complex process involves understanding the original text, selecting, preserving and summarizing important information.**

Text summarization is used by many industries today. It is widely applied in sectors such as insurance, law, journalism, education and research. Especially when dealing with large documents, scanning them in their entirety can be a time-consuming and inefficient process. Text summarization can speed up the search process by helping to shorten documents.

Text summarization has benefited significantly from advances in natural language processing. The development of technologies such as artificial intelligence and machine learning has enabled its applications to become more accurate and effective. Therefore, before developing a text summarization model with the T5 language model, it may be useful to take a brief look at the developments in this field.

<a id = "history"></a>
# **A Brief History of Text Summarization**

## 1950s and 1960s / Empirical Approaches: 
The first text summarization studies started in the 1950s. These studies focused on determining the importance of sentences by calculating various features of sentences and using these features. When it was realized that word frequency was not sufficient, methods that took into account the position of the sentence in the document were tried to be developed. However, due to the limited equipment in this period, the desired results could not be achieved, and this situation put text summarization studies on pause for a while.

## 1970s and 1980s / Rationalist Approaches:
In the 1960s, researchers started to focus on summarization by understanding the meaning of the text. They used approaches such as scripts, FRUMP and SUSY to identify important information in the text. However, these approaches were found to require hand-coded information and domain-specific knowledge. That is, the results obtained through the use of these methodologies were often only effective in a specific domain because they relied on specific knowledge. This encouraged the development of new methods and strategies to make text summarization processes more comprehensive and generalizable.

## 1990s / Return to the Empirical Approach:
The mid-1990s saw a return to the experimental approach of the 50s and 60s, but this time at a more advanced level. Rather than relying on heuristics automatically derived from data, researchers turned to self-coded procedures to determine the importance of information. They adopted approaches such as basic sentence analysis and concept analysis, and carried the studies to a different perspective from the previous ones.

## 2000s / Machine Learning Based Approaches:
In the 2000s, text summarization studies benefited significantly from developments in the field of artificial intelligence. In this period, machine learning algorithms were used to create better summaries. Advances in data storage and processing technologies have enabled machine learning text summarization methods to be used more widely. In this period, almost all statistical models such as Bayesian Classifiers, Decision Trees, Hidden Markov Models, etc. were tried to improve summarization.

## 2015s and today / Deep Neural Networks Based Approaches:
Despite the improvements brought by traditional machine learning approaches in the 2000s, deep learning-based neural approaches in 2015 marked a significant turning point in the field of text summarization. Deep learning-based neural networks are non-linear statistical data modeling tools that can find patterns in data by extracting complex relationships. Especially in this period, text summarization using seq2seq has made great progress with the contribution of models such as RNN and LSTM. These models represent the input and output of the text as a set of numerical values and generate a summary of the text. Furthermore, with the emergence of the Transformer structure, there has been a far-reaching progress in NLP tasks. With this evolution, powerful and versatile models such as BERT, GPT, PEGASUS, T5 were developed and new standards were set in the field of text summarization.

---

<a id = "approaches"></a>
# **Text Summarization Methods**

We have given an overview of the development of text summarization from past to present. There are different approaches that have been used to summarize texts in the course of past and present studies. These approaches may vary depending on the purpose of summarizing the text, the target audience and the length of the summary. The main approaches are as follows:

* Single document vs multi-document
* Generic vs User-Focused
* Indicative vs Informative
* Language of Summarizer
* Extracts vs Abstracts

Among these approaches, the **extracts vs abstracts** approach in particular provides a clear distinction in the practice of text summarization.

![abstractive-vs-extractive.jpg](attachment:26096ef8-cf3b-46a2-82fc-51c2bc0068ef.jpg)

**Extractive summarization** is the process of creating a summary by extracting important sentences and paragraphs from the text. Each sentence in a document is assigned a score based on its relationship to all other sentences in the document. The scoring may be different depending on the model/library you use. The sentences with the highest scores together form the summary of the document.

**Abstractive summarization** is a method of creating a new and original summary by understanding the meaning of the original text. This method rewrites the text to capture the main idea and important details of the text. In contrast to extractive summarization, it takes a more creative approach and tries to express the text in a shorter form using original language.

Both methods have their advantages and challenges. Extractive tends to preserve important information contained in the original text, while abstractive provides a more authentic and flexible summarization. In the context of this article, we will work with the T5 model as an example of abstractive summarization.

---

<a id = "seq"></a>
# **Seq2Seq, Attention, Transformers, and T5**

For a deeper understanding of text summarization, I would also like to briefly mention **sequence-to-sequence, attention, transformers and T5**.

### Sequence-to-sequence:
* Seq2seq is a neural network that accepts sequential input in neural network learning. This can be an RNN, a CNN or any other architecture.
* It is especially used in natural language processing tasks such as translation, summarization, generative question answering and classification
* It is based on the Encoder-Decoder structure, with an "encoder" to understand the input data and a "decoder" to generate the output data.

### Attention:
* Attention is a mechanism used in deep learning models that assigns different weights to different parts of the input, allowing the model to prioritize and emphasize the most important information when performing tasks such as translation or summarization.
* It facilitates information transfer, especially when working with long texts.
* It is a highly effective method for text summarization. For this reason, it is used in many architectures, such as Transformers.

### Transformers:
* The Transformer architecture is a model structure that has achieved great success, especially in natural language processing tasks.
* It uses a self-attention mechanism, which allows better modeling of the interaction of each word with other words.
* It encourages parallel computations, which reduces training times.

One of the big problems in NLP problems is obtaining the data to train the model. Fortunately, large language models have taken on this workload. Thanks to pre-trained models like T5, PEGASUS, etc., we can fine-tune the model on real data. Let's take a brief look at the pre-trained T5 model, which is often used in text summarization, and start building the model.

### T5 (Text-To-Text Transfer Transformer):

The T5 model is a multi-task model consisting of two main components, an encoder and a decoder. The encoder encodes the input sequence and the decoder generates the output sequence using the encoded data. The T5 model uses the Transformer architecture for the encoder and decoder. The Transformer architecture uses the attention mechanism to transform an input sequence into an output sequence. Thanks to its seq-to-seq structure and Transformer architecture, the T5 model can generate different types of creative text formats. These formats include poems, codes, scripts, scripts, pieces of music, e-mail, letters, etc.

![image3.gif](attachment:b04d0a07-56cb-4b58-8b05-6c83f9a4858b.gif)

<a id = "t5"></a>
# **Text Summarization with Google T5 Language Model**

Now that we have looked at the approaches and techniques used in text summarization from past to present, it is time to develop a model.

Firstly, load the libraries needed to create the model. We also need to provide the user login to download and use the language models from the Hugging Face library. If you haven't signed up for Hugging Face yet; sign up <https://huggingface.co/>. If you don't have the HuggingFace Token, you can create your own token <https://huggingface.co/settings/tokens> after becoming a member

<a id = "import"></a>
## Import Libraries

In [None]:
import nltk
import numpy as np

from datasets import load_dataset
from evaluate import load

import transformers

from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

from transformers import AutoModelForSeq2SeqLM

from huggingface_hub import notebook_login
notebook_login()

#!pip install datasets evaluate transformers rouge-score nltk
!apt install git-lfs

print(transformers.__version__)

There are several T5 models with different number of parameters (t5-small, t5-base, t5-large, t5-xl, t5-xxl). In this notebook, will use the t5-small model and the XSum dataset for fine-tuning. We will use ROGUE (Recall-Oriented Understudy for Gisting Evaluation) as the evaluation metric.

In [None]:
model_checkpoint = "t5-small"
raw_datasets = load_dataset("xsum")
metric = load("rouge")

We can write a small function to have an overview of the dataset. With the function below, let's take a small look at the train set from the data set divided into "train, validation and test".

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(raw_datasets["train"])

Some examples from the Train dataset:

![exptrain.png](attachment:5bd551c3-f93a-42fe-ae30-7451ec41a89e.png)

<a id = "prep"></a>
## Preprocessing the Data

Before training the model with the data, we need to pre-process the data. In this step, I will transform the data into a form that the model accepts. We will do this using AutoTokenizer under transformers.

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

I mentioned above that the T5 model can multitask, and the developers of the model have effectively solved the problem of which task to use it for by using a prefix. As users, we need to specify which task we want to use it for.

In [None]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can write the function to preprocess the samples. We just give them to the tokenizer with the argument truncation=True.

In [None]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

We tokenize the raw dataset according to the model. A batch process is used to perform the process efficiently.

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

<a id = "fine"></a>
## Fine-tuning the model

Now that our data is ready, we can download the pre-trained model and fine-tune it. Since our task is sequence-to-sequence, we use the AutoModelForSeq2SeqLM class.

In [None]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

To initialize a Seq2SeqTrainer we will need to do three more things. The most important one is Seq2SeqTrainingArguments, a class that contains all the attributes for customizing the training. It requires a folder name to be used to save the checkpoints of the model and all other arguments are optional.

In [None]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-xsum",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

Here we set the evaluation to take place at the end of each epoch, change the learning rate, use the batch_size value defined at the top of the cell and customize the weight reduction. Since Seq2SeqTrainer will record the model regularly and our dataset is quite large, we tell it to record at most three times. Finally, we use the predict_with_generate option to properly generate the summaries.

The last argument is to set everything up so that we can send the model to the Hub regularly during training.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing we need to define for our Seq2SeqTrainer is how to calculate metrics from predictions. For this we need to define a function that will use the metric we loaded earlier and do some preprocessing to convert the predictions into text.

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    # Note that other metrics may not have a `use_aggregator` parameter
    # and thus will return a list, computing a metric for each sentence.
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True, use_aggregator=True)
    # Extract a few results
    result = {key: value * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we need to import all this into Seq2SeqTrainer with our datasets.

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Now we can fine-tune our model by simply calling the train method.

In [None]:
trainer.train()

A long period of training has finally been completed. The ROGUE values of the model are as follows:

![train-rogue.png](attachment:b4befa9e-716e-4994-b7ac-876e074f32c5.png)

Let's make an example using our model. I took a passage about Mustafa Kemal Atatürk from the English Wikipedia and asked him to summarize it:

*Mustafa Kemal Atatürk, also known as Mustafa Kemal Pasha until 1921, 
and Ghazi Mustafa Kemal from 1921 until the Surname Law of 1934 
(c. 1881 – 10 November 1938), was a Turkish field marshal, 
revolutionary statesman, author, and 
the founding father of the Republic of Turkey, 
serving as its first president from 1923 until his death in 1938. 
He undertook sweeping progressive reforms, 
which modernized Turkey into a secular, industrializing nation. 
Ideologically a secularist and nationalist, his policies 
and socio-political theories became known as Kemalism. 
Due to his military and political accomplishments, 
Atatürk is regarded as one of the most important 
political leaders of the 20th century.*

![ataturk-summary.png](attachment:c554205a-85e4-42f3-b24c-77a54907b864.png)

Finally, if you want to upload the fine-tuned model to the HuggingFace Hub and save the model:

In [None]:
trainer.push_to_hub()

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("your_name/model_name")

Maybe we should have used a longer text, other parameters, a bigger model, or changed the structure of the T5 model. But no matter how hard you try to summarize a complex text with an NLP model, you will always end up with documents that the model fails to summarize. There is still a lot of room for improvement. Also, if your system hardware is up to it, I suggest you look into more powerful language models like FLAN-T5 or PEGASUS. Thank you for reading.

References:

Rothman, D., Transformers for Natural Language Processing Second Edition / Build, train, and fine-tune deep neural architectures for NLP with Python, PyTorch, TensorFlow, BERT, and GPT-3

Signh, J., NATURAL LANGUAGE PROCESSING IN THE REAL WORLD / Text Processing, Analytics, and Classification

Zechner, K., (1997) A literature survey on information extraction and text summarization. Computational Linguistics Program, Carnegie Mellon University, [online] Fall 1996. Available at: http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/zechner/infoextr.pdf.

Zhang, H., Cai, J., Xu, J. and Wang, J., (2019a) Pretraining-Based Natural Language Generation for Text Summarization. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). [online] Stroudsburg, PA, USA: Association for Computational Linguistics, pp.789–797. Available at: https://www.aclweb.org/anthology/K19-1074.

Saggion, Horacio., Poibeau, Thierry., Automatic Text Summarization: Past, Present, and Future

Orasan, C., Automatic Summarisation: 25 Years On