<a href="https://colab.research.google.com/github/rahiakela/transformers-for-natural-language-processing/blob/main/7-text-summarization-with-t5/text_summarization_with_t5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text summarization with T5

NLP summarizing tasks extract succinct parts of a text.We will initialize a T5-large transformer model. Finally, we will see how to use T5 to
summarize any type of document, including legal and corporate documents.

##Setup

In [None]:
%%shell

pip install transformers
pip install sentencepiece

In [2]:
import torch
import json 
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

# display the architecture of the model or not when we initialize the model
display_architecture=True

If we set display_architecture to True, the structure of the encoder layers, decoder layers, and feedforward sub-layers will be displayed.

## Initializing the T5-large transformer model

We will now import the T5-large conditional generation model to generate text and the T5-large tokenizer.

Initializing a pretrained tokenizer only takes one line. However, nothing proves that the tokenized dictionary contains all the vocabulary we need.

In [None]:
model = T5ForConditionalGeneration.from_pretrained("t5-large")
tokenizer = T5Tokenizer.from_pretrained("t5-large")

In [4]:
# initializes torch.device with 'cpu', CPU is enough for this notebook
device = torch.device('cpu')

We are ready to explore the architecture of the T5 model.

## Exploring the architecture of the T5 model

We will explore the architecture and configuration of a T5-large model.

If `display_architecture==true`, we can see the configuration of the model:

In [5]:
if (display_architecture==True):
  print(model.config)

T5Config {
  "_name_or_path": "t5-large",
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 4096,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "gradient_checkpointing": false,
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_

The model is a T5 transformer with 16 heads and 24 layers.

We can also see the text-to-text implementation of T5, which adds a prefix to an input sentence to trigger the task to perform. The prefix makes it possible to represent a wide range of tasks in a text-to-text format without modifying the model's parameters. 

In our case, the prefix is summarization:

```json
"task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    ---
    ---
}
```

We can see that T5:

- Implements beam search, which will expand the four most significant text
completion predictions.
- Applies early stopping when `num_beam` sentences are completed per batch.
- Makes sure not to repeat ngrams equal to `no_repeat_ngram_size`.
- Controls the length of the samples with `min_length` and `max_length`.


Another interesting parameter is the vocabulary size:

```
"vocab_size": 32128
```

Vocabulary size is a topic in itself. Too much vocabulary will lead to sparse
representations. Too little vocabulary will distort the NLP tasks.

We can also see the details of the transformer stacks by simply printing the model.

In [None]:
if (display_architecture==True):
  print(model)

We can see that the model runs operations on 1,024 features for the attention
sub-layers and 4,096 for the inner calculations of the feedforward network sublayer that will produce outputs of 1,024 features. The symmetrical structure of transformers is maintained through all of the layers.

You can also choose to select a specific aspect of the model by only running the cells you wish:

In [None]:
if (display_architecture==True):
  print(model.encoder)

In [None]:
if (display_architecture==True):
  print(model.decoder)

In [None]:
if display_architecture==True:
  print(model.forward)

We have initialized the T5 transformer. Let's now summarize documents.

## Summarizing documents with T5-large

The T5 model has a unified structure, whatever the task is through the prefix + input sequence approach. It may seem simple, but it takes NLP transformer models closer to universal training and zero-shot downstream tasks.

We will summarize legal and financial examples. Finally, we will define the limits of the approach.

In [10]:
def summarize(text, ml):
  # The context text or ground truth is then stripped of the \n characters
  preprocess_text = text.strip().replace("\n", "")
  # We then apply the innovative T5 task prefix "summarize" to the input text
  t5_prepared_Text = "summarize: " + preprocess_text
  # We can display the processed (stripped) and prepared text (task prefix)
  print("Preprocessed and prepared text: \n", t5_prepared_Text)

  # The text is now encoded to tokens IDs and returns them as torch tensors
  tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)

  # The encoded text is ready to be sent to the model to generate a summary
  summary_ids = model.generate(tokenized_text, 
                               num_beams=4,
                               no_repeat_ngram_size=2,
                               min_length=30,
                               max_length=ml,
                               early_stopping=True)
  
  # The generated output is now decoded with the tokenizer
  output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

  return output

Let's now experiment with the T5 model with a general topic.

###A general topic sample

In this subsection, we will run a text written by Project Gutenberg through the T5 model. We will use the sample to run a test on our summarizing function. You can copy and paste any other text you wish or load a text by adding some code.

The goal of the program is to run a few samples to see how T5 works. The input text is the beginning of the Project Gutenberg e-book containing the
**Declaration of Independence of the United States of America**:

In [11]:
text = """
The United States Declaration of Independence was the first Etext
released by Project Gutenberg, early in 1971. The title was stored
in an emailed instruction set which required a tape or diskpack be
hand mounted for retrieval. The diskpack was the size of a large
cake in a cake carrier, cost $1500, and contained 5 megabytes, of
which this file took 1-2%. Two tape backups were kept plus one on
paper tape. The 10,000 files we hope to have online by the end of
2001 should take about 1-2% of a comparably priced drive in 2001.
"""

We then call our summarize function and send the text we want to summarize and
the maximum length of the summary:

In [12]:
print("Number of characters:", len(text))
summary = summarize(text, 50)
print("\n\nSummarized text: \n", summary)

Number of characters: 530
Preprocessed and prepared text: 
 summarize: The United States Declaration of Independence was the first Etextreleased by Project Gutenberg, early in 1971. The title was storedin an emailed instruction set which required a tape or diskpack behand mounted for retrieval. The diskpack was the size of a largecake in a cake carrier, cost $1500, and contained 5 megabytes, ofwhich this file took 1-2%. Two tape backups were kept plus one onpaper tape. The 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in 2001.


Summarized text: 
 the united states declaration of independence was the first etext published by project gutenberg, early in 1971. the 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in


The output shows we sent 534 characters, the original text (ground truth) that was preprocessed, and the summary (prediction).

### The Bill of Rights sample

The following sample, taken from the Bill of Rights, is more difficult because it expressed the precise rights of a person.

In [13]:
text = """
No person shall be held to answer for a capital, or otherwise infamous
crime,
unless on a presentment or indictment of a Grand Jury,exceptin cases
arising
in the land or naval forces, or in the Militia, when in actual service
in time of War or public danger; nor shall any person be subject for
the same offense to be twice put in jeopardy of life or limb;
nor shall be compelled in any criminal case to be a witness against
himself,
nor be deprived of life, liberty, or property, without due process of
law;
nor shall private property be taken for public use without just
compensation.
"""

print("Number of characters:", len(text))
summary = summarize(text, 50)
print("\n\nSummarized text: \n", summary)

Number of characters: 588
Preprocessed and prepared text: 
 summarize: No person shall be held to answer for a capital, or otherwise infamouscrime,unless on a presentment or indictment of a Grand Jury,exceptin casesarisingin the land or naval forces, or in the Militia, when in actual servicein time of War or public danger; nor shall any person be subject forthe same offense to be twice put in jeopardy of life or limb;nor shall be compelled in any criminal case to be a witness againsthimself,nor be deprived of life, liberty, or property, without due process oflaw;nor shall private property be taken for public use without justcompensation.


Summarized text: 
 no person shall be held to answer for a capital, or otherwise infamouscrime unless ona presentment or indictment ofa Grand Jury. nor shall any person be subject for the same offense to be twice put


This sample is significant because it shows the limits that any transformer model or any other NLP model faces when faced with a text such as this one. We cannot just present samples that always work and make a user believe that transformers, no matter how innovative they are, have solved all of the NLP challenges we face.

Maybe we should have provided a longer text to summarize, used other parameters,
used a larger model, or changed the structure of the T5 model. However, no matter how hard you try to summarize a difficult text with an NLP model, you will always find documents that the model fails to summarize.

When a model fails on a task, we need to be humble and admit it. The SuperGLUE
human baseline is a difficult one to beat. We need to be patient, work harder, and improve transformer models until they can perform better than they do today. There is still room for a lot of progress.

###A corporate law sample

Corporate law contains many legal subtleties, which makes a summarizing task
quite tricky.

The input of this sample is an excerpt of the corporate law in the state of Montana, USA:

In [14]:
#Montana Corporate Law
#https://corporations.uslegal.com/state-corporation-law/montanacorporation-law/#:~:text=Montana%20Corporation%20Law,carrying%20out%20its%20business%20activities.
text = """
The law regarding corporations prescribes that a corporation
can be incorporated in the state of Montana to serve any lawful
purpose. In the state of Montana, a corporation has all the powers
of a natural person for carrying out its business activities. The
corporation can sue and be sued in its corporate name. It has
perpetual succession. The corporation can buy, sell or otherwise
acquire an interest in a real or personal property. It can conduct
business, carry on operations, and have offices and exercise the powers
in a state, territory or district in possession of the U.S., or in a
foreign country. It can appoint officers and agents of the corporation
for various duties and fix their compensation.
The name of a corporation must contain the word "corporation" or
its abbreviation "corp." The name of a corporation should not be
deceptively similar to the name of another corporation incorporated
in the same state. It should not be deceptively identical to the
fictitious name adopted by a foreign corporation having business
transactions in the state.
The corporation is formed by one or more natural persons by executing
and filing articles of incorporation to the secretary of state of
filing. The qualifications for directors are fixed either by articles
of incorporation or bylaws. The names and addresses of the initial
directors and purpose of incorporation should be set forth in the
articles of incorporation. The articles of incorporation should
contain the corporate name, the number of shares authorized to issue,
a brief statement of the character of business carried out by the
corporation, the names and addresses of the directors until successors
are elected, and name and addresses of incorporators. The shareholders
have the power to change the size of board of directors.
"""

print("Number of characters:", len(text))
summary = summarize(text, 50)
print("\n\nSummarized text: \n", summary)

Number of characters: 1805
Preprocessed and prepared text: 
 summarize: The law regarding corporations prescribes that a corporationcan be incorporated in the state of Montana to serve any lawfulpurpose. In the state of Montana, a corporation has all the powersof a natural person for carrying out its business activities. Thecorporation can sue and be sued in its corporate name. It hasperpetual succession. The corporation can buy, sell or otherwiseacquire an interest in a real or personal property. It can conductbusiness, carry on operations, and have offices and exercise the powersin a state, territory or district in possession of the U.S., or in aforeign country. It can appoint officers and agents of the corporationfor various duties and fix their compensation.The name of a corporation must contain the word "corporation" orits abbreviation "corp." The name of a corporation should not bedeceptively similar to the name of another corporation incorporatedin the same state. It should not 

This time, T5 found some of the essential aspects of the text to summarize. Take
some time to try to incorporate samples of your own to see what happens. Play
with the parameters to see if it affects the outcome.