<a href="https://colab.research.google.com/github/sahug/ds-bert/blob/main/BERT%20NLP%20-%20Text%20Summarization%20using%20BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERT NLP - Text Summarization using BERT**

**Summarization** creates a shorter version of a document or an article that captures all the important information. Along with translation, it is another example of a task that can be formulated as a sequence-to-sequence task. Summarization can be:

- **Extractive**: extract the most relevant information from a document.
- **Abstractive**: generate new text that captures the most relevant information.

**Load Dataset**

In [1]:
%pip install -qq datasets

[K     |████████████████████████████████| 346 kB 4.8 MB/s 
[K     |████████████████████████████████| 212 kB 37.9 MB/s 
[K     |████████████████████████████████| 140 kB 64.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 47.0 MB/s 
[K     |████████████████████████████████| 86 kB 6.1 MB/s 
[K     |████████████████████████████████| 86 kB 5.7 MB/s 
[K     |████████████████████████████████| 596 kB 58.9 MB/s 
[K     |████████████████████████████████| 127 kB 71.8 MB/s 
[K     |████████████████████████████████| 94 kB 3.4 MB/s 
[K     |████████████████████████████████| 144 kB 62.0 MB/s 
[K     |████████████████████████████████| 271 kB 59.2 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[?25h

In [2]:
from datasets import load_dataset
billsum = load_dataset("billsum", split="ca_test")

Downloading builder script:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/832 [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset billsum/default (download: 64.14 MiB, generated: 259.80 MiB, post-processed: Unknown size, total: 323.94 MiB) to /root/.cache/huggingface/datasets/billsum/default/3.0.0/d1e95173aed3acb71327864be74ead49b578522e4c7206048b2f2e5351b57959...


Downloading data:   0%|          | 0.00/67.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

Dataset billsum downloaded and prepared to /root/.cache/huggingface/datasets/billsum/default/3.0.0/d1e95173aed3acb71327864be74ead49b578522e4c7206048b2f2e5351b57959. Subsequent calls will reuse this data.


In [3]:
billsum

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})

**Train and Test Split**

In [4]:
billsum = billsum.train_test_split(test_size=0.2)

In [5]:
billsum["train"][0]

{'summary': 'The Corporation Tax Law allows various credits against the taxes imposed by that law. That law allows, for each taxable year beginning on or after July 1, 2008, any credit that is an eligible credit, as defined, to be assigned to any eligible assignee, as defined.\nThis bill would make technical, nonsubstantive changes to this provision.',
 'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 23663 of the Revenue and Taxation Code is amended to read:\n23663.\n(a) (1) Notwithstanding any other law\nto the contrary\n, for each taxable year beginning on or after July 1, 2008, any credit allowed to a taxpayer under this chapter that is an eligible credit may be assigned by that taxpayer to any eligible assignee.\n(2) A credit assigned under paragraph (1) may\nonly\nbe applied by the eligible assignee\nonly\nagainst the\n“tax” (as\n“tax,” as\ndefined in Section\n23036)\n23036,\nof the eligible assignee in a taxable year beginning on or af

**Preprocess**

The preprocessing function needs to:

- Prefix the input with a prompt so `T5` knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
- Use a context manager with the `as_target_tokenizer()` function to parallelize tokenization of inputs and labels.
- Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [6]:
%pip install -qq transformers

[K     |████████████████████████████████| 4.2 MB 5.3 MB/s 
[K     |████████████████████████████████| 6.6 MB 29.9 MB/s 
[?25h

In [7]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [8]:
prefix = "summarize:"

def preprocess_function(examples):
  inputs = [prefix + doc for doc in examples["text"]]
  model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

  with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples["summary"], max_length=128, truncation=True)

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

In [9]:
pp = preprocess_function(billsum["train"][0])
print(pp["input_ids"])
print(pp["attention_mask"])
print(pp["labels"])

[[21603, 10, 382, 1], [21603, 10, 107, 1], [21603, 10, 15, 1], [21603, 10, 1], [21603, 10, 102, 1], [21603, 10, 15, 1], [21603, 10, 32, 1], [21603, 10, 102, 1], [21603, 10, 40, 1], [21603, 10, 15, 1], [21603, 10, 1], [21603, 10, 32, 1], [21603, 10, 89, 1], [21603, 10, 1], [21603, 10, 17, 1], [21603, 10, 107, 1], [21603, 10, 15, 1], [21603, 10, 1], [21603, 10, 134, 1], [21603, 10, 17, 1], [21603, 10, 9, 1], [21603, 10, 17, 1], [21603, 10, 15, 1], [21603, 10, 1], [21603, 10, 32, 1], [21603, 10, 89, 1], [21603, 10, 1], [21603, 10, 254, 1], [21603, 10, 9, 1], [21603, 10, 40, 1], [21603, 10, 23, 1], [21603, 10, 89, 1], [21603, 10, 32, 1], [21603, 10, 52, 1], [21603, 10, 29, 1], [21603, 10, 23, 1], [21603, 10, 9, 1], [21603, 10, 1], [21603, 10, 26, 1], [21603, 10, 32, 1], [21603, 10, 1], [21603, 10, 15, 1], [21603, 10, 29, 1], [21603, 10, 9, 1], [21603, 10, 75, 1], [21603, 10, 17, 1], [21603, 10, 1], [21603, 10, 9, 1], [21603, 10, 7, 1], [21603, 10, 1], [21603, 10, 89, 1], [21603, 10, 32, 1]

In [10]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Use `DataCollatorForSeq2Seq` to create a batch of examples. It will also dynamically pad your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the tokenizer function by setting `padding=True`, dynamic padding is more efficient.

In [11]:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, return_tensors="tf")

**Finetune**

To **fine-tune** a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with `to_tf_dataset`. Specify inputs and labels in columns, whether to shuffle the dataset order, batch size, and the data collator.

In [12]:
tf_train_set = tokenized_billsum["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = tokenized_billsum["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

**Optimizer**

In [13]:
from transformers import create_optimizer, AdamWeightDecay
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

**Model**

In [14]:
from transformers import TFAutoModelForSeq2SeqLM
model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


**Complile**

In [15]:
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


**Fit**

In [None]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=1)