<a href="https://colab.research.google.com/github/sabre-code/text-summarisation/blob/main/text_summarisation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarisation

#### PACKAGE INSTALLATION

In [2]:
!pip install transformers[torch]
!pip install datasets
!pip install sentencepiece
!pip install evaluate
!pip install sacrebleu
!pip install rouge_score



LOADING DATASET

In [2]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", version='3.0.0')

Downloading builder script:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/9.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
print(f"Features: {dataset['train'].column_names}")

Features: ['article', 'highlights', 'id']


In [None]:
sample = dataset['train'][1]

In [None]:
print(f'''
Article of 500 chars, total length : {len(sample['article'])}
      ''')
print(sample['article'][:500])
print(f"\n Summary (length : {len(sample['highlights'])})")
print(sample['highlights'])


Article of 500 chars, total length : 4051
      
Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events. Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial. MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor." Here, inmates with the most s

 Summary (length : 281)
Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .


In [None]:
sample_text = dataset["train"][1]['article'][:2000]
print(sample_text)
summaries = {}

Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events. Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial. MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor." Here, inmates with the most severe mental illnesses are incarcerated until they're ready to appear in court. Most often, they face drug charges or charges of assaulting an officer --charges that Judge Steven Leifman says are usually "avoidable felonies." He says the arrests often result from confrontations with police. Mentally ill people often won't do what they're told when police arrive on the scene -- confrontation seems to exacerbate their illness and they become more paranoid, delusional, and less likely to follow dir

In [None]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")
string = "This is first sentence. This is second sentence."
sent_tokenize(string)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['This is first sentence.', 'This is second sentence.']

In [None]:
def three_sentence_summary(text):
  return "\n".join(sent_tokenize(text)[:3])

summaries["baseline"] = three_sentence_summary(sample_text)
print(summaries)

{'baseline': 'Editor\'s note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events.\nHere, Soledad O\'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial.\nMIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor."'}


GPT-2 Large

In [None]:
from transformers import pipeline, set_seed

set_seed(1)
# pipe = pipeline("text-generation", model="gpt2-large")
# gpt2_query = sample_text + "\nTL;DR:\n"
# pipe_out = pipe(gpt2_query,  max_length=512, clean_up_tokenization_spaces = True)
# summaries["gpt2"] = "\n".join(sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query) :]))

In [None]:
#summaries["gpt2"]

## **T5 Model**

In [None]:
pipe = pipeline("summarization", model="t5-large")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
pipe_out = pipe(sample_text)
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))
summaries["t5"]

'mentally ill inmates are housed on the ninth floor of a florida jail .\nmost face drug charges or charges of assaulting an officer .\njudge says arrests often result from confrontations with police .\none-third of all people in Miami-dade county jails are mental ill .'

## **BART - facebook/bart-large-ccn**

In [None]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
summaries['bart'] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))
summaries['bart']

'Mentally ill inmates are housed on the "forgotten floor" of Miami-Dade jail.\nMost often, they face drug charges or charges of assaulting an officer.\nJudge Steven Leifman says the arrests often result from confrontations with police.\nHe says about one-third of all people in the county jails are mentally ill.'

## **PEGASUS MODEL**

In [None]:
pipe = pipeline("summarization",model= "google/pegasus-cnn_dailymail")
pipe_out = pipe(sample_text)

summaries["pegasus"] = pipe_out[0]["summary_text"]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)neration_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [None]:
summaries['pegasus']

'Mentally ill inmates in Miami are housed on the "forgotten floor"<n>The ninth floor is where they\'re held until they\'re ready to appear in court .<n>Most often, they face drug charges or charges of assaulting an officer .<n>They end up on the ninth floor severely mentally disturbed .'

In [None]:
import evaluate
bleu_metric = evaluate.load("sacrebleu")

## Loading CNN-Dailymail Dataset for Fine-tuning

In [3]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", version='3.0.0',split="train")

In [4]:
dataset

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 287113
})

In [4]:
import torch

class PegasusDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels['input_ids'][idx])
        return item
    def __len__(self):
        return len(self.labels['input_ids'])

## Tokenizing Data for training

In [5]:
from transformers import AutoTokenizer

checkpoint = "google/pegasus-large"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
def preprocess_function(examples):

    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [6]:


def tokenize_data(texts, labels):

  encodings = tokenizer(texts, truncation=True, padding=True)
  decodings = tokenizer(labels, truncation=True, padding=True)
  dataset_tokenized = PegasusDataset(encodings, decodings)
  return dataset_tokenized


train_texts, train_labels = dataset['article'][:1000], dataset['highlights'][:1000]
train_dataset = tokenize_data(train_texts, train_labels)

test_texts, test_labels = dataset['article'][1000:2000], dataset['highlights'][1000:2000]
test_dataset = tokenize_data(train_texts, train_labels)

In [27]:
dataset['article'][5000]

'LOS ANGELES, California (CNN) -- About 1.6 million fans registered for a chance at fewer than 9,000 pairs of tickets to Michael Jackson\'s memorial service next week, organizers said. Some memorial tickets went out to "friends and family" on Sunday. Registration ended at 6 p.m. Saturday. Officials will now "scrub" all entries to eliminate duplicates and those they suspect may have been registered using software that ticket scalpers use to generate multiple hits. A random drawing will follow. The winning 8,750 registrants will receive an e-mail Sunday after 11 a.m. (2 p.m. ET), AEG Live said. "I know I\'ll be hitting the \'refresh\' button on my inbox over and over again," said Jackie Flower, an arts student in San Diego, California. The e-mail will assign the selected registrants a unique code and direct them to a designated distribution center away from the Staples Center. There, they will each receive two tickets to either the memorial service at the Staples Center arena or a simulc

{'input_ids': tensor([8088,  131,  116,  ...,    0,    0,    0]),
 'attention_mask': tensor([1, 1, 1,  ..., 0, 0, 0]),
 'labels': tensor([11390,   445,  5313, 19105,   115,  4977,   127, 12771,   124,   109,
           198, 67262,  1030,   194,  8260,  8950, 86580,  1121,   649,   205,
           127,   186,   130,   114,   711,   113,   198, 55197,  1431, 67429,
           194,  1041, 11869,  4593,  1944,   108,  1532, 45259,   151,   198,
           187,   346,   109,  1601,   113,   109,  1977,   194, 86580,  1121,
           649,   109,   327,   117, 34348,   111,   178,   131,   116,  3780,
           118,   411,   110,   107,     1,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0])}

In [7]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## Evaluation

In [8]:
import evaluate

rouge = evaluate.load("rouge")

In [28]:
import numpy as np
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Training

In [16]:
#del model

NameError: ignored

In [18]:
#torch.cuda.empty_cache()

In [19]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).to(device)

for param in model.model.encoder.parameters():
      param.requires_grad = False

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:


training_args = Seq2SeqTrainingArguments(
    output_dir="pegasus-large-cnn-dailymail",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    fp16 = True,
    logging_steps = 16,
    predict_with_generate=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Step,Training Loss
16,7.1158
32,6.9799
48,6.995
64,6.9018
80,6.6608
96,6.8094
112,6.5426


TrainOutput(global_step=125, training_loss=6.848696716308594, metrics={'train_runtime': 125.7975, 'train_samples_per_second': 7.949, 'train_steps_per_second': 0.994, 'total_flos': 2889464414208000.0, 'train_loss': 6.848696716308594, 'epoch': 1.0})

In [17]:
del trainer

In [26]:
from huggingface_hub import notebook_login

notebook_login()
trainer.push_to_hub()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.66k [00:00<?, ?B/s]

'https://huggingface.co/sabre-code/pegasus-large-cnn-dailymail/tree/main/'