<a href="https://colab.research.google.com/github/tilaboy/nlp_transformer_tutorial/blob/main/learning_notes/ch6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install datasets --quiet
!pip install transformers --quiet
!pip install tensorflow --quiet
!pip install pandas --quiet
!pip install numpy --quiet
!pip install seqeval --quiet
!pip install torch --quiet
!pip install sklearn --quiet
!pip install matplotlib --quiet
!pip install sacrebleu --quiet
!pip install rouge_score --quiet
!pip install sentencepiece --quiet

In [3]:
from datasets import load_dataset, DatasetDict, load_metric
from datasets import get_dataset_config_names, concatenate_datasets
from collections import defaultdict, Counter
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import nltk 
from nltk.tokenize import sent_tokenize
from transformers import pipeline, set_seed, DataCollatorForSeq2Seq
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from tqdm import tqdm


In [3]:
dataset = load_dataset("cnn_dailymail", version="3.0.0")
print(f"Features: {dataset['train'].column_names}") 

Using custom data configuration default
Reusing dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/default/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)


  0%|          | 0/3 [00:00<?, ?it/s]

Features: ['article', 'highlights', 'id']


In [4]:
sample = dataset["train"][1] 
print(f""" Article (excerpt of 500 characters, total length: {len(sample["article"])}): """) 
print(f'\nSummary (length: {len(sample["highlights"])}):')
print(sample["highlights"])


 Article (excerpt of 500 characters, total length: 4051): 

Summary (length: 281):
Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .


In [5]:
sample_text = dataset["train"][1]["article"][:2000] 
# We'll collect the generated summaries of each model in a dictionary 
summaries = {}
nltk.download("punkt") 
string = "The U.S. are a country. The U.N. is an organization." 
print(f'{string} => {sent_tokenize(string)}\n')
for sent in sent_tokenize(sample_text):
    print(sent)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
The U.S. are a country. The U.N. is an organization. => ['The U.S. are a country.', 'The U.N. is an organization.']

Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events.
Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial.
MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor."
Here, inmates with the most severe mental illnesses are incarcerated until they're ready to appear in court.
Most often, they face drug charges or charges of assaulting an officer --charges that Judge Steven Leifman says are usually "avoidable felonies."
He says the arrests often result from confront

In [6]:
def three_sentence_summary(text): 
    return "\n".join(sent_tokenize(text)[:3]) 
summaries["baseline"] = three_sentence_summary(sample_text)


In [7]:
set_seed(42) 
pipe = pipeline("text-generation", model="gpt2-medium") 
gpt2_query = sample_text + "\nTL;DR:\n" 
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)

summaries["gpt2"] = "\n".join( sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query) :]))
print(f'Summary:\n{summaries["gpt2"]}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summary:
MIAMI-DADE COUNTY JAIL: Forgotten jail
Posted by CNN at 7:38pm


In [8]:
pipe = pipeline("summarization", model="t5-small") 
pipe_out = pipe(sample_text) 
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))
print(f'Summary:\n{summaries["t5"]}')

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Summary:
inmates with most severe mental illnesses are incarcerated until they're ready to appear in court .
most often, they face drug charges or charges of assaulting an officer .
they end up on the ninth floor severely mentally disturbed, but not getting real help .


In [9]:
pipe = pipeline("summarization", model="facebook/bart-base") 
pipe_out = pipe(sample_text) 
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))
print(f'Summary:\n{summaries["bart"]}')

Summary:
Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events.
Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial.
MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "Forgotten floor."
Here, inmates with the most severe mental illnesses are incarcerated until they're ready to appear in court.
Most often, they


In [10]:
pipe = pipeline("summarization", model="google/pegasus-xsum") 
pipe_out = pipe(sample_text) 
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>", ".\n")
print(f'Summary:\n{summaries["pegasus"]}')

Summary:
Mentally ill inmates in Miami are housed on the "forgotten floor"<n>The ninth floor is where they're held until they're ready to appear in court.
Most often, they face drug charges or charges of assaulting an officer.
They end up on the ninth floor severely mentally disturbed .


In [11]:
print("GROUND TRUTH") 
print(dataset["train"][1]["highlights"]) 
print("") 
for model_name in summaries: 
    print(model_name.upper()) 
    print(summaries[model_name]) 
    print("")

GROUND TRUTH
Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .

BASELINE
Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events.
Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial.
MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor."

GPT2
MIAMI-DADE COUNTY JAIL: Forgotten jail
Posted by CNN at 7:38pm

T5
inmates with most severe mental illnesses are incarcerated until they're ready to appear in court .
most often, they face drug charges or charges of as

In [12]:
bleu_metric = load_metric("sacrebleu")
rouge_metric = load_metric("rouge")

In [13]:
gold_ref = [dataset["train"][1]["highlights"]]
records = [] 
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
for model_name in summaries: 
    bleu_metric.add(prediction=summaries[model_name], reference=gold_ref)
    rouge_metric.add(prediction=summaries[model_name], reference=gold_ref)
    score_blue = bleu_metric.compute(smooth_method="floor", smooth_value=0) 
    score_rouge = rouge_metric.compute() 
    score_dict = dict((rn, score_rouge[rn].mid.fmeasure) for rn in rouge_names)
    score_dict['blue'] = score_blue['score']
    records.append(score_dict) 
pd.DataFrame.from_records(records, index=summaries.keys())


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum,blue
baseline,0.365079,0.145161,0.206349,0.238095,8.379861
gpt2,0.098361,0.0,0.098361,0.098361,0.0
t5,0.195652,0.022222,0.108696,0.152174,0.0
bart,0.326531,0.124138,0.176871,0.231293,6.709689
pegasus,0.323232,0.206186,0.282828,0.30303,22.382122


In [14]:
def evaluate_summaries_baseline(dataset, metric, column_text="article", column_summary="highlights"): 
    summaries = [three_sentence_summary(text) for text in dataset[column_text]] 
    metric.add_batch(predictions=summaries, references=dataset[column_summary]) 
    score = metric.compute() 
    return score

In [1]:
test_sampled = dataset["test"].shuffle(seed=42).select(range(100)) 
score = evaluate_summaries_baseline(test_sampled, rouge_metric) 
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame.from_dict(rouge_dict, orient="index", columns=["baseline"]).T


NameError: ignored

In [16]:
def chunks(list_of_elements, batch_size): 
    """Yield successive batch-sized chunks from list_of_elements.""" 
    for i in range(0, len(list_of_elements), batch_size): 
        yield list_of_elements[i : i + batch_size]

device = "cuda" if torch.cuda.is_available() else "cpu" 

def evaluate_summaries_pegasus(dataset, metric, model, tokenizer, batch_size=16, device=device, column_text="article", column_summary="highlights"): 
    article_batches = list(chunks(dataset[column_text], batch_size)) 
    target_batches = list(chunks(dataset[column_summary], batch_size)) 
    for article_batch, target_batch in tqdm( zip(article_batches, target_batches), total=len(article_batches)): 
        inputs = tokenizer(article_batch, max_length=1024, truncation=True, padding="max_length", return_tensors="pt") 
        summaries = model.generate(input_ids=inputs["input_ids"].to(device), 
                                   attention_mask=inputs["attention_mask"].to(device), 
                                   length_penalty=0.8, 
                                   num_beams=8, 
                                   max_length=128)
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=True) for s in summaries] 
        decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries] 
        metric.add_batch(predictions=decoded_summaries, references=target_batch) 
    score = metric.compute() 
    return score


In [None]:
model_ckpt = "google/pegasus-xsum" 
tokenizer = AutoTokenizer.from_pretrained(model_ckpt) 
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)
score = evaluate_summaries_pegasus(test_sampled, rouge_metric, model, tokenizer, batch_size=8) 
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names) 
pd.DataFrame(rouge_dict, index=["pegasus"])


  0%|          | 0/125 [00:00<?, ?it/s]

In [None]:
dataset_samsum = load_dataset("samsum") 
split_lengths = [len(dataset_samsum[split])for split in dataset_samsum] 
print(f"Split lengths: {split_lengths}") 
print(f"Features: {dataset_samsum['train'].column_names}") 
print("\nDialogue:") 
print(dataset_samsum["test"][0]["dialogue"]) 
print("\nSummary:") 
print(dataset_samsum["test"][0]["summary"])
