<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/summarization/T5_base_Finetune_multi_news_summarization_v2_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization

creates a shorter version of a document or an article that captures all the important information. Along with translation, it is another example of a task that can be formulated as a sequence-to-sequence task. Summarization can be:

- Extractive: extract the most relevant information from a document.
- Abstractive: generate new text that captures the most relevant information.

https://huggingface.co/docs/transformers/tasks/summarization



### Dataset --> multi_news dataset for summarization
Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited.

There are two features:

document: text of news articles seperated by special token "|||||".
summary: news summary.


@misc{alex2019multinews,
    title={Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model},
    author={Alexander R. Fabbri and Irene Li and Tianwei She and Suyi Li and Dragomir R. Radev},
    year={2019},
    eprint={1906.01749},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}


other Datasets

https://huggingface.co/datasets/ccdv/pubmed-summarization

https://huggingface.co/datasets/samsum


In [3]:
# Transformers installation
! pip install -q --disable-pip-version-check py7zr sentencepiece loralib peft trl
! pip install -q    wandb bitsandbytes
! pip install datasets evaluate rouge_score -q
! pip install transformers[torch] -q
! pip install accelerate -U -q
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).



1. Finetune [T5](https://huggingface.co/t5-base) on the Multi-News [multi_news](https://huggingface.co/datasets/multi_news) dataset for abstractive summarization.
2. Use your finetuned model for inference.

<Tip>
Model architectures:

<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

[BART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bart), [BigBird-Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bigbird_pegasus), [Blenderbot](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/blenderbot), [BlenderbotSmall](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/blenderbot-small), [Encoder decoder](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/encoder-decoder), [FairSeq Machine-Translation](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/fsmt), [GPTSAN-japanese](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gptsan-japanese), [LED](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/led), [LongT5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/longt5), [M2M100](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/m2m_100), [Marian](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/marian), [mBART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mbart), [MT5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mt5), [MVP](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mvp), [NLLB](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nllb), [NLLB-MOE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nllb-moe), [Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/pegasus), [PEGASUS-X](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/pegasus_x), [PLBart](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/plbart), [ProphetNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/prophetnet), [SwitchTransformers](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/switch_transformers), [T5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/t5), [XLM-ProphetNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-prophetnet)

<!--End of the generated tip-->


In [5]:
import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset
from torch import cuda, bfloat16
import transformers

import torch
import torch.nn as nn
from google.colab import userdata
import wandb

In [6]:
from google.colab import output
output.enable_custom_widget_manager()

In [7]:
PROJECT = "T5-base-Summarization"
MODEL_NAME = "google/flan-t5-base"
DATASET = "multi_news"

In [None]:


wandb_key = userdata.get('WANDB')
wandb.login(key=wandb_key)

wandb.init(project=PROJECT, # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes ="Fine tuning T5 base with ccdv/pubmed-summarization Dataset. Text Summarization") # the Hyperparameters I want to keep track of

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33molonok[0m ([33molonok69[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [8]:
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
device


'cuda:0'

## Load multi_news dataset

https://huggingface.co/datasets/multi_news

In [9]:
from datasets import load_dataset

dataset  = load_dataset("multi_news")

Split the dataset into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary'],
        num_rows: 44972
    })
    validation: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
    test: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
})

In [11]:
print(f"Train dataset size: {len(dataset['train'])}")
print(f"test dataset size: {len(dataset['test'])}")
print(f"Validation dataset size: {len(dataset['validation'])}")

Train dataset size: 44972
test dataset size: 5622
Validation dataset size: 5622


Then take a look at an example:

In [12]:
dataset['train'][100]['document']

'Katy Perry is all about breaking conventional beauty rules, from her love of everything technicolor and coated in glitter, to her no-brows, black lipstick Met Gala look. So, of course, the pop star — and face of CoverGirl — was the perfect person to help announce that the beauty brand has named its first-ever male CoverGirl, social media star James Charles. \n \n According to a press release from the brand, all CoverGirls “are role models and boundary-breakers, fearlessly expressing themselves, standing up for what they believe, and redefining what it means to be beautiful,” and who better to embody that ethos than Instagram sensation James Charles. After launching his beauty account a year ago, the teen has since quickly attracted hundreds of thousands of followers (427,000 to be exact) thanks to his unique, transformative approach to makeup artistry. \n \n RELATED PHOTOS: Katy Perry’s Most Outrageous Twitpics \n \n While Charles’ partnership with the brand kicks off today, we’ll hav

In [13]:
len(dataset['train'][100]['document'])

6217

In [14]:
dataset['train'][100]['summary']

'– If a woman can be president, who\'s to say a man can\'t be a CoverGirl. On Tuesday, the makeup company\'s current spokesperson, Katy Perry, announced James Charles as the first ever "CoverBoy" on her Instagram page. Charles, a 17-year-old "aspiring makeup artist," started using makeup only a year ago but has already amassed more than 430,000 followers on Instagram, the Huffington Post reports. According to People, Charles will appear in TV, print, and digital ads for "So Lashy" mascara later this month and will work with CoverGirl through 2017. "I am so thankful and excited," Charles posted on Instagram. "And yes I know I have lipstick on my teeth. It was a looonnnnggg day." CoverGirl says it wants to work with "role models and boundary-breakers, fearlessly expressing themselves, standing up for what they believe, and redefining what it means to be beautiful," Teen Vogue reports. The company calls Charles an inspiration. Teen Vogue is definitely on board, stating: "We\'re firm belie

In [15]:
len(dataset['train'][100]['summary'])

1268

There are two fields that you'll want to use:

- `text`: the text of the bill which'll be the input to the model.
- `summary`: a condensed version of `text` which'll be the model target.

## Preprocess

The next step is to load a T5 tokenizer to process `text` and `summary`:

Model--> https://huggingface.co/google/flan-t5-base

In [16]:
from transformers import AutoTokenizer

checkpoint = model_id = MODEL_NAME
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The preprocessing function you want to create needs to:

1. Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Use the keyword `text_target` argument when tokenizing labels.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [17]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=256, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [19]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [20]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 44972
    })
    validation: Dataset({
        features: ['document', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 5622
    })
    test: Dataset({
        features: ['document', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 5622
    })
})

In [21]:
len(tokenized_dataset['train'][100]['labels']), len(tokenized_dataset['train'][100]['input_ids'])

(256, 1024)

## Train

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load T5 with [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM):

At this point, only three steps remain:

1. Define your training hyperparameters in [Seq2SeqTrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the ROUGE metric and save the training checkpoint.
2. Pass the training arguments to [Seq2SeqTrainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [22]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer


# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)

Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [23]:
def print_number_of_trainable_model_parameters(model, tag="original_model", to_wandb=False):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()

    if to_wandb:
      wandb.log({f'{tag}': {"trainable_model_params":trainable_model_params}})
      wandb.log({f'{tag}': {"all_model_params":all_model_params}})
      wandb.log({f'{tag}': {"percentage_of_trainable_model_parameters": 100 * trainable_model_params}} )

    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params}%"

In [24]:
repository_id = f"{checkpoint.split('/')[1]}-{DATASET}"
repository_id

'flan-t5-base-multi_news'

In [25]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
dataset_id = "multi_news"
# Hugging Face repository id
repository_id = f"{checkpoint.split('/')[1]}-{DATASET}"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=10,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=500,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    report_to="wandb",
)



Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

In [26]:
print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.0%


In [27]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
        label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8

)

In [28]:
import gc
import torch
torch.cuda.empty_cache()
gc.collect()

103

In [29]:


trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,

)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [None]:

with wandb.init(project=PROJECT, job_type="train", # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes =f"Fine tuning {MODEL_NAME} with {DATASET}. Summarization Prompt Instruction"):

  print_number_of_trainable_model_parameters(model,"original_model",to_wandb=True)

  trainer.train()

VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Epoch,Training Loss,Validation Loss
1,2.3803,2.181632
2,2.2688,2.15545
3,2.1995,2.134135
4,2.1854,2.13518
5,2.1352,2.129668
6,2.1199,2.124078
7,2.1218,2.121777
8,2.1056,2.122299
9,2.0928,2.122811


Epoch,Training Loss,Validation Loss
1,2.3803,2.181632
2,2.2688,2.15545
3,2.1995,2.134135
4,2.1854,2.13518
5,2.1352,2.129668
6,2.1199,2.124078
7,2.1218,2.121777
8,2.1056,2.122299
9,2.0928,2.122811
10,2.0834,2.1226


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,█▅▂▃▂▁▁▁▁▁
eval/runtime,▇▇█▅▇▆▇▇▆▁
eval/samples_per_second,▂▂▁▄▂▃▂▂▃█
eval/steps_per_second,▂▂▁▄▂▃▂▂▃█
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
train/learning_rate,███▇▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▁▁▁
train/loss,█▇▆▆▅▅▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▁▁▁▁▁▁▁▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,2.1226
eval/runtime,64.1215
eval/samples_per_second,87.677
eval/steps_per_second,10.964
train/epoch,10.0
train/global_step,56220.0
train/learning_rate,0.0
train/loss,2.0834
train/total_flos,6.1589806749116e+17
train/train_loss,2.17737


Once training is completed, share your model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use your model:

In [None]:
trainer.push_to_hub("olonok/t5-base-multi_news-summarization")

model.safetensors:   0%|          | 0.00/495M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.86k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/olonok/flan-t5-base-multi_news/commit/7760e4c90a6aa23233608f7f765fc78f53e4b694', commit_message='olonok/t5-base-multi_news-summarization', commit_description='', oid='7760e4c90a6aa23233608f7f765fc78f53e4b694', pr_url=None, pr_revision=None, pr_num=None)

<Tip>

For a more in-depth example of how to finetune a model for summarization, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb).

</Tip>

In [None]:

! rm -rf t5-sum-checkpoint

In [None]:

!mkdir t5-sum-checkpoint
custom_path = "./t5-sum-checkpoint/"
trainer.save_model(output_dir=custom_path)

In [None]:

with wandb.init(project=PROJECT, job_type="models"):
  artifact = wandb.Artifact("T5-base_Summarization_model", type="model")
  artifact.add_dir(custom_path)
  wandb.save(custom_path)
  wandb.log_artifact(artifact)


[34m[1mwandb[0m: Adding directory to artifact (./t5-sum-checkpoint)... Done. 1.6s


VBox(children=(Label(value='0.102 MB of 472.258 MB uploaded\r'), FloatProgress(value=0.0002164082056539224, ma…

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Come up with some text you'd like to summarize. For T5, you need to prefix your input depending on the task you're working on. For summarization you should prefix your input as shown below:

In [30]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [31]:
run = wandb.init()
artifact = run.use_artifact('olonok69/T5-base-Summarization/T5-base_Summarization_model:v0', type='model')
artifact_dir = artifact.download()

fine_tune_model=  AutoModelForSeq2SeqLM.from_pretrained(artifact_dir, torch_dtype=torch.bfloat16)

[34m[1mwandb[0m: Currently logged in as: [33molonok[0m ([33molonok69[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Downloading large artifact T5-base_Summarization_model:v0, 472.26MB. 4 files... 
[34m[1mwandb[0m:   4 of 4 files downloaded.  
Done. 0:0:2.5


In [32]:
artifact_dir

'/content/artifacts/T5-base_Summarization_model:v0'

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for summarization with your model, and pass your text to it:

In [33]:
from transformers import pipeline

In [34]:


summarizer = pipeline("summarization", model=fine_tune_model, tokenizer=tokenizer)
response = summarizer(text)

Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


In [35]:
response[0]['summary_text']

'– The Inflation Reduction Act is the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying union jobs across the country. It will lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes, reports the New York Times.'

In [36]:
checkpoint

'google/flan-t5-base'

In [37]:
from transformers import pipeline

summarizer_ori = pipeline("summarization", model=checkpoint, tokenizer=tokenizer)
response_ori = summarizer_ori(text)

Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


In [38]:
response_ori[0]['summary_text']

"Inflation is a big issue in the United States, but it's also a problem in the world of health care. And that's why we're taking action to reduce the cost of healthcare."

In [39]:
len(text), len(response[0]['summary_text']), len(response_ori[0]['summary_text'])

(448, 397, 168)

In [40]:
dataset['validation']

Dataset({
    features: ['document', 'summary'],
    num_rows: 5622
})

In [41]:
import time
import evaluate
import pandas as pd
import numpy as np

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

In [42]:
rouge = evaluate.load('rouge')

In [43]:
dialogues = dataset['validation'][0:100]['document']
human_baseline_summaries = dataset['validation'][0:100]['summary']

original_model_text = []
original_human_summaries = []
original_model_summaries = []
fine_tune_model_summaries = []

In [None]:
for idx, dialogue in enumerate(tqdm(dialogues)):
    prompt = f"""
Summarize:

{dialogue}
 """
    original_model_text.append(dialogue)
    original_human_summaries.append(human_baseline_summaries[idx])
    # summarize fine_tuned model
    response = summarizer(prompt)
    fine_tune_model_summaries.append(response[0]['summary_text'])
    # summarize original model
    response_ori = summarizer_ori(prompt)
    original_model_summaries.append(response_ori[0]['summary_text'])


  0%|          | 0/100 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (2332 > 512). Running this sequence through the model will result in indexing errors
 24%|██▍       | 24/100 [27:17<1:00:26, 47.71s/it]

In [None]:
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, fine_tune_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'fine_tune_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,fine_tune_model_summaries
0,– The Da Vinci Code has sold so many copies—th...,– Dan Brown has topped Oxfam's 'most donated' ...,– The Da Vinci Code has sold more than 80 mill...
1,– A major snafu has hit benefit payments to st...,"– For weeks, student veterans across the count...",– The Department of Veterans Affairs says it w...
2,– Yemen-based al-Qaeda in the Arabian Peninsul...,– Al Qaeda in Yemen has claimed responsibility...,– Charlie Hebdo's latest cover features a cart...
3,– Cambridge Analytica is calling it quits. The...,"– Cambridge Analytica's parent company, the SC...","– Cambridge Analytica's parent company, the SC..."
4,"– A lengthy report in the New York Times, base...",The N.S.A. is an electronic omnivore of stagge...,– Edward Snowden's revelations about the Natio...
5,– Don Juan de Oñate sought a city of gold when...,– An archaeologist at Wichita State University...,– An archaeologist at Wichita State University...
6,– Another bad day for Anthony Weiner: Nancy Pe...,– Rep. Anthony Weiner has resigned from the Ho...,– Anthony Weiner has decided to take a leave o...
7,– An augmented reality startup is being sued f...,– Magic Leap is being sued for sex discriminat...,– A former Magic Leap VP is suing the company ...
8,– The length of a man's index and ring fingers...,– A new study suggests that the lengths of the...,– A new study suggests that the length of the ...
9,– A 71-year-old lawyer is suing United Airline...,– A 71-year-old Houston lawyer is suing United...,– A 71-year-old Houston man is suing United Ai...


In [None]:
# https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)


In [None]:
print(original_model_results)

{'rouge1': 0.3662092840410307, 'rouge2': 0.1156408354117632, 'rougeL': 0.1787691923692278, 'rougeLsum': 0.1793605972496164}


In [None]:
fine_tune_model_results = rouge.compute(
    predictions=fine_tune_model_summaries,
    references=human_baseline_summaries[0:len(fine_tune_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

In [None]:
print(fine_tune_model_results)

{'rouge1': 0.4197339424286126, 'rouge2': 0.13083171156689355, 'rougeL': 0.2001527106806991, 'rougeLsum': 0.200658349122237}


In [32]:
dialogues = dataset['validation'][:32]['document']

In [34]:
type(dialogues)

list

In [37]:
dialogues[0]

'Whether a sign of a good read; or a comment on the \'pulp\' nature of some genres of fiction, the Oxfam second-hand book charts have remained in The Da Vinci Code author\'s favour for the past four years. \n \n Dan Brown has topped Oxfam\'s \'most donated\' list again, his fourth consecutive year. Having sold more than 80 million copies of The Da Vinci Code and had all four of his novels on the New York Times bestseller list in the same week, it\'s hardly surprising that Brown\'s hefty tomes are being donated to charity by readers keen to make some room on their shelves. \n \n Another cult crime writer responsible to heavy-weight hardbacks, Stieg Larsson, is Oxfam\'s \'most sold\' author for the second time in a row. Both the \'most donated\' and \'most sold\' lists are dominated by crime fiction, trilogies and fantasy, with JK Rowling the only female author listed in either of the Top Fives. \n \n Click here or on "View Gallery" to see both charts in pictures ||||| A woman reads a co

In [45]:
summarizer = pipeline("summarization", model=fine_tune_model, tokenizer=tokenizer, device="cpu", batch_size=2 )


In [None]:
response= summarizer(dialogues)

In [None]:
len(response)

In [None]:
summarizer_ori = pipeline("summarization", model=checkpoint, tokenizer=tokenizer,  device_map='auto')
response_ori = summarizer_ori(dialogues)

In [None]:
len(response_ori)