In [1]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

In [2]:
import torch
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset, load_metric
from nltk.tokenize import sent_tokenize
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline, set_seed
import nltk
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
nltk.download('punkt')
from transformers import BertTokenizer, BertForNextSentencePrediction, pipeline

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# Load the dataset
dataset = load_dataset("ccdv/pubmed-summarization", ignore_verifications=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [4]:
sample_text = dataset["train"][1]["article"][:1000]
sample_text

'it occurs in more than 50% of patients and may reach 90% in certain types of cancers , especially in patients undergoing chemotherapy and/or radiation therapy.1 anemia is defined as an inadequate circulating level of hemoglobin ( hb ) ( hb < 12 g / dl ) and may arise as a result of the underlying disease , bleeding , poor nutrition , chemotherapy , or radiation therapy . \n preliminary studies suggest that survival and loco - regional control after radiation therapy , especially in head and neck cancers , may be compromised by anemia.24 anemia often worsens symptoms such as fatigue , weakness , and dyspnea , and thus may have a negative effect on quality of life ( qol ) and performance status in patients with cancer . \n thus , to improve physical functioning , qol , and prognosis in patients with cancer , it would be reasonable to take a proactive approach in identifying populations who need treatment for cancer - associated anemia ( caa ) and provide timely management . \n blood tra

In [5]:
summaries={}

In [6]:
# Initializing T5 pipeline
t5_pipeline = pipeline('summarization', model='t5-small')
t5_output = t5_pipeline(sample_text)
summaries['t5'] = '\n'.join(sent_tokenize(t5_output[0]['summary_text']))

In [7]:
# Initialize BART pipeline
bart_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")
bart_output = bart_pipeline(sample_text)
summaries['bart'] = '\n'.join(sent_tokenize(bart_output[0]['summary_text']))

In [8]:
# Initialize PEGASUS pipeline
pegasus_tokenizer = AutoTokenizer.from_pretrained("google/pegasus-large")
pegasus_model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-large")
pegasus_pipeline = pipeline("summarization", model=pegasus_model, tokenizer=pegasus_tokenizer)
pegasus_output = pegasus_pipeline(sample_text)
summaries['pegasus'] = '\n'.join(sent_tokenize(pegasus_output[0]['summary_text']))



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Your max_length is set to 256, but your input_length is only 214. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=107)


In [9]:
# Initialize BERT model
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertForNextSentencePrediction.from_pretrained("bert-base-uncased")
bert_pipeline = pipeline("feature-extraction", model=bert_model, tokenizer=bert_tokenizer)
bert_output = bert_pipeline(sample_text)
top_sentences = sorted(list(enumerate(bert_output[0])), key=lambda x: x[1], reverse=True)[:3]
summary_sentences = [sent_tokenize(sample_text)[index] for index, _ in top_sentences]
summaries['bert'] = '\n'.join(summary_sentences)


In [10]:
summaries

{'t5': 'anemia is defined as an inadequate circulating level of hemoglobin ( hb  12 g / dl ) and may arise as a result of the underlying disease .\npreliminary studies suggest survival and loco - regional control after radiation therapy may be compromised by anemia .',
 'bart': 'Anemia is defined as an inadequate circulating level of hemoglobin ( hb) It occurs in more than 50% of patients and may reach 90% in certain types of cancers.\nAnemia often worsens symptoms such as fatigue and dyspnea.\nIt can have a negative effect on quality of life ( qol) and performance status in patients with cancer.',
 'pegasus': 'preliminary studies suggest that survival and loco - regional control after radiation therapy , especially in head and neck cancers , may be compromised by anemia.24 anemia often worsens symptoms such as fatigue , weakness , and dyspnea , and thus may have a negative effect on quality of life ( qol ) and performance status in patients with cancer .',
 'bert': 'it occurs in more 

In [11]:
# Print summaries
for model, summary in summaries.items():
    print(f"{model.capitalize()} Summary:\n{summary}\n")

# Load ROUGE metric
rouge_metric = load_metric('rouge')

# Calculate ROUGE scores for each model
records = []
reference = dataset['train'][1]['abstract']

for model_name, summary in summaries.items():
    rouge_metric.add(prediction=summary, reference=reference)
    score = rouge_metric.compute()
    rouge_dict = {rn: score[rn].mid.fmeasure for rn in ["rouge1", "rouge2", "rougeL", "rougeLsum"]}
    print(f'{model_name.capitalize()} ROUGE Scores:', rouge_dict)
    records.append(rouge_dict)

# Convert results to DataFrame
df_results = pd.DataFrame.from_records(records, index=summaries.keys())
print(df_results)


T5 Summary:
anemia is defined as an inadequate circulating level of hemoglobin ( hb  12 g / dl ) and may arise as a result of the underlying disease .
preliminary studies suggest survival and loco - regional control after radiation therapy may be compromised by anemia .

Bart Summary:
Anemia is defined as an inadequate circulating level of hemoglobin ( hb) It occurs in more than 50% of patients and may reach 90% in certain types of cancers.
Anemia often worsens symptoms such as fatigue and dyspnea.
It can have a negative effect on quality of life ( qol) and performance status in patients with cancer.

Pegasus Summary:
preliminary studies suggest that survival and loco - regional control after radiation therapy , especially in head and neck cancers , may be compromised by anemia.24 anemia often worsens symptoms such as fatigue , weakness , and dyspnea , and thus may have a negative effect on quality of life ( qol ) and performance status in patients with cancer .

Bert Summary:
it occur

  rouge_metric = load_metric('rouge')
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


T5 ROUGE Scores: {'rouge1': 0.11956521739130434, 'rouge2': 0.027322404371584695, 'rougeL': 0.05978260869565217, 'rougeLsum': 0.10869565217391303}
Bart ROUGE Scores: {'rouge1': 0.1671018276762402, 'rouge2': 0.07349081364829396, 'rougeL': 0.10443864229765014, 'rougeLsum': 0.1566579634464752}
Pegasus ROUGE Scores: {'rouge1': 0.15748031496062992, 'rouge2': 0.058047493403693924, 'rougeL': 0.09448818897637797, 'rougeLsum': 0.12073490813648294}
Bert ROUGE Scores: {'rouge1': 0.2681818181818182, 'rouge2': 0.09132420091324202, 'rougeL': 0.1318181818181818, 'rougeLsum': 0.2227272727272727}
           rouge1    rouge2    rougeL  rougeLsum
t5       0.119565  0.027322  0.059783   0.108696
bart     0.167102  0.073491  0.104439   0.156658
pegasus  0.157480  0.058047  0.094488   0.120735
bert     0.268182  0.091324  0.131818   0.222727
