**--------------Project Introduction: Abstractive Summarization-------------**




This project focuses on Abstractive summarization where the goal is to generate concise summaries of articles while preserving their essential meaning. The dataset utilized for this project is the **CNN/Daily Mail dataset**, which consists of news articles paired with human-written summaries.

**Summarization Methods Used:**
1. **BART (Bidirectional and Auto-Regressive Transformer)**: A transformer model that combines bidirectional and autoregressive architectures to generate coherent summaries.
2. **T5 (Text-to-Text Transfer Transformer)**: A versatile model that frames all NLP tasks, including summarization, as text-to-text problems, allowing for easy task adaptation.
3. **GPT-based LLM (e.g., GPT-3)**: An autoregressive language model that generates text by predicting the next word in a sequence based on previous context.

To evaluate the generated summaries, we utilize the **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** score. ROUGE measures the overlap between the generated summaries and reference summaries, providing a quantitative assessment of summary quality based on n-gram precision and recall. In this project, ROUGE scores are computed to analyze and compare the performance of the three summarization methods used.

In [1]:
!pip install datasets
!pip install transformers
!pip install rouge-score


Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:

In [2]:

import pandas as pd
from datasets import load_dataset
from transformers import BartForConditionalGeneration, BartTokenizer
from transformers import T5ForConditionalGeneration, T5Tokenizer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from rouge_score import rouge_scorer



In [3]:


dataset = load_dataset("cnn_dailymail", "3.0.0")

# --------------------- BART Model ---------------------

bart_model_name = "facebook/bart-large-cnn"
bart_tokenizer = BartTokenizer.from_pretrained(bart_model_name)
bart_model = BartForConditionalGeneration.from_pretrained(bart_model_name)

# --------------------- T5 Model -----------------------

t5_model_name = "t5-large"
t5_tokenizer = T5Tokenizer.from_pretrained(t5_model_name)
t5_model = T5ForConditionalGeneration.from_pretrained(t5_model_name)

# --------------------- GPT-based LLM -----------------------

llm_model_name = "google/flan-t5-large"
llm_tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
llm_model = AutoModelForSeq2SeqLM.from_pretrained(llm_model_name)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [4]:

def generate_bart_summary(text):
    inputs = bart_tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = bart_model.generate(inputs['input_ids'], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return bart_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

def generate_t5_summary(text):
    inputs = t5_tokenizer.encode("summarize: " + text, return_tensors='pt', max_length=1024, truncation=True)
    summary_ids = t5_model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

def generate_llm_summary(text):
    inputs = llm_tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = llm_model.generate(inputs['input_ids'], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return llm_tokenizer.decode(summary_ids[0], skip_special_tokens=True)



In [5]:

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

for i in range(3):
    article = dataset['test'][i]['article']
    reference_summary = dataset['test'][i]['highlights']


    bart_summary = generate_bart_summary(article)
    t5_summary = generate_t5_summary(article)
    llm_summary = generate_llm_summary(article)

    print(f"Article {i+1}:\n")
    print("Generated Summary (BART):")
    print(bart_summary)
    print("\nGenerated Summary (T5):")
    print(t5_summary)
    print("\nGenerated Summary (LLM):")
    print(llm_summary)
    print("\nReference Summary:")
    print(reference_summary)
    print("-" * 80)

    #ROGUE SCORES
    print(f"ROUGE scores for Article {i+1}:")


    bart_scores = scorer.score(reference_summary, bart_summary)
    print("BART ROUGE Scores:", bart_scores)


    t5_scores = scorer.score(reference_summary, t5_summary)
    print("T5 ROUGE Scores:", t5_scores)


    llm_scores = scorer.score(reference_summary, llm_summary)
    print("LLM ROUGE Scores:", llm_scores)

    print("\n" + "=" * 100 + "\n")



Article 1:

Generated Summary (BART):
The Palestinian Authority becomes the 123rd member of the International Criminal Court. The move gives the court jurisdiction over alleged crimes in Palestinian territories. Israel and the United States opposed the Palestinians' efforts to join the body.

Generated Summary (T5):
the formal accession was marked with a ceremony at the hague, where the court is based. as members of the court, Palestinians may be subject to counter-charges as well.

Generated Summary (LLM):
The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday. The Palestinians signed the ICC's founding Rome Statute in January. The ICC opened a preliminary examination into the situation in Palestinian territories in January.

Reference Summary:
Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since last June .
Israel and the United States opposed the move, which could open the door to 