**T5 WITH ROUGE SCORE**

In [1]:
!pip install transformers
!pip install datasets
!pip install rouge-score


Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:0

**Load T5 Model and Tokenizer**

In [2]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the pre-trained T5 model and tokenizer
model_name = 't5-base'  # You can also use 't5-large' for better performance
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

**Prepare Input Data**

In [3]:
text = """Software is an essential component of modern technology, encompassing a wide range of programs, applications, and systems that enable computers and devices to perform tasks and processes. From operating systems to mobile apps, software plays a crucial role in our daily lives and in the functioning of various industries.Software is a fundamental aspect of the technology ecosystem, influencing almost every aspect of life and business today. Its development involves a systematic approach to ensure quality and functionality, with an ever-evolving landscape driven by emerging technologies and user demands. As the world continues to digitalize, the role of software will only grow in importance, presenting both opportunities and challenges for developers and users alike. The future of software promises exciting advancements, fostering innovation and transforming industries across the globe."""


**Tokenize and Summarize**

In [4]:
# Prepend the task to the input text
input_text = "summarize: " + text

# Tokenize the input text
inputs = tokenizer(input_text, max_length=512, return_tensors='pt', truncation=True)

# Generate summary
summary_ids = model.generate(inputs['input_ids'], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

# Decode the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Summary:")
print(summary)


Summary:
software is an essential component of modern technology. software is a fundamental aspect of the technology ecosystem. as the world continues to digitalize, the role of software will only grow in importance.


**Evaluate with ROUGE Score**

In [7]:
from rouge_score import rouge_scorer

# Reference summary for comparison (replace this with your actual summary)
reference_summary = "Software is an essential component of modern technology, encompassing a wide range of programs, applications, and systems that enable computers and devices to perform tasks and processes. From operating systems to mobile apps, software plays a crucial role in our daily lives and in the functioning of various industries.Software is a fundamental aspect of the technology ecosystem, influencing almost every aspect of life and business today. Its development involves a systematic approach to ensure quality and functionality, with an ever-evolving landscape driven by emerging technologies and user demands. As the world continues to digitalize, the role of software will only grow in importance, presenting both opportunities and challenges for developers and users alike. The future of software promises exciting advancements, fostering innovation and transforming industries across the globe."

# Initialize the scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Calculate ROUGE scores
scores = scorer.score(reference_summary, summary)

print("ROUGE Scores:")
print(scores)


ROUGE Scores:
{'rouge1': Score(precision=1.0, recall=0.24615384615384617, fmeasure=0.39506172839506176), 'rouge2': Score(precision=0.9354838709677419, recall=0.2248062015503876, fmeasure=0.3625), 'rougeL': Score(precision=1.0, recall=0.24615384615384617, fmeasure=0.39506172839506176)}


**Putting It All Together**

In [8]:
def summarize_and_evaluate(text, reference_summary):
    # Prepend the task to the input text
    input_text = "summarize: " + text

    # Tokenize and summarize
    inputs = tokenizer(input_text, max_length=512, return_tensors='pt', truncation=True)
    summary_ids = model.generate(inputs['input_ids'], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Calculate ROUGE scores
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference_summary, summary)

    return summary, scores

# Example usage
text = """Your long article or text goes here."""
reference_summary = "Your reference summary goes here."
summary, scores = summarize_and_evaluate(text, reference_summary)

print("Generated Summary:")
print(summary)
print("ROUGE Scores:")
print(scores)


Generated Summary:
.
ROUGE Scores:
{'rouge1': Score(precision=0.0, recall=0.0, fmeasure=0.0), 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0), 'rougeL': Score(precision=0, recall=0, fmeasure=0)}
