In [2]:
%%capture
!pip install transformers[torch]
!pip install rouge_score

In [16]:
%%capture
!pip install -U datasets
!pip install fsspec==2023.9.2

In [2]:
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
import torch
import numpy as np

Dataset : https://huggingface.co/datasets/alexfabbri/multi_news

In [None]:
dataset = load_dataset("multi_news")

In [10]:
!mv /root/.cache/huggingface/datasets/multi_news /content/

downloads
multi_news
_root_.cache_huggingface_datasets_multi_news_default_1.0.0_2f1f69a2bedc8ad1c5d8ae5148e4755ee7095f465c1c01ae8f85454342065a72.lock


In [23]:
!ls /content/multi_news/default/1.0.0/2f1f69a2bedc8ad1c5d8ae5148e4755ee7095f465c1c01ae8f85454342065a72/

dataset_info.json      multi_news-train-00000-of-00002.arrow
LICENSE		       multi_news-train-00001-of-00002.arrow
multi_news-test.arrow  multi_news-validation.arrow


In [3]:
from datasets import Dataset

train1 = Dataset.from_file("/content/multi_news/default/1.0.0/2f1f69a2bedc8ad1c5d8ae5148e4755ee7095f465c1c01ae8f85454342065a72/multi_news-train-00000-of-00002.arrow")
train2 = Dataset.from_file("/content/multi_news/default/1.0.0/2f1f69a2bedc8ad1c5d8ae5148e4755ee7095f465c1c01ae8f85454342065a72/multi_news-train-00001-of-00002.arrow")

# Concatenate both parts of the train set
from datasets import concatenate_datasets
train_dataset = concatenate_datasets([train1, train2])

# Load test and validation
test_dataset = Dataset.from_file("/content/multi_news/default/1.0.0/2f1f69a2bedc8ad1c5d8ae5148e4755ee7095f465c1c01ae8f85454342065a72/multi_news-test.arrow")
val_dataset = Dataset.from_file("/content/multi_news/default/1.0.0/2f1f69a2bedc8ad1c5d8ae5148e4755ee7095f465c1c01ae8f85454342065a72/multi_news-validation.arrow")


In [4]:
train_subset = train_dataset.select(range(10000))
test_subset = test_dataset.select(range(1000))
val_subset = val_dataset.select(range(1000))

In [5]:
checkpoint = 't5-small'
tokenizer = T5Tokenizer.from_pretrained(checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [6]:
def preprocess_function(examples):
    inputs = ["summarize: "+doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [7]:
tokenized_train = train_subset.map(preprocess_function, batched=True)
tokenized_test = test_subset.map(preprocess_function, batched=True)
tokenized_val = val_subset.map(preprocess_function, batched=True)

In [8]:
model = T5ForConditionalGeneration.from_pretrained(checkpoint)

In [9]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# ROUGE Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of summaries by comparing them to reference summaries (typically human-generated). ROUGE is particularly popular in the field of natural language processing for tasks such as summarization. The metrics focus on different aspects of the generated summary and provide insights into its quality. The main ROUGE metrics include:

## ROUGE-N
Measures the overlap of n-grams between the candidate summary and the reference summary. The most common versions are ROUGE-1 (unigrams) and ROUGE-2 (bigrams).

### ROUGE-1
Counts the overlap of single words.
- **ROUGE-1 Recall**:
  $$
  \text{ROUGE-1 Recall} = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in reference summary}}
  $$
- **ROUGE-1 Precision**:
  $$
  \text{ROUGE-1 Precision} = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in candidate summary}}
  $$
- **ROUGE-1 F1-Score**:
  $$
  \text{ROUGE-1 F1-Score} = 2 \times \frac{\text{ROUGE-1 Recall} \times \text{ROUGE-1 Precision}}{\text{ROUGE-1 Recall} + \text{ROUGE-1 Precision}}
  $$

**Example Calculation for ROUGE-1:**

Given a reference summary "The cat sat on the mat." and a candidate summary "The cat is on the mat.", calculate ROUGE-1:
- Unigrams in Reference: {The, cat, sat, on, the, mat}
- Unigrams in Candidate: {The, cat, is, on, the, mat}
- Overlap: {The, cat, on, the, mat}
- Recall: $ \frac{5}{6} $
- Precision: $ \frac{5}{6} $
- F1-Score: $ 2 \times \frac{5/6 \times 5/6}{5/6 + 5/6} = 0.833 $

### ROUGE-2
Counts the overlap of two-word sequences.
- **ROUGE-2 Recall**:
  $$
  \text{ROUGE-2 Recall} = \frac{\text{Number of overlapping bigrams}}{\text{Total bigrams in reference summary}}
  $$
- **ROUGE-2 Precision**:
  $$
  \text{ROUGE-2 Precision} = \frac{\text{Number of overlapping bigrams}}{\text{Total bigrams in candidate summary}}
  $$
- **ROUGE-2 F1-Score**:
  $$
  \text{ROUGE-2 F1-Score} = 2 \times \frac{\text{ROUGE-2 Recall} \times \text{ROUGE-2 Precision}}{\text{ROUGE-2 Recall} + \text{ROUGE-2 Precision}}
  $$

**Example Calculation for ROUGE-2:**

Using the same reference and candidate summaries:
- Bigrams in Reference: {The cat, cat sat, sat on, on the, the mat}
- Bigrams in Candidate: {The cat, cat is, is on, on the, the mat}
- Overlap: {The cat, on the, the mat}
- Recall: $ \frac{3}{5} = 0.600 $
- Precision: $ \frac{3}{5} = 0.600 $
- F1-Score: $ 2 \times \frac{0.6 \times 0.6}{0.6 + 0.6} = 0.600 $

## ROUGE-L
Measures the longest common subsequence (LCS) between the candidate and reference summaries. This captures the longest sequence of words that appear in both summaries in the same order, reflecting the importance of sentence-level structure.
- **ROUGE-L Recall**:
  $$
  \text{ROUGE-L Recall} = \frac{\text{LCS}}{\text{Total words in reference summary}}
  $$
- **ROUGE-L Precision**:
  $$
  \text{ROUGE-L Precision} = \frac{\text{LCS}}{\text{Total words in candidate summary}}
  $$
- **ROUGE-L F1-Score**:
  $$
  \text{ROUGE-L F1-Score} = 2 \times \frac{\text{ROUGE-L Recall} \times \text{ROUGE-L Precision}}{\text{ROUGE-L Recall} + \text{ROUGE-L Precision}}
  $$

**Example Calculation for ROUGE-L:**

Using the same reference and candidate summaries:
- LCS: "The cat on the mat"
- Recall: $ \frac{5}{6} \approx 0.833 $
- Precision: $ \frac{5}{6} \approx 0.833 $
- F1-Score: $ 2 \times \frac{0.833 \times 0.833}{0.833 + 0.833} = 0.833 $


In [10]:
!!pip install evaluate  # if not already installed


['Collecting evaluate',
 '  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)',
 'Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)',
 '\x1b[?25l   \x1b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\x1b[0m \x1b[32m0.0/84.0 kB\x1b[0m \x1b[31m?\x1b[0m eta \x1b[36m-:--:--\x1b[0m',
 '\x1b[2K   \x1b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\x1b[0m \x1b[32m84.0/84.0 kB\x1b[0m \x1b[31m4.1 MB/s\x1b[0m eta \x1b[36m0:00:00\x1b[0m',
 '\x1b[?25hInstalling collected packages: evaluate',
 'Successfully installed evaluate-0.4.3']

In [11]:
# Define compute_metrics function
from evaluate import load

rouge = load("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # Decode the predictions and labels
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    # Compute ROUGE scores
    rouge_output = rouge.compute(predictions=pred_str, references=label_str, use_stemmer=True)

    # Aggregate the ROUGE scores
    result = {key: value.mid.fmeasure * 100 for key, value in rouge_output.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in pred_ids]
    result["gen_len"] = np.mean(prediction_lens)

    return result

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [13]:
# Seq2Seq training arguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",             # Directory to save model checkpoints and logs
    # evaluation_strategy="epoch",        # Evaluate the model at the end of each epoch
    learning_rate=2e-5,                 # Learning rate for the optimizer
    per_device_train_batch_size=16,     # Batch size for training
    per_device_eval_batch_size=16,      # Batch size for evaluation
    weight_decay=0.01,                  # Weight decay for regularization
    save_total_limit=3,                 # Limit the total number of checkpoints saved
    num_train_epochs=3,                 # Number of training epochs
    predict_with_generate=True,         # Use generation mode for prediction
    generation_max_length=150,          # Maximum length for generated sequences
    generation_num_beams=4,             # Number of beams for beam search during generation
)

In [15]:
## Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,                       # The model to be trained
    args=training_args,                # Training arguments defined with Seq2SeqTrainingArguments
    train_dataset=tokenized_train,     # The training dataset
    eval_dataset=tokenized_val,        # The evaluation dataset
    data_collator=data_collator,       # The data collator for processing data batches
    tokenizer=tokenizer,               # The tokenizer used for preprocessing
    compute_metrics=compute_metrics,   # The function to compute evaluation metrics
)

  trainer = Seq2SeqTrainer(


In [16]:
# Train the model
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mr8899814[0m ([33mr8899814-no[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
500,3.4235
1000,3.1893
1500,3.1616


TrainOutput(global_step=1875, training_loss=3.2378852213541665, metrics={'train_runtime': 1433.8465, 'train_samples_per_second': 20.923, 'train_steps_per_second': 1.308, 'total_flos': 4060254044160000.0, 'train_loss': 3.2378852213541665, 'epoch': 3.0})

In [17]:
# Evaluate the model on validation set
trainer.evaluate()

# Evaluate the model on test set
test_results = trainer.evaluate(eval_dataset=tokenized_test)

print(test_results)

IndexError: piece id is out of range.

## Testing

In [20]:
import torch
# Select a specific data point from the test dataset
test_index = 0  # Change this index to the specific data point you want to summarize
example_text = """
Machine Learning (ML) is a powerful and transformative subfield of artificial intelligence (AI) that focuses on creating systems and algorithms that can learn from data, identify patterns, and make decisions or predictions with minimal human intervention. Unlike traditional programming, where a developer writes explicit instructions for every possible scenario, machine learning enables computers to learn how to perform tasks by analyzing large volumes of data and refining their performance over time. This ability to "learn" from experience makes ML particularly effective for solving complex problems that are difficult or impossible to define using fixed rules.

At its core, machine learning involves feeding data into algorithms that are designed to detect structures, correlations, and trends. These algorithms then use statistical methods to build models that can generalize from the data and apply their understanding to new, unseen information. For example, in supervised learning, a model is trained on labeled data—data that already includes the correct output—such as images of animals with tags indicating the species. The model learns to associate features in the data (like size, shape, and color) with the correct label, enabling it to classify new images with a high degree of accuracy. In contrast, unsupervised learning deals with unlabeled data and attempts to discover hidden patterns or groupings without prior guidance. Reinforcement learning, another branch, trains agents to make decisions by rewarding desirable behaviors and penalizing undesired ones, often used in robotics and game-playing AI systems.

Machine learning has a vast range of applications that are increasingly embedded in our daily lives. In healthcare, ML models are used for diagnosing diseases from medical images, predicting patient outcomes, and personalizing treatment plans. In finance, ML is instrumental in detecting fraudulent transactions, managing risk, and developing algorithmic trading strategies. In the tech industry, it powers search engines, recommendation systems, voice recognition, and natural language processing—enabling digital assistants like Google Assistant, Siri, and ChatGPT itself. Furthermore, in the automotive industry, ML plays a key role in the development of autonomous vehicles, allowing them to recognize traffic signs, detect pedestrians, and make real-time driving decisions.

One of the reasons machine learning is advancing so rapidly is the combination of growing computational power, vast amounts of data generated every day, and the development of more sophisticated algorithms. Frameworks like TensorFlow, PyTorch, and Scikit-learn have made it easier for researchers and developers to experiment, build, and deploy ML models across various platforms. However, as powerful as machine learning is, it also raises important challenges and ethical questions. Issues such as data privacy, algorithmic bias, model transparency, and the potential impact on employment and society must be addressed responsibly as the technology continues to evolve.

In conclusion, machine learning represents a fundamental shift in how we approach problem-solving and automation. By enabling machines to learn from data and improve over time, ML not only enhances efficiency and decision-making across numerous sectors but also opens up new possibilities for innovation. As research and development in this field continue to progress, the role of machine learning in shaping the future of technology, business, healthcare, and society at large will only become more significant.
"""

# Preprocess the input text
input_text = "summarize: " + example_text
inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = inputs.to(device)
# Generate the summary
summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

# Decode the generated summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Original Text:\n", example_text)
print("\nGenerated Summary:\n", summary)

Original Text:
 
Machine Learning (ML) is a powerful and transformative subfield of artificial intelligence (AI) that focuses on creating systems and algorithms that can learn from data, identify patterns, and make decisions or predictions with minimal human intervention. Unlike traditional programming, where a developer writes explicit instructions for every possible scenario, machine learning enables computers to learn how to perform tasks by analyzing large volumes of data and refining their performance over time. This ability to "learn" from experience makes ML particularly effective for solving complex problems that are difficult or impossible to define using fixed rules.

At its core, machine learning involves feeding data into algorithms that are designed to detect structures, correlations, and trends. These algorithms then use statistical methods to build models that can generalize from the data and apply their understanding to new, unseen information. For example, in supervise