<a href="https://colab.research.google.com/github/samyarsworld/text-summarization-NLP/blob/main/text_summarization_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarizer Project

## Install dependencies:
- transformers will be used in this project and amongst many available models, Pegasus from Hugging face project is implemented.
- sacrebleu, a family of BLEU (Bilingual Evaluation Understudy), is a metric used for evaluating the quality of machine-translated text

- rouge_score, and py7zr will be used for machine-generated text evaluation.
- datasets will be used to initialize our own custom dataset
- SentencePiece from the Hugging Face Transformers library, is an unsupervised text tokenizer and detokenizer mainly used for neural network-based text generation tasks. [sentencepiece] is an optional extra that can be added to the Transformers library for installation.
- py7zr for the 7z format which is a high-compression file archive format commonly associated with the 7-Zip file archiver.

In [None]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr

# Trainer class works with later (>20) versions of accelerate
!pip install accelerate -U
!pip uninstall transformers accelerate
!pip install transformers accelerate

##Initialization:

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

model_ckpt = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(DEVICE)

## Download and load the data:

samsum dataset available at Hugging Face is used. Original paper can be found at : https://arxiv.org/abs/1911.12237v2

In [5]:
import requests

url = "https://huggingface.co/datasets/samsum/raw/main/samsum.py"
response = requests.get(url)
data_path = "samsum.py"
with open(data_path, "wb") as f:
    f.write(response.content)

## Preprocess the data:

Tokenize inputs and outputs.

In [6]:
def tokenize(batch):
    input_encodings = tokenizer(batch['dialogue'] , max_length = 1024, truncation = True )

    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(batch['summary'], max_length = 128, truncation = True )

    return {
        'input_ids' : input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }

In [7]:
from datasets import load_dataset

dataset = load_dataset(data_path)
dataset = dataset.map(tokenize, batched = True)

Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]



Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

## Train the model:

In [8]:
from transformers import DataCollatorForSeq2Seq
from transformers import TrainingArguments, Trainer

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

trainer_args = TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    evaluation_strategy='steps', eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)


trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                  train_dataset=dataset["test"],
                  eval_dataset=dataset["validation"])


trainer.train()

You're using a PegasusTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss


TrainOutput(global_step=51, training_loss=3.0754146295435287, metrics={'train_runtime': 178.9453, 'train_samples_per_second': 4.577, 'train_steps_per_second': 0.285, 'total_flos': 313317832187904.0, 'train_loss': 3.0754146295435287, 'epoch': 1.0})

## Evaluate the model:

## Save the model and tokenizer:

In [11]:

## Save model
model_pegasus.save_pretrained("/content/pegasus-samsum-model")

## Save tokenizer
tokenizer.save_pretrained("/content/tokenizer")

('/content/tokenizer/tokenizer_config.json',
 '/content/tokenizer/special_tokens_map.json',
 '/content/tokenizer/spiece.model',
 '/content/tokenizer/added_tokens.json',
 '/content/tokenizer/tokenizer.json')

In [13]:
!zip -r /content/pegasus-samsum2.zip /content/pegasus-samsum-model
!zip -r /content/tokenizer2.zip /content/tokenizer

from google.colab import files
files.download("/content/pegasus-samsum.zip")
files.download("/content/tokenizer.zip")


  adding: content/pegasus-samsum-model/ (stored 0%)
  adding: content/pegasus-samsum-model/model.safetensors (deflated 7%)
  adding: content/pegasus-samsum-model/generation_config.json (deflated 45%)
  adding: content/pegasus-samsum-model/config.json (deflated 60%)
  adding: content/tokenizer/ (stored 0%)
  adding: content/tokenizer/tokenizer.json (deflated 78%)
  adding: content/tokenizer/tokenizer_config.json (deflated 94%)
  adding: content/tokenizer/spiece.model (deflated 50%)
  adding: content/tokenizer/special_tokens_map.json (deflated 82%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Load the model:

In [None]:
#Load

tokenizer = AutoTokenizer.from_pretrained("tokenizer")


## Make predictions: