<a href="https://colab.research.google.com/github/rodrigorcarmo/multi_agent_chatbot/blob/main/summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install transformers datasets evaluate rouge_score nltk

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m 

# Summarization

This notebook performs a fine tuning on the t5-small model available from the HuggingFace Hub, it performs the same steps as instructed on their website and the purpose was to get an overview and study their capabilities.

In [2]:
# Mounting the Google Drive to access the customer feedback dataset
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
# Logging on the Hugging Face Hub
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

Split the dataset into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [None]:
billsum = billsum.train_test_split(test_size=0.2)

Then take a look at an example:

In [None]:
billsum["train"][0]

 'summary': 'Existing law, the Swimming Pool Safety Act, provides that it does not apply to any pool within the jurisdiction of any political subdivision that adopts an ordinance for swimming pools, as specified. The act further requires, when a building permit is issued for construction of a new swimming pool or spa, or the remodeling of an existing pool or spa, at a private, single-family home, that the pool or spa be equipped with at least 1 of 7 drowning prevention safety features. The act requires the local building code official to inspect and approve the drowning safety prevention devices before the issuance of a final approval for the completion of permitted construction or remodeling work.\nThis bill would instead require, when a building permit is issued, that the pool or spa be equipped with at least 2 of the 7 drowning prevention safety features. By imposing additional duties on local officials, this bill would impose a state-mandated local program. The bill would remove th

## Preprocess

The next step is to load a T5 tokenizer to process `text` and `summary`:

In [None]:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

The preprocessing function you want to create needs to:

1. Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Use the keyword `text_target` argument when tokenizing labels.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [None]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [None]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## Evaluate

In [None]:
import evaluate

rouge = evaluate.load("rouge")

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the ROUGE metric:

In [None]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load T5 with [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM):

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="billsum_t5-model_summarization",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.506792,0.1506,0.0575,0.1233,0.1232,19.0
2,No log,2.449792,0.1644,0.0671,0.1359,0.1357,19.0
3,No log,2.420736,0.1836,0.0826,0.1521,0.1522,19.0
4,No log,2.405171,0.1889,0.0896,0.1574,0.1573,19.0
5,No log,2.399938,0.1896,0.0907,0.1582,0.1581,19.0




TrainOutput(global_step=310, training_loss=2.6301684964087704, metrics={'train_runtime': 349.6236, 'train_samples_per_second': 14.144, 'train_steps_per_second': 0.887, 'total_flos': 1338530416558080.0, 'train_loss': 2.6301684964087704, 'epoch': 5.0})

Once training is completed, share your model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use your model:

In [None]:
trainer.push_to_hub()

events.out.tfevents.1728933596.e935b70787ed.2230.2:   0%|          | 0.00/8.91k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rodrigorcarmo/billsum_t5-model_summarization/commit/b58f81b0f1b91bf32cc4495f86f011cc26be3756', commit_message='End of training', commit_description='', oid='b58f81b0f1b91bf32cc4495f86f011cc26be3756', pr_url=None, pr_revision=None, pr_num=None)

In [48]:
from transformers import pipeline
import pandas as pd

summarizer = pipeline("summarization", model="rodrigorcarmo/billsum_t5-model_summarization",device_map="auto",)
tokenizer_kwargs = {'truncation':True,'max_length':512}

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset/Customer_Feedback_Dataset.csv',sep=';')

positive_texs = df[df['sentiment'] == 'positive']['feedback_text'].tolist()
positive = 'summarize: '+' '.join(positive_texs)
negative_texs = df[df['sentiment'] == 'negative']['feedback_text'].tolist()
negative = 'summarize: '+' '.join(negative_texs)

In [51]:
negative

'summarize: Great product, but the delivery was late. Excellent value for money! Received a defective item. Easy to use website and quick checkout. The delivery was fast and the product is good. Great product, but the delivery was late. Not satisfied with the service. Excellent value for money! Not satisfied with the service. Excellent value for money! Excellent value for money! I had issues with the website. Great product, but the delivery was late. The customer service was very helpful. The pricing is too high for what you get. Not satisfied with the service. Excellent value for money! Easy to use website and quick checkout. Great product, but the delivery was late. Not satisfied with the service. The delivery was fast and the product is good. The customer service was very helpful. The delivery was fast and the product is good. I had issues with the website. The delivery was fast and the product is good. Great product, but the delivery was late. Not satisfied with the service. Receiv

In [3]:
summarizer(negative,**tokenizer_kwargs)

NameError: name 'summarizer' is not defined