<a href="https://colab.research.google.com/github/rodrigorcarmo/multi_agent_chatbot/blob/main/summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install transformers datasets evaluate rouge_score nltk

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m 

# Summarization

This notebook performs a fine tuning on the t5-small model available from the HuggingFace Hub, it performs the same steps as instructed on their website and the purpose was to get an abstractive summarizer capable of summarizing the reviews that appear repeated on the dataset

In [2]:
# Mounting the Google Drive to access the customer feedback dataset
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Logging on the Hugging Face Hub
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Dataset
For the task, the dataset used is the BillSum, a dataset for summarization of US Congressional and California state bills.

In [4]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.27k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

ca_test-00000-of-00001.parquet:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

Splitting the dataset into a train and test set

In [5]:
billsum = billsum.train_test_split(test_size=0.2)

A sample from the dataset

In [6]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 6253 of the Government Code is amended to read:\n6253.\n(a) Public records are open to inspection at all times during the office hours of the state or local agency and every person has a right to inspect any public record, except as hereafter provided. Any reasonably segregable portion of a record shall be available for inspection by any person requesting the record after deletion of the portions that are exempted by law.\n(b) Except with respect to public records exempt from disclosure by express provisions of law, each state or local agency, upon a request for a copy of records that reasonably describes an identifiable record or records, shall make the records promptly available to any person upon payment of fees covering direct costs of duplication, or a statutory fee if applicable. Upon request, an exact copy shall be provided unless impracticable to do so.\n(c) Each agency, upon a request

## Preprocess

The model chosen to refine is the `t5-small` from Google, which besides being a very small model (~60M parameters) was designed for text-to-text activites, where the input and output are text strings

Loading the T5 tokenizer to process `text` and `summary`

In [7]:
from transformers import AutoTokenizer
from transformers import DataCollatorForSeq2Seq

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

It is necessary to add the prefix `summarize: ` for the model to perform this task

In [8]:
def preprocess_function(examples):
    inputs = ["summarize: " + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [9]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

## Evaluate

Function to calculate the ROUGE metric given the generated summary and the actual one

In [11]:
import evaluate
import numpy as np

rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

## Train

For training the hyperparamters used were the same as provided on the HuggingFace Hub only with a little more epochs to evaluate if there were gains on the ROUGE metrics

In [13]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="billsum_t5-model_summarization",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.506792,0.1506,0.0575,0.1233,0.1232,19.0
2,No log,2.449792,0.1644,0.0671,0.1359,0.1357,19.0
3,No log,2.420736,0.1836,0.0826,0.1521,0.1522,19.0
4,No log,2.405171,0.1889,0.0896,0.1574,0.1573,19.0
5,No log,2.399938,0.1896,0.0907,0.1582,0.1581,19.0




TrainOutput(global_step=310, training_loss=2.6301684964087704, metrics={'train_runtime': 349.6236, 'train_samples_per_second': 14.144, 'train_steps_per_second': 0.887, 'total_flos': 1338530416558080.0, 'train_loss': 2.6301684964087704, 'epoch': 5.0})

Given the five epochs and its evolution we can observe gains on the ROUGE Score even though they remain low, but the stategy here was not to fine tune the model to proper scores but to get a tool to start using on the multi agent system and test its design to then better refine this model

Pushing the model to the hub to use it on the multi agent system

In [None]:
trainer.push_to_hub()

events.out.tfevents.1728933596.e935b70787ed.2230.2:   0%|          | 0.00/8.91k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rodrigorcarmo/billsum_t5-model_summarization/commit/b58f81b0f1b91bf32cc4495f86f011cc26be3756', commit_message='End of training', commit_description='', oid='b58f81b0f1b91bf32cc4495f86f011cc26be3756', pr_url=None, pr_revision=None, pr_num=None)

### Evaluating the summarizer on the customer dataset

The approach was to concatenate all reviews and evaluate the summary generated by the model. 
* First aggregating all the feedbacks with the same sentiment label
* Then passing it to the summarizer to understand the output

In [14]:
from transformers import pipeline
import pandas as pd

summarizer = pipeline("summarization", model="rodrigorcarmo/billsum_t5-model_summarization",device_map="auto",)
tokenizer_kwargs = {'truncation':True,'max_length':512}

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset/Customer_Feedback_Dataset.csv',sep=';')

positive_texs = df[df['sentiment'] == 'positive']['feedback_text'].tolist()
positive = 'summarize: '+' '.join(positive_texs)
negative_texs = df[df['sentiment'] == 'negative']['feedback_text'].tolist()
negative = 'summarize: '+' '.join(negative_texs)

config.json:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.7k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Example of the negative set of feedbacks

In [15]:
negative

'summarize: Great product, but the delivery was late. Excellent value for money! Received a defective item. Easy to use website and quick checkout. The delivery was fast and the product is good. Great product, but the delivery was late. Not satisfied with the service. Excellent value for money! Not satisfied with the service. Excellent value for money! Excellent value for money! I had issues with the website. Great product, but the delivery was late. The customer service was very helpful. The pricing is too high for what you get. Not satisfied with the service. Excellent value for money! Easy to use website and quick checkout. Great product, but the delivery was late. Not satisfied with the service. The delivery was fast and the product is good. The customer service was very helpful. The delivery was fast and the product is good. I had issues with the website. The delivery was fast and the product is good. Great product, but the delivery was late. Not satisfied with the service. Receiv

The summarizer was able to concise all the feedbacks by removing its duplicates automatically, which was the ideia as the provided data was not very helpful but the task itself could be replicated for other purposes

In [16]:
summarizer(negative,**tokenizer_kwargs)

[{'summary_text': 'Great product, but the delivery was late. Not satisfied with the service. Excellent value for money! Easy to use website and quick checkout. The delivery was fast and the product is good. The customer service was very helpful. The pricing is too high for what you get.'}]