<a href="https://colab.research.google.com/github/tmskss/ManPageSum/blob/main/colab/ModelComparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Compare T5 and code T5+ with ROUGE
### This notebook compares the T5 and T5+ models with ROUGE on the linux man pages tldr dataset, to help choose the better fitting model for summarization

In [1]:
!git clone https://github.com/tmskss/ManPageSum.git
%cd ManPageSum/colab
from install import *
install_requirements()

Cloning into 'ManPageSum'...
remote: Enumerating objects: 1199, done.[K
remote: Counting objects: 100% (1199/1199), done.[K
remote: Compressing objects: 100% (923/923), done.[K
remote: Total 1199 (delta 287), reused 1182 (delta 275), pack-reused 0[K
Receiving objects: 100% (1199/1199), 9.18 MiB | 17.28 MiB/s, done.
Resolving deltas: 100% (287/287), done.
/content/ManPageSum/colab
⏳ Installing base requirements ...
✅ Base requirements installed!
✅ Summary requirements installed!


In [2]:
from utils import *
setup_chapter()

Using transformers v4.32.1
Using datasets v2.0.0


In [3]:
from transformers import pipeline, set_seed

## Get the dataset

In [4]:
from datasets import load_dataset

dataset = load_dataset("tmskss/linux-man-pages-tldr-summarized")
print(f"Features: {dataset['train'].column_names}")

Downloading and preparing dataset csv/tmskss--linux-man-pages-tldr-summarized to /root/.cache/huggingface/datasets/csv/tmskss--linux-man-pages-tldr-summarized-ae8bc80ae0d1d6c6/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.01M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/tmskss--linux-man-pages-tldr-summarized-ae8bc80ae0d1d6c6/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  csv_file_reader = pd.read_csv(file, iterator=True, dtype=dtype, **self.config.read_csv_kwargs)


  0%|          | 0/1 [00:00<?, ?it/s]

Features: ['Command', 'Text', 'Summary']


Setting a sample to test pretrained models with

In [5]:
sample = dataset["train"][0]
print(f"""
Man page content (excerpt of 500 characters, total length: {len(sample["Text"])}):
""")
print(sample["Text"][:500])
print(f'\nSummary (length: {len(sample["Summary"])}):')
print(sample["Summary"])


Man page content (excerpt of 500 characters, total length: 2538):

 The chgrp utility shall set the group ID of the file named by each file operand
to the group ID specified by the group operand. For each file operand, or, if
the -R option is used, each file encountered while walking the directory trees
specified by the file operands, the chgrp utility shall perform actions
equivalent to the chown() function defined in the System Interfaces volume of
POSIX.1‐2017, called with the following arguments: * The file operand shall be
used as the path argument. * The

Summary (length: 574):
# chgrp
> Change group ownership of files and directories. More information:
> https://www.gnu.org/software/coreutils/chgrp.
  * Change the owner group of a file/directory:
`chgrp {{group}} {{path/to/file_or_directory}}`
  * Recursively change the owner group of a directory and its contents:
`chgrp -R {{group}} {{path/to/directory}}`
  * Change the owner group of a symbolic link:
`chgrp -h {{group}} {{pat

## Text sumarization pipelines

In [6]:
sample_text = dataset["train"][0]["Text"][:2000]
# We'll collect the generated summaries of each model in a dictionary
summaries = {}

In [7]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Summarization baseline

In [8]:
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

In [9]:
summaries["baseline"] = three_sentence_summary(sample_text)

In [10]:
print(summaries["baseline"])

 The chgrp utility shall set the group ID of the file named by each file operand
to the group ID specified by the group operand.
For each file operand, or, if the -R option is used, each file encountered while
walking the directory trees specified by the file operands, the chgrp utility
shall perform actions equivalent to the chown() function defined in the System
Interfaces volume of POSIX.1‐2017, called with the following arguments: * The
file operand shall be used as the path argument.
* The user ID of the file shall be used as the owner argument.


## Flan-T5

In [11]:
pipe = pipeline("summarization", model="google/flan-t5-base")
pipe_out = pipe(sample_text)
summaries["flan-t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [12]:
print(summaries["flan-t5"])

Set the group ID of the file named by each file operand.
Set the set-user-ID and set-group-ID bits of a regular file.
Confirm the implementation of the chgrp utility.


## BART Large

In [23]:
from transformers import BartTokenizer, BartModel

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
model = BartModel.from_pretrained('facebook/bart-large')

inputs = tokenizer(sample_text, return_tensors="pt")
outputs = model(**inputs)

summaries["codet5p"] = "\n".join(outputs)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

In [24]:
print(summaries["codet5p"])

last_hidden_state
past_key_values
encoder_last_hidden_state


## Comparing results

In [16]:
print("GROUND TRUTH")
print(dataset["train"][0]["Summary"])
print("")

for model_name in summaries:
    print(model_name.upper())
    print(summaries[model_name])
    print("")

GROUND TRUTH
# chgrp
> Change group ownership of files and directories. More information:
> https://www.gnu.org/software/coreutils/chgrp.
  * Change the owner group of a file/directory:
`chgrp {{group}} {{path/to/file_or_directory}}`
  * Recursively change the owner group of a directory and its contents:
`chgrp -R {{group}} {{path/to/directory}}`
  * Change the owner group of a symbolic link:
`chgrp -h {{group}} {{path/to/symlink}}`
  * Change the owner group of a file/directory to match a reference file:
`chgrp --reference={{path/to/reference_file}} {{path/to/file_or_directory}}`

BASELINE
 The chgrp utility shall set the group ID of the file named by each file operand
to the group ID specified by the group operand.
For each file operand, or, if the -R option is used, each file encountered while
walking the directory trees specified by the file operands, the chgrp utility
shall perform actions equivalent to the chown() function defined in the System
Interfaces volume of POSIX.1‐2017, 

## Measuring quality with ROUGE

In [18]:
from datasets import load_metric

rouge_metric = load_metric("rouge")

Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

In [22]:
import pandas as pd

reference = dataset["train"][0]["Summary"]
records = []
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

for model_name in summaries:
    rouge_metric.add(prediction=summaries[model_name], reference=reference)
    score = rouge_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
    records.append(rouge_dict)
pd.DataFrame.from_records(records, index=summaries.keys())

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.27027,0.010929,0.205405,0.237838
flan-t5,0.25,0.016949,0.216667,0.233333
codet5p,0.008245,0.0,0.008245,0.007067
