In [1]:
!pip install datasets evaluate rouge_score
!pip install --upgrade transformers accelerate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24954 sha256=43692f8c45a3d5c68c701eda63b107e7cbff3f198e4eb98c353b63d48a0615d4
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.1 rouge_score-0.1.2
Collecting transformers
  Downloading transformers-4.39.2-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB

# Hyperparameters

In [2]:

# Set hyperparameters
BATCH_SIZE = 4
NUM_TRAIN_EPOCHS = 3
LEARNING_RATE = 3e-5
WEIGHT_DECAY = 0.01
MAX_SOURCE_LENGTH = 512
MAX_TARGET_LENGTH = 64
TRAINING_DATASET_SIZE = 100
TESTING_DATASET_SIZE = 10

DATASET_PATH = "/kaggle/input/billsum-processed-train/ustrain_processed.csv"
TEST_DATASET_PATH = "/kaggle/input/billsum-processed-train/ustest_processed.csv"

#CHANGE THIS BEFORE ANY HYPERPARAMETER CHANGE !!!!!!!!!!!!
OUTPUT_DIR_CHECKPOINT = "/kaggle/working/model_testing_10k_6_14_2023_1"

# Load BillSum dataset
Loading the BillSum dataset with TRAINING_DATASET_SIZE and TESTING_DATASET_SIZE

In [3]:
from datasets import load_dataset

# Load the dataset
billsum = load_dataset("csv", data_files={"train": DATASET_PATH, "test": TEST_DATASET_PATH})

billsum["train"] = billsum["train"].select(range(TRAINING_DATASET_SIZE))
billsum["test"] = billsum["test"].select(range(TESTING_DATASET_SIZE))

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-00c6f4fe04f4b1e3/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

  csv_file_reader = pd.read_csv(file, iterator=True, dtype=dtype, **self.config.read_csv_kwargs)
  csv_file_reader = pd.read_csv(file, iterator=True, dtype=dtype, **self.config.read_csv_kwargs)


Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-00c6f4fe04f4b1e3/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Split the dataset into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [4]:
#billsum = billsum.train_test_split(test_size=0.2)

In billsum dataset, There are two fields that you'll want to use:

- `clean_text`: the preprocessed text of the bill which'll be the input to the model.
- `summary`: a condensed version of `text` which'll be the model target.

In [5]:
billsum["train"][0]

{'bill_id': '107_hr2256',
 'clean_text': 'SECTIONHEADER SHORT TITLE. This Act may be cited as the "Border Hospital Survival and Illegal Immigrant Care Act". SECTIONHEADER FINDINGS. The Congress finds as follows: Immigration is a Federal responsibility. The Immigration and Naturalization Service does not take into custody all aliens who are unlawfully present in the United States. Section 1867 of the Social Security Act and State laws require that, if any individual comes to a hospital and the hospital determines that the individual has an emergency medical condition, the hospital must provide either, within the staff and facilities available at the hospital, for such further medical examination and such treatment as may be required to stabilize the medical condition, or, if appropriate, for transfer of the individual to another medical facility. The Southwest border region is ill-equipped to absorb the expense of providing health care to undocumented aliens because it ranks last in the

# Preprocess
The next step is to load a `bart` tokenizer to process `text` and `summary`:

In [6]:
import os
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
checkpoint_dir = OUTPUT_DIR_CHECKPOINT

checkpoint = "facebook/bart-large-cnn"

if os.path.exists(checkpoint_dir):
    model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_dir)
    print("Using checkpoint model: ", checkpoint_dir)
else:
    model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The preprocessing function you want to create needs to:

1. Prefix the input with a prompt so `facebook/bart-large-cnn` knows this is a summarization task.
2. Use the keyword `text_target` argument when tokenizing labels.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [7]:
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["clean_text"]]
    model_inputs = tokenizer(inputs, max_length=MAX_SOURCE_LENGTH, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Apply the preprocessing function over the entire dataset method and speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [8]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Creating a batch of examples using `DataCollatorForSeq2Seq` which dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [9]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


# Writing The Evaluate function
Loaded the `ROUGE` metric:

In [10]:
import evaluate
rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Then use `compute_metrics` for the bill sum predictions and labels to `compute` to calculate the ROUGE metric:

In [11]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

# Train
Training the `facebook\bart-large-cnn` model using AutoModelForSeq2SeqLM which loads the pretrained model

In [12]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

At this point, only three steps remain:

1. Using `Seq2SeqTrainingArguments`, we can configure the hyperparameter for the model, At the end of each step, the `trainer` will evaluate the ROUGE metric and save the training checkpoint.
2. Pass the training arguments to `Seq2SeqTrainer` along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call `train()` and `save_model()` to train and save the model.

In [13]:
from transformers.trainer_callback import EarlyStoppingCallback
os.environ["WANDB_DISABLED"] = "true"

if not os.path.exists(checkpoint_dir):
    # Set up the training arguments
    training_args = Seq2SeqTrainingArguments(
        output_dir=checkpoint_dir,
        evaluation_strategy="steps",  # Change evaluation strategy to "steps"
        learning_rate=LEARNING_RATE,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        weight_decay=WEIGHT_DECAY,
        save_total_limit=1,
        num_train_epochs=NUM_TRAIN_EPOCHS,
        predict_with_generate=True,
        fp16=True,
        logging_steps=1,
        load_best_model_at_end=True
    )

    # Set up the trainer with early stopping
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_billsum["train"],
        eval_dataset=tokenized_billsum["test"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    )

    # Train the model
    trainer.train()
    trainer.save_model(checkpoint_dir)
else:
    print("Checkpoint already exists. Skipping training.")

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,2.0796,2.449455,0.4037,0.1921,0.2956,0.2956,79.9
2,2.4862,2.449455,0.4037,0.1921,0.2956,0.2956,79.9
3,2.9325,2.06271,0.3796,0.1729,0.2737,0.2729,85.5
4,2.5056,1.909763,0.4577,0.2648,0.3456,0.3449,100.1
5,2.1207,1.864363,0.4765,0.288,0.3809,0.3814,98.8
6,2.0326,1.809143,0.4824,0.2943,0.4025,0.4028,105.3
7,2.3645,1.74866,0.4553,0.2808,0.3775,0.3767,118.8
8,1.8818,1.704429,0.4735,0.3073,0.3984,0.398,120.5
9,2.233,1.686296,0.4869,0.3237,0.4112,0.4111,109.2
10,1.7334,1.683752,0.5153,0.3496,0.4593,0.46,95.3


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


In [14]:
# Evaluation on training dataset
train_metrics = trainer.evaluate(eval_dataset=tokenized_billsum["train"])
print("Training ROUGE Scores:")
print("ROUGE-1:", train_metrics["eval_rouge1"])
print("ROUGE-2:", train_metrics["eval_rouge2"])
print("ROUGE-L:", train_metrics["eval_rougeL"])

# Evaluation on testing dataset
test_metrics = trainer.evaluate(eval_dataset=tokenized_billsum["test"])
print("Testing ROUGE Scores:")
print("ROUGE-1:", test_metrics["eval_rouge1"])
print("ROUGE-2:", test_metrics["eval_rouge2"])
print("ROUGE-L:", test_metrics["eval_rougeL"])

Training ROUGE Scores:
ROUGE-1: 0.5733
ROUGE-2: 0.4209
ROUGE-L: 0.4916
Testing ROUGE Scores:
ROUGE-1: 0.5264
ROUGE-2: 0.3477
ROUGE-L: 0.4446


# Inference
For `facebook/bart-large-cnn`, you need to prefix your input depending on the task you're working on. For summarization you should prefix your input as shown below:

In [15]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

Using `summarize_text`, we can tokenize the text and use the `input_ids` for generating summary from the model with `model.generate()`:

In [16]:
import torch

def summarize_text(input_text):

    text = "summarize: " + input_text

    inputs = tokenizer(text, max_length=MAX_SOURCE_LENGTH, truncation=True, return_tensors="pt").input_ids
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

    # Move the model to the GPU
    model.to(device)

    # Move the inputs tensor to the GPU
    inputs = inputs.to(device)

    # Generate outputs
    outputs = model.generate(inputs, max_new_tokens=MAX_TARGET_LENGTH, do_sample=False)
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return summary


Using the first sample of `train` dataset

In [17]:
print(billsum["train"][0]["summary"])

Border Hospital Survival and Illegal Immigrant Care Act - Amends the Public Health Service Act to direct the Secretary of Health and Human Services to establish a five-year pilot program of health care provider reimbursement for the costs associated with providing emergency medical and ambulance services in Arizona to: (1) illegal aliens who are not detained by any Federal, State, or local law enforcement authority. Or (2) aliens paroled into the United States for less than one year to receive emergency medical treatment.


In [18]:
summarize_text(billsum["train"][0]["clean_text"])

'Border Hospital Survival and Illegal Immigrant Care Act - Amends the Public Health Service Act to direct the Secretary of Health and Human Services to establish and implement a five-year pilot program to reimburse health care providers in Arizona for providing emergency medical care provided in Arizona to aliens who are unlawfully present in the United'

Using the first sample of `test` dataset

In [19]:
print(billsum["test"][0]["summary"])

National Science Education Tax Incentive for Businesses Act of 2007 - Amends the Internal Revenue Code to allow a general business tax credit for contributions of property or services to elementary and secondary schools and for teacher training to promote instruction in science, technology, engineering, or mathematics .


In [20]:
summarize_text(billsum["test"][0]["clean_text"])

'National Science Education Tax Incentive for Businesses Act of 2007 - Amends the Internal Revenue Code to extend the elementary and secondary science, technology, engineering, and mathematics (STEM) contributions credit determined under this section for the taxable year to 100 percent of qualified STEM contributions of the taxpayer for such taxable'

In [21]:
model.save_pretrained(checkpoint_dir)
tokenizer.save_pretrained(checkpoint_dir)


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


('/kaggle/working/model_testing_10k_6_14_2023_1/tokenizer_config.json',
 '/kaggle/working/model_testing_10k_6_14_2023_1/special_tokens_map.json',
 '/kaggle/working/model_testing_10k_6_14_2023_1/vocab.json',
 '/kaggle/working/model_testing_10k_6_14_2023_1/merges.txt',
 '/kaggle/working/model_testing_10k_6_14_2023_1/added_tokens.json',
 '/kaggle/working/model_testing_10k_6_14_2023_1/tokenizer.json')

In [23]:

!zip filename.zip

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



zip error: Nothing to do! (filename.zip)


In [24]:
print(billsum["train"][1]["summary"])

Farm to School Improvements Act of 2010 - Amends the Richard B. Russell National School Lunch Act to direct the Secretary of Agriculture to provide competitive matching grants to schools, nonprofit organizations, and other able entities for farm to school programs that improve the access of school lunch and breakfast program participants to local foods. Provides that each grant may include an implementation grant, training and technical assistance grant, and planning grant. Requires farm to school programs to be designed to: (1) improve the nutritional health and well being of children, (2) procure healthy local foods from small and medium-sized farms. (3) support experiential nutrition education by involving school children in farm and garden-based agricultural education activities. (4) commit public and private community stakeholders to the sustained success of such programs. And (5) increase farmers' income by facilitating their access to institutional markets. Directs the Secretary

In [27]:
print(billsum["test"][1]["summary"])

Small Business Expansion and Hiring Act of 2011 - Amends the Internal Revenue Code to allow nongovernmental employers who employ an average of fewer than 100 employees during a taxable year a retained worker tax credit until December 31, 2012, for the lesser of $4,000 or 6.2 of the wages paid to a retained worker during a period of not less than 52 consecutive weeks of employment. Limits the amount of such credit with respect to any business location of the employer to $400,000 and provides that the number of retained workers taken into account for such credit shall not exceed the excess of the number of employees of the taxpayer at the end of the taxable year over the number of such employees at the beginning of the taxable year. Defines retained worker to mean any qualified individual who was employed on any date during the taxable year for a period of not less than 52 weeks and whose wages during the last 26 weeks of such period equaled at least 80 of such wages for the first 26 wee

In [28]:
summarize_text(billsum["test"][1]["clean_text"])

'Small Business Expansion and Hiring Act of 2011 - Amends the Internal Revenue Code to extend the retained worker credit for the lesser of $4,000, or 6.2 percent of the wages (as defined in section 3401(a) paid by the taxpayer to such retained worker during the 52'