## Text Summerization

Steps:

1. Install the libraries
2. Load the data
3. Load the BERT pretained tokenizer model
4. Process the data and generate the encodings
5. Train the summerization model
6. Evaluate the model
7. Save the model
8. Summarize the text

### 1. Install Libraries

In [None]:
!pip install transformers[torch] datasets

Collecting transformers[torch]
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m45.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers[torch])
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers[torch])
  Downloading tokenizers-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m94.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[t

### 2. Load Dataset

Start by loading the smaller California state bill subset of the **BillSum dataset** from the Hugging Face Datasets library.

In [None]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test[:10%]")

Downloading builder script:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.70k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/67.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

In [None]:
### Split Dataset
billsum = billsum.train_test_split(test_size=0.2)

In [None]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nThis act shall be known, and may be cited, as the Third Validating Act of 2015.\nSEC. 2.\nAs used in this act:\n(a) “Public body” means all of the following:\n(1) The state and all departments, agencies, boards, commissions, and authorities of the state. Except as provided in paragraph (2), “public body” also means all cities, counties, cities and counties, districts, authorities, agencies, boards, commissions, and other entities, whether created by a general statute or a special act, including, but not limited to, the following:\nAgencies, boards, commissions, or entities constituted or provided for under or pursuant to the Joint Exercise of Powers Act (Chapter 5 (commencing with Section 6500) of Division 7 of Title 1 of the Government Code).\nAir pollution control districts of any kind.\nAir quality management districts.\nAirport districts.\nAssessment districts, benefit assessment districts, and sp

### 3. Load Tokenizer

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### 4. Process the Data

Prefix the input with a prompt so T5 knows this is a summarization task.

In [None]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], padding="max_length", truncation=True, max_length=128)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/99 [00:00<?, ? examples/s]

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

In [None]:
tokenized_billsum['train'][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nThis act shall be known, and may be cited, as the Third Validating Act of 2015.\nSEC. 2.\nAs used in this act:\n(a) “Public body” means all of the following:\n(1) The state and all departments, agencies, boards, commissions, and authorities of the state. Except as provided in paragraph (2), “public body” also means all cities, counties, cities and counties, districts, authorities, agencies, boards, commissions, and other entities, whether created by a general statute or a special act, including, but not limited to, the following:\nAgencies, boards, commissions, or entities constituted or provided for under or pursuant to the Joint Exercise of Powers Act (Chapter 5 (commencing with Section 6500) of Division 7 of Title 1 of the Government Code).\nAir pollution control districts of any kind.\nAir quality management districts.\nAirport districts.\nAssessment districts, benefit assessment districts, and sp

### 5. Train the Model

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
training_args = Seq2SeqTrainingArguments(output_dir="./result", evaluation_strategy="epoch")

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer
)

In [None]:
trainer.train()

In [None]:
# Save the model
model.save_pretrained('./model/')

### 6. Summerize Text

In [None]:
from transformers import AutoModelForSeq2SeqLM, pipeline

model = AutoModelForSeq2SeqLM.from_pretrained('./model/')

text = "The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)
summarizer(text, min_length=5, max_length=20)

[{'summary_text': '[unused2] [unused8] the inflation reduction act lowers prescription drug costs, health care [unused2] reduction act czech [unused2] [unused8]'}]