
# Training model for text summarization


## 1 GPU Setup and Python Library Installation & Import 

###1.1 GPU Setup

In [None]:
import torch
# Checking GPU availability
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


### 1.2 Library *Installation*

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


###1.3 Library Import

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import torch


###1.4 Connceting to Google Drive

In [None]:
# run this code when running the code on Google Colab
from google.colab import drive
drive.mount('/content/drive')
import sys
sys.path.insert(0,'/content/drive/MyDrive/Applied_ML_Project/')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 2 Data Loading and Preprocessing

In [None]:
data = pd.read_pickle("/content/drive/MyDrive/Applied_ML_Project/data_preprocessing/summary_data1.pkl")
data

Unnamed: 0,original_text,reference_summary
0,"On September 15, 2005, the Equal Employment Op...",Equal Employment Opportunity Commission brough...
1,NOTE: This is one of three identically named c...,The case was brought by a non-profit organizat...
2,"On May 11, 2006, African-American employees of...",This case was brought by African American empl...
3,Pursuant to the Civil Rights of Institutionali...,Pursuant to the Civil Rights of Institutionali...
4,"On July 30, 2015, the Freedom of the Press Fou...",A non-profit organization dealing with rights ...
...,...,...
30861,BT program to beat dialler scams BT is introd...,BT is introducing two initiatives to help beat...
30862,Spam e-mails tempt net shoppers Computer user...,A third of them read unsolicited junk e-mail a...
30863,Be careful how you code A new European direct...,This goes to the heart of the European project...
30864,US cyber security chief resigns The man makin...,Amit Yoran was director of the National Cyber ...


In [None]:
data['original_text'][0]

'On September 15, 2005, the Equal Employment Opportunity Commission (EEOC) filed suit against House of Philadelphia, Inc., on behalf of an employee who was allegedly fired because she was pregnant. Seeking monetary and injunctive relief for the employee (including economic damage, compensation for emotional harm, and punitive damages), the EEOC brought suit under Title VII of the Civil Rights Act of 1964 for unlawful discrimination on the basis of sex. The EEOC also sought to recover its costs. Via private counsel, the employee filed a motion to intervene in the suit, which was automatically granted after the period for filing objections passed without incident. The employee brought claims under Title VII and state law and sought substantially the same relief as the EEOC, except that the complaint specifically sought reinstatement. Eventually the parties came to a settlement agreement, which the Court (Judge Kristi K. DuBose) entered as a consent decree on Jan 10, 2009. The terms of th

In [None]:
from transformers import AutoTokenizer, BasicTokenizer
checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
basic_tokenizer = BasicTokenizer()

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
import datasets
from datasets import Dataset, DatasetDict
tds = Dataset.from_pandas(data)


In [None]:
tds = tds.train_test_split(test_size=0.2)

In [None]:
tds['train']['original_text'][0]

"SECTION 1. SHORT TITLE. This Act may be cited as the ``Emergency Unemployment Compensation Expansion Act''. SEC. 2. TEMPORARY EXTENSION OF UNEMPLOYMENT INSURANCE PROVISIONS. (a) In General.--(1) Section 4007 of the Supplemental Appropriations Act, 2008 (Public Law 110-252; 26 U.S.C. 3304 note) is amended-- (A) by striking ``November 30, 2010'' each place it appears and inserting ``January 3, 2012''; (B) in the heading for subsection (b)(2), by striking ``november 30, 2010'' and inserting ``january 3, 2012''; and (C) in subsection (b)(3), by striking ``April 30, 2011'' and inserting ``June 9, 2012''. (2) Section 2005 of the Assistance for Unemployed Workers and Struggling Families Act, as contained in Public Law 111-5 (26 U.S.C. 3304 note; 123 Stat. 444), is amended-- (A) by striking ``December 1, 2010'' each place it appears and inserting ``January 4, 2012''; and (B) in subsection (c), by striking ``May 1, 2011'' and inserting ``June 11, 2012''. (3) Section 5 of the Unemployment Compe

In [None]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["original_text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["reference_summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
tokenized_tds = tds.map(preprocess_function, batched=True)

Map:   0%|          | 0/24692 [00:00<?, ? examples/s]

Map:   0%|          | 0/6174 [00:00<?, ? examples/s]

In [None]:
len(tokenized_tds['train']['input_ids'][0])

1024

In [None]:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

In [None]:
!pip install evaluate
!pip install rouge_score
import evaluate

rouge = evaluate.load("rouge")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="/content/drive/MyDrive/Applied_ML_Project/Custom_Model/policy_summarization_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_tds["train"],
    eval_dataset=tokenized_tds["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

/content/drive/MyDrive/Applied_ML_Project/Custom_Model/policy_summarization_model is already a clone of https://huggingface.co/Saish/policy_summarization_model. Make sure you pull the latest changes with `repo.git_pull()`.
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


RuntimeError: ignored

In [None]:
trainer.push_to_hub("Saish/policy_summarization_model")

In [None]:
text = "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."

In [None]:
len(text.split(" "))

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/Applied_ML_Project/Saish/policy_summarization_model")
inputs = tokenizer(text, return_tensors="pt").input_ids

In [None]:
inputs


In [None]:
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/Applied_ML_Project/Saish/policy_summarization_model")
outputs = model.generate(inputs, max_new_tokens=1000, do_sample=False)


In [None]:
tokenizer.decode(outputs[0], skip_special_tokens=True)