<a href="https://colab.research.google.com/github/vvikasreddy/JargonAI/blob/main/Summa_rizzer_on_custom_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing and Downloading libraries

In [1]:
# downloading necessary libraries
!pip install datasets
!pip install rouge_score
!pip install bert-score
!pip install evaluate



In [2]:
# importing libraries
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AdamW
from torch.utils.data import DataLoader, Dataset
from rouge_score import rouge_scorer
from bert_score import score as bert_score
from sklearn.model_selection import train_test_split
import textwrap
from tqdm import tqdm

# Loading the data and Pre-processing




In [3]:
#loading the dataset
dataset_path = "/content/Corpus_all.csv"
data = pd.read_csv(dataset_path, encoding="ISO-8859-1")

In [4]:
data.head()

Unnamed: 0,Document,Summary
0,412.106.1 English is not an official language ...,The Swiss Federal University for Vocational Ed...
1,944.021.1 English is not an official language ...,The EAER Ordinance on the Declaration for Timb...
2,Title and commencement 1 This order may be cit...,The North West Water Authority (Solway Firth) ...
3,Citation and commencement 1 This order may be ...,The Trafford Park Development Corporation (Are...
4,"Title, commencement and interpretation 1 1 Thi...",The North West Water Authority (Returns of Eel...


In [5]:

# putting the text and summaries into the containers
input_texts = data["Document"].tolist()
target_texts = data["Summary"].tolist()

In [6]:
# taking a look at the first index
print(input_texts[0])
print(target_texts[0])

412.106.1 English is not an official language of the Swiss Confederation. This translation is provided for information purposes only, has no legal force and may not be relied on in legal proceedings. Ordinance on the Swiss Federal University for Vocational Education and Training(SFUVET Ordinance)of 18 June 2021 (Status as of 1 August 2021)The Swiss Federal Council,on the basis of Article 35 of the SFUVET Act of 25 September 20201,ordains:1 SR 412.106Art. 1 Registered location The Swiss Federal University for Vocational Education and Training (SFUVET) shall be based in Zollikofen.Art. 2 Regional campuses SFUVET shall offer its services through three regional campuses: one in the German-speaking region, one in the French-speaking region and one in the Italian-speaking region of Switzerland.Art. 3 Federal Council's strategic objectives The Federal Department of Economic Affairs, Education and Research (EAER) shall submit SFUVET's strategic objectives drafted by the Federal Council to the 

In [7]:
#initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [8]:
#define the custom dataset class

class CustomDataset(Dataset):
    def __init__(self, tokenizer, input_texts, target_texts, max_input_length=512, max_target_length=128):
        self.tokenizer = tokenizer
        self.input_texts = input_texts
        self.target_texts = target_texts
        self.max_input_length = max_input_length
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.input_texts)

    def __getitem__(self, idx):
        input_encoding = self.tokenizer(
            self.input_texts[idx],
            max_length=self.max_input_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )
        target_encoding = self.tokenizer(
            self.target_texts[idx],
            max_length=self.max_target_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )
        return {
            "input_ids": input_encoding["input_ids"].squeeze(),
            "attention_mask": input_encoding["attention_mask"].squeeze(),
            "labels": target_encoding["input_ids"].squeeze(),
        }



In [9]:
# Split the dataset into training and testing sets
train_input_texts, test_input_texts, train_target_texts, test_target_texts = train_test_split(
    input_texts, target_texts, test_size=0.1, random_state=42
)

# Create datasets for train, validation, and test
train_dataset = CustomDataset(tokenizer, train_input_texts, train_target_texts)
test_dataset = CustomDataset(tokenizer, test_input_texts, test_target_texts)

batch_size = 32
# Create DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


## Model training

In [10]:
#set up training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)



In [11]:
#training loop
loss = float('inf')

epochs = 65
model.train()
for epoch in range(epochs):
    epoch_loss = 0
    for batch in tqdm(train_dataloader):
        optimizer.zero_grad()

        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        epoch_loss += loss.item()

        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss / len(train_dataloader)}")

    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in test_dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            val_loss += loss.item()
    print(f"Epoch {epoch + 1}/{epochs}, Validation Loss: {val_loss / len(test_dataloader)}")

  0%|          | 0/2 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
100%|██████████| 2/2 [00:01<00:00,  1.04it/s]


Epoch 1/65, Loss: 6.983645915985107
Epoch 1/65, Validation Loss: 4.949277400970459


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 2/65, Loss: 6.835392951965332
Epoch 2/65, Validation Loss: 4.703701972961426


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 3/65, Loss: 6.163495063781738
Epoch 3/65, Validation Loss: 4.47418737411499


100%|██████████| 2/2 [00:00<00:00,  2.97it/s]


Epoch 4/65, Loss: 5.70934271812439
Epoch 4/65, Validation Loss: 4.266974925994873


100%|██████████| 2/2 [00:00<00:00,  2.91it/s]


Epoch 5/65, Loss: 5.405320167541504
Epoch 5/65, Validation Loss: 4.081233501434326


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 6/65, Loss: 5.148494243621826
Epoch 6/65, Validation Loss: 3.9104955196380615


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 7/65, Loss: 4.824320077896118
Epoch 7/65, Validation Loss: 3.745394706726074


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 8/65, Loss: 4.8122217655181885
Epoch 8/65, Validation Loss: 3.592782974243164


100%|██████████| 2/2 [00:00<00:00,  2.94it/s]


Epoch 9/65, Loss: 4.336396932601929
Epoch 9/65, Validation Loss: 3.4476852416992188


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 10/65, Loss: 4.2925450801849365
Epoch 10/65, Validation Loss: 3.3197290897369385


100%|██████████| 2/2 [00:00<00:00,  2.97it/s]


Epoch 11/65, Loss: 3.9417015314102173
Epoch 11/65, Validation Loss: 3.201822519302368


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 12/65, Loss: 3.8141684532165527
Epoch 12/65, Validation Loss: 3.099726915359497


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 13/65, Loss: 3.4979047775268555
Epoch 13/65, Validation Loss: 3.0242297649383545


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 14/65, Loss: 3.3972718715667725
Epoch 14/65, Validation Loss: 2.977613687515259


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 15/65, Loss: 3.3047356605529785
Epoch 15/65, Validation Loss: 2.9467687606811523


100%|██████████| 2/2 [00:00<00:00,  2.89it/s]


Epoch 16/65, Loss: 3.2809261083602905
Epoch 16/65, Validation Loss: 2.9181394577026367


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 17/65, Loss: 3.1962039470672607
Epoch 17/65, Validation Loss: 2.8882675170898438


100%|██████████| 2/2 [00:00<00:00,  2.91it/s]


Epoch 18/65, Loss: 3.132725715637207
Epoch 18/65, Validation Loss: 2.858240842819214


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 19/65, Loss: 3.0453161001205444
Epoch 19/65, Validation Loss: 2.8288300037384033


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 20/65, Loss: 3.0158296823501587
Epoch 20/65, Validation Loss: 2.8018243312835693


100%|██████████| 2/2 [00:00<00:00,  2.94it/s]


Epoch 21/65, Loss: 2.984642267227173
Epoch 21/65, Validation Loss: 2.775897741317749


100%|██████████| 2/2 [00:00<00:00,  2.94it/s]


Epoch 22/65, Loss: 2.9441686868667603
Epoch 22/65, Validation Loss: 2.750776529312134


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 23/65, Loss: 2.8398263454437256
Epoch 23/65, Validation Loss: 2.726334571838379


100%|██████████| 2/2 [00:00<00:00,  2.97it/s]


Epoch 24/65, Loss: 2.849713683128357
Epoch 24/65, Validation Loss: 2.703078269958496


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 25/65, Loss: 2.756621837615967
Epoch 25/65, Validation Loss: 2.6803054809570312


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 26/65, Loss: 2.737100839614868
Epoch 26/65, Validation Loss: 2.6586968898773193


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 27/65, Loss: 2.7305967807769775
Epoch 27/65, Validation Loss: 2.6379778385162354


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 28/65, Loss: 2.65227735042572
Epoch 28/65, Validation Loss: 2.617579221725464


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 29/65, Loss: 2.6192179918289185
Epoch 29/65, Validation Loss: 2.5976884365081787


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 30/65, Loss: 2.561392068862915
Epoch 30/65, Validation Loss: 2.577934503555298


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 31/65, Loss: 2.529310941696167
Epoch 31/65, Validation Loss: 2.5588858127593994


100%|██████████| 2/2 [00:00<00:00,  2.94it/s]


Epoch 32/65, Loss: 2.478235602378845
Epoch 32/65, Validation Loss: 2.5410640239715576


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 33/65, Loss: 2.453104853630066
Epoch 33/65, Validation Loss: 2.5249204635620117


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 34/65, Loss: 2.4386022090911865
Epoch 34/65, Validation Loss: 2.509505271911621


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 35/65, Loss: 2.4216370582580566
Epoch 35/65, Validation Loss: 2.4948127269744873


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 36/65, Loss: 2.3739136457443237
Epoch 36/65, Validation Loss: 2.4814624786376953


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 37/65, Loss: 2.370430111885071
Epoch 37/65, Validation Loss: 2.4696249961853027


100%|██████████| 2/2 [00:00<00:00,  2.89it/s]


Epoch 38/65, Loss: 2.3208178281784058
Epoch 38/65, Validation Loss: 2.4589147567749023


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 39/65, Loss: 2.292074203491211
Epoch 39/65, Validation Loss: 2.4491331577301025


100%|██████████| 2/2 [00:00<00:00,  2.97it/s]


Epoch 40/65, Loss: 2.2865402698516846
Epoch 40/65, Validation Loss: 2.440375328063965


100%|██████████| 2/2 [00:00<00:00,  2.97it/s]


Epoch 41/65, Loss: 2.220367193222046
Epoch 41/65, Validation Loss: 2.432511329650879


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 42/65, Loss: 2.2111854553222656
Epoch 42/65, Validation Loss: 2.425785779953003


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 43/65, Loss: 2.1981165409088135
Epoch 43/65, Validation Loss: 2.4197776317596436


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 44/65, Loss: 2.1719584465026855
Epoch 44/65, Validation Loss: 2.414459466934204


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 45/65, Loss: 2.1595932245254517
Epoch 45/65, Validation Loss: 2.409733533859253


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 46/65, Loss: 2.132827043533325
Epoch 46/65, Validation Loss: 2.4053287506103516


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 47/65, Loss: 2.1138858795166016
Epoch 47/65, Validation Loss: 2.401242971420288


100%|██████████| 2/2 [00:00<00:00,  2.97it/s]


Epoch 48/65, Loss: 2.0757519006729126
Epoch 48/65, Validation Loss: 2.397233724594116


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 49/65, Loss: 2.041695773601532
Epoch 49/65, Validation Loss: 2.3933846950531006


100%|██████████| 2/2 [00:00<00:00,  2.94it/s]


Epoch 50/65, Loss: 1.992041528224945
Epoch 50/65, Validation Loss: 2.389887571334839


100%|██████████| 2/2 [00:00<00:00,  2.97it/s]


Epoch 51/65, Loss: 2.002606749534607
Epoch 51/65, Validation Loss: 2.386909008026123


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 52/65, Loss: 1.9690972566604614
Epoch 52/65, Validation Loss: 2.384392023086548


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 53/65, Loss: 1.9523724913597107
Epoch 53/65, Validation Loss: 2.381934404373169


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 54/65, Loss: 1.9265254139900208
Epoch 54/65, Validation Loss: 2.379361152648926


100%|██████████| 2/2 [00:00<00:00,  2.92it/s]


Epoch 55/65, Loss: 1.889311671257019
Epoch 55/65, Validation Loss: 2.3761138916015625


100%|██████████| 2/2 [00:00<00:00,  2.94it/s]


Epoch 56/65, Loss: 1.8696211576461792
Epoch 56/65, Validation Loss: 2.372945547103882


100%|██████████| 2/2 [00:00<00:00,  2.94it/s]


Epoch 57/65, Loss: 1.8328993916511536
Epoch 57/65, Validation Loss: 2.3700568675994873


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 58/65, Loss: 1.8276124596595764
Epoch 58/65, Validation Loss: 2.3669216632843018


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 59/65, Loss: 1.7811239957809448
Epoch 59/65, Validation Loss: 2.3635332584381104


100%|██████████| 2/2 [00:00<00:00,  2.94it/s]


Epoch 60/65, Loss: 1.777180552482605
Epoch 60/65, Validation Loss: 2.361196517944336


100%|██████████| 2/2 [00:00<00:00,  2.96it/s]


Epoch 61/65, Loss: 1.717605471611023
Epoch 61/65, Validation Loss: 2.359555959701538


100%|██████████| 2/2 [00:00<00:00,  2.94it/s]


Epoch 62/65, Loss: 1.7063341736793518
Epoch 62/65, Validation Loss: 2.3573296070098877


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 63/65, Loss: 1.6712150573730469
Epoch 63/65, Validation Loss: 2.355926036834717


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]


Epoch 64/65, Loss: 1.6553191542625427
Epoch 64/65, Validation Loss: 2.3555562496185303


100%|██████████| 2/2 [00:00<00:00,  2.95it/s]

Epoch 65/65, Loss: 1.6172003746032715
Epoch 65/65, Validation Loss: 2.3557913303375244





## Evaluating of the Model

In [12]:

def evaluate_model(dataset, tokenizer, model):
    model.eval()
    predictions = []
    references = []
    with torch.no_grad():
        for batch in tqdm(test_dataloader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=256)
            predictions.extend(tokenizer.batch_decode(outputs, skip_special_tokens=True))
            references.extend([tokenizer.decode(ref, skip_special_tokens=True) for ref in batch["labels"]])
    return predictions, references

In [13]:
#run evaluation
predictions, references = evaluate_model(test_dataset, tokenizer, model)

Evaluating: 100%|██████████| 1/1 [00:03<00:00,  3.89s/it]


In [15]:

# Calculate ROUGE Scores
def calculate_rouge(predictions, references):
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    rouge_scores = {"rouge1": [], "rouge2": [], "rougeL": []}
    for ref, pred in zip(references, predictions):
        scores = scorer.score(ref, pred)
        for key in rouge_scores:
            rouge_scores[key].append(scores[key].fmeasure)
    return {key: sum(values) / len(values) for key, values in rouge_scores.items()}

rouge_scores = calculate_rouge(predictions, references)
print(f"ROUGE Scores: {rouge_scores}")

ROUGE Scores: {'rouge1': 0.3840698292336229, 'rouge2': 0.13882919331745788, 'rougeL': 0.25124390540259856}


In [18]:
# Calculate BERT Scores
def calculate_bert_score(predictions, references):
    P, R, F1 = bert_score(predictions, references, model_type="bert-base-uncased", lang="en")
    return {
        "Precision": P.mean().item(),
        "Recall": R.mean().item(),
        "F1": F1.mean().item()
    }

bert_scores = calculate_bert_score(predictions, references)
print(f"BERT Scores: {bert_scores}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BERT Scores: {'Precision': 0.6510936617851257, 'Recall': 0.5887253284454346, 'F1': 0.6178234815597534}


In [19]:
# printing the summary predictions on the test set

# code to beautify the print statements
line_width = 145

for indx, val in enumerate(predictions):
    print(f"Reference {indx + 1}:\n{textwrap.fill(references[indx], width=line_width)}")
    print(f"Prediction {indx + 1}:\n{textwrap.fill(val, width=line_width)}")
    print("-" * 30)

Reference 1:
The Swiss Federal University for Vocational Education and Training (SFUVET) is based in Zollikofen and has 3 campuses in Switzerland. These
campuses are separated by the language spoken in the region, which are German, French, and Italilan. It colabortates with industry groups and
regional authorities to plan courses, services, and research. The government sets their goals and consults with national organikzations like
trade unions and employers associations. Old rules have been replaced on August 1, 2021, but a few parts will continue to be active until the end
of 2021.
Prediction 1:
The SFUVET Ordinance is established in Zollikofen, Switzerland. Its strategic objectives are to develop a partnership with professional
organizations and cantonal authorities in its strategic planning activities, training courses and research activities. It may also develop
advisory boards for professional organizations, cantonal authorities, and other interested parties. It repeals the 2005 

 references
  
 https://huggingface.co/docs/transformers/model_doc/t5

 model : https://huggingface.co/google/flan-t5-small

 tokenizer : https://huggingface.co/docs/transformers/main_classes/tokenizer

 train test split, sklearn : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

 dataloader : https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

 optimizer : https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html

 rogue score : https://huggingface.co/spaces/evaluate-metric/rouge/blob/main/README.md

 bert_score : https://github.com/Tiiiger/bert_score

 to beautify the print statements: https://www.geeksforgeeks.org/textwrap-text-wrapping-filling-python/