### Flant5 Train
Training file to fine-tune the flan-t5 base model. Flan-T5 is a variant of the T5 (Text-To-Text Transfer Transformer) model that has been fine-tuned using the FLAN (Fine-tuned Language Net) methodology.

#### Step 1: Install Required Dependencies

In [1]:
!pip install evaluate
!pip install sacrebleu
!pip install bert-score

Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.2
Collecting sacrebleu
  Downloading sacrebleu-2.4.3-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m825.9 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Downloading sacrebleu-2.4.3-py3-none-any.whl (103 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.0/104.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading portalocker-2.10.1-py3-none-any.whl (18 kB)
Installing collected packages: portalocker, sacrebl

Load the datasets, Large Language Model (LLM) and tokenizer.

In [2]:
import os
import torch
import numpy as np
from datasets import load_dataset
import json
from transformers import T5Tokenizer, T5ForConditionalGeneration, Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
import evaluate
import bert_score

#### Step 2: Preprocess dataset
The restructure_json function processes a list of JSON file names, reads each file, extracts specific fields ('disfluent' and 'original'), and writes the restructured data to new output files. It constructs file paths dynamically and uses JSON operations to read and write the data.

In [3]:
def restructure_json(file_names):
    """
    Restructures the JSON files specified by the given list of file names.

    Parameters:
    file_names (list): A list of file names (without extension) to be processed.

    Returns:
    None
    """
    for file_name in file_names:
        input_path = os.path.join(os.getcwd(), f"{file_name}.json")
        output_path = os.path.join(os.getcwd(), f"{file_name}_output.json")

        #print(input_path)
        #print(output_path)

        with open(input_path, 'r') as f:
            raw_data = json.load(f)
        #print(raw_data)

        dataset = [{'disfluent': item['disfluent'], 'original': item['original']} for item in raw_data.values()]

        with open(output_path, 'w') as f:
            json.dump(dataset, f, indent=4)

In [4]:
# Mention the train, dev and test file names without extension and if using holdout datasets, rename the holdout dataset to test dataset
# Please make sure that file name defined below should have .json extension
file_names = ["train", "dev", "test"]
restructure_json(file_names)

Load the preprocessed train, dev and test dataset.

In [5]:
data_files = {k: os.path.join(os.getcwd(), f"{k}_output.json") for k in ["train", "dev", "test"]}
dataset = load_dataset("json", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Define your model and tokenizer. Make sure to use the correct model name and tokenizer name

In [6]:
# Initialize the model and tokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
model_name = "google/flan-t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name, return_tensors="pt")
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

cuda


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

#### Step 3: Tokenize the input and target
Define the function to tokenize disfluent and original questions representing inputs and targets with maximum length 

In [7]:
max_length = 512
def tokenize_function(examples):
    """
    Preprocesses the input examples for training a disfluency detection model.

    Args:
        examples (dict): A dictionary containing the input examples with 'disfluent' and 'original' keys.

    Returns:
        dict: A dictionary containing the preprocessed model inputs with 'input_ids', 'attention_mask', and 'labels' keys.
    """

    inputs = examples['disfluent']
    targets = examples['original']
    model_inputs = tokenizer(inputs, max_length = max_length, truncation = True, padding="max_length")
    labels = tokenizer(targets, max_length = max_length, truncation = True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_data = dataset.map(tokenize_function, batched=True, remove_columns=['disfluent', 'original'])

Map:   0%|          | 0/7182 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3643 [00:00<?, ? examples/s]

In [8]:
# Data collator to handle dynamic padding and other pre-processing requirements
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

#### Step 4: Define the metric
Metric function to evaluate the model. Model is evaluated on sacrebleu and Bert F1 score.

In [9]:
# Load BLEU metrics
sacrebleu = evaluate.load("sacrebleu")

def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for prediction results.
    Args:
        eval_pred (tuple): A tuple containing predictions and labels.
    Returns:
        dict: A dictionary containing the computed evaluation metrics.
            - "bleu" (float): The BLEU score.
            - "Bert Score F1" (str): The average BERTScore F1 score.
    """
    predictions, labels = eval_pred
    # In case the model returns more than the prediction logits
    if isinstance(predictions, tuple):
        predictions = predictions[0]

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]  # SacreBLEU expects a list of references for each prediction

    sacrebleu_result = sacrebleu.compute(predictions=decoded_preds, references=decoded_labels)

    # Calculate BERTScore
    P, R, F1 = bert_score.score(decoded_preds, decoded_labels, lang="en", verbose=True)

    return {
        "bleu": sacrebleu_result["score"],
        "Bert Score F1": f"{F1.mean().item():.4f}"
    }

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

#### Step 5: Define the training arguments

In [10]:
training_args = Seq2SeqTrainingArguments(
    output_dir="/output/",
    eval_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    save_steps=10_000,
    save_total_limit=2,
    fp16=True,
    predict_with_generate=True,
    learning_rate=0.00017105019776419224,  # optimum learning rate found by hyperparameter tuning (using optuna)
    weight_decay=0.05238804154208795 # optimum weight decay found by hyperparameter tuning (using optuna)
)

#### Step 6: Intialize the trainer

In [11]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data['train'],
    eval_dataset=tokenized_data['dev'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


#### Step 7: Train the model

In [12]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Bleu,Bert score f1
1,0.8318,0.004827,89.607818,0.9888
2,0.0044,0.004769,89.818564,0.9893
3,0.0031,0.005379,90.199489,0.99
4,0.0021,0.006171,90.297948,0.9899
5,0.0018,0.007096,90.346927,0.9899




tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/20 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/16 [00:00<?, ?it/s]

Trainer is attempting to log a value of "0.9888" of type <class 'str'> for key "eval/Bert Score F1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


done in 3.95 seconds, 253.47 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/20 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/16 [00:00<?, ?it/s]

done in 3.70 seconds, 269.98 sentences/sec


Trainer is attempting to log a value of "0.9893" of type <class 'str'> for key "eval/Bert Score F1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/20 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/16 [00:00<?, ?it/s]

Trainer is attempting to log a value of "0.9900" of type <class 'str'> for key "eval/Bert Score F1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


done in 3.67 seconds, 272.23 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/20 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/16 [00:00<?, ?it/s]

done in 3.65 seconds, 273.62 sentences/sec


Trainer is attempting to log a value of "0.9899" of type <class 'str'> for key "eval/Bert Score F1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/20 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/16 [00:00<?, ?it/s]

done in 3.65 seconds, 274.25 sentences/sec


Trainer is attempting to log a value of "0.9899" of type <class 'str'> for key "eval/Bert Score F1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


TrainOutput(global_step=4490, training_loss=0.09538775702361806, metrics={'train_runtime': 7869.6704, 'train_samples_per_second': 4.563, 'train_steps_per_second': 0.571, 'total_flos': 2.458963652640768e+16, 'train_loss': 0.09538775702361806, 'epoch': 5.0})

In [None]:
# Save the model and tokenizer
# model.save_pretrained("/kaggle/working/disfl-fine-tuned-FlanT5-model")
# tokenizer.save_pretrained("/kaggle/working/disfl-fine-tuned-FlanT5-tokenizer")