### Flant5 optimum parameter tuning
Training file to find the optimum learning rate and weight decay for flan-t5 base model. Flan-T5 is a variant of the T5 (Text-To-Text Transfer Transformer) model that has been fine-tuned using the FLAN (Fine-tuned Language Net) methodology.

#### Step 1: Install Required Dependencies

In [1]:
!pip install evaluate
!pip install sacrebleu
!pip install optuna

Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.2
Collecting sacrebleu
  Downloading sacrebleu-2.4.3-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Downloading sacrebleu-2.4.3-py3-none-any.whl (103 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.0/104.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-2.10.1-py3-none-any.whl (18 kB)
Installing collected packages: portalocker, sacrebleu
Successfully installed portalocker-

Load the datasets, Large Language Model (LLM) and tokenizer.

In [2]:
import os
import optuna
import torch
import numpy as np
from datasets import load_dataset
import json
from transformers import T5Tokenizer, T5ForConditionalGeneration, Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
import evaluate
from datasets import DatasetDict

#### Step 2: Preprocess dataset
The restructure_json function processes a list of JSON file names, reads each file, extracts specific fields ('disfluent' and 'original'), and writes the restructured data to new output files. It constructs file paths dynamically and uses JSON operations to read and write the data.

In [3]:
def restructure_json(file_names):
    """
    Restructures the JSON files specified by the given list of file names.

    Parameters:
    file_names (list): A list of file names (without extension) to be processed.

    Returns:
    None
    """
    for file_name in file_names:
        input_path = os.path.join(os.getcwd(), f"{file_name}.json")
        output_path = os.path.join(os.getcwd(), f"{file_name}_output.json")

        #print(input_path)
        #print(output_path)

        with open(input_path, 'r') as f:
            raw_data = json.load(f)
        #print(raw_data)

        dataset = [{'disfluent': item['disfluent'], 'original': item['original']} for item in raw_data.values()]

        with open(output_path, 'w') as f:
            json.dump(dataset, f, indent=4)

In [4]:
# Mention the train, dev and test file names without extension and if using holdout datasets, rename the holdout dataset to test dataset
# Please make sure that file name defined below should have .json extension
file_names = ["train", "dev", "test"]
restructure_json(file_names)

Loading only 10% of the training data for faster training

In [5]:
data_files = {"train": os.path.join(os.getcwd(), "train_output.json"), "val": os.path.join(os.getcwd(), "dev_output.json"), "test": os.path.join(os.getcwd(), "test_output.json")}
dataset = load_dataset("json", data_files=data_files)
train_dataset_sample = dataset['train'].train_test_split(test_size=0.1)['test']
dataset = DatasetDict({
    'train': train_dataset_sample,
    'val': dataset['val'],
    'test': dataset['test']
})

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['disfluent', 'original'],
        num_rows: 719
    })
    val: Dataset({
        features: ['disfluent', 'original'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['disfluent', 'original'],
        num_rows: 3643
    })
})

Define your model and tokenizer. Make sure to use the correct model name and tokenizer name

In [6]:
# Initialize the model and tokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
model_name = "google/flan-t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name, return_tensors="pt")
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

cuda


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

#### Step 3: Tokenize the input and target
Define the function to tokenize disfluent and original questions representing inputs and targets with maximum length

In [7]:
max_length = 512
def tokenize_function(examples):
    """
    Preprocesses the input examples for training a disfluency detection model.

    Args:
        examples (dict): A dictionary containing the input examples with 'disfluent' and 'original' keys.

    Returns:
        dict: A dictionary containing the preprocessed model inputs with 'input_ids', 'attention_mask', and 'labels' keys.
    """

    inputs = examples['disfluent']
    targets = examples['original']
    model_inputs = tokenizer(inputs, max_length = max_length, truncation = True, padding="max_length")
    labels = tokenizer(targets, max_length = max_length, truncation = True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_data = dataset.map(tokenize_function, batched=True, remove_columns=['disfluent', 'original'])

Map:   0%|          | 0/719 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3643 [00:00<?, ? examples/s]

In [8]:
# Data collator to handle dynamic padding and other pre-processing requirements
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

#### Step 4: Define the metric
Metric function to evaluate the model. Model is evaluated on sacrebleu.

In [9]:
# Load BLEU metrics
sacrebleu = evaluate.load("sacrebleu")

def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for prediction results.
    Args:
        eval_pred (tuple): A tuple containing predictions and labels.
    Returns:
        dict: A dictionary containing the computed evaluation metrics.
            - "bleu" (float): The BLEU score.
            - "Bert Score F1" (str): The average BERTScore F1 score.
    """
    predictions, labels = eval_pred
    # In case the model returns more than the prediction logits
    if isinstance(predictions, tuple):
        predictions = predictions[0]

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]  # SacreBLEU expects a list of references for each prediction

    sacrebleu_result = sacrebleu.compute(predictions=decoded_preds, references=decoded_labels)

    return {
        "bleu": sacrebleu_result["score"]
    }

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

#### Step 5: Search over specified parameters using optuna objective function to find the optimal parameters
Define the objective function for Optuna

In [14]:
def objective(trial):
    learning_rate = trial.suggest_loguniform('learning_rate', 5e-5, 5e-4)
    weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-1)

    # Define training arguments
    training_args = Seq2SeqTrainingArguments(
        output_dir="/kaggle/working/",
        eval_strategy="epoch",
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=5,
        save_steps=10_000,
        save_total_limit=2,
        fp16=True,
        predict_with_generate=True,
        learning_rate=learning_rate,
        report_to="none",
        weight_decay=weight_decay
    )

    # Initialize the Trainer
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_data['train'],
        eval_dataset=tokenized_data['val'],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    # Train the model
    trainer.train()

    # Evaluate on the validation set
    eval_results = trainer.evaluate()
    sacrebleu_score = eval_results["eval_bleu"]

    return sacrebleu_score

# Create the Optuna study
study = optuna.create_study(direction="maximize")

# Optimize
study.optimize(objective, n_trials=10)

print(f"Best trial: {study.best_trial.value}")
print(f"Best hyperparameters: {study.best_trial.params}")

[I 2024-08-28 16:35:11,411] A new study created in memory with name: no-name-0f4d54bf-3465-4d1a-8408-435a2cefa91b
  learning_rate = trial.suggest_loguniform('learning_rate', 5e-5, 5e-4)
  weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-1)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.007307,84.195271
2,No log,0.005505,87.500398
3,No log,0.006246,87.240432
4,No log,0.006789,88.333249
5,No log,0.007112,88.18489


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


[I 2024-08-28 17:07:58,006] Trial 0 finished with value: 88.1848900975486 and parameters: {'learning_rate': 0.00017105019776419224, 'weight_decay': 0.05238804154208795}. Best is trial 0 with value: 88.1848900975486.
  learning_rate = trial.suggest_loguniform('learning_rate', 5e-5, 5e-4)
  weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-1)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.008489,87.300789
2,No log,0.01148,87.17953
3,No log,0.012964,87.626895
4,No log,0.013303,87.756079
5,No log,0.013404,87.765808


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


[I 2024-08-28 17:41:01,633] Trial 1 finished with value: 87.7658078259573 and parameters: {'learning_rate': 0.00013745880961196785, 'weight_decay': 0.00790220462142866}. Best is trial 0 with value: 88.1848900975486.
  learning_rate = trial.suggest_loguniform('learning_rate', 5e-5, 5e-4)
  weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-1)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.013106,86.976857
2,No log,0.01265,86.655717
3,No log,0.016174,87.23215
4,No log,0.016822,87.798871
5,No log,0.016276,87.734338


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


[I 2024-08-28 18:14:09,192] Trial 2 finished with value: 87.73433792398204 and parameters: {'learning_rate': 0.0003358995047563211, 'weight_decay': 0.006998333923142502}. Best is trial 0 with value: 88.1848900975486.
  learning_rate = trial.suggest_loguniform('learning_rate', 5e-5, 5e-4)
  weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-1)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.019587,86.856742
2,No log,0.020044,86.623435
3,No log,0.018903,87.103534
4,No log,0.019972,87.44864
5,No log,0.019953,87.450772


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


[I 2024-08-28 18:47:24,488] Trial 3 finished with value: 87.45077232438695 and parameters: {'learning_rate': 0.0002010409051012435, 'weight_decay': 0.0001780414196956808}. Best is trial 0 with value: 88.1848900975486.
  learning_rate = trial.suggest_loguniform('learning_rate', 5e-5, 5e-4)
  weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-1)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.021414,86.899634
2,No log,0.023039,86.826536
3,No log,0.022576,87.095644
4,No log,0.02206,87.315836
5,No log,0.021902,87.322558


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


[I 2024-08-28 19:20:35,717] Trial 4 finished with value: 87.32255844534652 and parameters: {'learning_rate': 5.6042448397969894e-05, 'weight_decay': 0.00030040389923257266}. Best is trial 0 with value: 88.1848900975486.
  learning_rate = trial.suggest_loguniform('learning_rate', 5e-5, 5e-4)
  weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-1)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.023576,86.669784
2,No log,0.024857,87.085142
3,No log,0.02463,87.147317
4,No log,0.024295,87.443798
5,No log,0.023899,87.344895


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


[I 2024-08-28 19:53:36,632] Trial 5 finished with value: 87.3448946236834 and parameters: {'learning_rate': 9.569055016937815e-05, 'weight_decay': 0.027748992947179113}. Best is trial 0 with value: 88.1848900975486.
  learning_rate = trial.suggest_loguniform('learning_rate', 5e-5, 5e-4)
  weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-1)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.025309,86.649137
2,No log,0.026493,86.048067
3,No log,0.026429,86.820905
4,No log,0.025533,86.709145
5,No log,0.025285,87.04065


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


[I 2024-08-28 20:26:11,577] Trial 6 finished with value: 87.04065002132774 and parameters: {'learning_rate': 0.00013392890980223928, 'weight_decay': 0.0002548997442143131}. Best is trial 0 with value: 88.1848900975486.
  learning_rate = trial.suggest_loguniform('learning_rate', 5e-5, 5e-4)
  weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-1)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.026495,86.468194
2,No log,0.024716,86.613557
3,No log,0.02693,86.7168
4,No log,0.027411,86.733166
5,No log,0.026828,86.648823


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


[I 2024-08-28 20:59:18,713] Trial 7 finished with value: 86.64882333468762 and parameters: {'learning_rate': 0.00015981737475379912, 'weight_decay': 0.05614893996615956}. Best is trial 0 with value: 88.1848900975486.
  learning_rate = trial.suggest_loguniform('learning_rate', 5e-5, 5e-4)
  weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-1)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.02743,86.540366
2,No log,0.028312,86.341131
3,No log,0.028621,86.275607
4,No log,0.02851,86.503674
5,No log,0.028221,86.506609


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


[I 2024-08-28 21:32:29,430] Trial 8 finished with value: 86.50660936569152 and parameters: {'learning_rate': 5.926221665277455e-05, 'weight_decay': 0.022743061251765912}. Best is trial 0 with value: 88.1848900975486.
  learning_rate = trial.suggest_loguniform('learning_rate', 5e-5, 5e-4)
  weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-1)
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.02836,86.136077
2,No log,0.028748,85.418342
3,No log,0.026574,86.764277
4,No log,0.02798,86.63574
5,No log,0.027477,86.848107


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


[I 2024-08-28 22:05:41,023] Trial 9 finished with value: 86.84810746521097 and parameters: {'learning_rate': 8.751052344549526e-05, 'weight_decay': 0.0006760175877463329}. Best is trial 0 with value: 88.1848900975486.


Best trial: 88.1848900975486
Best hyperparameters: {'learning_rate': 0.00017105019776419224, 'weight_decay': 0.05238804154208795}
