# Preparation
Almost the same as T5 training script --- using teacher forcing (pass label during training).

What is different is we will not save the full model, we will save the LoRA parameters (and unfreezed parameters), and I will provide a load model function at the end to load the model.

 In the training need to change optimizer's model to the lora model. I lost a bit of money because of this one mistake.

*2024.08.20*

I think there is still a problem with this version of trainer:
1. The output of the model is by default the same length as the input sequence.
Here by chance the input sequence is the same length as the labels (I padded them to be 40).If we want to be bug free for calculate accuracy, then we need to consider situations for defining loss function when the input sequence is not the same length to the labels.
2. Becasue I didn't do dynamic padding, but rather pad every word in the dataset to be the same length. I am guessing that this might be the reason why my T5-large didn't get the same amount of accuracy as the paper -- because the output of the model would also need to consider generate the pad token. So the loss of the model would be easily low if the model output just have the right amount of pad tokens. (Ususally the output is of length 40 == input length, and answers are just 4 tokens long, so 90% of the output need to be pad tokens).  
3. We don't need to save model in every epoch, just need to save the final model. And also we don't need to evaluate during training. (Technically after too much training, the model's generalization error increase, so we need to find that crossing before model overfits. But in reality, within less than 5 epochs, overfitting is almost always not a problem). So evaluate and save model during training should be an option, not by default.
4. Hyperparameters should be stored under each "sub train directory".

In [None]:
!pip install datasets
! pip install peft
from peft import LoraConfig, get_peft_model
# !pip install accelerate -U
from datasets import load_dataset, load_from_disk
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import os
import numpy as np
from transformers import get_scheduler
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.nn import functional as F
import json
from torch.nn import CrossEntropyLoss
# Mount to google drive either click it or add a block cell
# Change it to your google drive path where this notebook located.
drive_path = '/content/drive/MyDrive/Projects/CryptoniteAnalysis/Baselines/Seq2Seq'
os.chdir(drive_path)


Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2

This part is not the same, when we save models, we only save the LoRA parameters. So later when we are loading, we need to load given the LoRA config

In [None]:
RECORD_HYPERPARAMETERS = 'record_hyperparameters'
RECORD_TRAIN_LOSS_AND_ACCURACY = 'record_train_loss_and_accuracy'
RECORD_TEST_LOSS_AND_ACCURACY = 'record_test_loss_and_accuracy'
RECORD_MODEL = 'record_model'

def save_lora_parameters(model, output_dir):
    '''
    Notice that we are also saving the unfreeze parameters here in addition to LoRA parameters.
    This function will vary from model to model: because the unfreezed parameters are different.
    Somehow for T5 and BART, their head are both called 'lm_head'.
    '''
    lora_params = {k: v for k, v in model.named_parameters() if ('lora' in k or 'lm_head' in k)}
    torch.save(lora_params, os.path.join(output_dir, "lora_params.pth"))


def write_results(output_dir, result_type, results, **kwargs):
    # write the hyper parameter under output_dir
    if result_type == RECORD_HYPERPARAMETERS:
        hyper_parameters_file = os.path.join(output_dir, 'hyper_parameters.json')
        with open(hyper_parameters_file, 'w') as f:
            json.dump(results, f)
            return
    # write the train and validate results under output_dir
    if result_type == RECORD_TRAIN_LOSS_AND_ACCURACY:
        json_file = os.path.join(output_dir,f'train_metrics.json')
        if os.path.exists(json_file):
            with open(json_file, 'r') as file:
                data = json.load(file)
        else:
            # Initialize data as an empty dictionary or appropriate structure
            data = []

        # Append the results to the existing data
        data.append(results)
        with open(json_file, 'w') as f:
            json.dump(data, f, indent=4)
            return

    # write the validate and test results under output_dir
    if result_type == RECORD_TEST_LOSS_AND_ACCURACY:
        json_file = os.path.join(output_dir, f'validate_and_test_metrics.json')
        if os.path.exists(json_file):
            with open(json_file, 'r') as file:
                data = json.load(file)
        else:
            # Initialize data as an empty dictionary or appropriate structure
            data = []

        # Append the results to the existing data
        data.append(results)
        with open(json_file, 'w') as f:
            json.dump(data, f, indent=4)
            return

    # store the model under output_dir (result is a model)
    if result_type == RECORD_MODEL:
        model = results
        save_lora_parameters(model, output_dir)
        return




This part is the same as T5 training script



In [None]:
EVAL_SUBSAMPLE_SIZE = 100
VAL_SET_SIZE = 1000
TEST_SET_SIZE = 1000

EVAL_PER_STEP = 100

def calculate_accuracy(logits, labels, tokenizer):
    '''
    There can be two way to calculate accuracy:
    1. compare what percentage of the output tokens are the same (expect for special tokens)
    If we want to compare number of tokens to be the same, then we can flatten the tokens and compare one by one.
    predictions = predictions.view(-1)
    labels = labels.view(-1)

    2. compare how many answers are correct in a batch
    Then if we want to compare answers, the dumb way is to first batch decode them, and then compare the decoded strings oe by one.
    Another way is to compare the tokens without decoding. But I am not sure how to deal with special tokens (sometimes it might not
    Generate correct end tokens.)

    '''
    predictions = torch.argmax(logits, dim=-1)

    pred_words = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    gold_standard_words = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # calculate correct predictions
    correct_labels, total_labels = 0, 0
    for i in range(len(pred_words)):
        if pred_words[i] == gold_standard_words[i]:
            correct_labels += 1
        total_labels += 1
    accuracy = correct_labels / total_labels
    return accuracy

def customize_loss_and_accuracy(outputs, target, tokenizer):
    '''
    Potential bugs
    The output of the model is by default the same length as the input sequence.
    Here by chance the input sequence is the same length as the labels (I padded them to be 40)
    If we want to be bug free for calculate accuracy, then we need to consider situations when the input sequence is not the same length to the labels.

    After we batch_decode in accuracy, the evaluation time is super high. So I suggest we don't calculate accuracy during training.
    And also I feel like we don't need to evaluate during training -- take too much time.
    '''
    # make the input and target the correct size (input is (batch* seq_len, dictionary_size), output is (batch*seq_len))
    loss = F.cross_entropy(input=outputs.logits.view(-1, outputs.logits.size(-1)), target=target.view(-1))
    accuracy = calculate_accuracy(logits=outputs.logits, labels=target, tokenizer=tokenizer)
    return loss, accuracy

def train_batch(model, tokenizer, epoch, step, batch, device, optimizer, scheduler, epoch_dir):
    # set model to train mode
    model.train()

    # put everything on the right device
    batch =  {k: v.to(device) for k, v in batch.items()}
    batch_size = batch['labels'].shape[0]

    # clear gradients, same old as usual
    optimizer.zero_grad()

    # forward pass, T5 forced us to pass in labels, this is good for "teacher forcing" according to ChatGPT.
    outputs = model(**batch)

    # outputs.loss might be problematic because of the NllLossBackward0 without softmax, should use nn.CrossEntropy
    loss, accuracy = customize_loss_and_accuracy(outputs, target=batch['labels'], tokenizer=tokenizer)

    # back propagation
    loss.backward()
    optimizer.step()

    # scheduler adjust lr
    scheduler.step()

    # record the train loss and accuracy
    record = {"evaluate_set": 'train', "epoch":epoch, "batch":step,
              "avg_loss":loss.item()/batch_size, "accuracy":accuracy, 'subsample_size':"None"}
    print(record)
    # WRITE: save the result for this epoch
    write_results(epoch_dir, result_type=RECORD_TRAIN_LOSS_AND_ACCURACY, results=record)

    return


def evaluate_model(model, tokenizer, epoch, step, dataloaders, device, subsample_size, evaluate_set, epoch_dir):
    '''evaluate means validate or test'''
    # set model to eval mode
    model.eval()
    # calculate number of samples being evaluated
    total_validated_samples = 0
    # calculate total loss and total number of correct labels (weighted acuracy)
    total_loss = 0
    total_accurate = 0
    # turn off grad computation
    with torch.no_grad():
        # evaluate batch by batch
        for batch in dataloaders[evaluate_set]:
            # terminate the process if we are subsampling
            if total_validated_samples > subsample_size:
                break

            # put everything on the right device
            batch =  {k: v.to(device) for k, v in batch.items()}
            batch_size = batch['labels'].shape[0]

            # forward pass in the model
            outputs = model(**batch)

            # accumulate loss and accuracy
            loss, accuracy = customize_loss_and_accuracy(outputs, target=batch['labels'], tokenizer=tokenizer)
            total_loss += loss.item()
            total_accurate += accuracy * batch_size
            total_validated_samples += batch_size


    # calculate the loss and accuracy
    average_loss = total_loss/total_validated_samples
    accuracy = total_accurate/total_validated_samples

    # record the loss and accuracy
    record = {"evaluate_set": evaluate_set, "epoch":epoch, "batch":step,
              "avg_loss": average_loss, 'accuracy': accuracy, 'subsample_size': subsample_size}
    print(record)

    # write to file
    write_results(epoch_dir, result_type=RECORD_TEST_LOSS_AND_ACCURACY, results=record)
    return


def train_epoch(model, tokenizer, epoch, dataloaders, device, optimizer, scheduler, epoch_dir):
    # prepare output dir
    if not os.path.exists(epoch_dir):
        os.makedirs(epoch_dir)

    # evaluate at the beginning of the training
    evaluate_model(model=model, tokenizer=tokenizer, epoch=epoch, step=0, dataloaders=dataloaders, device=device, subsample_size=VAL_SET_SIZE, evaluate_set='validation', epoch_dir=epoch_dir)
    evaluate_model(model=model, tokenizer=tokenizer, epoch=epoch, step=0, dataloaders=dataloaders, device=device, subsample_size=TEST_SET_SIZE, evaluate_set='test', epoch_dir=epoch_dir)

    for step, batch in enumerate(dataloaders['train']):
        # train the batch
        train_batch(model, tokenizer, epoch, step, batch, device, optimizer, scheduler, epoch_dir)

        # validate the model once every 100 steps
        if step % EVAL_PER_STEP == 0:
            evaluate_model(model, tokenizer, epoch, step, dataloaders, device, subsample_size=EVAL_SUBSAMPLE_SIZE, evaluate_set='validation', epoch_dir=epoch_dir)

    # save the model
    write_results(epoch_dir, result_type=RECORD_MODEL, results=model)
    return


def train_model(model, tokenizer, output_dir, dataloaders, optimizer, scheduler, device, hyper_parameters):
    # get device
    model.to(device)

    # number of epochs
    num_train_epochs = hyper_parameters['num_train_epochs']

    # Create the subdirectory for the hyperparameters: this directory is where we will save the result of trainning
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # WRITE hyperparameter to subdirectory
    write_results(output_dir, result_type=RECORD_HYPERPARAMETERS, results=hyper_parameters)


    # train the model on the hyper parameters
    for epoch in range(num_train_epochs):
        epoch_dir = os.path.join(output_dir, f"epoch={epoch}")
        train_epoch(model, tokenizer, epoch, dataloaders, device, optimizer, scheduler, epoch_dir)

    # final evaluation
    evaluate_model(model=model, tokenizer=tokenizer, epoch=num_train_epochs, step="STOP", dataloaders=dataloaders, device=device, subsample_size=VAL_SET_SIZE, evaluate_set='validation', epoch_dir=epoch_dir)
    evaluate_model(model=model, tokenizer=tokenizer, epoch=num_train_epochs, step="STOP", dataloaders=dataloaders, device=device, subsample_size=TEST_SET_SIZE, evaluate_set='test', epoch_dir=epoch_dir)

    return

In [None]:
def load_lora_parameters(model, lora_config, lora_params_path):
    '''
    The input of this function should look like this:
    (It should be the same as defined in training)


    model_name = 'facebook/bart-large-cnn'
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    rank = 32
    lora_config = LoraConfig(
        task_type="classification",
        r=rank,
        lora_alpha=32,  # todo: idk what this does yet
        lora_dropout=0.05, # todo: idk what this does yet
        # print(model) to see all the linear layers, and do LoRA on all of them
        target_modules=['q_proj', 'k_proj', 'v_proj', 'out_proj', 'fc1', 'fc2'],
        # unfreeze the head of the model too
        modules_to_save=['lm_head']
    )

    lora_params_path is output dir
    '''
    lora_model = get_peft_model(model, lora_config)
    lora_params = torch.load(lora_params_path)
    model_dict = lora_model.state_dict()
    model_dict.update(lora_params)
    lora_model.load_state_dict(model_dict)
    return lora_model

# Training bart-base
I am not planning to tune it, just use it as an example for now.

In [None]:
# Parameters I am tuning
########################################################################################################
########################################################################################################
########################################################################################################

model_fp = 'bart-base'
model_name = 'facebook/bart-base'

# define hyperparameters
per_device_train_batch_size = 16
learning_rate = 5e-04
num_train_epochs = 3

# define the best hyper parameter
hyper_parameters = {
            'learning_rate': learning_rate,
            'per_device_train_batch_size': per_device_train_batch_size,
            'num_train_epochs': num_train_epochs
        }

# Define LoRA configuration
lora_config_dict = {
    'r': 32,
    'lora_alpha':32,  # todo: idk what this does yet
    'lora_dropout': 0.05, # todo: idk what this does yet
    # print(model) to see all the linear layers, and do LoRA on all of them
    'target_modules':['q_proj', 'k_proj', 'v_proj', 'out_proj', 'fc1', 'fc2'],
    # unfreeze the head of the model too
    'modules_to_save': ['lm_head']
}

# Parameters that techniquely I can tune but I am not tuning (Wouldn't be correct to create a function for them)
########################################################################################################
########################################################################################################
########################################################################################################

# load the preprocessed dataset
tokenized_dataset_fp = f'ProcessedDatasets/{model_fp}/'
tokenized_datasets = load_from_disk(tokenized_dataset_fp)
tokenized_datasets.set_format("torch")
# tokenized_datasets = tokenized_datasets.filter(lambda x: x['enumeration'] == '(9)')
tokenized_datasets = tokenized_datasets.remove_columns(['enumeration'])
# # for testing purposes
# n = 16 * 10
# tokenized_datasets['test'] = tokenized_datasets['test'].select(range(n))
# tokenized_datasets['validation'] = tokenized_datasets['validation'].select(range(n))
# tokenized_datasets['train'] = tokenized_datasets['train'].select(range(n))
# initialize dataloaders
dataloaders = {}
dataloaders['train'] = DataLoader(tokenized_datasets['train'], batch_size=per_device_train_batch_size, shuffle=True)
dataloaders['test'] = DataLoader(tokenized_datasets['test'], batch_size=per_device_train_batch_size)
dataloaders['validation'] = DataLoader(tokenized_datasets['validation'], batch_size=per_device_train_batch_size, shuffle=True)  # shuffle because we want to subsample


# define model
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Add LoRA:
lora_config = LoraConfig(**lora_config_dict)
model = get_peft_model(base_model, lora_config)
hyper_parameters['lora_config'] = lora_config_dict

# initialize optimizer: Notice: change the model here!
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# initialize scheduler
# calculate the arguments for shceduler: it depends on the sample size, batch size and epochs
num_training_steps = len(tokenized_datasets['train'])
lr_scheduler_type = 'linear'
lr_scheduler_kwargs = {'optimizer':optimizer,
                        'num_warmup_steps':int(0.1 * num_training_steps),
                        'num_training_steps':int((num_training_steps/per_device_train_batch_size) * num_train_epochs)}
scheduler = get_scheduler(lr_scheduler_type, **lr_scheduler_kwargs)


# defining the output directory
output_dir = f'TrainingData/{model_fp}/epoch={num_train_epochs}_batch={per_device_train_batch_size}_lr={learning_rate}_LoRA_teacher/'

# Training models
########################################################################################################
########################################################################################################
########################################################################################################

# define training parameters
training_parameters = {
    'model': model,
    'tokenizer': tokenizer,
    'output_dir': output_dir,
    'dataloaders': dataloaders,
    'optimizer':optimizer,
    'scheduler':scheduler,
    'device':torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    'hyper_parameters':hyper_parameters
}

# train the model
train_model(**training_parameters)

{'evaluate_set': 'validation', 'epoch': 0, 'batch': 0, 'avg_loss': 1.0829987287521363, 'accuracy': 0.029218749999999998, 'subsample_size': 1000}
{'evaluate_set': 'test', 'epoch': 0, 'batch': 0, 'avg_loss': 1.0830321073532105, 'accuracy': 0.02953125, 'subsample_size': 1000}
{'evaluate_set': 'train', 'epoch': 0, 'batch': 0, 'avg_loss': 0.9581412076950073, 'accuracy': 0.0265625, 'subsample_size': 'None'}
{'evaluate_set': 'validation', 'epoch': 0, 'batch': 0, 'avg_loss': 1.0858865635735648, 'accuracy': 0.02879464285714286, 'subsample_size': 100}
{'evaluate_set': 'train', 'epoch': 0, 'batch': 1, 'avg_loss': 0.9848820567131042, 'accuracy': 0.0265625, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 0, 'batch': 2, 'avg_loss': 0.9708995819091797, 'accuracy': 0.025, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 0, 'batch': 3, 'avg_loss': 0.9629691243171692, 'accuracy': 0.0265625, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 0, 'batch': 4, 'avg_loss': 0

# Training bart-large-cnn

In [None]:
# Parameters I am tuning
########################################################################################################
########################################################################################################
########################################################################################################

model_fp = 'bart-large-cnn'
model_name = 'facebook/bart-large-cnn'

# define hyperparameters
per_device_train_batch_size = 16
learning_rate = 5e-04
num_train_epochs = 3

# define the best hyper parameter
hyper_parameters = {
            'learning_rate': learning_rate,
            'per_device_train_batch_size': per_device_train_batch_size,
            'num_train_epochs': num_train_epochs
        }

# Define LoRA configuration
lora_config_dict = {
    'r': 32,
    'lora_alpha':32,  # todo: idk what this does yet
    'lora_dropout': 0.05, # todo: idk what this does yet
    # print(model) to see all the linear layers, and do LoRA on all of them
    'target_modules':['q_proj', 'k_proj', 'v_proj', 'out_proj', 'fc1', 'fc2'],
    # unfreeze the head of the model too
    'modules_to_save': ['lm_head']
}

# Parameters that techniquely I can tune but I am not tuning (Wouldn't be correct to create a function for them)
########################################################################################################
########################################################################################################
########################################################################################################

# load the preprocessed dataset
tokenized_dataset_fp = f'ProcessedDatasets/{model_fp}/'
tokenized_datasets = load_from_disk(tokenized_dataset_fp)
tokenized_datasets.set_format("torch")
# tokenized_datasets = tokenized_datasets.filter(lambda x: x['enumeration'] == '(9)')
tokenized_datasets = tokenized_datasets.remove_columns(['enumeration'])
# # for testing purposes
# n = 16 * 10
# tokenized_datasets['test'] = tokenized_datasets['test'].select(range(n))
# tokenized_datasets['validation'] = tokenized_datasets['validation'].select(range(n))
# tokenized_datasets['train'] = tokenized_datasets['train'].select(range(n))
# initialize dataloaders
dataloaders = {}
dataloaders['train'] = DataLoader(tokenized_datasets['train'], batch_size=per_device_train_batch_size, shuffle=True)
dataloaders['test'] = DataLoader(tokenized_datasets['test'], batch_size=per_device_train_batch_size)
dataloaders['validation'] = DataLoader(tokenized_datasets['validation'], batch_size=per_device_train_batch_size, shuffle=True)  # shuffle because we want to subsample


# define model
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Add LoRA:
lora_config = LoraConfig(**lora_config_dict)
model = get_peft_model(base_model, lora_config)
hyper_parameters['lora_config'] = lora_config_dict

# initialize optimizer: Notice: change the model here!
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# initialize scheduler
# calculate the arguments for shceduler: it depends on the sample size, batch size and epochs
num_training_steps = len(tokenized_datasets['train'])
lr_scheduler_type = 'linear'
lr_scheduler_kwargs = {'optimizer':optimizer,
                        'num_warmup_steps':int(0.1 * num_training_steps),
                        'num_training_steps':int((num_training_steps/per_device_train_batch_size) * num_train_epochs)}
scheduler = get_scheduler(lr_scheduler_type, **lr_scheduler_kwargs)


# defining the output directory
output_dir = f'TrainingData/{model_fp}/epoch={num_train_epochs}_batch={per_device_train_batch_size}_lr={learning_rate}_LoRA_teacher/'

# Training models
########################################################################################################
########################################################################################################
########################################################################################################

# define training parameters
training_parameters = {
    'model': model,
    'tokenizer': tokenizer,
    'output_dir': output_dir,
    'dataloaders': dataloaders,
    'optimizer':optimizer,
    'scheduler':scheduler,
    'device':torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    'hyper_parameters':hyper_parameters
}

# train the model
train_model(**training_parameters)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
{'evaluate_set': 'train', 'epoch': 2, 'batch': 24477, 'avg_loss': 0.03702634572982788, 'accuracy': 0.2894736842105263, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 2, 'batch': 24478, 'avg_loss': 0.03103853203356266, 'accuracy': 0.4057971014492754, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 2, 'batch': 24479, 'avg_loss': 0.031083565205335617, 'accuracy': 0.3835616438356164, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 2, 'batch': 24480, 'avg_loss': 0.03497011214494705, 'accuracy': 0.4084507042253521, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 2, 'batch': 24481, 'avg_loss': 0.03249131515622139, 'accuracy': 0.37333333333333335, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 2, 'batch': 24482, 'avg_loss': 0.030218252912163734, 'accuracy': 0.3783783783783784, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 2, 'batch': 24483, 'avg_los

In [None]:
from google.colab import runtime
runtime.unassign()

# Train T5-XL
Since LoRA perform horribly on BART without any modification, I will not waste more computational power on this. I will subsample 1/10 of the dataset to train, just to get a simple baseline.

In [None]:
# Parameters I am tuning
########################################################################################################
########################################################################################################
########################################################################################################

model_fp = 't5-large'
model_name = "google-t5/t5-large"

# define hyperparameters
per_device_train_batch_size = 16
learning_rate = 5e-04
num_train_epochs = 3

# define the best hyper parameter
hyper_parameters = {
            'learning_rate': learning_rate,
            'per_device_train_batch_size': per_device_train_batch_size,
            'num_train_epochs': num_train_epochs
        }

# Define LoRA configuration
lora_config_dict = {
    'r': 32,
    'lora_alpha':32,  # todo: idk what this does yet
    'lora_dropout': 0.05, # todo: idk what this does yet
    # print(model) to see all the linear layers, and do LoRA on all of them
    'target_modules':['q', 'k', 'v', 'o', 'wi_0', 'wi_1', 'wo'],
    # unfreeze the head of the model too
    'modules_to_save': ['lm_head']
}

# Parameters that techniquely I can tune but I am not tuning (Wouldn't be correct to create a function for them)
########################################################################################################
########################################################################################################
########################################################################################################

# load the preprocessed dataset
tokenized_dataset_fp = f'ProcessedDatasets/{model_fp}/'
tokenized_datasets = load_from_disk(tokenized_dataset_fp)
tokenized_datasets.set_format("torch")
# tokenized_datasets = tokenized_datasets.filter(lambda x: x['enumeration'] == '(9)')
tokenized_datasets = tokenized_datasets.remove_columns(['enumeration'])
# for testing purposes
n = 16 * 3000
# tokenized_datasets['test'] = tokenized_datasets['test'].select(range(n))
# tokenized_datasets['validation'] = tokenized_datasets['validation'].select(range(n))
tokenized_datasets['train'] = tokenized_datasets['train'].select(range(n))
# initialize dataloaders
dataloaders = {}
dataloaders['train'] = DataLoader(tokenized_datasets['train'], batch_size=per_device_train_batch_size, shuffle=True)
dataloaders['test'] = DataLoader(tokenized_datasets['test'], batch_size=per_device_train_batch_size)
dataloaders['validation'] = DataLoader(tokenized_datasets['validation'], batch_size=per_device_train_batch_size, shuffle=True)  # shuffle because we want to subsample


# define model
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Add LoRA:
lora_config = LoraConfig(**lora_config_dict)
model = get_peft_model(base_model, lora_config)
hyper_parameters['lora_config'] = lora_config_dict

# initialize optimizer: Notice: change the model here!
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# initialize scheduler
# calculate the arguments for shceduler: it depends on the sample size, batch size and epochs todo: bugs here
num_training_steps = len(tokenized_datasets['train'])
lr_scheduler_type = 'linear'
lr_scheduler_kwargs = {'optimizer':optimizer,
                        'num_warmup_steps':int(0.1 * num_training_steps),
                        'num_training_steps':int((num_training_steps/per_device_train_batch_size) * num_train_epochs)}
scheduler = get_scheduler(lr_scheduler_type, **lr_scheduler_kwargs)


# defining the output directory
output_dir = f'TrainingData/{model_fp}/epoch={num_train_epochs}_batch={per_device_train_batch_size}_lr={learning_rate}_LoRA_teacher/'

# Training models
########################################################################################################
########################################################################################################
########################################################################################################

# define training parameters
training_parameters = {
    'model': model,
    'tokenizer': tokenizer,
    'output_dir': output_dir,
    'dataloaders': dataloaders,
    'optimizer':optimizer,
    'scheduler':scheduler,
    'device':torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    'hyper_parameters':hyper_parameters
}

# train the model
train_model(**training_parameters)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
{'evaluate_set': 'train', 'epoch': 1, 'batch': 1052, 'avg_loss': 0.013883481733500957, 'accuracy': 0.54, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 1, 'batch': 1053, 'avg_loss': 0.01566918008029461, 'accuracy': 0.48, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 1, 'batch': 1054, 'avg_loss': 0.01699378713965416, 'accuracy': 0.47368421052631576, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 1, 'batch': 1055, 'avg_loss': 0.018053151667118073, 'accuracy': 0.41818181818181815, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 1, 'batch': 1056, 'avg_loss': 0.017987122759222984, 'accuracy': 0.4827586206896552, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 1, 'batch': 1057, 'avg_loss': 0.01573948934674263, 'accuracy': 0.4642857142857143, 'subsample_size': 'None'}
{'evaluate_set': 'train', 'epoch': 1, 'batch': 1058, 'avg_loss': 0.02007896825671196, 'accurac

In [None]:
from google.colab import runtime
runtime.unassign()