# LLM Fine-Tuning using HuggingFace's 🤗 Encoder-Decoder models and 🔭 Galileo

In this tutorial, we will fine-tune an Encoder-Decoder model from HuggingFace 🤗 for instruction completion and explore the results in Galileo.

We use the well known Alpaca intruction-tuning dataset, from the [Stanford Alpaca project](https://github.com/tatsu-lab/stanford_alpaca). In doing so, we help highlight several known data errors and limitations of this dataset!

**Make sure to select GPU in your Runtime! (Runtime -> Change Runtime type)**

# Install Dependancies [Including Setting up DQ] + Add Imports

In [None]:
#@title Install `dataquality`

# Upgrade pip
!pip install -U pip &> /dev/null

# Install all dependecies
!pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/
print('👋 Installed necessary libraries.')

# Select a small portion of the dataset for CI/QA.
import os
def _minimize_for_ci() -> bool:
    return os.getenv("MINIMIZE_FOR_CI", "false") == "true"

# 1. Initialize Galileo

In [None]:
import dataquality as dq
# 🔭🌕 Galileo log-in

dq.init(task_type="seq2seq",
        project_name="galileo-finetune",
        run_name=f"example_run_galileo-finetune_1")

# 2. Load Data
Load the data from Hugging Face and format it for fine-tuning an Encoder-Decoder model. Additionally, the original Alpaca dataset does not specify a val/test split so we randomly sample to get train/val/test with the ratios (0.85, 0.1, 0.05).

NOTE: We are working with LLMs (emphasis on Large) and Alpaca is a decently sized dataset with 52,000 data samples. Therefore, training times can be large. To speed up training during this tutorial, consider setting the flag `use_small_ds = True`. This will downsample the Alpaca dataset to 2500 samples before splitting into train/val/test.

In [None]:
use_small_ds = True

In [None]:
#@title Load 🤗 HuggingFace Alpaca Dataset

from datasets import load_dataset, Dataset, DatasetDict

ds = load_dataset("tatsu-lab/alpaca", trust_remote_code=True)

if use_small_ds or _minimize_for_ci():
    total_n_samples = 50 if _minimize_for_ci() else 10_000
    ds = DatasetDict({'train': Dataset.from_dict(ds['train'][:total_n_samples])})
ds

In [None]:
#@title Format the Dataset For Encoder-Decoder Fine-Tuning
#@markdown We use the following data format to combine the `Instruction` and `Input` columns: ```Human: {instruction} Input: {input}```

# FORMAT ALPACA
def format_alpaca(example, idx):
  return {"formatted_input": f"Human: {example['instruction']}" + f" Context: {example['input']}"*bool(example['input']),
          "id": idx}

if "formatted_input" not in ds['train'].features:
  ds = ds.map(format_alpaca, with_indices=True, remove_columns=['text', 'instruction', 'input'])

In [None]:
#@title Split the data into train/val/test splits as (0.85/0.1/0.05)
#@markdown The original Alpaca dataset does not come with a designated val/test split so we randomly sample to create these.

if 'val' not in ds and 'valid' not in ds and 'validation' not in ds:
  ds = ds.shuffle(seed=8)

  num_samples = len(ds['train'])
  train_size = int(num_samples * 0.85)
  val_size = int(num_samples * 0.1)

  ds_train = Dataset.from_dict(ds['train'][:train_size])
  ds_val = Dataset.from_dict(ds['train'][train_size:train_size + val_size])
  ds_test = Dataset.from_dict(ds['train'][train_size + val_size:])

  ds = DatasetDict({
      'train': ds_train,
      'val': ds_val,
      'test': ds_test
  })

ds

# 3. Setup Logging with Galileo
Galileo "watches" (i.e. uses) your `model`, `tokenizer`, and `GenerationConfig` to aid in logging + computing token level statistics and to power generation after training.

In this tutorial, we use the Encoder-Decoder model [`google/flan-t5-small`](https://huggingface.co/google/flan-t5-small) and leverage a simple greedy decoding strategy.

To speed up training and reduce memory, we limit the `max_output_tokens` (for the decoder block) to `128`, while leaving `max_input_tokens` (for the encoder block) as the default 512. Feel free to change this to reduce the samples with truncation.

In [None]:
from transformers import AutoTokenizer, GenerationConfig, T5ForConditionalGeneration
from dataquality.integrations.seq2seq.core import watch
from dataquality.schemas.seq2seq import Seq2SeqModelType

# Load model and tokenizer
MODEL = "google/flan-t5-small"
MAX_INPUT_TOKENS = 512
MAX_TARGET_TOKENS = 128

# Generation
MAX_NEW_TOKENS = 128


tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True, model_max_length=MAX_INPUT_TOKENS)
model = T5ForConditionalGeneration.from_pretrained(MODEL)

generation_config = GenerationConfig(
    max_new_tokens=MAX_NEW_TOKENS,
    do_sample=False,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id
)

watch(
    model_type=Seq2SeqModelType.encoder_decoder.value,
    model=model,
    tokenizer=tokenizer,
    generation_config=generation_config,
    # generation_splits=["test"],  # 🔭🌕 Galileo generates on Test set only by default
    max_input_tokens=MAX_INPUT_TOKENS,
    max_target_tokens=MAX_TARGET_TOKENS
)

# 4. Log input data with Galileo
Input data can be logged via `log_dataset` for logging iterables such as `HuggingFace` or `Pytorch Dataset` classes. This step logs the `input` (referenced as the `text`) and `target` (referenced as `label`) data columns for each split.

In our setting we have the following data columns:
- `text`/`input` = `formatted_input`
- `label`/`target` = `output`

In [None]:
# Log datasets with dq
from functools import partial

def _log_dataset(_ds, split, input_col, target_col):
    dq.log_dataset(
        _ds,
        text=input_col,
        label=target_col,
        split=split
    )

log_dataset = partial(_log_dataset, input_col="formatted_input", target_col="output")


# Log just for training
log_dataset(ds['train'], split="training")
log_dataset(ds['val'], split='validation')
log_dataset(ds['test'], split='test')

# 5. Tokenize and Format Data for Training

Here we tokenize the data and create per-split DataLoaders.

One important parameter to control is `BATCH_SIZE` and the `ACCUMULATION_STEPS`. To alleviate GPU memory limitations we use by default a `BATCH_SIZE=8` and `ACCUMULATION_STEPS=4` to have an `EFFECTIVE_BATCH_SIZE=32`.

These parameter setting require around `4 GB` of GPU RAM. If you run into RAM issues or if you are hungry to use more RAM feel free to change these setting!

In [None]:
BATCH_SIZE = 8
ACCUMULATION_STEPS = 4  # Effective batch size = 32

In [None]:
#@title Tokenize Inputs and Targets
def tokenize(row, input_col, target_col, max_input_length=512, max_target_length=512):
  """Tokenize the inputs and targets

  Creates the following columns:
    - input_ids
    - attention_mask
    - labels

  Note: We keep the id column for use in Galileo logging
  """
  model_inputs = tokenizer(
        row[input_col],
        truncation=True,
        max_length=max_input_length,
        padding=False,
        return_tensors=None,
    )
  labels = tokenizer(
        row[target_col],
        truncation=True,
        max_length=max_target_length,
        padding=False,
        return_tensors=None,
    ).input_ids

  model_inputs['labels'] = labels
  model_inputs['id'] = row['id']
  return model_inputs


ds_tokenized = ds.map(lambda x: tokenize(x, input_col="formatted_input", target_col="output", max_input_length=MAX_INPUT_TOKENS, max_target_length=MAX_TARGET_TOKENS),
                      remove_columns=ds['train'].column_names,
                      batched=True,
                      desc="Tokenizing Datasets")

In [None]:
#@title Setup the dataloaders
from transformers import DataCollatorForSeq2Seq
from torch.utils.data import DataLoader

data_collator = DataCollatorForSeq2Seq(tokenizer, return_tensors="pt", padding=True)

train_dataset = ds_tokenized["train"]
eval_dataset = ds_tokenized['val']
test_dataset = ds_tokenized['test']

train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=data_collator, batch_size=BATCH_SIZE, pin_memory=True)
eval_dataloader = DataLoader(eval_dataset, shuffle=False, collate_fn=data_collator, batch_size=BATCH_SIZE, pin_memory=True)
test_dataloader = DataLoader(test_dataset, shuffle=False, collate_fn=data_collator, batch_size=BATCH_SIZE, pin_memory=True)

evaluation_dataloaders = {
    'validation': eval_dataloader,
    'test': test_dataloader
}

# 6. Putting it all together - Training a Model!

Now we put it all together and train our model while logging to Galileo. This can be achieved with only 2 key additional lines of code:
1.  `dq.log_model_outputs`: Model "data" (i.e. logits) can be logged via `log_model_outputs` during the training and evaluation process.
2. `dq.set_epoch_and_split(split=<split>, epoch=<epoch>)`: Before logging model data for a given split we must indicate to Galileo which split and epoch we are logging for.

In [None]:
import torch

NUM_EPOCHS = 1
LR = 3e-4

device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')

In [None]:
#@title Run this cell to activate Tensorboard
# Load the TensorBoard notebook extension
%load_ext tensorboard
%tensorboard --logdir=/content/runs --load_fast=false

In [None]:
from torch.utils.tensorboard import SummaryWriter
from transformers import Adafactor
from tqdm import tqdm

writer = SummaryWriter()
writer.add_custom_scalars({
    "Losses": {
        "loss": ["Multiline", ["loss/train", "loss/validation", "loss/test"]]
        }
    }
)

# training and evaluation
model = model.to(device)
optimizer = Adafactor(model.parameters(), lr=LR, scale_parameter=False, relative_step=False)

for epoch in range(NUM_EPOCHS):
    # 🔭🌕 Galileo set epoch and split
    dq.set_epoch_and_split(split="training", epoch=epoch)
    model.train()

    train_epoch_loss = 0.
    for step, batch in enumerate(tqdm(train_dataloader)):
      ids = batch['id']
      batch = {k: v.to(device) for k, v in batch.items() if k != 'id'}

      outputs = model(**batch)

      # 🔭🌕 Galileo logging
      logits = outputs.logits  # Shape - [bs, bs_seq_ln, vocab]
      dq.log_model_outputs(
        logits = logits.cpu().numpy(),
        ids = ids
      )

      loss = outputs.loss / ACCUMULATION_STEPS

      loss.backward()
      # Grad Accumulation
      if ((step + 1) % ACCUMULATION_STEPS == 0) \
          or ((step + 1) == len(train_dataloader)):
        optimizer.step()
        optimizer.zero_grad()

      step_loss = loss.detach().cpu().item()
      train_epoch_loss += step_loss
      writer.add_scalar("Loss/train", step_loss, global_step=epoch*len(train_dataloader) + step) # Per step loss tracking

    train_epoch_loss = train_epoch_loss / len(train_dataloader) * ACCUMULATION_STEPS # Correct for the constant factor
    print(f"{epoch=}: {train_epoch_loss=}")

    # Evaluation
    model.eval()
    with torch.no_grad():
      for eval_split, dataloader in evaluation_dataloaders.items():
        eval_epoch_loss = 0
        dq.set_epoch_and_split(split=eval_split, epoch=epoch)
        for step, batch in enumerate(tqdm(dataloader, desc=f"Evaluation on split: {eval_split}")):
            ids = batch['id']
            batch = {k: v.to(device) for k, v in batch.items() if k != 'id'}

            outputs = model(**batch)

            # 🔭🌕 Galileo logging
            logits = outputs.logits  # Shape - [bs, bs_seq_ln, vocab]
            dq.log_model_outputs(
              logits = logits.cpu().numpy(),
              ids = ids
            )

            loss = outputs.loss
            eval_step_loss = loss.cpu().item()
            eval_epoch_loss += eval_step_loss

      # Look just at the loss in aggregate!
      eval_epoch_loss = eval_epoch_loss / len(eval_dataloader)
      writer.add_scalar(f"Loss/{eval_split}", eval_epoch_loss, global_step=epoch)

      print(f"{eval_split}: {eval_epoch_loss=}")

 # 6. Wrapping up - Pushing Data to Galileo

Now that we have finished training, the final step is to call `dq.finish()`, which kicks off data processing and uploads results to Galileo!


Some special things to note:
1. As a reminder, internally, Galileo leverage's 🤗 HuggingFace's standardized `generation` workflow to generate model completions over specified data splits. This is why we have you log your `GenerationConfig`!
2. `dq.finish()` requires some additional GPU RAM (about `0.5-1` GB from `generation` and some other processes). Therefore, it is very helpful to clear out any unused GPU RAM to avoid memory issues. In general, deleting the `optimizer` should suffice (though you can delete all unused GPU data), which we demonstrate below.

In [None]:
del optimizer
del batch
del outputs
torch.cuda.empty_cache()

import gc
gc.collect()

In [None]:
dq.finish(data_embs_col="input")