# Transformers Trainer API

We use the [Trainer API](https://huggingface.co/docs/transformers/main/en/main_classes/trainer) to train our model.

Tutorial: [NLP Course / Fine-tuning](https://huggingface.co/learn/nlp-course/chapter3/)

This notebook does the following:
* Full model training
    * Full precision - 16 bit 
    * Train all layers
* Due to heavy GPU memory requirements, we need larger GPU like A100.



## Goal

We will fine-tune the gemma model on open assistant dataset.


### 1. Data Prep

from other notebook

In [1]:
# HF login
import os
from huggingface_hub import login
from dotenv import load_dotenv

load_dotenv()

HF_TOKEN = os.getenv("HF_TOKEN")
login(HF_TOKEN)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/vijay/.cache/huggingface/token
Login successful


In [33]:
# TODO - Azure ML login
import azureml
from azureml.core import Workspace, Run

ws = Workspace.from_config()
print(f"Workspace name: {ws.name}, location: {ws.location}, subscription id: {ws.subscription_id}, resource group: {ws.resource_group}")

Workspace name: found-ml-workspace, location: southcentralus, subscription id: cc15f598-391e-45f7-a7ea-2b9ad5b3bffc, resource group: machine-learning


In [20]:
# import torch
# torch.cuda.empty_cache()

In [2]:
!nvidia-smi

Wed Mar 20 19:08:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          Off | 00000001:00:00.0 Off |                    0 |
| N/A   36C    P0              62W / 300W |      0MiB / 81920MiB |     21%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
DATASET_NAME = "timdettmers/openassistant-guanaco"
MODEL_NAME = "google/gemma-2b"

In [3]:
from datasets import load_dataset


raw_datasets = load_dataset(DATASET_NAME)

Repo card metadata block was not found. Setting CardData to empty.


In [4]:
from transformers import AutoTokenizer, DataCollatorWithPadding, DataCollatorForLanguageModeling


tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

MAX_LENGTH = 16     # that will fit in GPU memory

# prepare data for Causal LM task
def tokenize_function(example):
    input_text = example["text"]
    # tokenize the text
    return tokenizer(
        input_text,
        text_target=input_text,     # input is same as output. left shifting of labels is done by the transformers model
        truncation=True,
        max_length=MAX_LENGTH,
        padding=True,               # not done by DataCollatorForLanguageModeling. It needs to be done by tokenizer before passing to model
        # return_tensors="pt",      # handled by data collator  
    )


#### About `DataCollators`
* DataCollatorForLanguageModeling does not pad. (unlike DataCollatorWithPadding which pads inputs)
* It is responsible for handling labels and setting label_ids to -100 if they are padded.
* It also creates labels = input_ids if labels are not provided.

Summary about different DataCollator in this [blog](https://towardsdatascience.com/data-collators-in-huggingface-a0c76db798d2)

In [5]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [6]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, remove_columns=["text"])

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

In [7]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})

In [8]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 518
    })
})

In [9]:
tokenizer.decode(tokenized_datasets['train'][0]['input_ids'])

'<bos>### Human: Can you write a short introduction about the relevance of the term'

In [10]:
tokenizer.decode(tokenized_datasets['train'][0]['labels'])

'<bos>### Human: Can you write a short introduction about the relevance of the term'

In [11]:
tokenized_datasets['train'][0]['labels'] == tokenized_datasets['train'][0]['input_ids']

True

In [12]:
len(tokenized_datasets['train'][0]['input_ids'])

16

##### Dummy dataset test

In [13]:
from datasets import Dataset

dummy_dataset = {
    "text": [
        "This is test 1.",
        "This is test 2. Much longer seq.",
        "This is test 3. Much longer seq. Much longer seq. Much longer seq.",
        "This is test 4. Short and sweet."
    ]
}

dummy_dataset = Dataset.from_dict(dummy_dataset)
dummy_dataset

Dataset({
    features: ['text'],
    num_rows: 4
})

In [14]:
def dummy_tokenize_function(example):
    input_text = example["text"]
    # tokenize the text
    return tokenizer(
        input_text,
        # text_target=input_text,     # input is same as output. left shifting of labels is done by the transformers model
        truncation=True,
        max_length=10,
        padding=True,         
        # return_tensors="pt",
    )

tokenized_dummy_dataset = dummy_dataset.map(dummy_tokenize_function, batched=True, remove_columns=["text"])
tokenized_dummy_dataset

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 4
})

In [16]:
tokenized_dummy_dataset[:4]

{'input_ids': [[0, 0, 0, 2, 1596, 603, 2121, 235248, 235274, 235265],
  [2, 1596, 603, 2121, 235248, 235284, 235265, 19154, 5543, 28410],
  [2, 1596, 603, 2121, 235248, 235304, 235265, 19154, 5543, 28410],
  [2, 1596, 603, 2121, 235248, 235310, 235265, 13406, 578, 7786]],
 'attention_mask': [[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [17]:
# dummy_datacollator = DataCollatorWithPadding(tokenizer=tokenizer, padding="longest")
dummy_datacollator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

samples = tokenized_dummy_dataset[:4]
samples = [samples]
dc_output = dummy_datacollator(samples)
dc_output

{'input_ids': tensor([[[     0,      0,      0,      2,   1596,    603,   2121, 235248,
          235274, 235265],
         [     2,   1596,    603,   2121, 235248, 235284, 235265,  19154,
            5543,  28410],
         [     2,   1596,    603,   2121, 235248, 235304, 235265,  19154,
            5543,  28410],
         [     2,   1596,    603,   2121, 235248, 235310, 235265,  13406,
             578,   7786]]]), 'attention_mask': tensor([[[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]]), 'labels': tensor([[[  -100,   -100,   -100,      2,   1596,    603,   2121, 235248,
          235274, 235265],
         [     2,   1596,    603,   2121, 235248, 235284, 235265,  19154,
            5543,  28410],
         [     2,   1596,    603,   2121, 235248, 235304, 235265,  19154,
            5543,  28410],
         [     2,   1596,    603,   2121, 235248, 235310, 235265,  13406,
       

### 2. Training

Download pre-trained model.

In [18]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [19]:
model.config

GemmaConfig {
  "_name_or_path": "google/gemma-2b",
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.38.1",
  "use_cache": true,
  "vocab_size": 256000
}

In [20]:
model.device

device(type='mps', index=0)

In [21]:
# test model
s1 = tokenizer("Hello, my name is", return_tensors="pt").to("mps")
g1 = model.generate(**s1, max_length=50, do_sample=True)
o1 = tokenizer.batch_decode(g1, skip_special_tokens=True)

print(o1[0])

Hello, my name is Anna and I am here to tell you about my journey with psoriasis.

I have suffered from psoriasis for over a year and a half. I went from not wearing certain clothes if I didn’t have a mask on


We first define `TraingArguments` with training parameters.

Then define `Trainer` and train.

* Note: `Trainer` class loads in GPU by default, even if model is on CPU. Figure out how to train model on CPU with trainer.

In [23]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    ".model/gemma-2b-oa-ft-test1",                     # directory to store the checkpoint
    # num_train_epochs=3,
    max_steps=10,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_steps=5,
    # eval_steps=500,
    # label_names=["labels"],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["test"],
    # eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,                
    tokenizer=tokenizer,
)

In [24]:
trainer.accelerator.device

device(type='mps')

start training

In [25]:
trainer.train()

  0%|          | 0/10 [00:00<?, ?it/s]

RuntimeError: MPS backend out of memory (MPS allocated: 34.20 GB, other allocations: 1.95 GB, max allowed: 36.27 GB). Tried to allocate 128.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).