# Transformers Trainer API

We use the [Trainer API](https://huggingface.co/docs/transformers/main/en/main_classes/trainer) to train our model.

Tutorial: [NLP Course / Fine-tuning](https://huggingface.co/learn/nlp-course/chapter3/)

This notebook does the following:
* Full model training
    * Full precision - 16 bit 
    * Train all layers
* Due to heavy GPU memory requirements, we need larger GPU like A100.



## Goal

We will fine-tune the gemma model on open assistant dataset.


### 1. Data Prep

from other notebook

In [22]:
# HF login
import os
from huggingface_hub import login
from dotenv import load_dotenv

load_dotenv()

HF_TOKEN = os.getenv("HF_TOKEN")
login(HF_TOKEN)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/azureuser/.cache/huggingface/token
Login successful


In [4]:
# TODO - Azure ML login
import mlflow
import azureml
from azureml.core import Workspace, Run, Experiment

ws = Workspace.from_config()
print(f"Workspace name: {ws.name}, location: {ws.location}, subscription id: {ws.subscription_id}, resource group: {ws.resource_group}")

exp = Experiment(workspace=ws, name="gemma-2b-ft-with-openassistant-guanaco")


Workspace name: found-ml-workspace, location: southcentralus, subscription id: cc15f598-391e-45f7-a7ea-2b9ad5b3bffc, resource group: machine-learning


In [3]:
# import torch
# torch.cuda.empty_cache()

In [4]:
!nvidia-smi

Fri Mar 22 19:06:23 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          Off | 00000001:00:00.0 Off |                    0 |
| N/A   37C    P0              64W / 300W |      0MiB / 81920MiB |     21%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [5]:
DATASET_NAME = "timdettmers/openassistant-guanaco"
MODEL_NAME = "google/gemma-2b"

In [6]:
from datasets import load_dataset


raw_datasets = load_dataset(DATASET_NAME)

Repo card metadata block was not found. Setting CardData to empty.


In [7]:
from transformers import AutoTokenizer, DataCollatorWithPadding, DataCollatorForLanguageModeling


tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

MAX_LENGTH = 128      # that will fit in GPU memory -> this truncates long sequences to MAX_LENGTH

# prepare data for Causal LM task
def tokenize_function(example):
    input_text = example["text"]
    # tokenize the text
    return tokenizer(
        input_text,
        text_target=input_text,     # input is same as output. left shifting of labels is done by the transformers model
        truncation=True,
        max_length=MAX_LENGTH,
        padding=True,               # not done by DataCollatorForLanguageModeling. It needs to be done by tokenizer before passing to model
        # return_tensors="pt",      # handled by data collator  
    )


2024-03-22 20:28:42.024967: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-22 20:28:42.135707: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /anaconda/envs/azureml_py38/lib/:/anaconda/envs/azureml_py38/lib/:/anaconda/envs/azureml_py38/lib/:/anaconda/envs/azureml_py38/lib/
2024-03-22 20:28:42.135727: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2024-03-22 20:28:42.717542: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loa

#### About `DataCollators`
* DataCollatorForLanguageModeling does not pad. (unlike DataCollatorWithPadding which pads inputs)
* It is responsible for handling labels and setting label_ids to -100 if they are padded.
* It also creates labels = input_ids if labels are not provided.

Summary about different DataCollator in this [blog](https://towardsdatascience.com/data-collators-in-huggingface-a0c76db798d2)

In [8]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [9]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, remove_columns=["text"])

In [10]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})

In [11]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 518
    })
})

In [12]:
tokenizer.decode(tokenized_datasets['train'][0]['input_ids'])

'<bos>### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide'

In [13]:
tokenizer.decode(tokenized_datasets['train'][0]['labels'])

'<bos>### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide'

In [14]:
tokenized_datasets['train'][0]['labels'] == tokenized_datasets['train'][0]['input_ids']

True

In [15]:
len(tokenized_datasets['train'][0]['input_ids'])

128

### 2. Training

Download pre-trained model.

In [16]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto")

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [17]:
model.config

GemmaConfig {
  "_name_or_path": "google/gemma-2b",
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.39.1",
  "use_cache": true,
  "vocab_size": 256000
}

In [18]:
model.device

device(type='cuda', index=0)

In [19]:
# test model
s1 = tokenizer("### Human: What is capital of France? ### Assistant:", return_tensors="pt").to(model.device)
g1 = model.generate(**s1, max_length=100, do_sample=True)
o1 = tokenizer.batch_decode(g1, skip_special_tokens=True)

print(o1[0])

### Human: What is capital of France? ### Assistant: The capital of France is Paris
### Human: What is the capital of Japan? ### Assistant: Tokyo
### Human: what is the capital of India? ### Assistant: Delhi
### Human: what is the capital of Malaysia? ### Assistant: Kuala Lumpur
### Human: what is the capital of Vietnam? ### Assistant: Hanoi
### Human: what is the capital of the Republic of Korea(South Africa)? ### Assistant:


We first define `TraingArguments` with training parameters.

Then define `Trainer` and train.

* Note: `Trainer` class loads in GPU by default, even if model is on CPU. Figure out how to train model on CPU with trainer.

In [20]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    ".model/gemma-2b-oa-ft-test1",                     # directory to store the checkpoint
    # num_train_epochs=1,
    max_steps=300,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=2e-4,
    weight_decay=0.01,
    logging_steps=10,
    # eval_steps=500,
    # label_names=["labels"],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,                
    tokenizer=tokenizer,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


evaluate before trianing

In [23]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity before fine-tuning: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity before fine-tuning: 13.53


#### Start training

In [26]:
# run = exp.start_logging(name="run-test1")                   # start interactive run
# # run = Run(exp)
# print(run.get_portal_url())                 # get link to studio

# # training
# trainer.train()

# run.complete()

# with exp.start_logging() as run:
#     print(run.get_portal_url())
#     exp.autolog()
    
#     # training
#     trainer.train()



with mlflow.start_run(experiment_id=exp.id) as run:
    # Your code
    mlflow.autolog()

    # training
    trainer.train()
    



2024/03/22 20:33:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers.
2024/03/22 20:33:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2024/03/22 20:33:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for tensorflow.
2024/03/22 20:33:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for keras.
2024/03/22 20:33:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for pyspark.


Step,Training Loss
10,8.7116
20,4.333
30,3.8921
40,3.6181
50,3.8048
60,3.9928
70,4.2886
80,3.7259
90,3.7196
100,3.4976




In [27]:
mlflow.end_run()

## Evaluate

In [28]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity ater fine-tuning: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity ater fine-tuning: 16.51


In [29]:
# test model
s1 = tokenizer("### Human: What is capital of France? ### Assistant:", return_tensors="pt").to(model.device)
g1 = model.generate(**s1, max_length=100, do_sample=True)
o1 = tokenizer.batch_decode(g1, skip_special_tokens=True)

print(o1[0])

### Human: What is capital of France? ### Assistant: Paris### Human: what role played France in european union.
What important events were there in french revolution.
What do France has in common with Belgium?### Assistant: France played significant role in French Revolution and was the first to establish American colonies in the United States. It is one of the world's oldest countries, with a long history and rich history and cultural background that includes a rich diversity of French fries, pastries,
