### Objective

In this notebook, we will look at `trl` package that stands for **transformer reinforcement learning**. RL is reward based training process and we will be looking at a very straight forward way of fine tu

In [1]:
import trl

In [2]:
trl.__version__

'0.9.6'

#### Dataset

We will be using a very small subset of **IMDB** dataset for this experiment.

In [3]:
# importing the libraries for accessing dataset
from datasets import load_dataset

In [4]:
dataset = load_dataset("imdb", split="train")
dataset = dataset.train_test_split(test_size=0.2)['test'].train_test_split(test_size=0.1)

In [5]:
dataset.shape

{'train': (4500, 2), 'test': (500, 2)}

#### Define Training Arguments

In [6]:
# these specific batch sizes have been chosen based on a GPU with VRAM of 12 GB
# unfortunately use of args like so has been deprecated 

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir='/Users/rbalasubramaniam/dailyResearch/trainers/output',
    push_to_hub=False,
    report_to="none",
    per_device_eval_batch_size=3,
    per_device_train_batch_size=4,
    eval_strategy='steps',
    eval_steps=200,
    save_strategy='epoch',
    num_train_epochs=1
)

In [7]:
# training dataset
dataset['train'][0]

{'text': "Any Way the Wind Blows is Tom Barmans (who is also know as front man of the rock formation 'dEUS') debut movie. Entirely shot in Antwerp (Belgium), the movie starts on a sunny friday morning and skips rather superficially between the events that fill the day of a dozen of main characters. When the movie ends, you have a lot of stuff to think about, because most of the different story-lines are left wide open.<br /><br />The movie has a (purely instrumental) sound track that will rock your socks off. In most scenes, the music truly enhances the general atmosphere and feel, really making the movie hallucinating to watch at certain points of time. The main scene in the film, the party, is very well shot.<br /><br />The director didn't hesitate to use video clip techniques, having his main characters dancing on one of the best sound tracks I've heard lately.<br /><br />The screenplay is great stuff. Camera angles and colors are very well chosen. The 'costumes' are very hot and ve

#### Create Trainer

Here we will look at the first type of RL trainer that we call **SFTTrainer**. We have to remember that SFT trainers don't give a lot of support for customized workflows and also they are very stream-lined, although easy to work with they dont offer flexibility.

In [12]:
# create the trainer instance
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m",  clean_up_tokenization_spaces=True)

sft_config = SFTConfig(output_dir="/tmp")

In [13]:
sft_config.output_dir="/Users/rbalasubramaniam/dailyResearch/trainers/output/"
sft_config.push_to_hub=False
sft_config.per_device_train_batch_size=4
sft_config.per_device_eval_batch_size=3
sft_config.eval_strategy='steps'
sft_config.eval_steps=200
sft_config.save_strategy='epoch'
sft_config.num_train_epochs=1
sft_config.dataset_text_field="text"
sft_config.max_seq_length=512

In [14]:
trainer = SFTTrainer(
    model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    args=sft_config,
)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [15]:
trainer.train()

RuntimeError: MPS does not support cumsum_out_mps op with int64 input. Support has been added in macOS 13.3