### Objective

In this notebook, we will look at `trl` package that stands for **transformer reinforcement learning**. RL is reward based training process and we will be looking at a very straight forward way of fine tu

In [1]:
import trl

In [2]:
trl.__version__

'0.9.6'

#### Dataset

We will be using a very small subset of **IMDB** dataset for this experiment.

In [3]:
# importing the libraries for accessing dataset
from datasets import load_dataset

In [4]:
dataset = load_dataset("imdb", split="train")
dataset = dataset.train_test_split(test_size=0.2)['test'].train_test_split(test_size=0.1)

In [5]:
dataset.shape

{'train': (4500, 2), 'test': (500, 2)}

#### Define Training Arguments

In [6]:
# these specific batch sizes have been chosen based on a GPU with VRAM of 12 GB
# unfortunately use of args like so has been deprecated 

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir='/home/ubuntu/dailyResearch/trainers/output',
    push_to_hub=False,
    report_to="none",
    per_device_eval_batch_size=3,
    per_device_train_batch_size=4,
    eval_strategy='steps',
    eval_steps=200,
    save_strategy='epoch',
    num_train_epochs=1
)

In [7]:
# training dataset
dataset['train'][0]

{'text': 'Råzone is an awful movie! It is so simple. It seems they tried to make a movie to show the reel life. Just like Zappa did many years ago. But unfortunately Denmark lacks good young actors. Leon are by many still the little girl in "krummernes Jul", and Laura is simply not good enough to play such an important role. several times in the movie she plays with out soul and this is destroying the movie!<br /><br />Even though i consider it a movie you ought to see. I do not agree that the youth are behaving like this, but i think it can show how it can end, if you are letting your child down. Also it is important to support danish movies and new companies like "Film folket"!<br /><br />all in all I think people should see Råzone. Not because it is a great film, but because it is a movies which is dealing with important themes. I also think it is important to point out that there are some violent scenes in it, and actually it is in these scenes, Laura is acting best. - like the end

#### Create Trainer

Here we will look at the first type of RL trainer that we call **SFTTrainer**. We have to remember that SFT trainers don't give a lot of support for customized workflows and also they are very stream-lined, although easy to work with they dont offer flexibility.

In [8]:
# create the trainer instance
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m",  clean_up_tokenization_spaces=True)
tokenizer.clean_up_tokenization_spaces = True

sft_config = SFTConfig(output_dir="/tmp")

In [9]:
# we don't have to specify a value for report_to -> will cause error if you provide "none"
sft_config.output_dir="/home/ubuntu/dailyResearch/trainers/output"
sft_config.push_to_hub=False
sft_config.per_device_train_batch_size=4
sft_config.per_device_eval_batch_size=3
sft_config.eval_strategy='steps'
sft_config.eval_steps=200
sft_config.save_strategy='epoch'
sft_config.num_train_epochs=1
sft_config.dataset_text_field="text"
sft_config.max_seq_length=512

In [10]:
trainer = SFTTrainer(
    model,
    tokenizer=tokenizer,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    args=sft_config,
)

Map:   0%|          | 0/4500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
trainer.train()

Step,Training Loss,Validation Loss
