# Instruction tuning

In [None]:
!pip install transformers
!pip install torch
!pip install accelerate
!pip install pyarrow
!pip install datasets

Today, we are breaking up the pipeline function from transformers that we have used previously. One of things the pipeline was doing behind the curtain was tokenising the text, but we can just as easily do that in a separate step.

Huggingface lets us initialize our model and tokenizer with the .from_pretrained() method, which will ensure that:
- we get a tokenizer that corresponds to the model architecture we want to use, and
- we download the vocabulary used when pretraining this specific checkpoint

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, max_length=250)



Now let's try to tokenise some text!

In [None]:
input_text = "My name is "

tokenized_text = tokenizer(input_text, return_tensors="pt")
tokenized_text

The output is a dictionary, the first part of which are the input_ids are the IDs of the tokens in the vocabulary. We can check this by decoding the IDs back into words.

In [None]:
tokenizer.decode([564])

The second part is the attention mask, which is a binary mask that tells the model which tokens to pay attention to and which to ignore (remember the causal vs fully-visible attention mask?). This is useful when we have padded our input to be the same length as the longest sequence in the batch, and we want the model to ignore the padding tokens.

Batched inputs are often different lengths, so they can’t be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a special padding token to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model.

Try to add padding="max_length" to the tokenizer and see what happens. (You can find all the different possible values for the argument in the documentation: https://huggingface.co/transformers/main_classes/tokenizer.html)

Truncation works in the other direction by truncating long sequences to the maximum length the model can accept. Try to insert a long sentence and add the truncation argument.

In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well.

Now, let's move on to another part of the pipeline, which corresponds to the .generate() method. This method takes the token IDs and generates the next token IDs. We can do this in a separate step as well.


In [None]:
output = model.generate(tokenized_text["input_ids"])
output

We then only need to decode the IDs back into words to get the generated text.

In [None]:
tokenizer.decode(output[0])

Now, let's try and use the GPU for this task! We can do this by moving the model and the inputs to the GPU using the .to() method. 

In [None]:
model = model.to("cuda")
model.generate(tokenizer(input_text, return_tensors="pt").to("cuda")["input_ids"])

We see that the device used is now cuda (the GPU) and the processing time is way faster!

Task
- Make your own function that works like the pipeline, but using the tokenization and generation steps we just saw

## Machine translation

This week, we will attempt machine translation.

The dataset is the [OPUS-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) which contains translation pairs from over 100 languages. I chose the Danish and English translation pairs because that makes it easier for me to evaluate the quality of the translations, but feel very free to choose a different language pair if you prefer. You can see the different language pairs available in the "Subset" part of the dataset viewer in the link above.

We'll use huggingface's datasets library to load the dataset.

In [None]:
from datasets import load_dataset

ds = load_dataset("Helsinki-NLP/opus-100", "da-en", split='train[:1%]')

In [None]:
ds

In [None]:
ds["translation"]

The translation pairs are nested in the translation column, so we need to flatten the dataset to get the source and target language in separate columns.

In [None]:
def unpack_cols(row):
    row["en"] = row["translation"]["en"]
    row["da"] = row["translation"]["da"]
    return row

train = ds.map(unpack_cols, remove_columns=["translation"])
train

In [None]:
train[150]

Now, try to pick a few sentences and see how well the model translates them without any additional help - just giving the source language sentence as input and letting the model generate the translation.

try to pick a few sentences and see how well the model can translate out of the box.

In [None]:
your_pipeline_function(train[150]['en'])


- Did it work? If not, why?

If zero shot prompting didn't work,

- try one shot and
- few shot prompting, to see if providing context helps the model to generate better translations.

Now, let's try instruction  tuning the model to hopefully get a better result!

The datasets library has a nice map method that we can use to apply a function to all the examples in the dataset. The map method can take a custom function, so we just need to write a function that prepares our data for the model.

Write preprocessing function that takes in a batch of the dataset and
- defines an instruction and appends it to the input text
- creates a list of all input texts (hint: you can use a loop or list comprehension)
- creates a new column in the dataset called "input_ids" that contains the token ids of the input text (hint: you can use the tokenizer on the list of input texts)
- creates a list of all output texts
- creates a new column in the dataset called "labels" that contains the token ids of the output text
- returns the batch

If you need a bit of help, I've started the first three steps for you.

In [None]:
def preprocessing_func(batch):
    instruction = 
    input_texts = [instruction + row + for row in batch['en']]
    batch["input_ids"] = tokenizer(input_texts, padding="max_length", truncation=True, return_tensors="pt").to("cuda").input_ids

In [None]:
tokenized_train = train.map(preprocessing_func, batched=True)

In [None]:
tokenized_train = tokenized_train.remove_columns(["en", "da"])
tokenized_train

Now the data is ready for the model, so we can fine-tune it!

In [None]:
tokenized_train[0]

We then want to initalize a Trainer class.

To do this, we have to defined the TrainingArguments, which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional.

I have changed a few parameters, like the learning rate and weight decay, as well as setting the max number of steps (so it doesn't run for a very long time) and the logging steps (so we get updated more frequently on the loss) and the batch size (also for speed). If you want to play around with changing other parameters, you can find the full list of arguments in the documentation (https://huggingface.co/docs/transformers/en/main_classes/trainer).

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(output_dir="./flan-t5-small-da-en",
   per_device_train_batch_size=4,
   learning_rate=1e-3,
   weight_decay=0.01,
   max_steps=3000,
   logging_steps=200,
)

trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
)

Now we're ready to train! Buckle up, this will probably take a bit of time...

In [None]:
trainer.train()

We can now save our model and load it in as a pretrained model!

In [None]:
trainer.save_model("instruct-model")

In [None]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("instruct-model").to("cuda")

You can adjust your pipeline to incorporate the new instruct_model!

In [None]:
your_pipeline_function()

Try to test the finetuned model on the examples from before.

- Does it perform better than before? Why/why not?
- What happens if you instruct the model to perform a different task, like summarisation or reasoning? Does the performance gain transfer? Why/why not?
- If you wanted to instruction tune a model to be able to solve a wide variety of tasks (like chatpgt), what kind of training data would you need?
- How would you produce that kind of data and what are the possible limitations?