<a href="https://colab.research.google.com/github/vanderbilt-data-science/ai-winter/blob/main/1_text_simple_classification_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning Models
> An introductory example on fine-tuning models

In this walkthrough, we'll explore the standard steps of fine-tuning a model, and we'll apply this towards the intuitive task of text classification.

We'll leverage the [`tweet_eval` dataset](https://huggingface.co/datasets/tweet_eval) to try to classify emotions of tweets into relevant categories.

# Initial Setup

### Install required packages
Note that this is mostly required if you're on Google Colab.

In [None]:
%%capture
! pip install transformers
! pip install datasets

### Import packages of interest

In [None]:
import numpy as np
import pandas as pd

from datasets import load_dataset, load_metric, Dataset
from transformers import pipeline
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

from huggingface_hub import notebook_login

## Log into HuggingFace CLI
Why are we doing this? Below, we'll use our own user accounts to grab datasets and upload models. If we don't do this, we'll have to pass in the auth token over. This isn't bad, but let's streamline our efforts!

In [None]:
!git config --global credential.helper store

In [None]:
notebook_login()

# Load data from HuggingFace Hub, Datasets, or from disk

In this example, we'll pull from the Huggingface Datasets repository. However, if you have your own dataset, you can use this here. We'll go over how to use your own datasets in future classes.

In [None]:
#Load tweet_eval dataset, emotion configuration


In [None]:
# View general structure of data


In [None]:
# Look at an example


In [None]:
# Look at labels


In [None]:
# Create id2label, label2id, and standard info from datasets


print(num_classes)
print(id2label)
label2id

# Pre-process inputs
We've already learned about tokenizers - let's see what this looks like as we approach training. A richer treatment of tokenizers can be found in the Huggingface [instructions on tokenizers](https://huggingface.co/course/chapter2/4?fw=pt). Then, let's try it on our own!

In [None]:
#instantiate tokenizer


In [None]:
#define tokenizing function


In [None]:
#do the tokenizing using map function
tokenized_ds = demo_ds.map(tokenize_inputs, batched=True, remove_columns=['text'])

In [None]:
tokenized_ds

## An aside on tokenizer functionality
We can do many things with tokenizers to help us to tokenize our data and process it. Let's check out these outputs further.

In [None]:
#check out input IDs
print(tokenized_ds['train']['input_ids'][0])

#compare against the text
print(demo_ds['train']['text'][0])

[101, 789, 160, 1766, 1616, 1110, 170, 1205, 7727, 1113, 170, 2463, 1128, 1336, 1309, 1138, 112, 119, 11882, 11545, 119, 108, 15710, 108, 3645, 108, 3994, 102]
“Worry is a down payment on a problem you may never have'.  Joyce Meyer.  #motivation #leadership #worry


In [None]:
#check out the length of the list of lists
print(len(tokenized_ds['train']['input_ids']))

#check out the length of a single element
print(len(tokenized_ds['train']['input_ids'][0]))

3257
28


In [None]:
#convert input_ids to token representation
input0_tokens = tokenizer.convert_ids_to_tokens(tokenized_ds['train']['input_ids'][0])
print(input0_tokens)

['[CLS]', '“', 'W', '##or', '##ry', 'is', 'a', 'down', 'payment', 'on', 'a', 'problem', 'you', 'may', 'never', 'have', "'", '.', 'Joyce', 'Meyer', '.', '#', 'motivation', '#', 'leadership', '#', 'worry', '[SEP]']


In [None]:
#see what this looks like as a string
print(tokenizer.convert_tokens_to_string(input0_tokens))

#another method directly from the input ids
tokenizer.decode(tokenized_ds['train']['input_ids'][0])

[CLS] “ Worry is a down payment on a problem you may never have '. Joyce Meyer. # motivation # leadership # worry [SEP]


"[CLS] “ Worry is a down payment on a problem you may never have '. Joyce Meyer. # motivation # leadership # worry [SEP]"

In [None]:
#other information about tokenizer
print(tokenizer.vocab_size)

#see actual tokenizer vocab (we've abbreviated here)
#tokenizer.vocab
pd.DataFrame({'tokens': tokenizer.vocab.keys(), 'inds': tokenizer.vocab.values()}).set_index('inds').head(10)

28996


Unnamed: 0_level_0,tokens
inds,Unnamed: 1_level_1
15555,Scene
18102,Auxiliary
27474,##lanche
318,Ś
12117,Genesis
6214,corporate
15685,erect
28478,##ث
26174,##taining
5331,Ty


## An aside on dynamically padded batch size
HF has the capacity to dynamically pad your batches such that each input is only as long as any given input in the batch. This helps with memory.You can learn more [here](https://huggingface.co/course/chapter3/2?fw=pt). For now, we'll simply instantiate a data collator and use it during training to demonstrate how we can do this.

In [None]:
#Instantiate data collator


# Model Training Preparation

## Define model and task architecture

In [None]:
# Choose the model type and instantiate it for the task


## Consideration of appropriate metrics

What are good metrics for us to use for classification?

### From HF Datasets Metrics
Some metrics are available to us through [HF Datasets Metrics repo](https://huggingface.co/metrics).

In [None]:
#load a metric
metric = load_metric("accuracy")

#define the metric behavior
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### Custom definitions
We can also define our own. The function inputs are a tuple of logits and labels, and the function must return a dictionary of key-value pairs. The keys should be the name of the metric and the values should be the values of that metric. We'll see an example below.

In [None]:
from sklearn.metrics import precision_recall_fscore_support

def compute_metrics(eval_pred):
    # Calculate your own metrics here

## Settings for Model Training
Now, let's set some parameters that will govern the training loop of the model training. This includes practical considerations such as:
* Where the model should be saved
* Whether the model should be pushed to hub
* How often to assess the performance of the model on the validation set

As well as settings for neural network training, including:
* Number of epochs to train
* Learning rate
* Optimizer parameters

We do this through [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) and the [Trainer class.](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) Let's take a look!

In [None]:
#set new training arguments
training_args = TrainingArguments("bert-emotion",
                                  logging_strategy = "epoch",
                                  evaluation_strategy="epoch",
                                  save_strategy='epoch',
                                  load_best_model_at_end = True,
                                  metric_for_best_model='fscore',
                                  greater_is_better=True,
                                  per_device_train_batch_size = 4,
                                  per_device_eval_batch_size = 4,
                                  num_train_epochs=3,
                                  push_to_hub=True,
                                  hub_strategy='checkpoint',
                                  report_to='all')

#set data and functionality for trainer
trainer = Trainer(model=model,
                  args=training_args,
                  tokenizer=tokenizer,
                  data_collator=data_collator,
                  train_dataset=tokenized_ds['train'],
                  eval_dataset=tokenized_ds['validation'],
                  compute_metrics=compute_metrics)

# Train model
Now, let's actually train the model!

In [None]:
#actually train the model


In [None]:
#it's recommended to push the final version to HF after training completes.
#Note that the code below takes FOREVER depending on the size of your model so you might consider NOT running
#this line until the end of class
#trainer.push_to_hub(commit_message='end of training 3 epochs')

# Using trained model with `Trainer`


## Evaluate
We can assess the performance of the model over a large number of inputs (e.g., the test set). Here, we initially look at the performance of the training set to make sure the model _can_ learn from the data we've provided.

In [None]:
#run model evaluation on train dataset


## Predict
We can also use the model to predict and have the actual logits returned to us. This is helpful if we want to better inspect the performance of the model to consider consistent reasons for misclassifications and ideas on how to improve the performance of our model.

In [None]:
#use trainer to predict


In [None]:
#decide to create a confusion matrix, so import this knowledge
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
# Create a dataframe for inspection
preds_df = pd.DataFrame({'pred_ids':np.argmax(preds.predictions, axis=-1),
                         'label_ids':preds.label_ids,
                         'text':demo_ds['train']['text']})
display(preds_df.head())

# Populate pred_labels
preds_df['pred_labels'] = preds_df['pred_ids'].replace(id2label)
preds_df['true_labels'] = preds_df['label_ids'].replace(id2label)

# Define misclassified
preds_df['is_misclassified'] = preds_df['pred_ids'] != preds_df['label_ids']
display(preds_df.query('is_misclassified == True'))

# Get confusion matrix
ConfusionMatrixDisplay.from_predictions(preds_df['true_labels'], preds_df['pred_labels'])

In [None]:
#an example of inspecting the results to see examples of incorrect labels
preds_df.query("true_labels=='joy' and pred_labels=='anger'")['text'].tolist()

# Using your fine-tuned model
You can use the model that you've saved locally or the model that you've pushed to hub within a pipeline. Let's see how this works.

In [None]:
#create pipeline from your classifier


#optionally, load from HF
#emotion_classifier = pipeline('text-classification', model='charreaubell/bert-emotion', use_auth_token=True)

#get output



In [None]:
#inspect results


# Homework
Now that you've learned the essentials of training, let's take a moment to reflect on what we've learned, augment our knowledge, and avoid some known pitfalls.

In [None]:
#@title Tokenizers: Verifying your understanding

#@markdown Make sure to run all the cells in the `Aside on Tokenizer functionality`
#@markdown section to make sure that you understand the encoding and decoding
#@markdown functions.

In [None]:
#@title Updating our tokenization function
#@markdown Using the HF API, HF course, and Tunstall text
#@markdown determine how you would pad each input during tokenization.
#@markdown What methods of padding and truncation are available?

In [None]:
#@title Expanding our knowledge of Datasets `map` method
#@markdown What if we wanted to remove all html from our
#@markdown data prior to tokenization? We can do this with the map
#@markdown of Datasets. Use the following resource from the
#@markdown [HuggingFace Course](https://huggingface.co/course/chapter5/3?fw=pt#the-map-methods-superpowers) to understand how one might
#@markdown go about doing this. Implement this here.

## Training Arguments and the Trainer Class

In [None]:
#@title Model metrics
#@markdown Let's say that we wanted to see the precision and recall
#@markdown for each of the individual classes rather than the `macro` averaging
#@markdown as we saw in our current `compute_metrics` function we've written.

#@markdown Using the same sklearn functions (or not, but sklearn may make it easier),
#@markdown return the precision and recall for each individual class label in addition
#@markdown to the macro scores. Recall that what is returned from the `compute_metrics`
#@markdown function must be a dictionary.

In [None]:
#@title TrainingArguments parameters
#@markdown We've logged, saved, and evaluated at each epoch. However,
#@markdown if we have an extremely large dataset, seeing one or more of these
#@markdown at the end of each epoch (e.g., if it takes 3 hours to make it through
#@markdown a single epoch) may conflict with our desire to monitor our model training.

#@markdown 1. Using the TrainingArguments API, change your model to log, evaluate, and save
#@markdown every 200 steps rather than every epoch.
#@markdown 2. How does this change your checkpointing directories?
#@markdown 3. How does this influence the intervals of evaluation?

## Your research project
You've successfully trained a model - great job!! Now, let's focus on what YOU need to do for your task. What is the task that best describes what you're after? Using the [Transformer Notebooks](https://huggingface.co/docs/transformers/notebooks) and use the `Open in Colab` badge, explore what this task looks like. Note that even if your modality is different, you may be able to directly still use these notebooks with a few changes!