# Fine-tuning a HuggingFace model

## Code Preamble

In [220]:
import evaluate
import numpy as np
import pandas as pd
import re

from datasets import Dataset, load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    EarlyStoppingCallback,
    pipeline,
    TrainingArguments, 
    Trainer
)

## Fine-Tuning

- A common use of LLMs is to leverage their **generalized** linguistic capacities by finetunint them for a **particular** task
- For instance: We could take an LLM and train it to... 
    - classify text sequences
    - classify tokens
    - produce dialogue
    - answer questions
    - etc etc etc

## Author Attribution

- The task I want to train the model to perform on is to identify authors of text
    - This is known as "author attribution"
    - E.g. Italian Computer Scientists tried to identify Elena Ferrante by comparing her work with known Italian authors and journalists
- We'll be using one of the few author attribution datasets on Huggingface 
    - Uses text from 13 journalists at the Guardian
-  We can find the [data here](https://huggingface.co/datasets/guardian_authorship)

We load it by calling ```load_dataset```. The function needs the url of the dataset and a specification of which part of the data we want:

In [224]:
dataset = load_dataset('guardian_authorship', 'cross_genre_1')

In [252]:
dataset['train'][50]

{'author': 9,
 'topic': 2,
 'article': 'We had a traditional Christmas. The pilot light went out in the boiler. My mother was bad-tempered. The giblets made the dog sick. Sheffield Wednesday lost. But one unexpected incident shone more brightly than the star in the east. Princess Anne\'s behaviour immediately after she had celebrated the birth of the Prince of Peace made it the most memorable season of goodwill for years. I offer her royal highness my humble congratulations and hope it is not lese-majeste on my part to add: "Keep it up, Ma\'am. Keep it up." Endpiece readers may not be familiar with what happened outside Sandringham Church - and may find the details of the story difficult to believe even when they hear them. But they were reported in the tabloid newspapers and are therefore beyond dispute. A 75-year-old lady called Mrs Halfpenny made a basket - plaiting the wicker with her own hands - and filled it with flowers. \n',
 'label': 12,
 '__index_level_0__': 95}

There are some issues with the data, so I wrote a quick script to fix it. 
- Merge train, test and validate as pandas df
- Create new Dataset
- Do my own train_test_split

Most often this is **not** the case!

In [225]:
def fix_guardian_data(dataset):
    # Add a label column to the data
    dataset['train'] = dataset['train'].add_column("label", dataset['train']['author'])
    dataset['test'] = dataset['test'].add_column("label", dataset['test']['author'])
    dataset['validation'] = dataset['validation'].add_column("label", dataset['validation']['author'])
    
    # We want to do our own test-train split
    # To do this, I first make the data into one big dataframe
    train_df = pd.DataFrame(dataset['train'])
    test_df = pd.DataFrame(dataset['test'])
    val_df = pd.DataFrame(dataset['validation'])
    all_data = pd.concat([train_df, test_df, val_df])
    
    # Now I create a Huggingface dataset from that dataframe
    dataset = Dataset.from_pandas()
    
    # I decide which column is the 'label' column
    dataset = dataset.class_encode_column("label")
    
    # Then I take the train_test_split. I want 20% of the data to be in the test set
    dataset = dataset.train_test_split(test_size=0.2, stratify_by_column="label")
    return dataset

dataset = fix_guardian_data(dataset)

Stringifying the column:   0%|          | 0/444 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/444 [00:00<?, ? examples/s]

In [226]:
dataset

DatasetDict({
    train: Dataset({
        features: ['author', 'topic', 'article', 'label', '__index_level_0__'],
        num_rows: 355
    })
    test: Dataset({
        features: ['author', 'topic', 'article', 'label', '__index_level_0__'],
        num_rows: 89
    })
})

In [227]:
list(set(dataset['train']['label']))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

Now we tokenize, as always for NLP. 
- Different LLM's use different tokenizers. 
- Like the model, our tokenizer needs to know where in the Huggingface Hub to look for specs to tokenize
- We can use the  ```AutoTokenizer``` class instead of setting a particular tokenizer class

We will be using DistilBERT, a smaller and more nimble version of BERT.

In [256]:
model_type = "distilbert-base-uncased"

In [257]:
tokenizer = AutoTokenizer.from_pretrained(model_type)

def tokenize_function(examples):
    return tokenizer(examples["article"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/355 [00:00<?, ? examples/s]

Map:   0%|          | 0/89 [00:00<?, ? examples/s]

Next we set our hyperparameters:

In [258]:
batch_size = 8
epochs = 10
weight_decay = 0.01
learning_rate = 1e-5

We feed most hyperparameters to the the [```TrainingArguments``` class](https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments)

In [259]:
training_args = TrainingArguments(
    output_dir="test_trainer", 
    evaluation_strategy="epoch",
    save_strategy = "epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    weight_decay=weight_decay,
    learning_rate=1e-5,
    load_best_model_at_end = True
)

In [260]:
Now we can specify our model. 

SyntaxError: unterminated string literal (detected at line 1) (624830870.py, line 1)

In [234]:
model = AutoModelForSequenceClassification.from_pretrained(model_type, num_labels=13)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


When we train, we want to keep track of the model performance. For this we need to give the model a fucntion that takes in the eval and returns some sort of ... . For this we can use the ```evaluate``` library and write a function around it.

In [230]:
metric = evaluate.load("accuracy")

In [231]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

We can also create so called "callbacks". 
- These are objects that customize the training loop
- Some of the deftault ones have [their own classes in HuggingFace](https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/callback).

In our case, we want the model to stop if it didn't improve during 3 sequential epochs.

In [248]:
early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=3)

Finally, we create a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer). 
- This is a class HuggingFace inherits from [```PyTorch Lightning```](https://lightning.ai/docs/pytorch/stable/common/trainer.html)
- Used in many other libraries, like TorchGeo
- Given an instance of a model class, this does the whole job of forward and backward passing

In [235]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42),
    eval_dataset=tokenized_datasets["test"].shuffle(seed=42),
    compute_metrics=compute_metrics,
    callbacks = [early_stopping_callback]
)

Now we just run ```train()```, like with ```PyTorch```!

In [236]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.535404,0.168539
2,No log,2.484488,0.202247
3,No log,2.38249,0.348315
4,No log,2.289948,0.393258
5,No log,2.209553,0.382022
6,No log,2.15697,0.393258
7,No log,2.056545,0.505618
8,No log,1.997387,0.550562
9,No log,1.975646,0.561798
10,No log,1.968134,0.550562


TrainOutput(global_step=450, training_loss=2.111922064887153, metrics={'train_runtime': 3744.6896, 'train_samples_per_second': 0.948, 'train_steps_per_second': 0.12, 'total_flos': 470351515699200.0, 'train_loss': 2.111922064887153, 'epoch': 10.0})

The model has finished training! 
- Now we can use it in a ```Huggingface``` pipeline

In [238]:
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

The model outputs probabilities, no need to work with logits:

In [239]:
pipe(tokenized_datasets["test"][50]['article'][:512])

[{'label': 'LABEL_3', 'score': 0.1491585373878479}]

Now we'll compare the model predictions on the test set with our predictions on it.
- We'll check if ```pred_lab == real_lab``` and count how many times it´s ```True```.
    - This is our count for how many times we predicted correctly :) 

In [246]:
correct = []
for idx in range(len(tokenized_datasets["test"]['author'])):
    # We pull the predicted label from each prediction
    # The model is only able to predict based on the 512 first tokens
    pred_lab = pipe(tokenized_datasets["test"][idx]['article'][:512])[0]['label']
    
    # The model outputs strings. 
    # We pull the number from it and turn it into an integere
    pred_lab = int(re.findall(r'\d+', pred_lab)[0])
    
    # We get the real label from the test data itself
    real_lab = tokenized_datasets["test"][idx]['label']
    
    # Now we compare and append to a list
    correct.append(pred_lab == real_lab)

In [247]:
Counter(correct)

Counter({False: 66, True: 23})

We could do more analysis. For example:
- Which authors did the model struggle with?
- Which did it predict confidently?

Further work could include:
- Pull out the model weights to see if there are specific words tha are important for predicting specific authors?
- Test if we can deceive the model by performing style transfer