# LLMs Transfer Learning

Notebook by: Samson Bakos

Based on the documentation available at: https://huggingface.co

## Example 1: Text Classification

We're going to do this using HF's default objects for simplicity, but you can also do it in Tensorflow/Keras, or PyTorch

#### Load the [Yelp Reviews Dataset](https://huggingface.co/datasets/yelp_review_full)

In [1]:
from datasets import load_dataset

# load dataset 
dataset = load_dataset("yelp_review_full")

# print example from training set
dataset["train"][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

Each example in this dataset includes a text review from Yelp, along with a star rating (1-5, mapped to labels 0-4)

Our task is predict the star rating given the review (5 class classification)

#### Preprocess Text

In [2]:
from transformers import AutoTokenizer

# use the default preprocessor 
# important to ensure expected input to our model (i.e. same lemmatization modelling, stopwords, etc)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
   # Map function 
    # padding and truncation control for variable length sequences
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# apply to all datasets with .map(). Built in function of the HF datasets class
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

This is slow as heck (unless this is already stored in memory) and that was just preprocessing and loading. 

Lets take a reduced subset of the data to speed up for demo purposes

In [3]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

#### Train

Load DistilBert itself

This will throw a warning, but its fine.

Its basically just telling us this model isn't trained on a specific task yet

In [4]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=5)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We have some setup to do before we start, but there are pretty useful premade functions for us here.

There are ALOT of [hyperparameters for training](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). 

There are some more important settings like number of epochs, optimizer function (i.e. ADAM, SGD), learning rate, loss function, etc. Reasonable setting for settings are built in as defaults, and alot of the parameters are minor, so we're mostly going to leave this alone. 

The only thing we'll specify is our output directory, and that we want to see intermediate results every epoch

In [8]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

By default, the model is not evaluated during training (the loss function isn't accuracy, its something like cross-entropy) - we need to be able to pass our Trainer function an evaluation function to have an interpretable way to see what we're doing

In [10]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred # raw outputs, actual labels
    predictions = np.argmax(logits, axis=-1) #prediction is the highest output probability
    return metric.compute(predictions=predictions, references=labels) # accuracy computation

Build the actual Trainer Object

In [11]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Run it!

This is slow even with a GPU, smaller BERT model, and smaller dataset - but thats the cost of doing buisness with huge models

Bear in mind that with this setup, the model will use your GPU by default if you have one (its using 'mps' on my Apple M1). If you don't have one, this will be even slower.

These models are more often used with external cloud computation/ distributed systems where possible. 

In [12]:
trainer.train() 

  0%|          | 0/375 [00:00<?, ?it/s]

  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 1.155847191810608, 'eval_accuracy': 0.493, 'eval_runtime': 69.0357, 'eval_samples_per_second': 14.485, 'eval_steps_per_second': 1.811, 'epoch': 1.0}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.98539799451828, 'eval_accuracy': 0.579, 'eval_runtime': 82.9272, 'eval_samples_per_second': 12.059, 'eval_steps_per_second': 1.507, 'epoch': 2.0}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.9947499632835388, 'eval_accuracy': 0.588, 'eval_runtime': 76.4124, 'eval_samples_per_second': 13.087, 'eval_steps_per_second': 1.636, 'epoch': 3.0}
{'train_runtime': 904.7474, 'train_samples_per_second': 3.316, 'train_steps_per_second': 0.414, 'train_loss': 0.9811983235677083, 'epoch': 3.0}


TrainOutput(global_step=375, training_loss=0.9811983235677083, metrics={'train_runtime': 904.7474, 'train_samples_per_second': 3.316, 'train_steps_per_second': 0.414, 'train_loss': 0.9811983235677083, 'epoch': 3.0})

RESULTS:

Time:

Epoch 1: 0.493

Epoch 2: 0.579

Epoch 3: 0.588

This hasn't fully converged yet, but it might be starting to slow down.
- Could go farther but your computer might be heating up at this point

Close only counts in horseshoes and handgrenades - we're not rewarding the model for guessing 2/5 when the actual score is 1/5 - thats just as wrong as 5/5

100% accuracy on this task is impossible because not everyone has the same understanding of what each star ratings correspond to - there isn't perfect alignment between the text and the rating

We've also shot ourselves in the foot with a really small training dataset

#### Comparison

In [15]:
small_train_dataset

Dataset({
    features: ['label', 'text', 'input_ids', 'attention_mask'],
    num_rows: 1000
})

In [20]:
type(small_train_dataset['label'])

list

#### Classical ML for Comparison

TF-IDF for representation, Multiclass Naive Bayes for classification

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def preprocess_text(text):
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in ENGLISH_STOP_WORDS])
    return text

X_train = [preprocess_text(text) for text in small_train_dataset['text']]
y_train = small_train_dataset['label']

X_test = [preprocess_text(text) for text in small_eval_dataset['text']]
y_test = small_eval_dataset['label']

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.438


We're still beating classical ML/NLP by a fair bit! 

We can imagine that our more complex LLM approach would be able to extract proportionally more value from a larger dataset.

Both would (probably) see increased accuracy on the full dataset, but the LLM moreso.