# LLMs Transfer Learning

Notebook by: Samson Bakos

Based on the documentation available at: https://huggingface.co

## Example 1: Text Classification

We're going to do this using HF's default objects for simplicity, but you can also do it in Tensorflow/Keras, or PyTorch

#### Load the [Yelp Reviews Dataset](https://huggingface.co/datasets/yelp_review_full)

In [1]:
from datasets import load_dataset

# load dataset 
dataset = load_dataset("yelp_review_full")

# print example from training set
dataset["train"][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

Each example in this dataset includes a text review from Yelp, along with a star rating (1-5, mapped to labels 0-4)

Our task is predict the star rating given the review (5 class classification)

#### Preprocess Text

In [2]:
from transformers import AutoTokenizer

# use the default preprocessor 
# important to ensure expected input to our model (i.e. same lemmatization modelling, stopwords, etc)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
   # Map function 
    # padding and truncation control for variable length sequences
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# apply to all datasets with .map(). Built in function of the HF datasets class
tokenized_datasets = dataset.map(tokenize_function, batched=True)

This is slow as heck (unless this is already stored in memory) and that was just preprocessing and loading. 

Lets take a reduced subset of the data to speed up for demo purposes

In [3]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

#### Train

Load DistilBert itself

This will throw a warning, but its fine.

Its basically just telling us this model isn't trained on a specific task yet

In [4]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=5)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We have some setup to do before we start, but there are pretty useful premade functions for us here.

There are ALOT of [hyperparameters for training](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). 

There are some more important settings like number of epochs, optimizer function (i.e. ADAM, SGD), learning rate, loss function, etc. Reasonable setting for settings are built in as defaults, and alot of the parameters are minor, so we're mostly going to leave this alone. 

The only thing we'll specify is our output directory, and that we want to see intermediate results every epoch

In [5]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

By default, the model is not evaluated during training (the loss function isn't accuracy, its something like cross-entropy) - we need to be able to pass our Trainer function an evaluation function to have an interpretable way to see what we're doing

In [6]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred # raw outputs, actual labels
    predictions = np.argmax(logits, axis=-1) #prediction is the highest output probability
    return metric.compute(predictions=predictions, references=labels) # accuracy computation

Build the actual Trainer Object

In [7]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Run it!

This is slow even with a GPU, smaller BERT model, and smaller dataset - but thats the cost of doing buisness with huge models

Bear in mind that with this setup, the model will use your GPU by default if you have one (its using 'mps' on my Apple M1). If you don't have one, this will be even slower.

These models are more often used with external cloud computation/ distributed systems where possible. 

In [8]:
trainer.train() 

  0%|          | 0/375 [00:00<?, ?it/s]

  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 1.1380902528762817, 'eval_accuracy': 0.505, 'eval_runtime': 67.8238, 'eval_samples_per_second': 14.744, 'eval_steps_per_second': 1.843, 'epoch': 1.0}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.9756078124046326, 'eval_accuracy': 0.602, 'eval_runtime': 69.2094, 'eval_samples_per_second': 14.449, 'eval_steps_per_second': 1.806, 'epoch': 2.0}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.9829638600349426, 'eval_accuracy': 0.609, 'eval_runtime': 71.9185, 'eval_samples_per_second': 13.905, 'eval_steps_per_second': 1.738, 'epoch': 3.0}
{'train_runtime': 836.4285, 'train_samples_per_second': 3.587, 'train_steps_per_second': 0.448, 'train_loss': 0.94194873046875, 'epoch': 3.0}


TrainOutput(global_step=375, training_loss=0.94194873046875, metrics={'train_runtime': 836.4285, 'train_samples_per_second': 3.587, 'train_steps_per_second': 0.448, 'train_loss': 0.94194873046875, 'epoch': 3.0})

RESULTS:

Time: ~15 minutes on 2022 Mac M1

Epoch 1: 0.493

Epoch 2: 0.579

Epoch 3: 0.588

This hasn't fully converged yet, but it might be starting to slow down.
- Could go farther but your computer might be heating up at this point

Close only counts in horseshoes and handgrenades - we're not rewarding the model for guessing 2/5 when the actual score is 1/5 - thats just as wrong as 5/5

100% accuracy on this task is impossible because not everyone has the same understanding of what each star ratings correspond to - there isn't perfect alignment between the text and the rating

We've also shot ourselves in the foot with a really small training dataset

#### Comparison

In [9]:
small_train_dataset

Dataset({
    features: ['label', 'text', 'input_ids', 'attention_mask'],
    num_rows: 1000
})

In [10]:
type(small_train_dataset['text'])

list

#### Classical ML for Comparison

TF-IDF for representation, Multiclass Naive Bayes for classification

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def preprocess_text(text):
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in ENGLISH_STOP_WORDS])
    return text

X_train = [preprocess_text(text) for text in small_train_dataset['text']]
y_train = small_train_dataset['label']

X_test = [preprocess_text(text) for text in small_eval_dataset['text']]
y_test = small_eval_dataset['label']

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.438


We're still beating classical ML/NLP by a fair bit! 

We can imagine that our more complex LLM approach would be able to extract proportionally more value from a larger dataset.

Both would (probably) see increased accuracy on the full dataset, but the LLM moreso.

## Example 2: Summarizing Text

Lets try something we can't do with classical ML!

We're going to summarize text




In [12]:
billsum = load_dataset("billsum", split="ca_test") #legal bill text from California

In [13]:
billsum

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})

In [14]:
# Shrink it cause the full dataset made my computer memory overflow :(
billsum = billsum.shuffle(seed=42).select(range(100))

In [15]:
billsum

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 100
})

In [16]:
billsum = billsum.train_test_split(test_size=0.2)

billsum['train'][0]

# We have a long version - our input
# And a summary - our desired output

# Training on the task of transforming a 'text' into its corresponding 'summary', we'll create a legal explanation model

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 3212 of the Labor Code is repealed.\nSEC. 2.\nSection 3212 is added to the Labor Code, to read:\n3212.\n(a) As used in this act, the term “injury” includes both of the following:\n(1) With respect to the following members, a hernia, when any part of the hernia develops or manifests itself during a period while the member is in the service of the office, staff, division, department, or unit:\n(A) Members of a sheriff’s office or the California Highway Patrol, district attorney’s staff of inspectors and investigators, or police or fire departments of cities, counties, cities and counties, districts, or other public or municipal corporations or political subdivisions, whether those members are volunteers, or are partly paid or fully paid.\n(B) Active firefighting members of the Department of Forestry and Fire Protection whose duties require firefighting or of any county forestry or firefighting d

In [17]:
checkpoint = "google-t5/t5-small" # We need an encoder-decoder model since we're going text-text
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # Use the right tokenizer

In [18]:
prefix = "summarize: " # This is a multipurpose model - we need to attach a task to tell it what we want
# Just like how you prompt chat GPT with specific questions


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]] # add summarize to texts
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True) # tokenize inputs
    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True) # tokenize outputs

    model_inputs["labels"] = labels["input_ids"] # match up the text and summary. Specific to this application
    return model_inputs

In [19]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [20]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint) # batch examples
# Better for computation and memory
# Also makes it easier/less disruptive to make all sequences in a batch the same length, rather than the whole dataset 

In [21]:
rouge = evaluate.load("rouge") # metric
# specially designed metric for summarization tasks: https://huggingface.co/spaces/evaluate-metric/rouge

In [22]:
# From the documentation, don't worry about syntax here

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}


In [23]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) # load t5

In [24]:
# Define training args:

training_args = Seq2SeqTrainingArguments(
    output_dir="billsum_model", 
    evaluation_strategy="epoch", 
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01, # regularization in optimizer
    save_total_limit=3, # maximum number of versions to have saved
    num_train_epochs=3,
    predict_with_generate= True
)

In [25]:
# build trainer

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
) 

In [26]:
import os
os.environ['PYTORCH_MPS_HIGH_WATERMARK_RATIO'] = '0.0' # don't try this at home

In [27]:
trainer.train()

  0%|          | 0/60 [00:00<?, ?it/s]

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/5 [00:00<?, ?it/s]

{'eval_loss': 3.7612648010253906, 'eval_rouge1': 0.1369, 'eval_rouge2': 0.0436, 'eval_rougeL': 0.1222, 'eval_rougeLsum': 0.1229, 'eval_gen_len': 19.0, 'eval_runtime': 10.5324, 'eval_samples_per_second': 1.899, 'eval_steps_per_second': 0.475, 'epoch': 1.0}


  0%|          | 0/5 [00:00<?, ?it/s]

{'eval_loss': 3.336975574493408, 'eval_rouge1': 0.1355, 'eval_rouge2': 0.044, 'eval_rougeL': 0.1219, 'eval_rougeLsum': 0.1223, 'eval_gen_len': 19.0, 'eval_runtime': 7.739, 'eval_samples_per_second': 2.584, 'eval_steps_per_second': 0.646, 'epoch': 2.0}


  0%|          | 0/5 [00:00<?, ?it/s]

{'eval_loss': 3.2518959045410156, 'eval_rouge1': 0.129, 'eval_rouge2': 0.0382, 'eval_rougeL': 0.1123, 'eval_rougeLsum': 0.1127, 'eval_gen_len': 19.0, 'eval_runtime': 8.1974, 'eval_samples_per_second': 2.44, 'eval_steps_per_second': 0.61, 'epoch': 3.0}
{'train_runtime': 301.3856, 'train_samples_per_second': 0.796, 'train_steps_per_second': 0.199, 'train_loss': 3.7708613077799478, 'epoch': 3.0}


TrainOutput(global_step=60, training_loss=3.7708613077799478, metrics={'train_runtime': 301.3856, 'train_samples_per_second': 0.796, 'train_steps_per_second': 0.199, 'train_loss': 3.7708613077799478, 'epoch': 3.0})

In [34]:
trainer.save_model('billsum_model')

In [35]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [39]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("billsum_model", local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained("billsum_model", local_files_only=True)

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)

# Generate summary
summary_ids = model.generate(inputs["input_ids"], num_beams=4, min_length=None, max_length=60, early_stopping=True)

# Decode the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Print the summary
print("Summary:", summary)


Summary: the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in American history. no one making under $400,000 per year will pay a penny more in taxes.


Pretty similar but with a bit less fluff. Note that this was a VERY minimal training to allow it to work locally