# Lightweight Fine-Tuning Project

In this cell, describe your choices for each of the following

* PEFT technique: `LoRA`
* Model: `GPT-2` 
* Evaluation approach: `Accuracy on test set` 
* Fine-tuning dataset: `yelp_review_full`

In [144]:
# Import required packages for this notebook
import torch
import evaluate

import numpy as np

from evaluate import evaluator
from collections import Counter
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification
from transformers import Trainer, TrainingArguments, DataCollatorWithPadding
from peft import get_peft_model, AutoPeftModelForSequenceClassification
from peft import LoraConfig, TaskType

## Some helper functions used in the notebook

In [3]:
def print_trainable_parameters(model):
    """Determine and print out the number of trainable trainable and all parameters."""
    # Code was taken form huggingface (https://huggingface.co/).
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

In [139]:
def compute_metrics(eval_pred):
    """Returns the accuracy of a model by providing predictions and ground truth labels."""
    # Code was taken form the Udacity example notebooks.
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'accuracy': (predictions == labels).mean()}

In [14]:
def predict(tokenizer, model, text):
    """Returns the logits, the predicted class id and the corresponding label."""
    inputs = tokenizer(text, padding='max_length', truncation=True, return_tensors='pt').to('cuda:0')
    with torch.no_grad():
        logits = model(**inputs).logits
        predicted_class_id = logits.argmax().item()

    return {'logits': logits, 'class_id': predicted_class_id, 'label': model.config.id2label[predicted_class_id]}

## Loading and Evaluating a Foundation Model

In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

### Loading the data

In [6]:
# Load the train and test splits of the yelp_review_full dataset

splits = ['train', 'test']
sizes_percent = {'train': 0.1, 'test': 0.1} # only a fraction of the data is taken to reduce computational resources needed

ds = {split: ds for split, ds in zip(
      splits, load_dataset('yelp_review_full', split=splits, trust_remote_code=True))}

# Thin out the dataset to make it run faster
for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range(int(ds[split].shape[0]*sizes_percent[split])))

# Show the dataset
ds

{'train': Dataset({
     features: ['label', 'text'],
     num_rows: 65000
 }),
 'test': Dataset({
     features: ['label', 'text'],
     num_rows: 5000
 })}

In [153]:
# Check the number of samples for each split and class; make sure that they are equally distributed
from collections import Counter
for split in splits:
    lbls = dict(Counter(ds[split]['label']))
    for lbl in lbls:
        print(f'Split: {split} \t Label: {lbl} \t Amount: {lbls[lbl]}')

Split: train 	 Label: 4 	 Amount: 12900
Split: train 	 Label: 2 	 Amount: 12875
Split: train 	 Label: 0 	 Amount: 13109
Split: train 	 Label: 3 	 Amount: 13108
Split: train 	 Label: 1 	 Amount: 13008
Split: test 	 Label: 2 	 Amount: 974
Split: test 	 Label: 4 	 Amount: 953
Split: test 	 Label: 1 	 Amount: 986
Split: test 	 Label: 3 	 Amount: 1055
Split: test 	 Label: 0 	 Amount: 1032


### Creating a tokenizer and tokenized datasets

In [6]:
# Create tokenizer for GPT2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'

def preprocess_function(samples):
    """Preprocess the yelp dataset by returning tokenized samples."""
    return tokenizer(samples['text'], padding='max_length', truncation=True)

# Create tokenized datasets.
tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)
    # rename and remove columns
    tokenized_ds[split] = tokenized_ds[split].rename_column('label', 'labels')
    tokenized_ds[split] = tokenized_ds[split].remove_columns(['attention_mask', 'text'])

# Show the dataset
tokenized_ds

Map:   0%|          | 0/65000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

{'train': Dataset({
     features: ['labels', 'input_ids'],
     num_rows: 65000
 }),
 'test': Dataset({
     features: ['labels', 'input_ids'],
     num_rows: 5000
 })}

### Loading the base model and customizing it for the classification task

In [7]:
# Load a gpt-2 foundation model for sequence classification
model = GPT2ForSequenceClassification.from_pretrained(
    'gpt2',
    num_labels=5,
    id2label={0: '1 star', 1: '2 stars', 2: '3 stars', 3: '4 stars', 4: '5 stars'},
    label2id={'1 star': 0, '2 stars': 1, '3 stars': 2, '4 stars': 3, '5 stars': 4},
)

# Inform the model about the pad_token_id specified in the tokenizer!
model.config.pad_token_id = model.config.eos_token_id

# Freeze all the parameters of the base model
for param in model.base_model.parameters():
    param.requires_grad = False

# Print the model
print(model)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=5, bias=False)
)


In [8]:
# Show the amount of trainable parameters.
print_trainable_parameters(model)

trainable params: 3840 || all params: 124443648 || trainable%: 0.00


### Training the model

In [9]:
# Create HuggingFace Trainer to handle the training and evaluation loop for PyTorch.
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir='./data/sentiment_analysis_base',
        learning_rate=2e-3,
        per_device_train_batch_size=40,
        per_device_eval_batch_size=40,
        num_train_epochs=5,
        weight_decay=0.01,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        load_best_model_at_end=True,
        label_names=["labels"]
    ),
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['test'],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

In [10]:
# Train the classifier
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.1322,1.165133,0.4956
2,1.1059,1.075997,0.528
3,1.079,1.069036,0.5272
4,1.0619,1.044171,0.5394
5,1.0339,1.032026,0.5508


TrainOutput(global_step=8125, training_loss=1.1012529146634615, metrics={'train_runtime': 14766.1504, 'train_samples_per_second': 22.01, 'train_steps_per_second': 0.55, 'total_flos': 1.69847488512e+17, 'train_loss': 1.1012529146634615, 'epoch': 5.0})

In [145]:
# Save the model
#model.save_pretrained('./models/gpt-2-pretrained')

# Save tokenizer
#tokenizer.save_pretrained('./models/gpt-2-tokenizer')

# Optionally, load the trained model later
#model_reloaded = GPT2ForSequenceClassification.from_pretrained('./models/gpt-2-pretrained').to('cuda:0')

### Evaluating the model

In [12]:
# Show the performance of the model on the test set
trainer.evaluate()

{'eval_loss': 1.032025933265686,
 'eval_accuracy': 0.5508,
 'eval_runtime': 192.5453,
 'eval_samples_per_second': 25.968,
 'eval_steps_per_second': 0.649,
 'epoch': 5.0}

In [26]:
# Test an example text from yelp

example = '''Top notch doctor in a top notch practice. Can't say I am surprised when \
I was referred to him by another doctor who I think is wonderful and because he went \
to one of the best medical schools in the country. \nIt is really easy to get an appointment. \
There is minimal wait to be seen and his bedside manner is great.'''

# yelp label: 5 stars

# Get model prediction
prediction = predict(tokenizer, model, example)

print(f'Logits: {prediction["logits"]}')
print(f'Class-Id: {prediction["class_id"]}')
print(f'Label: {prediction["label"]}')

Logits: tensor([[ 3.6169,  4.5685,  6.8587,  9.6815, 11.6487]], device='cuda:0')
Class-Id: 4
Label: 5 stars


## Performing Parameter-Efficient Fine-Tuning

In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

### Create a PEFT model

In [140]:
# Load a gpt-2 foundation model for sequence classification
model = GPT2ForSequenceClassification.from_pretrained(
    'gpt2',
    num_labels=5,
    id2label={0: '1 star', 1: '2 stars', 2: '3 stars', 3: '4 stars', 4: '5 stars'},
    label2id={'1 star': 0, '2 stars': 1, '3 stars': 2, '4 stars': 3, '5 stars': 4},
)

# Inform the model about the pad_token_id specified in the tokenizer!
model.config.pad_token_id = model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:
# Create a Lora configuration
config = LoraConfig(
    r=16,
    lora_alpha=16,
    use_rslora=True,
    lora_dropout=0.05,
    modules_to_save=["score"],
    task_type=TaskType.SEQ_CLS
)

In [34]:
# Get Lora model for gpt-2 using the Lora configuration
lora_model = get_peft_model(model, config)

# Show the amount of trainable parameters.
lora_model.print_trainable_parameters()

trainable params: 593,664 || all params: 125,037,312 || trainable%: 0.4747894772402017




In [35]:
# Print the Lora model
print(lora_model)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D()
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDi

### Fine-tune the Lora Model

In [36]:
# Create HuggingFace Trainer to handle the training and evaluation loop for PyTorch.
trainer = Trainer(
    model=lora_model,
    args=TrainingArguments(
        output_dir='./data/sentiment_analysis_lora',
        learning_rate=5e-4,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=5,
        weight_decay=0.05,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        load_best_model_at_end=True,
        label_names=["labels"]
    ),
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['test'],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

In [37]:
# Train the classifier
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8971,0.888117,0.6122
2,0.8653,0.90392,0.6226
3,0.8278,0.854013,0.651
4,0.7568,0.854632,0.658
5,0.7402,0.866431,0.6606


TrainOutput(global_step=81250, training_loss=0.8381380267803485, metrics={'train_runtime': 37848.7196, 'train_samples_per_second': 8.587, 'train_steps_per_second': 2.147, 'total_flos': 1.710329167872e+17, 'train_loss': 0.8381380267803485, 'epoch': 5.0})

### Save the model and the fine-tuned parameters

In [150]:
# Save fine-tuned parameters
#lora_model.save_pretrained("./models/gpt-2-lora-ft-parameters")

In [151]:
# Save the complete model
#merged_model = lora_model.merge_and_unload()
#merged_model.save_pretrained('./models/gpt-2-lora-full')

## Performing Inference with a PEFT Model

In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

### Calculate accuracy of trained base model (without PEFT) for test set.

In [147]:
# Load base model
model_reloaded = GPT2ForSequenceClassification.from_pretrained('./models/gpt-2-pretrained').to('cuda:0')

In [148]:
# Load the tokenizer if not already loaded above
tokenizer = GPT2Tokenizer.from_pretrained('models/gpt-2-tokenizer')

In [149]:
label_mapping = {'1 star': 0, '2 stars': 1, '3 stars': 2, '4 stars': 3, '5 stars': 4}
task_evaluator = evaluator("text-classification")
results = task_evaluator.compute(model_or_pipeline=model_reloaded,
                                 tokenizer=tokenizer,
                                 data=ds['test'],
                                 input_column='text',
                                 label_column='label',
                                 metric='accuracy',
                                 label_mapping=label_mapping,)
print(results)

{'accuracy': 0.5508, 'total_time_in_seconds': 74.39159498299705, 'samples_per_second': 67.21189404720789, 'latency_in_seconds': 0.01487831899659941}


### Load gpt-2 model and add the saved model weights

Alternatively you could load the complete Lora model as follows:

`lora_reloaded = GPT2ForSequenceClassification.from_pretrained('models/gpt-2-lora-full').to('cuda:0')`

In [141]:
# Load model with saved PEFT model weights
lora_reloaded = AutoPeftModelForSequenceClassification.from_pretrained(
    'models/gpt-2-lora-ft-parameters',
    num_labels=5,
    id2label={0: '1 star', 1: '2 stars', 2: '3 stars', 3: '4 stars', 4: '5 stars'},
    label2id={'1 star': 0, '2 stars': 1, '3 stars': 2, '4 stars': 3, '5 stars': 4},
).to('cuda:0')

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [143]:
### Evaluating the model

# Get predictions for each test sample
pred_logits = []
for sample in ds['test']:
    logits = predict(tokenizer, lora_reloaded, sample['text'])['logits']
    pred_logits.append(logits.detach().cpu().squeeze().numpy())

# Compute accuracy
compute_metrics((pred_logits, np.array(ds['test']['label'])))

{'accuracy': 0.6354}

---

### Result

After reloading and re-splitting the data into `train`and `test`sets, the preformance of the base model (only fine-tuned to the classification task) with respect to the accuracy metric on the test set is 55.08 % and is thus much worse than the PEFT model's accuracy of 63.54 %.

Interestingly, reloading the complete full model as described below, the accuracy is a little bit higher: 65.1 %.

---

Evaluation of the fine tuned model reloaded with `lora_reloaded = GPT2ForSequenceClassification.from_pretrained('models/gpt-2-lora-full').to('cuda:0')` can be done as follows (not supported for PEFT-models):

```
import evaluate
from evaluate import evaluator

label_mapping = {'1 star': 0, '2 stars': 1, '3 stars': 2, '4 stars': 3, '5 stars': 4}
task_evaluator = evaluator("text-classification")
results = task_evaluator.compute(model_or_pipeline=lora_reloaded,
                                 tokenizer=tokenizer,
                                 data=ds['test'],
                                 input_column='text',
                                 label_column='label',
                                 metric='accuracy',
                                 label_mapping=label_mapping,)
print(results)

In [146]:
# Test an example text from yelp

example = '''Top notch doctor in a top notch practice. Can't say I am surprised when \
I was referred to him by another doctor who I think is wonderful and because he went \
to one of the best medical schools in the country. \nIt is really easy to get an appointment. \
There is minimal wait to be seen and his bedside manner is great.'''

# yelp label: 5 stars

# Get model prediction
prediction = predict(tokenizer, lora_reloaded, example)

print(f'Logits: {prediction["logits"]}')
print(f'Class-Id: {prediction["class_id"]}')
print(f'Label: {prediction["label"]}')

Logits: tensor([[-3.2397, -3.5468, -1.6284,  1.9804,  4.6484]], device='cuda:0')
Class-Id: 4
Label: 5 stars
