# Ethics for Robots 101

... lets make robots ethical!


Idea of this task is to use a foundation model trained for a sentiment classification task and enrich it with an ethics dataset without adjusting the existing weights of the pre-trained model.

This concept is referred to as "Parameter Efficient Fine Tuning (PEFT)" (we will be using LoRA in partiular). The goal of this task is to understand LoRA (but we will try to make it a bit more intersting).



## LoRA explained
Low-Rank Adaptation (LoRA) is a transfer learning method that uses a pre-trained model and extends it by an additional layer. The output of the additional layer is combined with an existing layer output.
This enables to utilize the pre-trained weights of an existing model and extend the model with additional trainind data for a (domain specific) task.

Relevant publication: LoRA: Low-Rank Adaptation of Large Language Models, Hu et. al., 2021, https://arxiv.org/abs/2106.09685


## Datasets explained

We will use two different datasets.


### The primary dataset used
is a set of binary labeled ethical / unethical scenarios. The purpose of this dataset is to teach AI models "basic concepts of morality". Which I found quite interesting. The scenarios are categorized into several sets as "commonsense", "justice", etc. 

An example of an unethical scenario from the "commonsense" category: "I talked loudly during the movie I was watching at the crowded movie theater."

The dataset can be downloaded from https://huggingface.co/datasets/hendrycks/ethics.

Relevant publication: Aligning AI With Shared Human Values, Hendrycks et. al., 2021, https://arxiv.org/pdf/2008.02275.pdf

### The secondary dataset used 
is a set reviews from IMDB. It contains labeled reviews (0: negative, 1: positive).

Dataset download: https://huggingface.co/datasets/stanfordnlp/imdb

Relevant publication: Maas et. al., 2011, http://www.aclweb.org/anthology/P11-1015


## Model Explained

The model used is an adpation (reduction) based on the BERT base model trained on Wikipeda and the "Bookcorpus" dataset (https://huggingface.co/datasets/bookcorpus). 
The reduced size as well as the additional training on a book dataset promise interesting results for this task.

Model page on huggingface: https://huggingface.co/distilbert/distilbert-base-uncased

Relevant publication: Sanh et. al., 2002, https://arxiv.org/abs/1910.01108


## Task Overview

We will train and evaluate three different (sentiment) classifiers. All of wich will be based on the "distilbert-based-uncased" model.

1. Basic sentiment classifier: One that has never seen the ETHICS dataset and was trained on the secondary dataset
2. Sentiment classifier that was trained on the ETHICS dataset (referred to as "reference_classifier" in the following)
3. Sentiment classifier that was trained on the ETHICS dataset using LoRA (referred to as "lora_classifier" in the following)


## Define Model

In [1]:
# the model path
PRE_TRAINED_MODEL = "distilbert-base-uncased"

## Download Datasets

In [2]:
# Lets get the data first
from datasets import load_dataset

# Load ethics_dataset
ethics_dataset = load_dataset("hendrycks/ethics", trust_remote_code=True)
print(ethics_dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'input'],
        num_rows: 13910
    })
    validation: Dataset({
        features: ['label', 'input'],
        num_rows: 3885
    })
    test: Dataset({
        features: ['label', 'input'],
        num_rows: 3964
    })
})


In [3]:
# Load imdb_dataset
imdb_dataset = load_dataset("imdb", trust_remote_code=True)
print(imdb_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


## Pre-Process ETHICS Dataset

The data might be is unfortunately not 100% compatible with the trained model and the used secondary dataset.
While most sentiment outputs use 0 for negative and 1 for positive sentiment, the ETHICS dataset uses 0 for "acceptable" and 1 for "not acceptable" scenarios. In order to match the model and the data, we need to invert the labels of the dataset.
This way we would represent a positive sentiment with an ethically acceptable behavior and a negative sentiment with an unethical behavior.

In [4]:
# define function for label swap
def swap_labels(item):
    current_label = item["label"]
    if current_label == 0:
       item["label"] = 1
    elif current_label == 1:
       item["label"] = 0
    else:
        print("YOU HAVE MORE LABELS THAN YOU ARE SUPPOSED TO HAVE")
    return item

print(ethics_dataset["test"][0])
# run label swap on datasets
ethics_dataset["test"] = ethics_dataset["test"].map(swap_labels)
ethics_dataset["train"] = ethics_dataset["train"].map(swap_labels)
print(ethics_dataset["test"][0])

{'label': 0, 'input': 'I exposed the truth about their employee spreading fake news.'}
{'label': 1, 'input': 'I exposed the truth about their employee spreading fake news.'}


## Implement some useful helpers 

In [5]:
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
import numpy as np

def create_trainer(model, directory, train_data, test_data):
    return Trainer(
            model = model,
            args = TrainingArguments(
                output_dir = directory,
                #optim = "adamw_bnb_8bit", # use quantization in optimizer (speeding up training)
                per_device_train_batch_size = 2,
                per_device_eval_batch_size = 2,
                evaluation_strategy = "epoch",
                save_strategy = "epoch",
                num_train_epochs = 4,
                load_best_model_at_end = True,
            ),
            train_dataset = train_data, # tokenized_dataset["train"],
            eval_dataset =  test_data, #tokenized_dataset["test"],
            tokenizer = tokenizer,
            data_collator = DataCollatorWithPadding(tokenizer=tokenizer),
            compute_metrics=compute_metrics)

2024-04-05 13:25:21.736389: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [6]:
# define metric computation
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

In [7]:
id2label={0: "Negative", 1: "Positive"} 
label2id={"Negative": 0, "Positive": 1}

In [8]:
def classify_text(input, classifier):
    classifier.to('cuda')
    # tokenize inputs
    inputs = tokenizer(input, truncation=True, padding=True, return_tensors="pt").input_ids.to('cuda')
    # get logits of classifier
    outputs = lora_classifier(inputs).logits
    # apply softmax
    probabilities = torch.nn.functional.softmax(outputs, dim=1)
    # get predicted class
    predicted_class = torch.argmax(probabilities)
    #print result
    
    if predicted_class == 1:
        print("Positive scenario " + str(probabilities[0][1] * 100) + " %")
    else:
        print("Negaive scenario " + str(probabilities[0][0] * 100) + " %")

In [9]:
def tokenize_dataset(dataset, tokenizer, content_column):
    # tokenize dataset
    tokenized_dataset = {}
    for item in dataset:
        tokenized_dataset[item] = dataset[item].map(
            lambda x: tokenizer(x[content_column], truncation=True), batched=True
        )
    return tokenized_dataset

In [10]:
import warnings
warnings.filterwarnings('ignore')


from transformers.utils import logging
logging.set_verbosity_error() 

## Tokenize Dataset

We have a dictionary that contains the train, test and validation data. The contents are located in the "input" / "text" column and require to be tokenized (split into the tokens that were learned by the model)

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL) # use tokens from model

tokenized_ethics_dataset = tokenize_dataset(ethics_dataset, tokenizer, "input")
tokenized_imdb_dataset = tokenize_dataset(imdb_dataset, tokenizer, "text")

print(tokenized_ethics_dataset)
print(tokenized_imdb_dataset)

{'train': Dataset({
    features: ['label', 'input', 'input_ids', 'attention_mask'],
    num_rows: 13910
}), 'validation': Dataset({
    features: ['label', 'input', 'input_ids', 'attention_mask'],
    num_rows: 3885
}), 'test': Dataset({
    features: ['label', 'input', 'input_ids', 'attention_mask'],
    num_rows: 3964
})}
{'train': Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 25000
}), 'test': Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 25000
}), 'unsupervised': Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 50000
})}


In [12]:
# select only subset of train / test data (due to limited computational resources available)
num_train = 2000
num_test =  500
train_data_imdb = tokenized_imdb_dataset["train"].shuffle(seed=42).select(range(num_train))
test_data_imdb = tokenized_imdb_dataset["test"].shuffle(seed=42).select(range(num_test))

train_data_ethics = tokenized_ethics_dataset["train"].shuffle(seed=42).select(range(num_train))
test_data_ethics = tokenized_ethics_dataset["test"].shuffle(seed=42).select(range(num_test))

## Train Basic Sentiment Classifier

Define and train a classifier that has never seen the ETHICS dataset

In [13]:
from transformers import BertForSequenceClassification


basic_classifier = BertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL,
                                                                 num_labels = 2,
                                                                 label2id = label2id,
                                                                 id2label = id2label)

# freeze existing model weights (make sure you are not updating the pre-trained model)
for parameter in basic_classifier.base_model.parameters():
    parameter.reuires_grad = False

basic_classifier.to("cuda")

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [14]:
# this is the trick: we train on the IMDB dataset but we use the ETHICS dataset for evaluation
basic_classifier_trainer = create_trainer(basic_classifier, "data/basic_classifier_", train_data_imdb, test_data_ethics)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [15]:
basic_classifier_trainer.train()

Attempted to log scalar metric loss:
0.7676
Attempted to log scalar metric grad_norm:
19.204891204833984
Attempted to log scalar metric learning_rate:
4.375e-05
Attempted to log scalar metric epoch:
0.5
{'loss': 0.7676, 'grad_norm': 19.204891204833984, 'learning_rate': 4.375e-05, 'epoch': 0.5}
Attempted to log scalar metric loss:
0.7096
Attempted to log scalar metric grad_norm:
16.57432746887207
Attempted to log scalar metric learning_rate:
3.7500000000000003e-05
Attempted to log scalar metric epoch:
1.0
{'loss': 0.7096, 'grad_norm': 16.57432746887207, 'learning_rate': 3.7500000000000003e-05, 'epoch': 1.0}
Attempted to log scalar metric eval_loss:
0.71096271276474
Attempted to log scalar metric eval_accuracy:
0.504
Attempted to log scalar metric eval_runtime:
7.805
Attempted to log scalar metric eval_samples_per_second:
64.062
Attempted to log scalar metric eval_steps_per_second:
32.031
Attempted to log scalar metric epoch:
1.0
{'eval_loss': 0.71096271276474, 'eval_accuracy': 0.504, 'e

TrainOutput(global_step=4000, training_loss=0.713396499633789, metrics={'train_runtime': 455.6213, 'train_samples_per_second': 17.558, 'train_steps_per_second': 8.779, 'train_loss': 0.713396499633789, 'epoch': 4.0})

In [16]:
basic_classifier.save_pretrained("data/basic_classifier")

## Train Reference Classifier

In order to compare the results of fine tuning, we will train a reference classifier by adding a new head onto an pre-trained model and train the particular head only.

Now lets prepare the training.

In [17]:
from transformers import BertForSequenceClassification


reference_classifier = BertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL,
                                                                     num_labels = 2,
                                                                     label2id = label2id,
                                                                     id2label = id2label)

# freeze existing model weights (make sure you are not updating the pre-trained model)
for parameter in reference_classifier.base_model.parameters():
    parameter.reuires_grad = False

reference_classifier.to("cuda")

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [18]:
reference_classifier_trainer = create_trainer(reference_classifier, "data/reference_classifier_",
                                              train_data_ethics, test_data_ethics)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [19]:
reference_classifier_trainer.train()

Attempted to log scalar metric loss:
0.7432
Attempted to log scalar metric grad_norm:
3.989985227584839
Attempted to log scalar metric learning_rate:
4.375e-05
Attempted to log scalar metric epoch:
0.5
{'loss': 0.7432, 'grad_norm': 3.989985227584839, 'learning_rate': 4.375e-05, 'epoch': 0.5}
Attempted to log scalar metric loss:
0.7276
Attempted to log scalar metric grad_norm:
3.4024088382720947
Attempted to log scalar metric learning_rate:
3.7500000000000003e-05
Attempted to log scalar metric epoch:
1.0
{'loss': 0.7276, 'grad_norm': 3.4024088382720947, 'learning_rate': 3.7500000000000003e-05, 'epoch': 1.0}
Attempted to log scalar metric eval_loss:
0.6937154531478882
Attempted to log scalar metric eval_accuracy:
0.504
Attempted to log scalar metric eval_runtime:
7.8566
Attempted to log scalar metric eval_samples_per_second:
63.641
Attempted to log scalar metric eval_steps_per_second:
31.82
Attempted to log scalar metric epoch:
1.0
{'eval_loss': 0.6937154531478882, 'eval_accuracy': 0.504

TrainOutput(global_step=4000, training_loss=0.7123526763916016, metrics={'train_runtime': 461.3573, 'train_samples_per_second': 17.34, 'train_steps_per_second': 8.67, 'train_loss': 0.7123526763916016, 'epoch': 4.0})

In [20]:
reference_classifier.save_pretrained("data/reference_classifier")

## Train LoRA Classifier

In [21]:
from peft import get_peft_model, LoraConfig, TaskType
import numpy as np

from transformers import BertForSequenceClassification

pre_trained_model = BertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL,
                                                                  num_labels = 2,
                                                                  label2id = label2id,
                                                                  id2label = id2label)

# use std. settings for LoRA
lora_classifier = get_peft_model(pre_trained_model, LoraConfig(task_type="SEQ_CLS",inference_mode=False))
lora_classifier.to("cuda")

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): BertForSequenceClassification(
      (bert): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0-11): 12 x BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Identity()
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=7

In [22]:
lora_classifier_trainer = create_trainer(lora_classifier, "data/lora_classifier_", train_data_ethics, test_data_ethics)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [23]:
lora_classifier_trainer.train()

Attempted to log scalar metric loss:
0.7019
Attempted to log scalar metric grad_norm:
3.917961597442627
Attempted to log scalar metric learning_rate:
4.375e-05
Attempted to log scalar metric epoch:
0.5
{'loss': 0.7019, 'grad_norm': 3.917961597442627, 'learning_rate': 4.375e-05, 'epoch': 0.5}
Attempted to log scalar metric loss:
0.7
Attempted to log scalar metric grad_norm:
4.358515739440918
Attempted to log scalar metric learning_rate:
3.7500000000000003e-05
Attempted to log scalar metric epoch:
1.0
{'loss': 0.7, 'grad_norm': 4.358515739440918, 'learning_rate': 3.7500000000000003e-05, 'epoch': 1.0}
Attempted to log scalar metric eval_loss:
0.6979681253433228
Attempted to log scalar metric eval_accuracy:
0.504
Attempted to log scalar metric eval_runtime:
8.1603
Attempted to log scalar metric eval_samples_per_second:
61.272
Attempted to log scalar metric eval_steps_per_second:
30.636
Attempted to log scalar metric epoch:
1.0
{'eval_loss': 0.6979681253433228, 'eval_accuracy': 0.504, 'eval

TrainOutput(global_step=4000, training_loss=0.6959104385375977, metrics={'train_runtime': 334.707, 'train_samples_per_second': 23.901, 'train_steps_per_second': 11.951, 'train_loss': 0.6959104385375977, 'epoch': 4.0})

In [24]:
lora_classifier.save_pretrained("data/lora_classifier")

## Evaluate Classifiers

In [25]:
basic_classifier_trainer.evaluate()

Attempted to log scalar metric eval_loss:
0.6932373642921448
Attempted to log scalar metric eval_accuracy:
0.504
Attempted to log scalar metric eval_runtime:
7.784
Attempted to log scalar metric eval_samples_per_second:
64.234
Attempted to log scalar metric eval_steps_per_second:
32.117
Attempted to log scalar metric epoch:
4.0
{'eval_loss': 0.6932373642921448, 'eval_accuracy': 0.504, 'eval_runtime': 7.784, 'eval_samples_per_second': 64.234, 'eval_steps_per_second': 32.117, 'epoch': 4.0}


{'eval_loss': 0.6932373642921448,
 'eval_accuracy': 0.504,
 'eval_runtime': 7.784,
 'eval_samples_per_second': 64.234,
 'eval_steps_per_second': 32.117,
 'epoch': 4.0}

In [26]:
reference_classifier_trainer.evaluate()

Attempted to log scalar metric eval_loss:
0.693172812461853
Attempted to log scalar metric eval_accuracy:
0.504
Attempted to log scalar metric eval_runtime:
7.7802
Attempted to log scalar metric eval_samples_per_second:
64.265
Attempted to log scalar metric eval_steps_per_second:
32.133
Attempted to log scalar metric epoch:
4.0
{'eval_loss': 0.693172812461853, 'eval_accuracy': 0.504, 'eval_runtime': 7.7802, 'eval_samples_per_second': 64.265, 'eval_steps_per_second': 32.133, 'epoch': 4.0}


{'eval_loss': 0.693172812461853,
 'eval_accuracy': 0.504,
 'eval_runtime': 7.7802,
 'eval_samples_per_second': 64.265,
 'eval_steps_per_second': 32.133,
 'epoch': 4.0}

In [27]:
lora_classifier_trainer.evaluate()

Attempted to log scalar metric eval_loss:
0.692632794380188
Attempted to log scalar metric eval_accuracy:
0.504
Attempted to log scalar metric eval_runtime:
8.3981
Attempted to log scalar metric eval_samples_per_second:
59.537
Attempted to log scalar metric eval_steps_per_second:
29.769
Attempted to log scalar metric epoch:
4.0
{'eval_loss': 0.692632794380188, 'eval_accuracy': 0.504, 'eval_runtime': 8.3981, 'eval_samples_per_second': 59.537, 'eval_steps_per_second': 29.769, 'epoch': 4.0}


{'eval_loss': 0.692632794380188,
 'eval_accuracy': 0.504,
 'eval_runtime': 8.3981,
 'eval_samples_per_second': 59.537,
 'eval_steps_per_second': 29.769,
 'epoch': 4.0}

## Manual Cross Check

In [28]:
import torch
text = "The woman left the house nicely"
classify_text(text, basic_classifier)
classify_text(text, reference_classifier)
classify_text(text, lora_classifier)

Positive scenario tensor(53.1025, device='cuda:0', grad_fn=<MulBackward0>) %
Positive scenario tensor(53.1025, device='cuda:0', grad_fn=<MulBackward0>) %
Positive scenario tensor(53.1025, device='cuda:0', grad_fn=<MulBackward0>) %


## Conclusion

we can see that the evaluations for all three models are almost identical. Same applies to the manual cross check. This implies that the training data provided was not enough to apply the exisiting classifier to a new task. It was expected that the sentiment classifiers trained on the ETHICS data would ourperform the basic sentiment classifier (that has never seen the dataset).

The reason for selecting a very small subset for the training / evaluation task are the limited GPU resouces available.

This shows that even with very poweful and well trained models, a major success for transfer learning is related to the data available.

More experiments on more powerful GPUs are mandatory.