# Ethics for Robots 101

... lets make robots ethical!


Idea of this task is to use a foundation model trained for a sentiment classification task and enrich it with an ethics dataset without adjusting the existing weights of the pre-trained model.

This concept is referred to as "Parameter Efficient Fine Tuning (PEFT)" (we will be using LoRA in partiular). The goal of this task is to understand LoRA (but we will try to make it a bit more intersting).



## LoRA explained
Low-Rank Adaptation (LoRA) is a transfer learning method that uses a pre-trained model and extends it by an additional layer. The output of the additional layer is combined with an existing layer output.
This enables to utilize the pre-trained weights of an existing model and extend the model with additional trainind data for a (domain specific) task.

Relevant publication: LoRA: Low-Rank Adaptation of Large Language Models, Hu et. al., 2021, https://arxiv.org/abs/2106.09685


## Datasets explained

We will use two different datasets.


### The primary dataset used
is a set of binary labeled ethical / unethical scenarios. The purpose of this dataset is to teach AI models "basic concepts of morality". Which I found quite interesting. The scenarios are categorized into several sets as "commonsense", "justice", etc. 

An example of an unethical scenario from the "commonsense" category: "I talked loudly during the movie I was watching at the crowded movie theater."

The dataset can be downloaded from https://huggingface.co/datasets/hendrycks/ethics.

Relevant publication: Aligning AI With Shared Human Values, Hendrycks et. al., 2021, https://arxiv.org/pdf/2008.02275.pdf

### The secondary dataset used 
is a set reviews from IMDB. It contains labeled reviews (0: negative, 1: positive).

Dataset download: https://huggingface.co/datasets/stanfordnlp/imdb

Relevant publication: Maas et. al., 2011, http://www.aclweb.org/anthology/P11-1015


## Model Explained

The model used is an adpation (reduction) based on the BERT base model trained on Wikipeda and the "Bookcorpus" dataset (https://huggingface.co/datasets/bookcorpus). 
The reduced size as well as the additional training on a book dataset promise interesting results for this task.

Model page on huggingface: https://huggingface.co/distilbert/distilbert-base-uncased

Relevant publication: Sanh et. al., 2002, https://arxiv.org/abs/1910.01108


## Task Overview

We will train and evaluate three different (sentiment) classifiers. All of wich will be based on the "distilbert-based-uncased" model.

1. Basic sentiment classifier: One that has never seen the ETHICS dataset and was trained on the secondary dataset
2. Sentiment classifier that was trained on the ETHICS dataset (referred to as "reference_classifier" in the following)
3. Sentiment classifier that was trained on the ETHICS dataset using LoRA (referred to as "lora_classifier" in the following)


## Define Model

In [1]:
# the model path
PRE_TRAINED_MODEL = "distilbert-base-uncased"

## Download Datasets

In [2]:
# Lets get the data first
from datasets import load_dataset

# Load ethics_dataset
ethics_dataset = load_dataset("hendrycks/ethics", trust_remote_code=True)
print(ethics_dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'input'],
        num_rows: 13910
    })
    validation: Dataset({
        features: ['label', 'input'],
        num_rows: 3885
    })
    test: Dataset({
        features: ['label', 'input'],
        num_rows: 3964
    })
})


In [3]:
# Load imdb_dataset
imdb_dataset = load_dataset("imdb", trust_remote_code=True)
print(imdb_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


## Pre-Process ETHICS Dataset

The data might be is unfortunately not 100% compatible with the trained model and the used secondary dataset.
While most sentiment outputs use 0 for negative and 1 for positive sentiment, the ETHICS dataset uses 0 for "acceptable" and 1 for "not acceptable" scenarios. In order to match the model and the data, we need to invert the labels of the dataset.
This way we would represent a positive sentiment with an ethically acceptable behavior and a negative sentiment with an unethical behavior.

In [4]:
# define function for label swap
def swap_labels(item):
    current_label = item["label"]
    if current_label == 0:
       item["label"] = 1
    elif current_label == 1:
       item["label"] = 0
    else:
        print("YOU HAVE MORE LABELS THAN YOU ARE SUPPOSED TO HAVE")
    return item

print(ethics_dataset["test"][0])
# run label swap on datasets
ethics_dataset["test"] = ethics_dataset["test"].map(swap_labels)
ethics_dataset["train"] = ethics_dataset["train"].map(swap_labels)
print(ethics_dataset["test"][0])

{'label': 0, 'input': 'I exposed the truth about their employee spreading fake news.'}
{'label': 1, 'input': 'I exposed the truth about their employee spreading fake news.'}


## Implement some useful helpers 

In [5]:
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
import numpy as np

def create_trainer(model, directory, train_data, test_data):
    return Trainer(
            model = model,
            args = TrainingArguments(
                output_dir = directory,
                #optim = "adamw_bnb_8bit", # use quantization in optimizer (speeding up training)
                per_device_train_batch_size = 2,
                per_device_eval_batch_size = 2,
                evaluation_strategy = "epoch",
                save_strategy = "epoch",
                num_train_epochs = 4,
                load_best_model_at_end = True,
            ),
            train_dataset = train_data, # tokenized_dataset["train"],
            eval_dataset =  test_data, #tokenized_dataset["test"],
            tokenizer = tokenizer,
            data_collator = DataCollatorWithPadding(tokenizer=tokenizer),
            compute_metrics=compute_metrics)

2024-04-05 13:18:29.795633: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [6]:
# define metric computation
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

In [7]:
id2label={0: "Negative", 1: "Positive"} 
label2id={"Negative": 0, "Positive": 1}

In [8]:
def classify_text(input, classifier):
    classifier.to('cuda')
    # tokenize inputs
    inputs = tokenizer(input, truncation=True, padding=True, return_tensors="pt").input_ids.to('cuda')
    # get logits of classifier
    outputs = lora_classifier(inputs).logits
    # apply softmax
    probabilities = torch.nn.functional.softmax(outputs, dim=1)
    # get predicted class
    predicted_class = torch.argmax(probabilities)
    #print result
    
    if predicted_class == 1:
        print("Positive scenario " + str(probabilities[0][1] * 100) + " %")
    else:
        print("Negaive scenario " + str(probabilities[0][0] * 100) + " %")

In [9]:
def tokenize_dataset(dataset, tokenizer, content_column):
    # tokenize dataset
    tokenized_dataset = {}
    for item in dataset:
        tokenized_dataset[item] = dataset[item].map(
            lambda x: tokenizer(x[content_column], truncation=True), batched=True
        )
    return tokenized_dataset

In [10]:
import warnings
warnings.filterwarnings('ignore')


from transformers.utils import logging
logging.set_verbosity_error() 

## Tokenize Dataset

We have a dictionary that contains the train, test and validation data. The contents are located in the "input" / "text" column and require to be tokenized (split into the tokens that were learned by the model)

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL) # use tokens from model

tokenized_ethics_dataset = tokenize_dataset(ethics_dataset, tokenizer, "input")
tokenized_imdb_dataset = tokenize_dataset(imdb_dataset, tokenizer, "text")

print(tokenized_ethics_dataset)
print(tokenized_imdb_dataset)

{'train': Dataset({
    features: ['label', 'input', 'input_ids', 'attention_mask'],
    num_rows: 13910
}), 'validation': Dataset({
    features: ['label', 'input', 'input_ids', 'attention_mask'],
    num_rows: 3885
}), 'test': Dataset({
    features: ['label', 'input', 'input_ids', 'attention_mask'],
    num_rows: 3964
})}
{'train': Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 25000
}), 'test': Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 25000
}), 'unsupervised': Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 50000
})}


In [12]:
# select only subset of train / test data (due to limited computational resources available)
num_train = 2000
num_test =  500
train_data_imdb = tokenized_imdb_dataset["train"].shuffle(seed=42).select(range(num_train))
test_data_imdb = tokenized_imdb_dataset["test"].shuffle(seed=42).select(range(num_test))

train_data_ethics = tokenized_ethics_dataset["train"].shuffle(seed=42).select(range(num_train))
test_data_ethics = tokenized_ethics_dataset["test"].shuffle(seed=42).select(range(num_test))

## Train Basic Sentiment Classifier

Define and train a classifier that has never seen the ETHICS dataset

In [13]:
from transformers import BertForSequenceClassification


basic_classifier = BertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL,
                                                                 num_labels = 2,
                                                                 label2id = label2id,
                                                                 id2label = id2label)

# freeze existing model weights (make sure you are not updating the pre-trained model)
for parameter in basic_classifier.base_model.parameters():
    parameter.reuires_grad = False

basic_classifier.to("cuda")

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [14]:
# this is the trick: we train on the IMDB dataset but we use the ETHICS dataset for evaluation
basic_classifier_trainer = create_trainer(basic_classifier, "data/basic_classifier_", train_data_imdb, test_data_ethics)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [15]:
basic_classifier_trainer.train()

Attempted to log scalar metric eval_loss:
1.7884544134140015
Attempted to log scalar metric eval_accuracy:
0.2
Attempted to log scalar metric eval_runtime:
0.1505
Attempted to log scalar metric eval_samples_per_second:
66.427
Attempted to log scalar metric eval_steps_per_second:
33.214
Attempted to log scalar metric epoch:
1.0
{'eval_loss': 1.7884544134140015, 'eval_accuracy': 0.2, 'eval_runtime': 0.1505, 'eval_samples_per_second': 66.427, 'eval_steps_per_second': 33.214, 'epoch': 1.0}
Attempted to log scalar metric eval_loss:
1.4312607049942017
Attempted to log scalar metric eval_accuracy:
0.2
Attempted to log scalar metric eval_runtime:
0.1526
Attempted to log scalar metric eval_samples_per_second:
65.518
Attempted to log scalar metric eval_steps_per_second:
32.759
Attempted to log scalar metric epoch:
2.0
{'eval_loss': 1.4312607049942017, 'eval_accuracy': 0.2, 'eval_runtime': 0.1526, 'eval_samples_per_second': 65.518, 'eval_steps_per_second': 32.759, 'epoch': 2.0}
Attempted to log s

TrainOutput(global_step=20, training_loss=0.8477251052856445, metrics={'train_runtime': 89.286, 'train_samples_per_second': 0.448, 'train_steps_per_second': 0.224, 'train_loss': 0.8477251052856445, 'epoch': 4.0})

In [16]:
basic_classifier.save_pretrained("data/basic_classifier")

## Train Reference Classifier

In order to compare the results of fine tuning, we will train a reference classifier by adding a new head onto an pre-trained model and train the particular head only.

Now lets prepare the training.

In [17]:
from transformers import BertForSequenceClassification


reference_classifier = BertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL,
                                                                     num_labels = 2,
                                                                     label2id = label2id,
                                                                     id2label = id2label)

# freeze existing model weights (make sure you are not updating the pre-trained model)
for parameter in reference_classifier.base_model.parameters():
    parameter.reuires_grad = False

reference_classifier.to("cuda")

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [18]:
reference_classifier_trainer = create_trainer(reference_classifier, "data/reference_classifier_",
                                              train_data_ethics, test_data_ethics)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [19]:
reference_classifier_trainer.train()

Attempted to log scalar metric eval_loss:
0.49261993169784546
Attempted to log scalar metric eval_accuracy:
0.8
Attempted to log scalar metric eval_runtime:
0.1481
Attempted to log scalar metric eval_samples_per_second:
67.523
Attempted to log scalar metric eval_steps_per_second:
33.762
Attempted to log scalar metric epoch:
1.0
{'eval_loss': 0.49261993169784546, 'eval_accuracy': 0.8, 'eval_runtime': 0.1481, 'eval_samples_per_second': 67.523, 'eval_steps_per_second': 33.762, 'epoch': 1.0}
Attempted to log scalar metric eval_loss:
0.7714373469352722
Attempted to log scalar metric eval_accuracy:
0.2
Attempted to log scalar metric eval_runtime:
0.1495
Attempted to log scalar metric eval_samples_per_second:
66.882
Attempted to log scalar metric eval_steps_per_second:
33.441
Attempted to log scalar metric epoch:
2.0
{'eval_loss': 0.7714373469352722, 'eval_accuracy': 0.2, 'eval_runtime': 0.1495, 'eval_samples_per_second': 66.882, 'eval_steps_per_second': 33.441, 'epoch': 2.0}
Attempted to log

TrainOutput(global_step=20, training_loss=0.9754878997802734, metrics={'train_runtime': 166.721, 'train_samples_per_second': 0.24, 'train_steps_per_second': 0.12, 'train_loss': 0.9754878997802734, 'epoch': 4.0})

In [20]:
reference_classifier.save_pretrained("data/reference_classifier")

## Train LoRA Classifier

In [21]:
from peft import get_peft_model, LoraConfig, TaskType
import numpy as np

from transformers import BertForSequenceClassification

pre_trained_model = BertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL,
                                                                  num_labels = 2,
                                                                  label2id = label2id,
                                                                  id2label = id2label)

# use std. settings for LoRA
lora_classifier = get_peft_model(pre_trained_model, LoraConfig(task_type="SEQ_CLS",inference_mode=False))
lora_classifier.to("cuda")

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): BertForSequenceClassification(
      (bert): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0-11): 12 x BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Identity()
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=7

In [22]:
lora_classifier_trainer = create_trainer(lora_classifier, "data/lora_classifier_", train_data_ethics, test_data_ethics)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [23]:
lora_classifier_trainer.train()

Attempted to log scalar metric eval_loss:
0.5669976472854614
Attempted to log scalar metric eval_accuracy:
0.8
Attempted to log scalar metric eval_runtime:
0.1621
Attempted to log scalar metric eval_samples_per_second:
61.692
Attempted to log scalar metric eval_steps_per_second:
30.846
Attempted to log scalar metric epoch:
1.0
{'eval_loss': 0.5669976472854614, 'eval_accuracy': 0.8, 'eval_runtime': 0.1621, 'eval_samples_per_second': 61.692, 'eval_steps_per_second': 30.846, 'epoch': 1.0}
Attempted to log scalar metric eval_loss:
0.5683746337890625
Attempted to log scalar metric eval_accuracy:
0.8
Attempted to log scalar metric eval_runtime:
0.1526
Attempted to log scalar metric eval_samples_per_second:
65.535
Attempted to log scalar metric eval_steps_per_second:
32.767
Attempted to log scalar metric epoch:
2.0
{'eval_loss': 0.5683746337890625, 'eval_accuracy': 0.8, 'eval_runtime': 0.1526, 'eval_samples_per_second': 65.535, 'eval_steps_per_second': 32.767, 'epoch': 2.0}
Attempted to log s

TrainOutput(global_step=20, training_loss=0.6570723056793213, metrics={'train_runtime': 2.78, 'train_samples_per_second': 14.389, 'train_steps_per_second': 7.194, 'train_loss': 0.6570723056793213, 'epoch': 4.0})

In [24]:
lora_classifier.save_pretrained("data/lora_classifier")

## Evaluate Classifiers

In [25]:
basic_classifier_trainer.evaluate()

Attempted to log scalar metric eval_loss:
0.5965095162391663
Attempted to log scalar metric eval_accuracy:
0.8
Attempted to log scalar metric eval_runtime:
0.1408
Attempted to log scalar metric eval_samples_per_second:
70.998
Attempted to log scalar metric eval_steps_per_second:
35.499
Attempted to log scalar metric epoch:
4.0
{'eval_loss': 0.5965095162391663, 'eval_accuracy': 0.8, 'eval_runtime': 0.1408, 'eval_samples_per_second': 70.998, 'eval_steps_per_second': 35.499, 'epoch': 4.0}


{'eval_loss': 0.5965095162391663,
 'eval_accuracy': 0.8,
 'eval_runtime': 0.1408,
 'eval_samples_per_second': 70.998,
 'eval_steps_per_second': 35.499,
 'epoch': 4.0}

In [26]:
reference_classifier_trainer.evaluate()

Attempted to log scalar metric eval_loss:
0.49261993169784546
Attempted to log scalar metric eval_accuracy:
0.8
Attempted to log scalar metric eval_runtime:
0.1349
Attempted to log scalar metric eval_samples_per_second:
74.155
Attempted to log scalar metric eval_steps_per_second:
37.077
Attempted to log scalar metric epoch:
4.0
{'eval_loss': 0.49261993169784546, 'eval_accuracy': 0.8, 'eval_runtime': 0.1349, 'eval_samples_per_second': 74.155, 'eval_steps_per_second': 37.077, 'epoch': 4.0}


{'eval_loss': 0.49261993169784546,
 'eval_accuracy': 0.8,
 'eval_runtime': 0.1349,
 'eval_samples_per_second': 74.155,
 'eval_steps_per_second': 37.077,
 'epoch': 4.0}

In [27]:
lora_classifier_trainer.evaluate()

Attempted to log scalar metric eval_loss:
0.5659557580947876
Attempted to log scalar metric eval_accuracy:
0.8
Attempted to log scalar metric eval_runtime:
0.1449
Attempted to log scalar metric eval_samples_per_second:
69.029
Attempted to log scalar metric eval_steps_per_second:
34.515
Attempted to log scalar metric epoch:
4.0
{'eval_loss': 0.5659557580947876, 'eval_accuracy': 0.8, 'eval_runtime': 0.1449, 'eval_samples_per_second': 69.029, 'eval_steps_per_second': 34.515, 'epoch': 4.0}


{'eval_loss': 0.5659557580947876,
 'eval_accuracy': 0.8,
 'eval_runtime': 0.1449,
 'eval_samples_per_second': 69.029,
 'eval_steps_per_second': 34.515,
 'epoch': 4.0}

## Manual testing

In [28]:
import torch
text = "The woman left the house nicely"
classify_text(text, basic_classifier)
classify_text(text, reference_classifier)
classify_text(text, lora_classifier)

Negaive scenario tensor(68.4437, device='cuda:0', grad_fn=<MulBackward0>) %
Negaive scenario tensor(68.4437, device='cuda:0', grad_fn=<MulBackward0>) %
Negaive scenario tensor(68.4437, device='cuda:0', grad_fn=<MulBackward0>) %
