# IMPORTANT
1. Interrupt all existing kernels (otherwise you will get memory problems). To do so go to the navigation bar and click Kernel > Shut Down All Kernels
2. **After** step 1 and contrary to part 2, start this notebook with the **Torch** Kernel. To do so go to the navigation bar and click Kernel > Change Kernel... > Select Torch > Click Select (Alternativly via the kernel button on the top right of the notebook).

## Binary Judgement Prediction with Transformer language models

The last model class we'll experiment with are Transformer language models. We will *not* train a model from scratch on this dataset because Transformer language models are typically very large networks, with million of parameters, which would likely overfit to the dataset at hand. Instead, we will use a pre-trained language model, an autoregressive Transformer optimised to predict the next word in texts of many different domains.

We suggest you use [GPT-neo-125m](https://huggingface.co/EleutherAI/gpt-neo-125m), a model designed to replicate the architecture of OpenAI's GPT-3 in its smallest version (125 million parameters). Feel free to substitute this with another pretrained autoregressive language model from the Hugging Face [model hub](https://huggingface.co/models?sort=trending) but beware of model size.

First, let's install and load the necessary python libraries.

In [None]:
from tqdm.notebook import tqdm_notebook as tqdm
from sklearn.metrics import accuracy_score, PrecisionRecallDisplay
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, pipeline
from torch.utils.data import Dataset, DataLoader
import torch
import pandas as pd

Load data as preprocessed in the preparation notebook

In [None]:
data = pd.read_csv("data.csv")

Loading Helper functions that were also used in part 2.

A note from the ***Software Engineering point of view:*** Having these functions defined twice in both notebooks is not nice. However, as we run parts on tensorflow and on pytorch separately, it is an easy approach the appropriate framework.

In [None]:
def load_ECHR_dataset_for_binary_judgement_classification(dataframe, for_tensorflow=False):
    X_train, X_val, X_test = load_input_from_ECHR_dataset(dataframe)
    y_train, y_val, y_test = load_binary_output_from_ECHR_dataset(dataframe)
    if for_tensorflow:
        train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
        val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
        test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
    else:
        train_ds = {"texts": X_train, "labels": y_train}
        val_ds = {"texts": X_val, "labels": y_val}
        test_ds = {"texts": X_test, "labels": y_test}
    return train_ds, val_ds, test_ds


# Convenience functions: prepare data splits in scikit-friendly format
# You don't need to read the code in this cell, but please make sure you execute it.

def load_input_from_ECHR_dataset(dataframe):
    # Input: text
    X_train = data[data.partition == 'train'].text.to_list()
    X_val = data[data.partition == 'dev'].text.to_list()
    X_test = data[data.partition == 'test'].text.to_list()
    return X_train, X_val, X_test

def load_binary_output_from_ECHR_dataset(dataframe):
    # Binary output: violation judgement
    y_train_binary = data[data.partition == 'train'].binary_judgement.to_numpy()
    y_val_binary = data[data.partition == 'dev'].binary_judgement.to_numpy()
    y_test_binary = data[data.partition == 'test'].binary_judgement.to_numpy()
    return y_train_binary, y_val_binary, y_test_binary

def load_regression_output_from_ECHR_dataset(dataframe):
    # Regression output: case importance score
    y_train_regression = data[data.partition == 'train'].importance.astype(float).to_numpy()
    y_val_regression = data[data.partition == 'dev'].importance.astype(float).to_numpy()
    y_test_regression = data[data.partition == 'test'].importance.astype(float).to_numpy()
    return y_train_regression, y_val_regression, y_test_regression

def load_multiclass_output_from_ECHR_dataset(dataframe):
    # Multiclass output: case importance label
    y_train_multiclass = data[data.partition == 'train'].importance.to_numpy()
    y_val_multiclass = data[data.partition == 'dev'].importance.to_numpy()
    y_test_multiclass = data[data.partition == 'test'].importance.to_numpy()
    return y_train_multiclass, y_val_multiclass, y_test_multiclass

def load_ECHR_dataset_for_binary_judgement_classification(dataframe, for_tensorflow=False):
    X_train, X_val, X_test = load_input_from_ECHR_dataset(dataframe)
    y_train, y_val, y_test = load_binary_output_from_ECHR_dataset(dataframe)
    if for_tensorflow:
        train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
        val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
        test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
    else:
        train_ds = {"texts": X_train, "labels": y_train}
        val_ds = {"texts": X_val, "labels": y_val}
        test_ds = {"texts": X_test, "labels": y_test}
    return train_ds, val_ds, test_ds

def load_ECHR_dataset_for_case_importance_regression(dataframe, for_tensorflow=False):
    X_train, X_val, X_test = load_input_from_ECHR_dataset(dataframe)
    y_train, y_val, y_test = load_regression_output_from_ECHR_dataset(dataframe)
    if for_tensorflow:
        train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
        val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
        test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
    else:
        train_ds = {"texts": X_train, "labels": y_train}
        val_ds = {"texts": X_val, "labels": y_val}
        test_ds = {"texts": X_test, "labels": y_test}
        return train_ds, val_ds, test_ds

def load_ECHR_dataset_for_case_importance_classification(dataframe, for_tensorflow=False):
    X_train, X_val, X_test = load_input_from_ECHR_dataset(dataframe)
    y_train, y_val, y_test = load_multiclass_output_from_ECHR_dataset(dataframe)
    if for_tensorflow:
        train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
        val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
        test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
    else:
        train_ds = {"texts": X_train, "labels": y_train}
        val_ds = {"texts": X_val, "labels": y_val}
        test_ds = {"texts": X_test, "labels": y_test}
    return train_ds, val_ds, test_ds


def load_classification_model_and_tokenizer(model_name_or_path):
    lm = AutoModelForSequenceClassification.from_pretrained(model_name_or_path)

    # Load the tokenizer suitable for this model
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

    if not lm.config.pad_token_id:
        lm.config.pad_token_id = lm.config.eos_token_id
        tokenizer.pad_token = tokenizer.eos_token

    return lm, tokenizer

class EHRCDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(self.texts[idx],
                                  truncation=True,
                                  padding='max_length',
                                  max_length=self.max_length,
                                  return_attention_mask=True,
                                  return_tensors='pt')

        item = {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

        return item

Load the data in model-friendly format using the convenience functions above.

In [None]:
# Load the data using our convenience functions
train_ds, val_ds, test_ds = load_ECHR_dataset_for_binary_judgement_classification(data)

# Load the tokenizer suitable for the model model
MODEL_NAME = "EleutherAI/gpt-neo-125m"
lm, tokenizer = load_classification_model_and_tokenizer(MODEL_NAME)

# Create dataset and data loaders for training and validation
train_dataset = EHRCDataset(train_ds['texts'], train_ds['labels'], tokenizer, max_length=2048)
val_dataset = EHRCDataset(val_ds['texts'], val_ds['labels'], tokenizer, max_length=2048)

### Zero-shot classification and prompting

Note that this model is pre-trained on the general language modelling task (predicting the next word in a text) and not on the legal judgement prediction task. This is different from the setup you have seen in the tutorial on pre-trained Transformers. The type of classification we will perform with this model is typically referred to as *zero-shot classification*, meaning that the model is asked to classify by seeing *no* examples from the dataset.

In [None]:
zero_shot_classifier = pipeline(
    "zero-shot-classification",
    model=MODEL_NAME,
    device="cuda:0"
)

Instead of using a `text-classification` pipeline, we are using a `zero-shot-classification` pipeline. These two are almost equivalent except that `zero-shot-classification` doesn't require a hardcoded number of potential classes. They can be chosen at runtime:

In [None]:
candidate_labels = ["innocent", "guilty"]
label2id = {label: i for i, label in enumerate(candidate_labels)}

Why should this work? The language model is essentially asked if "innocent" is more or less likely to follow the court case text then "guilty".

But does it work in practice?

In [None]:
predictions_binary_classifier_7 = []

for text in tqdm(val_ds["texts"]):

    # Forward pass of zero-shot classification
    result = zero_shot_classifier(
        text,
        candidate_labels
    )

    # Get the model prediction (labels ordered according to their probability)
    prediction = label2id[result["labels"][0]]
    predictions_binary_classifier_7.append(prediction)

# Calculate the accuracy
acc_classifier7 = accuracy_score(val_ds["labels"], predictions_binary_classifier_7)
print("\nAccuracy:", acc_classifier7)

To further steer the model towards giving sensible answers, it is good practice to prepend or append a templated string to the input example. In this case, we could for instance use the template "The party being sued in this court case is", which makes the model much less surprised to see "innocent" or "guilty" as continuations and gives the model a context to interpret those continuations as we would like it to. This technique is referred to as *prompting*.


In [None]:
prompt = "The party being sued in this court case is {}"
candidate_labels = ["innocent", "guilty"]
label2id = {label: i for i, label in enumerate(candidate_labels)}

Does this work better?

In [None]:
predictions_binary_classifier_8 = []

for text in tqdm(val_ds["texts"]):

    # Forward pass of zero-shot classification
    result = zero_shot_classifier(
        text,
        candidate_labels,
        hypothesis_template=prompt  # here we prompt the model with our template
    )

    # Get the model prediction (labels ordered according to their probability)
    prediction = label2id[result["labels"][0]]
    predictions_binary_classifier_8.append(prediction)

# Calculate the accuracy
acc_classifier8 = accuracy_score(val_ds["labels"], predictions_binary_classifier_8)
print("\nAccuracy:", acc_classifier8)

**Exercise:** Try at least one more combination of prompt and labels and test the corresponding zero-shot classifier.

In [None]:
# prompt = "..."  # fill in a prompt
# candidate_labels = ["...", "..."]  # fill in potential labels
label2id = {label: i for i, label in enumerate(candidate_labels)}

predictions_binary_classifier_9 = []

for text in tqdm(val_ds["texts"]):
  # ... # forward pass
  )

  # ... # get the model prediction
  
# ... # calculate the accuracy

print("\nAccuracy:", acc_classifier9)                                                

### Fine-tuning

Finally, we fine-tune the pre-trained language model on the binary prediction task. By showing it examples of court cases and supervised labels, we obtain a Transformer model specialized for the judgement prediction task. Note that this might result in the model forgetting previous knowledge and becoming less performant in other tasks, including next-word prediction.

Let's launch the fine-tuning and save the fine-tuned model checkpoint.

In [None]:
# define convenience functions
def finetune_lm(model, train_dataset, val_dataset, n_epochs, batch_size, learning_rate, output_dir):
    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=n_epochs,
        logging_dir="./logs",
        load_best_model_at_end=True,
        save_strategy="epoch",
        eval_strategy="epoch",
        save_total_limit=1,
        learning_rate=learning_rate,
        gradient_accumulation_steps=4 # added due to out-of-memory CUDA error
    )

    # Create Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=lambda p: {"accuracy": accuracy_score(p.predictions.argmax(-1), p.label_ids)},
    )

    # Train the model
    trainer.train()

    return model, trainer


In [None]:
# Load the data using our convenience functions
train_ds, val_ds, test_ds = load_ECHR_dataset_for_binary_judgement_classification(data)

# Load the tokenizer suitable for the model model
MODEL_NAME = "EleutherAI/gpt-neo-125m"
lm, tokenizer = load_classification_model_and_tokenizer(MODEL_NAME)

# Create dataset and data loaders for training and validation
train_dataset = EHRCDataset(train_ds['texts'], train_ds['labels'], tokenizer, max_length=2048)
val_dataset = EHRCDataset(val_ds['texts'], val_ds['labels'], tokenizer, max_length=2048)


Now we can either train the model from scratch or load a pre-trained model. Note that training the model from scratch takes significant time (>3h). So we do not expect that you train the model from scratch necessarily.

In [None]:
import os

train_from_scratch = False

if train_from_scratch:
    N_EPOCHS = 5
    BATCH_SIZE = 3
    LEARNING_RATE = 1e-5
    modelFile_finetuned = "models_trained/lm_for_classification_5ep"

    lm_finetuned, lm_trainer = finetune_lm(lm, train_dataset, val_dataset, N_EPOCHS, BATCH_SIZE, LEARNING_RATE, modelFile_finetuned)
    
    # Save or use the trained model as needed
    lm_finetuned.save_pretrained(modelFile_finetuned)
    
else:
    # load fine-tuned model    
    folder_path = "./models_trained/"
    if not os.path.exists(folder_path):
        os.mkdir("models_trained")
    modelFile_finetuned = "models_trained/lm_for_classification_5ep"
    if not os.path.exists(modelFile_finetuned):
        # Download model
        !wget https://polybox.ethz.ch/index.php/s/uBs2WuEvu4TVfCo/download
        # unzip model
        !tar -xvzf download -C $folder_path # the path to the zip file and the destination directory

    lm_finetuned = AutoModelForSequenceClassification.from_pretrained(modelFile_finetuned)

Now we obtain predictions from the model and evaluate its accuracy.

In [None]:
# Check if GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the device
lm_finetuned.to(device)

# List to store predicted labels
predictions_binary_classifier_10 = []

# Tokenize and predict labels for each example in the dataset
for text in tqdm(val_ds['texts']):

    # Tokenize input text
    tokenized_input = tokenizer(text, return_tensors='pt').to(device)

    # Forward pass
    output = lm_finetuned(**tokenized_input)

    # Get predicted label
    predicted_label = torch.argmax(output.logits, dim=1).item()

    # Store predicted label in the list
    predictions_binary_classifier_10.append(predicted_label)

# Calculate the accuracy
acc_classifier10 = accuracy_score(val_ds["labels"], predictions_binary_classifier_10)
print(acc_classifier10)

Now we export the accuracies of the individual models for the last notebook, in which we compare different models

In [None]:
import json

acc = {
    "acc_classifier7": acc_classifier7,
    "acc_classifier8": acc_classifier8,
    "acc_classifier9": acc_classifier9,
    "acc_classifier10": acc_classifier10
}

# Save to a JSON file
with open("accuracy.json", "w") as f:
    json.dump(acc, f)
data.to_csv("data.csv")

# Evaluation on Test Set

In [None]:
# Load test set
data = pd.read_csv("data.csv")
_, _, test_set = load_ECHR_dataset_for_binary_judgement_classification(data)

test_documents = test_set['texts']
test_labels = test_set['labels']

Example evaluation with Transformers.

In [None]:
from tqdm import tqdm
from sklearn.metrics import classification_report

prompt = "Is this a case of 'violation' of human rights or a case of 'absolution'? It is a case of {}"
candidate_labels = ["violation", "absolution"]  # fill in potential labels
label2id = {label: i for i, label in enumerate(candidate_labels)}

zero_shot_classifier = pipeline(
    "zero-shot-classification",
    model="EleutherAI/gpt-neo-125m",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Make predictions with Transformers
binary_predictions = []

for text in tqdm(test_documents):
    # Forward pass of zero-shot classification
    result = zero_shot_classifier(
        text,
        candidate_labels,
        hypothesis_template=prompt
    )

    # Get the model prediction (labels ordered according to their probability)
    prediction = label2id[result["labels"][0]]
    binary_predictions.append(prediction)

# Evaluation report
report = classification_report(
    y_true=test_labels,
    y_pred=binary_predictions
)

print(report)

**Exercise:** Compare the results to the evaluation on the test set in part 2 (logistic regression/LSTM)