<a href="https://colab.research.google.com/github/tosittig/CASAIS/blob/main/CAS_project3_tsittig.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Legal judgement prediction

For this project, we will use the **ECHR dataset**, a collection of 11.5K court cases extracted from the public database of the European Court of Human Rights and further annotated by human experts (more info [here](https://www.aclweb.org/anthology/P19-1424/)). You will develop NLP models that, given the facts of a case, predict whether a human rights article or protocol has been violated. We call such problems *binary classification*.

We will start from simple logistic regression classifiers that use bag-of-words representations of a court case as features, then move to bidirectional LSTM classifiers with frozen and adaptive embeddings, and conclude with pre-trained and fine-tuned Transformer language models.

For those who want to go above and beyond, or simply exercise their NLP classification skills further, it is possible to work on a non-mandatory project extension. Here, you will build models that predict a court case's "importance score". This is a value from 1 to 4 that allows legal practitioners to identify pivotal cases. You will address this as a *multi-class classification* problem. But more on this later!

All of the binary classification tasks, which are mandatory, are based on notions and code that you have been exposed to through lectures and/or tutorials.

## Preliminary data analysis

Let's begin by loading the dataset. The ECHR dataset is open-source and can be downloaded from [this web page](https://archive.org/details/ECHR-ACL2019), but we are going to load a cleaner version of it, which has been pre-processed for this course.

For that, we will need the `datasets` library installed.

In [None]:
!pip install datasets

Now we can import the `load_dataset` from `datasets`, as well as the `pandas` library.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
from datasets import load_dataset


We load the data from the Hugging Face dataset hub and we store it in a pandas dataframe.

In [None]:
dataset = load_dataset("glnmario/ECHR")
full_data = pd.DataFrame(dataset['train'])

# Here, 'train' is just the default name for single-partition datasets.
# The actual training, development, and test set are defined in the
# first column of the dataframe ('partition').


***Display and inspect the first 5 rows of the dataset.***

In [None]:
... # fill in this line

As it is common for datasets used in Machine Learning projects, the dataset is split into 3 partitions: training, development, and test set. The training and development sets contain cases from 1959 through 2013, and the test set from 2014 through 2018.
> Note: *It's good practice to never look at the test set during development, as the test set represents the data your Machine Learning system will have to deal with once deployed, which you can't observe at development time. Here, we will keep the test set at hand but you should avoid making any modelling decision based on its content or features. Furthermore, for data which covers a significant period of time (as we have it here), it's best to use the most recent portion of the data as test data, as this will be most similar to the real-world data for which we will use the system.*

The sizes of the partitions, in terms of number of court cases, are the following:

In [None]:
print("Training set     ", len(full_data[full_data.partition == "train"]))
print("Development set  ", len(full_data[full_data.partition == "dev"]))
print("Test set         ", len(full_data[full_data.partition == "test"]))
print("Total           ", len(full_data))

Each instance in this dataset is a court case. Each court case is annotated with the following properties (the columns of the dataframe):

*   `partition`: a label indicating dataset partition this court case belongs to ("train", "dev", or "test")
*   `itemid`: a code which uniquely identifies this court case
*   `languageisocode`: an [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) describing the language in which the case is reported
*   `respondent`: the ISO code of the party being sued or tried (respondents are nation states)
*   `branch`: the branch of the Court dealing with the case, indicating at which stage of the trial a judgement was made (it can be one out of "ADMISSIBILITY", "CHAMBER", "GRANDCHAMBER", "COMMITTEE")
*   `date`: the date of the judgement
*   `docname`: the title of the court case (for example, "ERIKSON v. ITALY")
*   `importance`: an "importance score" from 1 (key case) to 4 (unimportant), denoting a case's contribution in the development of case-law
*   `conclusion`: a short summary of the case conclusion (for example, "Inadmissible" or "Violation of Art. 6-1; No violation of Art. 10"
*   `judges`: the name of the judges
*   `text`: the facts brought to the attention of the Court
*   `binary_judgement`: a binary label indicating whether an article or protocol was (1) or wasn't (0) violated


In [None]:
full_data.columns

 ### Filter court cases based on length
 We are now going to filter out from the dataset the court cases with the longest texts. We will do this for two reasons. First, this will speed up the experiments. Second, the Transformer model that we will use at the end of the project has, like most Transormers, a limited *window size*, which cannot fit more than 2048 tokens. This is the maximum sequence length that a Transformer can process at a time.

***Set a threshold by inspecting how many data points it tosses out and how balanced the sizes of the different partitions are (see the next four code cells). The threshold should be smaller than 2048, but greater than or equal to 300.***

In [None]:
THRESHOLD = ...

Let's look at basic text length statistics and how many court cases are left out when using a certain threshold.

First, we measure the length of every text in the dataset. We do this by splitting each text into words as indicated by whitespace characters, and then counting the number of resulting words.

In [None]:
# Extract text lengths using whitespaces as a simple criterion to separate words
text_lengths = []
for text in full_data.text:
  word_list = text.split()
  num_words = len(word_list)
  text_lengths.append(num_words)

Now can plot the distribution of text lengths, marking the threshold with a vertical line.

In [None]:
import matplotlib.pyplot as plt

# Plot text lengths
plt.hist(text_lengths, bins=100, alpha=0.5)
plt.ylabel('Frequency')
plt.xlabel('Text length (number of words)')

# Add a vertical bar corresponding to the threshold
plt.axvline(THRESHOLD, color='k', linestyle='dashed', linewidth=1)

plt.show()

As you can see this leaves out quite a few court cases, but it is okay for the purposes of this project.

In [None]:
# Add text length as an extra column to the dataset
full_data['text_length'] = text_lengths

# Calculate how many cases are discarded
n_left_out = sum(full_data.text_length > THRESHOLD)
print(f"Omitting {n_left_out} long cases.")

# Filter out court cases with a text length larger than the threshold
data = full_data[full_data.text_length <= THRESHOLD]

Let's also make sure the dataset is still reasonably balanced with respect to the training, validation, and test partitions.

In [None]:
print("Training set     ", len(data[data.partition == "train"]))
print("Development set  ", len(data[data.partition == "dev"]))
print("Test set         ", len(data[data.partition == "test"]))
print("Total           ", len(data))

### Data visualization

Now that we have our final version of the dataset, let's visualise the distribution of some of the dataset properties (date, branch, respondent, etc.) to get a sense of the data. What time span does the dataset cover? How many cases make it to the Grand Chamber? Which countries have been sued most often?

***Fill in the code for the second plot.***

In [None]:
import seaborn as sns

plt.figure(figsize=(15, 10))

# Plot number of instances per date
plt.subplot(3, 1, 1)
sns.countplot(x='date', data=data, palette='viridis')
plt.xticks(rotation=90)  # Rotate x-axis labels
plt.title('Number of Instances by Date')

# Plot number of instances per branch
plt.subplot(3, 1, 2)
... # fill in this line
plt.title('Number of Instances by Branch')

# Plot number of instances per top 10 respondents
plt.subplot(3, 1, 3)
top_respondents = data['respondent'].value_counts().nlargest(10)
sns.barplot(x=top_respondents.index, y=top_respondents.values, palette='colorblind')
plt.title('Number of Instances by Top 10 Respondents')

plt.tight_layout()
plt.show()

Let's now look at how many cases in this dataset actually resulted in violations of human rights articles or protocols. This is typically called the *class label distribution*. It will give us an idea of the dataset *class balance* (or *class imbalance*), an important property to look out for when making modelling and evaluation decisions.

In [None]:
plt.figure(figsize=(10, 7))

# Plot binary class label distribution per partition
sns.countplot(
    x='partition',
    hue='binary_judgement',
    data=data,
    palette='colorblind',
    order=['train', 'dev', 'test']
)

# Annotate plot
plt.legend(title='Judgement', labels=['0: no violation', '1: violation'])
plt.title('Distribution of Binary Judgement Labels for Each Dataset Partition')
plt.xlabel('Partition')
plt.ylabel('Number of Cases')
plt.show()

Finally, let's look at the class distribution of importance scores. Remember: importance scores range from 1 (key case) to 4 (unimportant).

***Write code that plots the class distribution per data partition.***

In [None]:
# Plot importance score distribution per partition

# ... # write the code snippet that plots class distribution by partition


## Binary Judgement Prediction with Bag of Words

Let's finally start with the task of predicting the outcome of a case given the text describing the main facts brought to the attention of the court. As we have just seen, each court case is annotated with a binary judgement label: whether the offendant has (label 1) or has not (label 0) violated any human rights article or protocol. This is a similar scenario to the sentiment classification task you have worked on previously in this course.

### Set-up
First, we load the necessary python libraries. Similarly to the sentiment classification example, we will use `keras` and `tensorflow`.

***Fix the random seed of `tensorflow` and `numpy` to ensure reproducibility.***

In [None]:
import numpy as np
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers import Input, TextVectorization, Embedding, Conv1D, MaxPooling1D, Flatten, LSTM, Bidirectional
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import Constant

# Initialize random number generators to ensure reproducibility
... # fill in this line
... # fill in this line

In [None]:
# @title Convenience functions: prepare data splits in scikit-friendly format
# @markdown You don't need to read the code in this cell, but please make sure you execute it.

def load_input_from_ECHR_dataset(dataframe):
    # Input: text
    X_train = data[data.partition == 'train'].text.to_list()
    X_val = data[data.partition == 'dev'].text.to_list()
    X_test = data[data.partition == 'test'].text.to_list()
    return X_train, X_val, X_test

def load_binary_output_from_ECHR_dataset(dataframe):
    # Binary output: violation judgement
    y_train_binary = data[data.partition == 'train'].binary_judgement.to_numpy()
    y_val_binary = data[data.partition == 'dev'].binary_judgement.to_numpy()
    y_test_binary = data[data.partition == 'test'].binary_judgement.to_numpy()
    return y_train_binary, y_val_binary, y_test_binary

def load_regression_output_from_ECHR_dataset(dataframe):
    # Regression output: case importance score
    y_train_regression = data[data.partition == 'train'].importance.astype(float).to_numpy()
    y_val_regression = data[data.partition == 'dev'].importance.astype(float).to_numpy()
    y_test_regression = data[data.partition == 'test'].importance.astype(float).to_numpy()
    return y_train_regression, y_val_regression, y_test_regression

def load_multiclass_output_from_ECHR_dataset(dataframe):
    # Multiclass output: case importance label
    y_train_multiclass = data[data.partition == 'train'].importance.to_numpy()
    y_val_multiclass = data[data.partition == 'dev'].importance.to_numpy()
    y_test_multiclass = data[data.partition == 'test'].importance.to_numpy()
    return y_train_multiclass, y_val_multiclass, y_test_multiclass

def load_ECHR_dataset_for_binary_judgement_classification(dataframe, for_tensorflow=False):
    X_train, X_val, X_test = load_input_from_ECHR_dataset(dataframe)
    y_train, y_val, y_test = load_binary_output_from_ECHR_dataset(dataframe)
    if for_tensorflow:
        train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
        val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
        test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
    else:
        train_ds = {"texts": X_train, "labels": y_train}
        val_ds = {"texts": X_val, "labels": y_val}
        test_ds = {"texts": X_test, "labels": y_test}
    return train_ds, val_ds, test_ds

def load_ECHR_dataset_for_case_importance_regression(dataframe, for_tensorflow=False):
    X_train, X_val, X_test = load_input_from_ECHR_dataset(dataframe)
    y_train, y_val, y_test = load_regression_output_from_ECHR_dataset(dataframe)
    if for_tensorflow:
        train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
        val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
        test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
    else:
        train_ds = {"texts": X_train, "labels": y_train}
        val_ds = {"texts": X_val, "labels": y_val}
        test_ds = {"texts": X_test, "labels": y_test}
        return train_ds, val_ds, test_ds

def load_ECHR_dataset_for_case_importance_classification(dataframe, for_tensorflow=False):
    X_train, X_val, X_test = load_input_from_ECHR_dataset(dataframe)
    y_train, y_val, y_test = load_multiclass_output_from_ECHR_dataset(dataframe)
    if for_tensorflow:
        train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
        val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
        test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
    else:
        train_ds = {"texts": X_train, "labels": y_train}
        val_ds = {"texts": X_val, "labels": y_val}
        test_ds = {"texts": X_test, "labels": y_test}
    return train_ds, val_ds, test_ds


### Loading the data

We now load the data in needed for the binary classification task in a model-friendly format, using some convenience functions defined in the cell above. As we have seen, the ECHR dataset comes with a predefined train-validation-test split.

In [None]:
train_ds, val_ds, test_ds = load_ECHR_dataset_for_binary_judgement_classification(data, for_tensorflow=True)

# Print 3 examples from the dataset
for example, label in train_ds.take(3):
  print("Input: ", example)
  print(10*".")
  print('Target labels: ', label)
  print(50*"-")


### Fit and evaluate

The following piece of code defines a function that trains (fits) a model on the training data and evaluates it on the development set. It then returns the training and validation accuracy obtained by the model at training epoch.

Please take some time to read this code and to understand all of its components.

In [None]:
def fit_and_eval_binary_classifier(
    train_ds,
    val_ds,
    model,
    learning_rate,
    buffer_size,
    batch_size,
    n_epochs,
    patience_n_epochs=5
    ):

    # compile
    model.compile(
        loss='binary_crossentropy',
        metrics=['accuracy'],
        optimizer=Adam(learning_rate=learning_rate)
    )

    # preliminaries
    tf.random.set_seed(42)
    np.random.seed(42)
    tf.config.run_functions_eagerly(True)

    # train
    history = model.fit(
        train_ds.shuffle(buffer_size=buffer_size).batch(batch_size),
        validation_data=val_ds.batch(batch_size),
        epochs=n_epochs,
        verbose=1,
        callbacks=[EarlyStopping(
            monitor='val_accuracy', patience=patience_n_epochs, verbose=False, restore_best_weights=True
        )]
    )

    return history.history

Now that the data is loaded and the training and evaluation procedure is in place, we can move to modelling.

### Creating Bag-of-Words text representations

We will use [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) to obtain bag-of-words representations of texts.

These representations will depend on two main parameters:
* the vocabulary size `VOCAB_SIZE`, which limits the number of word considered to the `VOCAB_SIZE` most frequent ones
* the type of bag-of-words representation: based on raw word counts (`count`) or on word counts weighed by inverse document frequency (`tf-idf`)

***Write code to create count-based and tf-idf text representations.***

In [None]:
VOCAB_SIZE = 1000

# Create count-based features
# ----------------------------
encoder_bow_count = ... # fill in this line
... # fill in this line

# Create tf-idf features
# ----------------------------
encoder_bow_tfidf = ... # fill in this line
... # fill in this line


In [None]:
# Let's take a peak at the vocabulary
vocab = np.array(encoder_bow_count.get_vocabulary())
vocab[:30]

### Binary classifier 1
As a first model, we will implement a logistic regression classifier with count-based BOW representations.

***Write code to define the model architecture, the training obectives, and the evaluation metric.***

In [None]:
# Define main hyperparameters
# --------------------------------------------------------------
LEARNING_RATE = 0.005
N_EPOCHS = 20
BUFFER_SIZE = 10000
BATCH_SIZE = 50


# Define model architecture
# --------------------------------------------------------------
binary_classifier_1 = Sequential(
    name = f'Logistic regression, count-based BOW, |V| = {VOCAB_SIZE}'
)
# binary_classifier_1.add(...)  # fill in this line
# binary_classifier_1.add(...)  # fill in this line
# binary_classifier_1.add(...)  # fill in this line


# Define training objective, evaluation metric, and optimizer
# --------------------------------------------------------------
binary_classifier_1.compile(
    loss='...', # fill in this line
    metrics=['...'], # fill in this line
    optimizer=Adam(learning_rate=LEARNING_RATE)
)
print(binary_classifier_1.summary())


Let's fit and evaluate this first classifier. This runs quickly, but if you want, you can skip the next two cells and directly load a pre-trained model with its training history.

In [None]:
# fit_and_eval_binary_classifier returns the training and validation accuracy scores over epochs
train_acc_model_1, val_acc_model_1, =  fit_and_eval_binary_classifier(
    train_ds=train_ds,
    val_ds=val_ds,
    model=binary_classifier_1,
    learning_rate=LEARNING_RATE,
    buffer_size=BUFFER_SIZE,
    batch_size=BATCH_SIZE,
    n_epochs=N_EPOCHS,
    patience_n_epochs=N_EPOCHS
)

In [None]:
# create an output directory
!mkdir models_BoW

# save classifier
binary_classifier_1.save('models_BoW/logistic_regression_count_based_BOW_V_1000.keras')

# save training history
np.save('models_BoW/logistic_regression_count_based_BOW_V_1000.history.npy', history_classifier_1)


Here, you can load a pre-trained classifier and its training history. The code below assumes you have uploaded `models_BoW.zip` onto the (Colab) file system.

In [None]:
# !unzip models_BoW.zip

# binary_classifier_1 = load_model('models_BoW/logistic_regression_count_based_BOW_V_1000.keras')
# history_classifier_1 = np.load('models_BoW/logistic_regression_count_based_BOW_V_1000.history.npy', allow_pickle='TRUE').item()

In [None]:
train_acc_model_1 = history_classifier_1['accuracy']
val_acc_model_1 = history_classifier_1['val_accuracy']

These are its training and validation accuracy over epochs. (Note that the model stops training after `patience_n_epochs` where its loss doesn't improve. We set this value equal to the number of epochs, so the model completes them all. However, you can set this parameter to a lower value for more efficient training. This is what you'd likely do in practice.)

In [None]:
plt.plot(
    range(1, len(train_acc_model_1) + 1),  # the epochs for the x-axis
    train_acc_model_1,  # the training accuracy
    'b:',  # for dotted blue line
    label='Logreg count-based BOW, Training acc'
)
plt.plot(
    range(1, len(val_acc_model_1) + 1),  # the epochs for the x-axis
    val_acc_model_1,  # the validation accuracy
    'b',  # for dense blue line
    label='Logreg count-based BOW, Validation acc'
)
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

### Binary classifier 2
Next is a logistic regression classifier with tfidf-based BOW representations.


***Write code to define the model architecture, the training obectives, and the evaluation metric.***

In [None]:
# Define main hyperparameters
# --------------------------------------------------------------
LEARNING_RATE = 0.005
N_EPOCHS = 20
BUFFER_SIZE = 10000
BATCH_SIZE = 50


# Define model architecture
# --------------------------------------------------------------
binary_classifier_2 = Sequential(
    name = f'Logistic regression, tfidf-based BOW, |V| = {VOCAB_SIZE}'
)
binary_classifier_2.add(...)  # fill in this line
binary_classifier_2.add(...)  # fill in this line
binary_classifier_2.add(...)  # fill in this line


# Define training objective, evaluation metric, and optimizer
# --------------------------------------------------------------
binary_classifier_2.compile(
    loss='...',  # fill in this line
    metrics=['...'],  # fill in this line
    optimizer=Adam(lr=LEARNING_RATE)
)
print(binary_classifier_2.summary())


Fit and evaluate, then plot accuracy.

In [None]:
history_classifier_2 =  fit_and_eval_binary_classifier(
    train_ds=train_ds,
    val_ds=val_ds,
    model=binary_classifier_2,
    learning_rate=LEARNING_RATE,
    buffer_size=BUFFER_SIZE,
    batch_size=BATCH_SIZE,
    n_epochs=N_EPOCHS,
    patience_n_epochs=N_EPOCHS
)

In [None]:
# save classifier
binary_classifier_2.save('models_BoW/logistic_regression_tfidf_based_BOW_V_1000.keras')

# save training history
np.save('models_BoW/logistic_regression_tfidf_based_BOW_V_1000.history.npy', history_classifier_2)


Here, you can load a pre-trained classifier and its training history.

In [None]:
# binary_classifier_2 = load_model('models_BoW/logistic_regression_tfidf_based_BOW_V_1000.keras')
# history_classifier_2 = np.load('models_BoW/logistic_regression_tfidf_based_BOW_V_1000.history.npy', allow_pickle='TRUE').item()


In [None]:
train_acc_model_2 = history_classifier_2['accuracy']
val_acc_model_2 = history_classifier_2['val_accuracy']

In [None]:
plt.plot(
    range(1, len(train_acc_model_2) + 1),
    train_acc_model_2,
    'g:',
    label='Logreg tfidf-based BOW, Training acc'
)
plt.plot(
    range(1, len(val_acc_model_2) + 1),
    val_acc_model_2,
    'g',
    label='Logreg tfidf-based BOW, Validation acc'
)
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

### Model comparison

To compare the two models visually, ***plot the training and validation accuracy of the two bag-of-words models.***

In [None]:
...  # fill in this line
...  # fill in this line
...  # fill in this line
...  # fill in this line
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

***Briefly describe the results.***

*Enter your response here (two or three sentences should suffice).*

### Analysis
To understand the models' performance beyond the evaluation scores, it is useful to carry out what could be called an *intepretability analysis*.

We interpret what the model has learned by analysing its weights.

***Write code to extract the weights from the two classifiers above and to obtain the vocabulary entries with the highest weights.***

In [None]:
..vocab1 = np.array(encoder_bow_count.get_vocabulary())
vocab2 = np.array(encoder_bow_tfidf.get_vocabulary())

# Extract the classifier weights
classifier_1_vocab_weights = ...  # fill in this line
classifier_2_vocab_weights = ...  # fill in this line

# Sort the weights and get the correspondingly sorted vocabulary indices
classifier_1_vocab_weights_sorted = ...  # fill in this line
classifier_2_vocab_weights_sorted = ...  # fill in this line

# The indices with the largest values indicate which words are most indicative of violations
print("Words predictive of violations")
print("Model 1:\n", vocab1[classifier_1_vocab_weights_sorted[-10:]])
print()
print("Model 2:\n", vocab2[classifier_2_vocab_weights_sorted[-10:]])

# ... and vice versa
print("\n\nWords predictive of absolution")
print("Model 1:\n", vocab1[classifier_1_vocab_weights_sorted[:10]])
print("Model 2:\n", vocab2[classifier_2_vocab_weights_sorted[:10]])

Do the words with the highest weights correspond to sensible violation or absolution cues?

## Binary Judgement Prediction with LSTMs

As a next model class, we will consider recurrent neural models — in particular, LSTMs. As you have learned, these models are able to take into account the order of words in sentences, which is in principle a big advantage over bag-of-words models. "The woman sued Switzerland" is not the same as "Switzerland sued the woman"!

### BiLSTM with embeddings trained from scratch

First, we'll design a simple one-layer bidirectional LSTM with word embeddings learned from scratch.

***Write code to create word embeddings for the vocabulary of this dataset.***

First, define the right encoder.

In [None]:
EMBEDDING_DIM = 50
VOCAB_SIZE = 1000

encoder_embed = ... # fill in this line
... # fill in this line

# print the vocabulary id of the word "human"
encoder_embed("human").numpy()

Then, create the embedding matrix.

In [None]:
embedding_layer = Embedding(
    input_dim=...,  # fill in this line
    output_dim=...,  # fill in this line
    embeddings_initializer="uniform",
    trainable=True,
)

***Write code to define the model architecture***. Remember, this should include an input layer, encoder and embedding layers, a bidirectional LSTM layer and an output layer. Keep the dimensionality of the LSTM layer low (for example, 16).

In [None]:
binary_classifier_3 = Sequential(
    name=f"1-layer BiLSTM classifier, embeddings from scratch)"
)
binary_classifier_3.add(...)  # fill in this line
binary_classifier_3.add(...)  # fill in this line
binary_classifier_3.add(...)  # fill in this line
binary_classifier_3.add(...)  # fill in this line
binary_classifier_3.add(...)  # fill in this line

binary_classifier_3.summary()

Fit and evaluate. Note: the LSTM takes longer to train than the logistic regression. We set the patience parameter to 3 to avoid redundant epochs. You can also skip the next two cells and load pre-trained model weights directly.

In [None]:
LEARNING_RATE = 0.005
BATCH_SIZE = 50
BUFFER_SIZE = 10000
N_EPOCHS = 20

history_classifier_3 =  fit_and_eval_binary_classifier(
    train_ds=train_ds,
    val_ds=val_ds,
    model=binary_classifier_3,
    learning_rate=LEARNING_RATE,
    buffer_size=BUFFER_SIZE,
    batch_size=BATCH_SIZE,
    n_epochs=N_EPOCHS,
    patience_n_epochs=5
)

In [None]:
# create an output directory
!mkdir models_LSTM

# save classifier
binary_classifier_3.save('models_LSTM/1_layer_BiLSTM_embeds_from_scratch.keras')

# save training history
np.save('models_LSTM/1_layer_BiLSTM_embeds_from_scratch.history.npy', history_classifier_3)


Here, you can load the pre-trained model. Please upload `models_LSTM.zip` first.

In [None]:
# !unzip models_LSTM.zip

# binary_classifier_3 = load_model('models_LSTM/1_layer_BiLSTM_embeds_from_scratch.keras')
# history_classifier_3 = np.load('models_LSTM/1_layer_BiLSTM_embeds_from_scratch.history.npy', allow_pickle='TRUE').item()

In [None]:
train_acc_model_3 = history_classifier_3['accuracy']
val_acc_model_3 = history_classifier_3['val_accuracy']

### Deeper network
Next, let's try with a deeper two-layer LSTM network. Word embeddings will be still learned from scratch.


***Define the full two-layer Bidirectional LSTM in the cell below.***  This is identical to the one-layer model, but with an extra Bidirectional LSTM layer. Again, keep the dimensionality of the LSTM layers low.

In [None]:
encoder_embed = ... # fill in this line
... # fill in this line


embedding_layer = Embedding(
    input_dim=...,  # fill in this line
    output_dim=...,  # fill in this line
    embeddings_initializer="uniform",
    trainable=True,
)

binary_classifier_4 = Sequential(
    name=f"2-layer BiLSTM classifier (embeddings from scratch)"
)
binary_classifier_4.add(...)  # fill in this line
binary_classifier_4.add(...)  # fill in this line
binary_classifier_4.add(...)  # fill in this line
binary_classifier_4.add(...)  # fill in this line
binary_classifier_4.add(...)  # fill in this line
binary_classifier_4.add(...)  # fill in this line

binary_classifier_4.summary()

Fit and evaluate.

In [None]:
LEARNING_RATE = 0.005
BATCH_SIZE = 50
BUFFER_SIZE = 10000
N_EPOCHS = 20

history_classifier_4 =  fit_and_eval_binary_classifier(
    train_ds=train_ds,
    val_ds=val_ds,
    model=binary_classifier_4,
    learning_rate=LEARNING_RATE,
    buffer_size=BUFFER_SIZE,
    batch_size=BATCH_SIZE,
    n_epochs=N_EPOCHS,
    patience_n_epochs=5
)


In [None]:
# save classifier
binary_classifier_4.save('models_LSTM/2_layer_BiLSTM_embeds_from_scratch.keras')

# save training history
np.save('models_LSTM/2_layer_BiLSTM_embeds_from_scratch.history.npy', history_classifier_4)


Here, you can load the pre-trained model.

In [None]:
# binary_classifier_4 = load_model('models_LSTM/2_layer_BiLSTM_embeds_from_scratch.keras')
# history_classifier_4 = np.load('models_LSTM/2_layer_BiLSTM_embeds_from_scratch.history.npy', allow_pickle='TRUE').item()


In [None]:
train_acc_model_4 = history_classifier_4['accuracy']
val_acc_model_4 = history_classifier_4['val_accuracy']

To compare the two bidirectional LSTMs, ***plot the training and validation accuracy of the two models.***

In [None]:
...  # fill in this line
...  # fill in this line
...  # fill in this line
...  # fill in this line
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

### Pre-trained word embeddings

The dataset at hand is very domain-specific and not particularly large so it is unlikely that the model will be able learn the general meaning of words. Luckily the network can be initialised with pre-trained word embeddings, which were trained on generalist corpora to capture the meaning of all words in the vocabulary. We will download pre-trained GloVe embeddings of dimensionality 50, trained on a corpus of 6 billion tokens.

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

In [None]:
# Load pre-trained GloVe embeddings
# ----------------------------------
glove_file_path = '/content/glove.6B.50d.txt'
EMBEDDING_DIM = 50

embeddings_index = {}
with open(glove_file_path) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))


hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((VOCAB_SIZE, EMBEDDING_DIM))
for i, word in enumerate(encoder_embed.get_vocabulary()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
        print(word)

print("Converted %d words (%d misses)" % (hits, misses))


#### Frozen embeddings

Here, we are going to leave the word embeddings "frozen". That is, they will not be updated throughout the training of the LSTM. In this way, the embeddings will remain general representations of word meaning while the rest of the network will specialise for the legal judgement prediction task.

***Define a Bidirectional LSTM with frozen, pre-trained word embeddings.*** You can make the LSTM one-layer for faster training.

In [None]:
pretrained_embedding_layer_frozen = Embedding(
    input_dim=...,  # fill in this line
    output_dim=...,  # fill in this line
    embeddings_initializer=Constant(embedding_matrix),
    trainable=...,  # fill in this line
)

binary_classifier_5 = Sequential(
    name=f"1-layer BiLSTM classifier (frozen pre-trained embeddings)"
)
# Fill in the following lines to build the LSTM
binary_classifier_5.add(...)
binary_classifier_5.add(...)
binary_classifier_5.add(...)
binary_classifier_5.add(...)
binary_classifier_5.add(...)

binary_classifier_5.summary()

Fit and evaluate. Alternatively, skip the next two cells and load the pre-trained model weights.

In [None]:
LEARNING_RATE = 0.005
BATCH_SIZE = 50
BUFFER_SIZE = 10000
N_EPOCHS = 20

history_classifier_5 =  fit_and_eval_binary_classifier(
    train_ds=train_ds,
    val_ds=val_ds,
    model=binary_classifier_5,
    learning_rate=LEARNING_RATE,
    buffer_size=BUFFER_SIZE,
    batch_size=BATCH_SIZE,
    n_epochs=N_EPOCHS,
    patience_n_epochs=5
)


In [None]:
# save classifier
binary_classifier_5.save('models_LSTM/1_layer_BiLSTM_embeds_pretrained_frozen.keras')

# save training history
np.save('models_LSTM/1_layer_BiLSTM_embeds_pretrained_frozen.history.npy', history_classifier_5)

In [None]:
# binary_classifier_5 = load_model('models_LSTM/1_layer_BiLSTM_embeds_pretrained_frozen.keras')
# history_classifier_5 = np.load('models_LSTM/1_layer_BiLSTM_embeds_pretrained_frozen.history.npy', allow_pickle='TRUE').item()


In [None]:
train_acc_model_5 = history_classifier_5['accuracy']
val_acc_model_5 = history_classifier_5['val_accuracy']

#### Adaptive embeddings

Now let's unfreeze the embeddings and allow them to be updated throughout training.

***Define a Bidirectional LSTM with adaptive embeddings.***

In [None]:
pretrained_embedding_layer_adaptive = Embedding(
    input_dim=...,  # fill in this line
    output_dim=...,  # fill in this line
    embeddings_initializer=Constant(embedding_matrix),
    trainable=...,  # fill in this line
)

binary_classifier_6 = Sequential(
    name=f"1-layer BiLSTM classifier (frozen pre-trained embeddings)"
)
# Fill in the following lines to build the LSTM
binary_classifier_6.add(...)
binary_classifier_6.add(...)
binary_classifier_6.add(...)
binary_classifier_6.add(...)
binary_classifier_6.add(...)

binary_classifier_6.summary()

In [None]:
LEARNING_RATE = 0.005
BATCH_SIZE = 50
BUFFER_SIZE = 10000
N_EPOCHS = 20

history_classifier_6 =  fit_and_eval_binary_classifier(
    train_ds=train_ds,
    val_ds=val_ds,
    model=binary_classifier_6,
    learning_rate=LEARNING_RATE,
    buffer_size=BUFFER_SIZE,
    batch_size=BATCH_SIZE,
    n_epochs=N_EPOCHS
)


In [None]:
# save classifier
binary_classifier_6.save('models_LSTM/1_layer_BiLSTM_embeds_pretrained_adaptive.keras')

# save training history
np.save('models_LSTM/1_layer_BiLSTM_embeds_pretrained_adaptive.history.npy', history_classifier_6)


In [None]:
# binary_classifier_6 = load_model('models_LSTM/1_layer_BiLSTM_embeds_pretrained_adaptive.keras')
# history_classifier_6 = np.load('models_LSTM/1_layer_BiLSTM_embeds_pretrained_adaptive.history.npy', allow_pickle='TRUE').item()


In [None]:
train_acc_model_6 = history_classifier_6['accuracy']
val_acc_model_6 = history_classifier_6['val_accuracy']

***Plot the training and validation accuracy of the four LSTM models.***

In [None]:
... # fill in this code block

plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

## Binary Judgement Prediction with Transformer language models

The last model class we'll experiment with are Transformer language models. We will *not* train a model from scratch on this dataset because Transformer language models are typically very large networks, with million of parameters, which would likely overfit to the dataset at hand. Instead, we will use a pre-trained language model, an autoregressive Transformer optimised to predict the next word in texts of many different domains.

We suggest you use [GPT-neo-125m](https://huggingface.co/EleutherAI/gpt-neo-125m), a model designed to replicate the architecture of OpenAI's GPT-3 in its smallest version (125 million parameters). Feel free to substitute this with another pretrained autoregressive language model from the Hugging Face [model hub](https://huggingface.co/models?sort=trending) but beware of model size.

If you are running this notebook on Google colab, *change the runtime type to `T4-GPU` using the dropdown menu on the top right.* After that, you might need to reload the data and the convenience functions defined above.


First, let's install and load the necessary python libraries. If you run this notebook in the ETH Jupyter hub, you can directly load the libraries. If not, please install using the cell below and restart the Runtime session.

In [None]:
!pip install transformers sacremoses accelerate -U

In [None]:
from tqdm.notebook import tqdm_notebook as tqdm
from sklearn.metrics import accuracy_score, PrecisionRecallDisplay
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, pipeline
from torch.utils.data import Dataset, DataLoader
import torch


In [None]:
# @title Convenience functions for Huggingface transformers
# @markdown You don't need to read the code in this cell, but please make sure you execute it.

def load_classification_model_and_tokenizer(model_name_or_path):
    lm = AutoModelForSequenceClassification.from_pretrained(model_name_or_path)

    # Load the tokenizer suitable for this model
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

    if not lm.config.pad_token_id:
        lm.config.pad_token_id = lm.config.eos_token_id
        tokenizer.pad_token = tokenizer.eos_token

    return lm, tokenizer

class EHRCDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(self.texts[idx],
                                  truncation=True,
                                  padding='max_length',
                                  max_length=self.max_length,
                                  return_attention_mask=True,
                                  return_tensors='pt')

        item = {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

        return item


def finetune_lm(model, train_dataset, val_dataset, n_epochs, batch_size, learning_rate, output_dir):
    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=n_epochs,
        logging_dir="./logs",
        load_best_model_at_end=True,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        save_total_limit=1,
        learning_rate=learning_rate
    )

    # Create Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=lambda p: {"accuracy": accuracy_score(p.predictions.argmax(-1), p.label_ids)},
    )

    # Train the model
    trainer.train()

    return model, trainer


Load the data in model-friendly format using the convenience functions above.

In [None]:
# Load the data using our convenience functions
train_ds, val_ds, test_ds = load_ECHR_dataset_for_binary_judgement_classification(data)

# Load the tokenizer suitable for the model model
MODEL_NAME = "EleutherAI/gpt-neo-125m"
lm, tokenizer = load_classification_model_and_tokenizer(MODEL_NAME)

# Create dataset and data loaders for training and validation
train_dataset = EHRCDataset(train_ds['texts'], train_ds['labels'], tokenizer, max_length=2048)
val_dataset = EHRCDataset(val_ds['texts'], val_ds['labels'], tokenizer, max_length=2048)

### Zero-shot classification and prompting

Note that this model is pre-trained on the general language modelling task (predicting the next word in a text) and not on the legal judgement prediction task. This is different from the setup you have seen in the tutorial on pre-trained Transformers. The type of classification we will perform with this model is typically referred to as *zero-shot classification*, meaning that the model is asked to classify by seeing *no* examples from the dataset.

In [None]:
zero_shot_classifier = pipeline(
    "zero-shot-classification",
    model=lm_name,
    device="cuda:0"
)

Instead of using a `text-classification` pipeline, we are using a `zero-shot-classification` pipeline. These two are almost equivalent except that `zero-shot-classification` doesn't require a hardcoded number of potential classes. They can be chosen at runtime:

In [None]:
candidate_labels = ["innocent", "guilty"]
label2id = {label: i for i, label in enumerate(candidate_labels)}


Why should this work? The language model is essentially asked if "innocent" is more or less likely to follow the court case text then "guilty".

But does it work in practice?

In [None]:
predictions_binary_classifier_7 = []

for text in tqdm(val_ds["texts"]):

    # Forward pass of zero-shot classification
    result = zero_shot_classifier(
        text,
        candidate_labels
    )

    # Get the model prediction (labels ordered according to their probability)
    prediction = label2id[result["labels"][0]]
    predictions_binary_classifier_7.append(prediction)

# Calculate the accuracy
acc_classifier7 = accuracy_score(val_ds["labels"], predictions_binary_classifier_7)
print("\nAccuracy:", acc_classifier7)

To further steer the model towards giving sensible answers, it is good practice to prepend or append a templated string to the input example. In this case, we could for instance use the template "The party being sued in this court case is", which makes the model much less surprised to see "innocent" or "guilty" as continuations and gives the model a context to interpret those continuations as we would like it to. This technique is referred to as *prompting*.


In [None]:
prompt = "The party being sued in this court case is {}"
candidate_labels = ["innocent", "guilty"]
label2id = {label: i for i, label in enumerate(candidate_labels)}

Does this work better?

In [None]:
predictions_binary_classifier_8 = []

for text in tqdm(val_ds["texts"]):

    # Forward pass of zero-shot classification
    result = zero_shot_classifier(
        text,
        candidate_labels,
        hypothesis_template=prompt  # here we prompt the model with our template
    )

    # Get the model prediction (labels ordered according to their probability)
    prediction = label2id[result["labels"][0]]
    predictions_binary_classifier_8.append(prediction)

# Calculate the accuracy
acc_classifier8 = accuracy_score(val_ds["labels"], predictions_binary_classifier_8)
print("\nAccuracy:", acc_classifier8)


***Try at least one more combination of prompt and labels and test the corresponding zero-shot classifier.***

In [None]:
# prompt = "..."  # fill in a prompt
# candidate_labels = ["...", "..."]  # fill in potential labels
prompt = "Is this a case of 'violation' of human rights or a case of 'absolution'? It is a case of {}"
candidate_labels = ["violation", "absolution"]  # fill in potential labels
label2id = {label: i for i, label in enumerate(candidate_labels)}

predictions_binary_classifier_9 = []

for text in tqdm(val_ds["texts"]):
  ... # forward pass

  ... # get the model prediction


... # calculate the accuracy

### Fine-tuning

Finally, we fine-tune the pre-trained language model on the binary prediction task. By showing it examples of court cases and supervised labels, we obtain a Transformer model specialized for the judgement prediction task. Note that this might result in the model forgetting previous knowledge and becoming less performant in other tasks, including next-word prediction.

Let's launch the fine-tuning and save the fine-tuned model checkpoint.

In [None]:
N_EPOCHS = 5
BATCH_SIZE = 3
LEARNING_RATE = 1e-5
OUTPUT_DIR = "/content/lm_for_classification_5ep"

lm_finetuned, lm_trainer = finetune_lm(lm, train_dataset, val_dataset, N_EPOCHS, BATCH_SIZE, LEARNING_RATE, OUTPUT_DIR)

# Save or use the trained model as needed
lm.save_pretrained(OUTPUT_DIR)

Now we obtain predictions from the model and evaluate its accuracy.

In [None]:
# List to store predicted labels
predictions_binary_classifier_10 = []

# Tokenize and predict labels for each example in the dataset
for text in val_ds['texts']:

    # Tokenize input text
    tokenized_input = tokenizer(text, return_tensors='pt')

    # Forward pass
    output = lm_finetuned(**tokenized_input)

    # Get predicted label
    predicted_label = torch.argmax(output.logits, dim=1).item()

    # Store predicted label in the list
    predictions_binary_classifier_10.append(predicted_label)

# Calculate the accuracy
acc_classifier10 = accuracy_score(val_ds["labels"], predictions_binary_classifier_10)
print(acc_classifier10)

You can also load the fine-tuned model weights.

In [None]:
# !unzip models_Transformer.zip

In [None]:
# lm_finetuned, tokenizer = load_classification_model_and_tokenizer("models_Transformer/checkpoint-2204")

In [None]:
# device = 'cuda' if torch.cuda.is_available() else "cpu"
# lm_finetuned.to(device)

# # List to store predicted labels
# predictions_binary_classifier_10 = []

# # Tokenize and predict labels for each example in the dataset
# for text in val_ds['texts']:

#     # Tokenize input text
#     tokenized_input = tokenizer(text, return_tensors='pt').to(device)

#     # Forward pass
#     output = lm_finetuned(**tokenized_input)

#     # Get predicted label
#     predicted_label = torch.argmax(output.logits, dim=1).item()

#     # Store predicted label in the list
#     predictions_binary_classifier_10.append(predicted_label)

# # Calculate the accuracy
# acc_classifier10 = accuracy_score(val_ds["labels"], predictions_binary_classifier_10)
# print(acc_classifier10)

## Evaluate on the test set

You have compared at least 10 different classifiers so far. ***Now evaluate the best 3 on the test set and report their accuracy.***

In [None]:
# Load test set

_, _, test_set = load_ECHR_dataset_for_binary_judgement_classification(data)

test_documents = test_set['texts']
test_labels = test_set['labels']


Example evaluation with logistic regression classifiers and LSTMs.

In [None]:
from sklearn.metrics import classification_report

# Make prediction for the test set sentences
predictions = binary_classifier_1.predict(
    test_documents
)

# Turn predicted probabilities into binary classification scores
binary_predictions = [1 if pred > 0.5 else 0 for pred in predictions]

# Evaluate model by comparing its prediction to the gold labels
report = classification_report(
    y_true=test_labels,
    y_pred=binary_predictions
)

print(report)

Example evaluation with Transformers.

In [None]:
prompt = "Is this a case of 'violation' of human rights or a case of 'absolution'? It is a case of {}"
candidate_labels = ["violation", "absolution"]  # fill in potential labels
label2id = {label: i for i, label in enumerate(candidate_labels)}

zero_shot_classifier = pipeline(
    "zero-shot-classification",
    model="EleutherAI/gpt-neo-125m",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Make predictions with Transformers
binary_predictions = []

for text in test_documents:
    # Forward pass of zero-shot classification
    result = zero_shot_classifier(
        text,
        candidate_labels,
        hypothesis_template=prompt
    )

    # Get the model prediction (labels ordered according to their probability)
    prediction = label2id[result["labels"][0]]
    binary_predictions.append(prediction)

# Evaluation report
report = classification_report(
    y_true=test_labels,
    y_pred=binary_predictions
)

print(report)


## [Optional] Case importance prediction

The main task of this project was binary legal judgement classification but each court case in the ECHR dataset is also annotated with importance scores, a value from 1 to 4 that allows legal practitioners to identify pivotal cases.

> Note: Importance scores can be thought of as values on a continuous scale from 1 to 4, or they can be considered as four separate classes, each with its specific meaning. Depending on which interpretation we decide to go with, predicting importance scores can be cast as a:
*   *regression task*: predicting a continous score from 1 to 4
*   *multi-class classification*: predicting a categorical label out ot 4 options

**Your (optional and open-ended) task is now to train and compare multi-class classifiers that predict the importance score of a court case.**




In [None]:
# ...