# Legal judgement prediction

For this project, we will use the **ECHR dataset**, a collection of 11.5K court cases extracted from the public database of the European Court of Human Rights and further annotated by human experts (more info [here](https://www.aclweb.org/anthology/P19-1424/)). You will develop NLP models that, given the facts of a case, predict whether a human rights article or protocol has been violated. We call such problems *binary classification*.

We will start from simple logistic regression classifiers that use bag-of-words representations of a court case as features, then move to bidirectional LSTM classifiers with frozen and adaptive embeddings, and conclude with pre-trained and fine-tuned Transformer language models.

For those who want to go above and beyond, or simply exercise their NLP classification skills further, it is possible to work on a non-mandatory project extension. Here, you will build models that predict a court case's "importance score". This is a value from 1 to 4 that allows legal practitioners to identify pivotal cases. You will address this as a *multi-class classification* problem. But more on this later!

All of the binary classification tasks, which are mandatory, are based on notions and code that you have been exposed to through lectures and/or tutorials.

## Preliminary data analysis

Let's begin by loading the dataset. The ECHR dataset is open-source and can be downloaded from [this web page](https://archive.org/details/ECHR-ACL2019), but we are going to load a cleaner version of it, which has been pre-processed for this course.

Now we can import the `load_dataset` from `datasets`, as well as the `pandas` library.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np
from datasets import load_dataset
import os

We load the data from the Hugging Face dataset hub and we store it in a pandas dataframe.

In [None]:
dataset = load_dataset("glnmario/ECHR")
full_data = pd.DataFrame(dataset['train'])

# Here, 'train' is just the default name for single-partition datasets.
# The actual training, development, and test set are defined in the
# first column of the dataframe ('partition').

**Exercise:** Display and inspect the first 5 rows of the dataset.

In [None]:
... # fill in this line

<details>
  <summary>Click to see the solution</summary>
  
  ```python
full_data.head()
  ```
</details>

As it is common for datasets used in Machine Learning projects, the dataset is split into 3 partitions: training, development, and test set. The training and development sets contain cases from 1959 through 2013, and the test set from 2014 through 2018.
> Note: *It's good practice to never look at the test set during development, as the test set represents the data your Machine Learning system will have to deal with once deployed, which you can't observe at development time. Here, we will keep the test set at hand but you should avoid making any modelling decision based on its content or features. Furthermore, for data which covers a significant period of time (as we have it here), it's best to use the most recent portion of the data as test data, as this will be most similar to the real-world data for which we will use the system.*

The sizes of the partitions, in terms of number of court cases, are the following:

In [None]:
length_train = len(full_data[full_data.partition == "train"])
length_dev = len(full_data[full_data.partition == "dev"])
length_test = len(full_data[full_data.partition == "test"])
length_total = len(full_data)

print("Training set     ", length_train, "(", round(length_train/length_total*100,2), "% )")
print("Development set  ", length_dev, "(", round(length_dev/length_total*100,2), "% )")
print("Test set         ", length_test, "(", round(length_test/length_total*100,2), "% )")
print("Total           ", length_total)

Each instance in this dataset is a court case. Each court case is annotated with the following properties (the columns of the dataframe):

*   `partition`: a label indicating dataset partition this court case belongs to ("train", "dev", or "test")
*   `itemid`: a code which uniquely identifies this court case
*   `languageisocode`: an [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) describing the language in which the case is reported
*   `respondent`: the ISO code of the party being sued or tried (respondents are nation states)
*   `branch`: the branch of the Court dealing with the case, indicating at which stage of the trial a judgement was made (it can be one out of "ADMISSIBILITY", "CHAMBER", "GRANDCHAMBER", "COMMITTEE")
*   `date`: the date of the judgement
*   `docname`: the title of the court case (for example, "ERIKSON v. ITALY")
*   `importance`: an "importance score" from 1 (key case) to 4 (unimportant), denoting a case's contribution in the development of case-law
*   `conclusion`: a short summary of the case conclusion (for example, "Inadmissible" or "Violation of Art. 6-1; No violation of Art. 10"
*   `judges`: the name of the judges
*   `text`: the facts brought to the attention of the Court
*   `binary_judgement`: a binary label indicating whether an article or protocol was (1) or wasn't (0) violated


In [None]:
full_data.columns

 ### Filter court cases based on length
 We are now going to filter out from the dataset the court cases with the longest texts. We will do this for two reasons. First, this will speed up the experiments. Second, the Transformer model that we will use at the end of the project has, like most Transormers, a limited *window size*, which cannot fit more than 2048 tokens. This is the maximum sequence length that a
 Transformer can process at a time.

 Therefore, we will set a threshold. We keep all texts with a length smaller or equal than this threshold and filter out the others.

***Set a threshold by inspecting how many data points it tosses out and how balanced the sizes of the different partitions are (see the next four code cells). The threshold should be smaller than 2048, but greater than or equal to 300.***

In [None]:
THRESHOLD = ... # fill this line

We chose threshold = 1024 that we have texts that are long enough but also not too long.

Let's look at basic text length statistics and how many court cases are left out when using a certain threshold.

First, we measure the length of every text in the dataset. We do this by splitting each text into words as indicated by whitespace characters, and then counting the number of resulting words.

In [None]:
# Extract text lengths using whitespaces as a simple criterion to separate words
text_lengths = []
for text in full_data.text:
  word_list = text.split()
  num_words = len(word_list)
  text_lengths.append(num_words)

Now can plot the distribution of text lengths, marking the threshold with a vertical line.

In [None]:
import matplotlib.pyplot as plt

# Plot text lengths
plt.hist(text_lengths, bins=100, alpha=0.5)
plt.ylabel('Frequency')
plt.xlabel('Text length (number of words)')

# Add a vertical bar corresponding to the threshold
plt.axvline(THRESHOLD, color='k', linestyle='dashed', linewidth=1)

plt.show()

In [None]:
percentage_included = np.sum(np.array(text_lengths) < THRESHOLD) / len(text_lengths)
print(f"Percentage of texts included: {round(percentage_included,4)*100} %")

With THRESHOLD = 1024, we can inlcude around 35.67 % of all texts.

As you can see this leaves out quite a few court cases, but it is okay for the purposes of this project.

In [None]:
# Add text length as an extra column to the dataset
full_data['text_length'] = text_lengths

# Calculate how many cases are discarded
n_left_out = sum(full_data.text_length > THRESHOLD)
print(f"Omitting {n_left_out} long cases.")

# Filter out court cases with a text length larger than the threshold
data = full_data[full_data.text_length <= THRESHOLD]

Let's also make sure the dataset is still reasonably balanced with respect to the training, validation, and test partitions.

In [None]:
length_train_partitioned = len(data[data.partition == "train"])
length_dev_partitioned = len(data[data.partition == "dev"])
length_test_partitioned = len(data[data.partition == "test"])
length_total_partitioned = len(data)

print("Training set     ", length_train_partitioned, "(", round(length_train_partitioned/length_total_partitioned*100,2), "% )")
print("Development set   ", length_dev_partitioned, "(", round(length_dev_partitioned/length_total_partitioned*100,2), "% )")
print("Test set         ", length_test_partitioned, "(", round(length_test_partitioned/length_total_partitioned*100,2), "% )")
print("Total            ", len(data))

### Data visualization

Now that we have our final version of the dataset, let's visualise the distribution of some of the dataset properties (date, branch, respondent, etc.) to get a sense of the data. What time span does the dataset cover? How many cases make it to the Grand Chamber? Which countries have been sued most often?

**Exercise:** Fill in the code for the second plot.

In [None]:
import seaborn as sns

plt.figure(figsize=(15, 10))

# Plot number of instances per date
plt.subplot(3, 1, 1)
sns.countplot(x='date', data=data, palette='viridis')
plt.xticks(rotation=90)  # Rotate x-axis labels
plt.title('Number of Instances by Date')

# Plot number of instances per branch
plt.subplot(3, 1, 2)
... # fill in this line
plt.title('Number of Instances by Branch')

# Plot number of instances per top 10 respondents
plt.subplot(3, 1, 3)
top_respondents = data['respondent'].value_counts().nlargest(10)
sns.barplot(x=top_respondents.index, y=top_respondents.values, palette='colorblind')
plt.title('Number of Instances by Top 10 Respondents')

plt.tight_layout()
plt.show()

Let's now look at how many cases in this dataset actually resulted in violations of human rights articles or protocols. This is typically called the *class label distribution*. It will give us an idea of the dataset *class balance* (or *class imbalance*), an important property to look out for when making modelling and evaluation decisions.

In [None]:
plt.figure(figsize=(10, 7))

# Plot binary class label distribution per partition
sns.countplot(
    x='partition',
    hue='binary_judgement',
    data=data,
    palette='colorblind',
    order=['train', 'dev', 'test']
)

# Annotate plot
plt.legend(title='Judgement', labels=['0: no violation', '1: violation'])
plt.title('Distribution of Binary Judgement Labels for Each Dataset Partition')
plt.xlabel('Partition')
plt.ylabel('Number of Cases')
plt.show()

Finally, let's look at the class distribution of importance scores. Remember: importance scores range from 1 (key case) to 4 (unimportant).

**Exercise:** Write code that plots the class distribution per data partition.

In [None]:
# Plot importance score distribution per partition

# ... # write the code snippet that plots class distribution by partition

# Annotate plot
plt.legend(title='Importance')
plt.title('Distribution of Case Importance Score for Each Partition')
plt.xlabel('Partition')
plt.ylabel('Importance Score')
plt.show()

Let's save the data for the next step

In [None]:
data.to_csv("data.csv", index=False)