# Generate dataset used for finetuning

This notebook outlines the steps undertook to generate the dataset used for finetuning. We already generated the train vs test datasets previously (and had not used the train dataset for any experiments so far), we want to ensure there is no dataleakage between the train and test datasets on opinion, cluster, and docket level.

# Import libraries

In [1]:
import pandas as pd

# Load the datasets

In [2]:
train = pd.read_csv("outputs/3.train.csv")
test = pd.read_csv("outputs/3.test.csv")

# Check for possible leakage

In [3]:
print(len(train[train["docket_id"].isin(test["docket_id"])]))
print(len(train[train["docket_number"].isin(test["docket_number"])]))
print(len(train[train["cluster_id"].isin(test["cluster_id"])]))
print(len(train[train["opinion_id"].isin(test["opinion_id"])]))

0
113
0
0


In [4]:
print(len(test[test["docket_id"].isin(train["docket_id"])]))
print(len(test[test["docket_number"].isin(train["docket_number"])]))
print(len(test[test["cluster_id"].isin(train["cluster_id"])]))
print(len(test[test["opinion_id"].isin(train["opinion_id"])]))

0
103
0
0


# Resolve leakage and generate validation set

In [5]:
val = train[train["docket_number"].isin(test["docket_number"])]
len(val)

113

In [6]:
train = train[~train["docket_number"].isin(test["docket_number"])]
len(train)

390

In [7]:
len(test)

450

In practice, we may want the test set to be of similar size to the validation set, but since we had already generated the test set for prior experiments, to ensure consistent metric measurements, we held the test set as is.

# Confirm no leakage

In [8]:
assert len(train[train["docket_id"].isin(test["docket_id"])]) == 0
assert len(train[train["docket_number"].isin(test["docket_number"])]) == 0
assert len(train[train["cluster_id"].isin(test["cluster_id"])]) == 0
assert len(train[train["opinion_id"].isin(test["opinion_id"])]) == 0

In [9]:
assert len(train[train["docket_id"].isin(val["docket_id"])]) == 0
assert len(train[train["docket_number"].isin(val["docket_number"])]) == 0
assert len(train[train["cluster_id"].isin(val["cluster_id"])]) == 0
assert len(train[train["opinion_id"].isin(val["opinion_id"])]) == 0

In [10]:
assert len(test[test["docket_id"].isin(train["docket_id"])]) == 0
assert len(test[test["docket_number"].isin(train["docket_number"])]) == 0
assert len(test[test["cluster_id"].isin(train["cluster_id"])]) == 0
assert len(test[test["opinion_id"].isin(train["opinion_id"])]) == 0

In [11]:
assert len(val[val["docket_id"].isin(train["docket_id"])]) == 0
assert len(val[val["docket_number"].isin(train["docket_number"])]) == 0
assert len(val[val["cluster_id"].isin(train["cluster_id"])]) == 0
assert len(val[val["opinion_id"].isin(train["opinion_id"])]) == 0

# Remove records that have fewer than 100 tokens (or ~50 words) as they do not have enough meaningful context for retrieval

In [13]:
train = train[train["opinion_word_count"] > 50]
len(train)

315

In [14]:
val = val[val["opinion_word_count"] > 50]
len(val)

93

In [15]:
test = test[test["opinion_word_count"] > 50]
len(test)

362

# Save the dataset for future use

In [16]:
train.to_csv("outputs/4.finetune_train.csv", index=False)
val.to_csv("outputs/4.finetune_val.csv", index=False)
test.to_csv("outputs/4.finetune_test.csv", index=False)