# Create query-opinion pairs of train & validation data for finetuning

1. I used the BERT tokenizer for ease of convenience (even though we will use numerous non-BERT models)
2. I set the max_tokens to 480 to account for the tokenization differences between the different models (for 512 context window)
3. I used GPT 4o to generate both relevant and irrelevant queries
4. Since most of the opinions are above the 512 context window, the opinions are first chunks to max-tokens and then query pairs are generated for each chunk

# Import libraries

In [1]:
import pandas as pd

import finetune_utils as util

PATH = ""

# Load the data

In [2]:
cols = ["opinion_id", "opinion", "opinion_word_count"]

train = pd.read_csv(f"{PATH}outputs/4.finetune_train.csv")
train = train[cols]

val = pd.read_csv(f"{PATH}outputs/4.finetune_val.csv")
val = val[cols]

# Take a look at the word count distribution & split train/val opinions to chunks

In [3]:
train["opinion_word_count"].describe()

count      315.000000
mean      2288.469841
std       2606.012431
min         52.000000
25%        613.000000
50%       1533.000000
75%       3026.000000
max      21186.000000
Name: opinion_word_count, dtype: float64

In [4]:
val["opinion_word_count"].describe()

count      93.000000
mean     1426.150538
std      1474.447332
min        55.000000
25%       395.000000
50%       994.000000
75%      1863.000000
max      8475.000000
Name: opinion_word_count, dtype: float64

since we have long context records but the base models we use for finetuning have shorter context windows, to best utilize our train/val dataset, we will split the opinions to chunks based on token counts, generate queries for each chunk and use the chunked data for finetuning.

## Chunk the opinions

In [5]:
train = util.process_opinions(train)
train.head()

Token indices sequence length is longer than the specified maximum sequence length for this model (719 > 512). Running this sequence through the model will result in indexing errors


Unnamed: 0,chunk_id,chunk_size,chunked_opinion,opinion_id,opinion
0,4200405_1,197,DISTRICT COURT OF APPEAL OF THE STATE OF FLORI...,4200405,DISTRICT COURT OF APPEAL OF THE STATE OF FLOR...
1,2790710_1,416,Twersky v Incorporated Vil. of Great Neck ( 20...,2790710,Twersky v Incorporated Vil. of Great Neck ( 2...
2,2790710_2,480,"), entered June 17, 2014, as denied their moti...",2790710,Twersky v Incorporated Vil. of Great Neck ( 2...
3,2790710_3,267,"The appellants failed to establish, prima faci...",2790710,Twersky v Incorporated Vil. of Great Neck ( 2...
4,1470544_1,478,951 A. 2d 180 ( 2008 ) Philip S. HORNER v. GOV...,1470544,951 A.2d 180 (2008) Philip S. HORNER v. GOVERN...


In [6]:
len(train)

2828

In [7]:
train["chunk_size"].describe()

count    2828.000000
mean      432.561174
std        75.628180
min        30.000000
25%       435.000000
50%       458.000000
75%       471.000000
max       480.000000
Name: chunk_size, dtype: float64

In [8]:
val = util.process_opinions(val)
val.head()

Unnamed: 0,chunk_id,chunk_size,chunked_opinion,opinion_id,opinion
0,6237262_1,416,Mr. Justice Mercur delivered the opinion of th...,6237262,Mr. Justice Mercur delivered the opinion of t...
1,6237262_2,283,We think the plaintiff has no reason to compla...,6237262,Mr. Justice Mercur delivered the opinion of t...
2,1467030_1,442,217 N. J. Super. 541 ( 1987 ) 526 A. 2d 290 AL...,1467030,217 N.J. Super. 541 (1987) 526 A.2d 290 ALAN C...
3,1467030_2,475,Prior to plaintiff ' s retirement in August 19...,1467030,217 N.J. Super. 541 (1987) 526 A.2d 290 ALAN C...
4,1467030_3,463,"That precise issue was never raised, for in D ...",1467030,217 N.J. Super. 541 (1987) 526 A.2d 290 ALAN C...


In [9]:
len(val)

489

In [10]:
val["chunk_size"].describe()

count    489.000000
mean     416.613497
std       91.049189
min       22.000000
25%      418.000000
50%      451.000000
75%      468.000000
max      480.000000
Name: chunk_size, dtype: float64

# Make GPT calls on train data

In [None]:
%%time

train_pred = util.predict(train)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 3
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 0
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 2
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 1
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 4
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 5
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 7
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 6
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 O

In [14]:
train_queries = pd.DataFrame(train_pred)
train_queries.head()

Unnamed: 0,model,input_tokens,output_tokens,relevant,irrelevant,chunk_id
0,gpt-4o-2024-11-20,466,38,What are the legal standards for determining l...,What are the requirements for establishing int...,2790710_3
1,gpt-4o-2024-11-20,402,44,What are the procedural outcomes of appealing ...,What are the tax implications of forming an LL...,4200405_1
2,gpt-4o-2024-11-20,692,44,What is the legal responsibility of property o...,What are the tax implications of selling real ...,2790710_2
3,gpt-4o-2024-11-20,606,45,What is the appellate court's role in reviewin...,What are the tax implications of selling real ...,2790710_1
4,gpt-4o-2024-11-20,699,42,What are the legal challenges to mandatory sex...,What are the requirements for adopting a child...,1470544_1


## Confirm all datapoints were predicted & join prediction to original

In [15]:
assert len(train) == len(train_queries)

In [16]:
train_df = train.merge(train_queries, how="left", on="chunk_id")
train_df.head()

Unnamed: 0,chunk_id,chunk_size,chunked_opinion,opinion_id,opinion,model,input_tokens,output_tokens,relevant,irrelevant
0,4200405_1,197,DISTRICT COURT OF APPEAL OF THE STATE OF FLORI...,4200405,DISTRICT COURT OF APPEAL OF THE STATE OF FLOR...,gpt-4o-2024-11-20,402,44,What are the procedural outcomes of appealing ...,What are the tax implications of forming an LL...
1,2790710_1,416,Twersky v Incorporated Vil. of Great Neck ( 20...,2790710,Twersky v Incorporated Vil. of Great Neck ( 2...,gpt-4o-2024-11-20,606,45,What is the appellate court's role in reviewin...,What are the tax implications of selling real ...
2,2790710_2,480,"), entered June 17, 2014, as denied their moti...",2790710,Twersky v Incorporated Vil. of Great Neck ( 2...,gpt-4o-2024-11-20,692,44,What is the legal responsibility of property o...,What are the tax implications of selling real ...
3,2790710_3,267,"The appellants failed to establish, prima faci...",2790710,Twersky v Incorporated Vil. of Great Neck ( 2...,gpt-4o-2024-11-20,466,38,What are the legal standards for determining l...,What are the requirements for establishing int...
4,1470544_1,478,951 A. 2d 180 ( 2008 ) Philip S. HORNER v. GOV...,1470544,951 A.2d 180 (2008) Philip S. HORNER v. GOVERN...,gpt-4o-2024-11-20,699,42,What are the legal challenges to mandatory sex...,What are the requirements for adopting a child...


# Make GPT call on Val data

In [17]:
%%time

val_pred = util.predict(val)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 2
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 1
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 3
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 0
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 5
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 8
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 6
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Completed: 10
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 

CPU times: user 22.6 s, sys: 1.24 s, total: 23.8 s
Wall time: 2min 33s


In [18]:
val_queries = pd.DataFrame(val_pred)
val_queries.head()

Unnamed: 0,model,input_tokens,output_tokens,relevant,irrelevant,chunk_id
0,gpt-4o-2024-11-20,672,42,Can alimony obligations be modified or termina...,What are the tax implications of inheriting pr...,1467030_1
1,gpt-4o-2024-11-20,504,42,What legal principles govern the consideration...,What are the tax implications of selling inher...,6237262_2
2,gpt-4o-2024-11-20,703,43,Can pension income be considered for alimony p...,What are the rules for child custody determina...,1467030_2
3,gpt-4o-2024-11-20,645,44,What constitutes fraud in a sheriff’s sale and...,What are the requirements for filing a patent ...,6237262_1
4,gpt-4o-2024-11-20,697,49,What are the legal principles guiding alimony ...,What are the tax implications of receiving an ...,1467030_4


## Confirm all datapoints were predicted & join prediction to original

In [19]:
assert len(val) == len(val_queries)

In [20]:
val_df = val.merge(val_queries, how="left", on="chunk_id")
val_df.head()

Unnamed: 0,chunk_id,chunk_size,chunked_opinion,opinion_id,opinion,model,input_tokens,output_tokens,relevant,irrelevant
0,6237262_1,416,Mr. Justice Mercur delivered the opinion of th...,6237262,Mr. Justice Mercur delivered the opinion of t...,gpt-4o-2024-11-20,645,44,What constitutes fraud in a sheriff’s sale and...,What are the requirements for filing a patent ...
1,6237262_2,283,We think the plaintiff has no reason to compla...,6237262,Mr. Justice Mercur delivered the opinion of t...,gpt-4o-2024-11-20,504,42,What legal principles govern the consideration...,What are the tax implications of selling inher...
2,1467030_1,442,217 N. J. Super. 541 ( 1987 ) 526 A. 2d 290 AL...,1467030,217 N.J. Super. 541 (1987) 526 A.2d 290 ALAN C...,gpt-4o-2024-11-20,672,42,Can alimony obligations be modified or termina...,What are the tax implications of inheriting pr...
3,1467030_2,475,Prior to plaintiff ' s retirement in August 19...,1467030,217 N.J. Super. 541 (1987) 526 A.2d 290 ALAN C...,gpt-4o-2024-11-20,703,43,Can pension income be considered for alimony p...,What are the rules for child custody determina...
4,1467030_3,463,"That precise issue was never raised, for in D ...",1467030,217 N.J. Super. 541 (1987) 526 A.2d 290 ALAN C...,gpt-4o-2024-11-20,667,43,How are pensions considered when determining a...,What are the tax implications of withdrawing e...


# Save the data for future use

In [21]:
train_df.to_csv("outputs/5.finetune_train_512.csv", index=False)
val_df.to_csv("outputs/5.finetune_val_512.csv", index=False)