# Check similarities

This notebook outlines the steps I undertook to confirm the relevant queries are more similar to the opinion than the irrelevant queries are.

# Import libraries

Restart kernel after installing torch

In [1]:
#%pip install transformers -q
#%pip install torch -q

import numpy as np
import pandas as pd

from transformers import AutoTokenizer, AutoModel
from sklearn.feature_extraction.text import TfidfVectorizer

# Global variables

In [2]:
TOKENIZER = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
MODEL = AutoModel.from_pretrained("nlpaueb/legal-bert-base-uncased")
MAX_LENGTH = 512 #max_length for legal-bert model

# Helper functions

In [3]:
def calculate_cosine(string1, string2):
    similarity = np.dot(string1, string2)/np.sqrt(np.dot(string1, string1) * np.dot(string2, string2))
    return 0 if np.isnan(similarity) else similarity

def get_tfidf_similarity(query, result):
    vectorizer = TfidfVectorizer()
    model = vectorizer.fit([result])
    result_embedding = model.transform([result]).toarray().flatten()
    query_embedding = model.transform([query]).toarray().flatten()
    return calculate_cosine(query_embedding, result_embedding)
    
def get_llm_similarity(query, result):
    # get the CLS token representing the query sentence from the model
    inputs = TOKENIZER(query,
                       return_tensors="pt",
                       padding='max_length',
                       truncation=True, 
                       max_length=MAX_LENGTH)
    outputs = MODEL(**inputs)
    query_embedding = outputs.last_hidden_state[0][0].detach().numpy()

    # get the CLS token representing the result sentence from the model
    inputs = TOKENIZER(result,
                       return_tensors="pt",
                       padding='max_length',
                       truncation=True, 
                       max_length=MAX_LENGTH)
    outputs = MODEL(**inputs)
    result_embedding = outputs.last_hidden_state[0][0].detach().numpy()

    # calculate the cosine similarity between the two CLS tokens
    return calculate_cosine(query_embedding, result_embedding)

# Load the data

In [4]:
stmt = pd.read_csv("outputs/2a.queries_generated.csv")
qstn = pd.read_csv("outputs/2b.questions_generated.csv")

# Check query-result sets

## Calculate the cosine similarity based on TF-IDF matrix

In [5]:
stmt.loc[:, "relevant_tfidf"] = stmt.apply(lambda row: get_tfidf_similarity(row["relevant_query_stmt"], row["opinion"]), axis=1)
stmt.loc[:, "irrelevant_tfidf"] = stmt.apply(lambda row: get_tfidf_similarity(row["irrelevant_query_stmt"], row["opinion"]), axis=1)

  similarity = np.dot(string1, string2)/np.sqrt(np.dot(string1, string1) * np.dot(string2, string2))
  similarity = np.dot(string1, string2)/np.sqrt(np.dot(string1, string1) * np.dot(string2, string2))


In [6]:
stmt.head()

Unnamed: 0,opinion_id,opinion,opinion_4omini_tokens,input_opinion,relevant_query_stmt,irrelevant_query_stmt,relevant_tfidf,irrelevant_tfidf
0,444587,"748 F.2d 972 UNITED STATES of America, Plainti...",1873,"748 F.2d 972 UNITED STATES of America, Plainti...",Juan Jose Velasquez appeal of sentence correct...,Procedural rules in civil cases,0.280198,0.111282
1,9410469,Nebraska Supreme Court Online Library www.nebr...,11223,Nebraska Supreme Court Online Library www.nebr...,Landlord tenant law and eviction rules in Nebr...,Criminal sentencing and appeals,0.218676,0.10814
2,714663,78 F.3d 599 U.S. v. Johnston ** NO. 94-2273 Un...,85,78 F.3d 599 U.S. v. Johnston ** NO. 94-2273 Un...,Sufficiency of evidence in aggravated assault ...,Commercial contract disputes,0.174078,0.0
3,2729050,"Pursuant to Ind. Appellate Rule 65(D), this Me...",6099,"Pursuant to Ind. Appellate Rule 65(D), this Me...",Conviction for felony dealing with cocaine bas...,Child custody disputes,0.140683,0.0
4,692963,51 F.3d 288 311 U.S.App.D.C. 145 UNITED STATES...,5918,51 F.3d 288 311 U.S.App.D.C. 145 UNITED STATES...,Possession of instruments for making false ide...,Real estate contract disputes,0.195962,0.004313


## Use Legal-Bert as double check

In [7]:
double_check = stmt[stmt["relevant_tfidf"] < stmt["irrelevant_tfidf"]]
len(double_check)

163

In [8]:
double_check.loc[:, "relevant_llm"] = double_check.apply(lambda row: get_llm_similarity(row["relevant_query_stmt"], row["opinion"]), axis=1)
double_check.loc[:, "irrelevant_llm"] = double_check.apply(lambda row: get_llm_similarity(row["irrelevant_query_stmt"], row["opinion"]), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  double_check.loc[:, "relevant_llm"] = double_check.apply(lambda row: get_llm_similarity(row["relevant_query_stmt"], row["opinion"]), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  double_check.loc[:, "irrelevant_llm"] = double_check.apply(lambda row: get_llm_similarity(row["irrelevant_query_stmt"], row["opinion"]), axis=1)


## Manual check

In [9]:
manual_check = double_check[double_check["relevant_llm"] < double_check["irrelevant_llm"]]
len(manual_check)

57

In [10]:
manual_review = manual_check.sample(5, random_state=42)[["opinion", "relevant_query_stmt", "irrelevant_query_stmt"]]
manual_review

Unnamed: 0,opinion,relevant_query_stmt,irrelevant_query_stmt
12,NOT PRECEDENTIAL UNITED STATES COURT OF APPEA...,United States Sentencing Guidelines applicatio...,Federal tax laws and their application to chil...
103,"Joyce Brown, a Kentucky resident proceeding p...",IRS levy and wrongful seizure case under vario...,Compensation case related to industrial accident.
428,i i i i i i MEMORANDUM OPINION No. 04-10-00519...,Mandamus denied in habeas corpus proceeding Texas,Petition for a divorce ruling
229,The following order has been entered on the mo...,Order dismissing charges as per NC Appellate P...,Request for retrial in a traffic violation case
498,122 Mich. App. 449 (1983) 332 N.W.2d 501 BULLA...,Workers' compensation dependency determination...,Construction and building regulations Michigan...


In [24]:
manual_review.iloc[4]["relevant_query_stmt"]

"Workers' compensation dependency determination Michigan case"

In [25]:
manual_review.iloc[4]["irrelevant_query_stmt"]

'Construction and building regulations Michigan opinion'

In [26]:
manual_review.iloc[4]["opinion"]

"122 Mich. App. 449 (1983) 332 N.W.2d 501 BULLARD v. TITUS CONSTRUCTION COMPANY Docket No. 56153. Michigan Court of Appeals. Decided January 19, 1983. Williams, Klukowski, Wood, Drew & Fotieo, P.C. (by Stephen R. Drew and Ronald C. Love), for plaintiff-appellee. Smith, Haughey, Rice & Roegge (by Craig R. Noland), for Titus Construction Company and Hartford Accident and Indemnity Company. Cholette, Perkins & Buchanan (by Edward D. Wells), for A.F. Murch Company and Pacific Employers Insurance Company. *451 Before: MacKENZIE, P.J., and D.E. HOLBROOK, JR., and D.S. DEWITT, [*] JJ. (ON REHEARING) D.S. DEWITT, J. We granted defendants A.F. Murch Company's and Pacific Employers Insurance Company's application for rehearing in this case to consider whether we incorrectly ordered the case remanded to the Workers' Compensation Appeal Board for a determination of whether plaintiff's two stepchildren are dependents in fact. We now conclude that remand was unnecessary. In this Court's original opi

# Check question-result sets

## Calculate the cosine similarity based on TF-IDF matrix

In [27]:
qstn.loc[:, "relevant_tfidf"] = qstn.apply(lambda row: get_tfidf_similarity(row["relevant_query_qstn"], row["opinion"]), axis=1)
qstn.loc[:, "irrelevant_tfidf"] = qstn.apply(lambda row: get_tfidf_similarity(row["irrelevant_query_qstn"], row["opinion"]), axis=1)

  similarity = np.dot(string1, string2)/np.sqrt(np.dot(string1, string1) * np.dot(string2, string2))
  similarity = np.dot(string1, string2)/np.sqrt(np.dot(string1, string1) * np.dot(string2, string2))


In [28]:
qstn.head()

Unnamed: 0,opinion_id,opinion,opinion_4omini_tokens,input_opinion,relevant_query_qstn,irrelevant_query_qstn,relevant_tfidf,irrelevant_tfidf
0,444587,"748 F.2d 972 UNITED STATES of America, Plainti...",1873,"748 F.2d 972 UNITED STATES of America, Plainti...",What are the legal grounds for vacating and re...,What is the history of illegal immigration in ...,0.470931,0.663938
1,9410469,Nebraska Supreme Court Online Library www.nebr...,11223,Nebraska Supreme Court Online Library www.nebr...,What are the implications of mootness in legal...,What are the steps to forming a business in Ne...,0.442592,0.528665
2,714663,78 F.3d 599 U.S. v. Johnston ** NO. 94-2273 Un...,85,78 F.3d 599 U.S. v. Johnston ** NO. 94-2273 Un...,What elements constitute the crime of making f...,What are the latest trends in environmental la...,0.174078,0.0
3,2729050,"Pursuant to Ind. Appellate Rule 65(D), this Me...",6099,"Pursuant to Ind. Appellate Rule 65(D), this Me...",What constitutes sufficient evidence to sustai...,How do zoning laws affect urban development?,0.218522,0.01355
4,692963,51 F.3d 288 311 U.S.App.D.C. 145 UNITED STATES...,5918,51 F.3d 288 311 U.S.App.D.C. 145 UNITED STATES...,What qualifies as an 'identification document'...,What are the regulations concerning restaurant...,0.133713,0.380651


## Use Legal-Bert as double check

In [29]:
double_check = qstn[qstn["relevant_tfidf"] < qstn["irrelevant_tfidf"]]
len(double_check)

366

In [30]:
double_check.loc[:, "relevant_llm"] = double_check.apply(lambda row: get_llm_similarity(row["relevant_query_qstn"], row["opinion"]), axis=1)
double_check.loc[:, "irrelevant_llm"] = double_check.apply(lambda row: get_llm_similarity(row["irrelevant_query_qstn"], row["opinion"]), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  double_check.loc[:, "relevant_llm"] = double_check.apply(lambda row: get_llm_similarity(row["relevant_query_qstn"], row["opinion"]), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  double_check.loc[:, "irrelevant_llm"] = double_check.apply(lambda row: get_llm_similarity(row["irrelevant_query_qstn"], row["opinion"]), axis=1)


## Manual check

In [31]:
manual_check = double_check[double_check["relevant_llm"] < double_check["irrelevant_llm"]]
len(manual_check)

107

In [39]:
manual_review = manual_check.sample(5, random_state=42)[["opinion", "relevant_query_qstn", "irrelevant_query_qstn"]]
manual_review

Unnamed: 0,opinion,relevant_query_qstn,irrelevant_query_qstn
676,UNITED STATES DISTRICT COURT FOR THE DISTRICT...,What legal rights do individuals have when int...,What are the personal preferences for seafood ...
92,"BUTTLER, P. J. Claimant seeks judicial review...",What factors determine a claimant's eligibilit...,What are the historical origins of labor union...
20,Wilde J. _ _ delivered the opinion of the Cou...,What constitutes sufficient consideration for ...,What are the implications of bankruptcy on cor...
865,"PLUMMER, J. The appellant was convicted upon ...",What evidence was presented to support the con...,What are the historical origins of cannabis us...
631,Hon. William G. Connelie Superintendent State ...,Can the District Attorney delegate prosecutori...,What are the historical cases involving land a...


In [52]:
manual_review.iloc[4]["relevant_query_qstn"]

'Can the District Attorney delegate prosecutorial duties to a police officer for petty offenses?'

In [53]:
manual_review.iloc[4]["irrelevant_query_qstn"]

'What are the historical cases involving land acquisitions in New Jersey?'

In [54]:
manual_review.iloc[4]["opinion"]

'Hon. William G. Connelie Superintendent State Police This response is in answer to the two questions you pose in your letter dated May 7, 1979. In your first question you ask whether a District Attorney may delegate his duty to conduct all prosecutions for petty offenses cognizable by the local courts of his county to the police officer who was either the arresting officer or the complainant upon the accusatory instrument. Section 700 (1) of the County Law provides that the District Attorney has a duty to prosecute all crimes and offenses cognizable by the courts of his county. Section 10 of the Penal Law defines an "offense" as conduct punishable by imprisonment or by a fine as provided by any State law, any local law, or any order or regulation of any governmental instrumentality authorized by law to adopt such a regulation. The court of Appeals in People v Van Sickle, 13 N.Y.2d 61 (1963) ruled that a lay complaining witness may conduct a prosecution, but that the District Attorney 