# Filter Dataset from oyez_gather script
This script takes in input oyez.json as generated by the oyez_gather script.
Here's an already generated dataset:

Minimal JSON compact form (216MB):
https://www.dropbox.com/s/9kyk0dr2gf3ls23/oyez.json?dl=0

Prettified JSON human-readable form (431 MB):
https://www.dropbox.com/s/52a58aac8iujupv/oyez_pretty.json?dl=0

## Imports

In [1]:
# !pip install json
# !pip install pandas
# !pip install numpy
# !pip install nltk
# !pip install bs4
# !pip install re
# !pip install contractions

import json
import pandas as pd
import numpy as np
import nltk
nltk.download('wordnet')
nltk.download('punkt')
import re
from bs4 import BeautifulSoup
import contractions
from collections import defaultdict
import string
import difflib
import os.path
import pickle

# Just for visuals
pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sanavesa\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sanavesa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 1. Dataset Preparation

## Load the dataset

In [2]:
with open('../oyez.json', 'r') as f:
    data = json.load(f)

## Filter the dataset to only the columns we need with cases with non-missing values

In [3]:
# Returns true if the given case's judgment has been decided, false otherwise
def is_decided(entry):
    for timeline in entry['timeline']:
        if timeline['event'] == "Decided":
            return True
    return False

# Returns true if the given case has the necessary columns for our analysis, false otherwise
# Currently only considers:
#    1- case's judgment has been decided
#    2- case has a non-empty facts field
#    3- case has a non-empty decisions field
#    4- case with the same first/second party
#    5- case has facts field with 4 or more words
#    6- case has a winning party
def is_entry_complete(entry):
    try:
        if not is_decided(entry):
            return False
        facts = entry['facts_of_the_case']
        if facts == None or len(str(facts)) == 0:
            return False
        # Ignore case with facts being Currently unknown, Not Available, Currently unavailable.
        if len(str(facts).split()) <= 3:
            return False
        if entry['decisions'] == None or len(entry['decisions']) == 0:
            return False
        if entry['first_party'] == entry['second_party'] and entry['first_party'] != None:
            return False
        return True
    except:
        return False

# Returns a dict with only the necessary columns we need from a given case
# Currently the followings columns are considered:
#    1-  case ID (assigned by oyez.org)
#    2-  case name
#    3-  href URL to the oyez.org case
#    4-  name of the first party
#    5-  name of the second party
#    6-  facts of the case
#    7-  name of the winning party
def filter_entry(entry):
    row = {}
    row['ID'] = entry['ID']
    row['name'] = entry['name']
    row['href'] = entry['href']
    row['first_party'] = entry['first_party']
    row['second_party'] = entry['second_party']
    row['winning_party'] = None
    for decision in entry['decisions']:
        winning_party = decision['winning_party']
        if winning_party != None and len(winning_party) > 0 and str(winning_party).lower() not in ['dismissal', 'n/a']:
            row['winning_party'] = winning_party
            break
    row['facts'] = entry['facts_of_the_case']
    return row

In [4]:
filtered_data = []
for entry in data:
    if is_entry_complete(entry):
        filtered_data.append(filter_entry(entry))

## Prepare Pandas DataFrame for the Dataset 

In [5]:
df = pd.DataFrame(filtered_data)

# Remove rows with missing values in the listed columns
df.dropna(inplace=True, subset=['first_party', 'second_party', 'facts', 'winning_party'], how='any')
df.reset_index(inplace=True)

print(f'There are {len(df)} cases.')

There are 2384 cases.


In [6]:
display(df.head(n=3))

Unnamed: 0,index,ID,name,href,first_party,second_party,winning_party,facts
0,0,50606,Roe v. Wade,https://api.oyez.org/cases/1971/70-18,Jane Roe,Henry Wade,Jane Roe,"<p>In 1970, Jane Roe (a fictional name used in court documents to protect the plaintiff’s identity) filed a lawsuit against Henry Wade, the district attorney of Dallas County, Texas, where she resided, challenging a Texas law making abortion illegal except by a doctor’s orders to save a woman’s life. In her lawsuit, Roe alleged that the state laws were unconstitutionally vague and abridged her right of personal privacy, protected by the First, Fourth, Fifth, Ninth, and Fourteenth Amendments.</p>\n"
1,1,50613,Stanley v. Illinois,https://api.oyez.org/cases/1971/70-5014,"Peter Stanley, Sr.",Illinois,Stanley,"<p>Joan Stanley had three children with Peter Stanley. The Stanleys never married, but lived together off and on for 18 years. When Joan died, the State of Illinois took the children. Under Illinois law, unwed fathers were presumed unfit parents regardless of their actual fitness and their children became wards of the state. Peter appealed the decision, arguing that the Illinois law violated the Equal Protection Clause of the Fourteenth Amendment because unwed mothers were not deprived of their children without a showing that they were actually unfit parents. The Illinois Supreme Court rejected Stanley’s Equal Protection claim, holding that his actual fitness as a parent was irrelevant because he and the children’s mother were unmarried.</p>\n"
2,2,50623,Giglio v. United States,https://api.oyez.org/cases/1971/70-29,John Giglio,United States,Giglio,"<p>John Giglio was convicted of passing forged money orders. While his appeal to the U.S. Court of Appeals for the Second Circuit was pending, Giglio’s counsel discovered new evidence. The evidence indicated that the prosecution failed to disclose that it promised a key witness immunity from prosecution in exchange for testimony against Giglio. The district court denied Giglio’s motion for a new trial, finding that the error did not affect the verdict. The Court of Appeals affirmed.</p>\n"


# 2. Preprocess Dataset

## Statistics before Preprocessing

In [7]:
avg_char_before_preprocessing = df['facts'].apply(lambda x: len(str(x))).mean()
print(f'Average facts character length (before preprocesesing): {avg_char_before_preprocessing:.0f}')

avg_word_before_preprocessing = df['facts'].apply(lambda x: len(str(x).split())).mean()
print(f'Average facts word length (before preprocesesing): {avg_word_before_preprocessing:.0f}')

Average facts character length (before preprocesesing): 1142
Average facts word length (before preprocesesing): 177


## Remove HTML/URL

In [8]:
# Function to remove HTML tags and URLs from a string
def sanitize_review(text):
    # remove HTML tags
    text = BeautifulSoup(str(text), 'html.parser').get_text()   
    # remove URLS
    text = re.sub(r'http\S+', '', str(text))
    return text

df['facts'] = df['facts'].apply(sanitize_review)

## Remove non-UTF8 characters

In [9]:
def remove_non_ascii(text):
    return text.encode(encoding='utf-8', errors='ignore').decode()

df['facts'] = df['facts'].apply(remove_non_ascii)

## Remove Contractions

In [10]:
def fix_contractions(text):
    return contractions.fix(text)

df['facts'] = df['facts'].apply(fix_contractions)

## Remove extra spaces

In [11]:
def remove_extra_spaces(text):
    return ' '.join(str(text).split())

df['facts'] = df['facts'].apply(remove_extra_spaces)

## Statistics after Preprocessing

In [12]:
avg_char_after_preprocessing = df['facts'].apply(lambda x: len(str(x))).mean()
print(f'Average facts character length (after preprocesesing): {avg_char_after_preprocessing:.0f}')

avg_word_after_preprocessing = df['facts'].apply(lambda x: len(str(x).split())).mean()
print(f'Average facts word length (after preprocesesing): {avg_word_after_preprocessing:.0f}')

Average facts character length (after preprocesesing): 1126
Average facts word length (after preprocesesing): 177


In [13]:
display(df.head(n=3))

Unnamed: 0,index,ID,name,href,first_party,second_party,winning_party,facts
0,0,50606,Roe v. Wade,https://api.oyez.org/cases/1971/70-18,Jane Roe,Henry Wade,Jane Roe,"In 1970, Jane Roe (a fictional name used in court documents to protect the plaintiff’s identity) filed a lawsuit against Henry Wade, the district attorney of Dallas County, Texas, where she resided, challenging a Texas law making abortion illegal except by a doctor’s orders to save a woman’s life. In her lawsuit, Roe alleged that the state laws were unconstitutionally vague and abridged her right of personal privacy, protected by the First, Fourth, Fifth, Ninth, and Fourteenth Amendments."
1,1,50613,Stanley v. Illinois,https://api.oyez.org/cases/1971/70-5014,"Peter Stanley, Sr.",Illinois,Stanley,"Joan Stanley had three children with Peter Stanley. The Stanleys never married, but lived together off and on for 18 years. When Joan died, the State of Illinois took the children. Under Illinois law, unwed fathers were presumed unfit parents regardless of their actual fitness and their children became wards of the state. Peter appealed the decision, arguing that the Illinois law violated the Equal Protection Clause of the Fourteenth Amendment because unwed mothers were not deprived of their children without a showing that they were actually unfit parents. The Illinois Supreme Court rejected Stanley’s Equal Protection claim, holding that his actual fitness as a parent was irrelevant because he and the children’s mother were unmarried."
2,2,50623,Giglio v. United States,https://api.oyez.org/cases/1971/70-29,John Giglio,United States,Giglio,"John Giglio was convicted of passing forged money orders. While his appeal to the you.S. Court of Appeals for the Second Circuit was pending, Giglio’s counsel discovered new evidence. The evidence indicated that the prosecution failed to disclose that it promised a key witness immunity from prosecution in exchange for testimony against Giglio. The district court denied Giglio’s motion for a new trial, finding that the error did not affect the verdict. The Court of Appeals affirmed."


# 2. Export dataset

## Determine winner_index

In [14]:
def levenshtein_distance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

def get_abbreviation(word):
    return ''.join(w[0] for w in word.split() if w[0].isupper())

def compute_winner(row):
    first_party = row['first_party']
    second_party = row['second_party']
    winning_party = row['winning_party']
    winner = difflib.get_close_matches(winning_party, [first_party, second_party])
    if len(winner) == 0:
        if first_party in winning_party or winning_party in first_party:
            return 0
        elif second_party in winning_party or winning_party in second_party:
            return 1
        else:
            if winning_party.isupper():
                if winning_party == get_abbreviation(first_party):
                    return 0
                elif winning_party == get_abbreviation(second_party):
                    return 1
            d1 = levenshtein_distance(first_party, winning_party)
            d2 = levenshtein_distance(second_party, winning_party)
            if d1 <= d2:
                return 0
            else:
                return 1
    else:
        winner = winner[0]
        if winner == first_party:
            return 0
        elif winner == second_party:
            return 1
        else:
            return -1

df['winner_index'] = df.apply(lambda row: compute_winner(row), axis=1)

In [15]:
display(df.head(n=3))

Unnamed: 0,index,ID,name,href,first_party,second_party,winning_party,facts,winner_index
0,0,50606,Roe v. Wade,https://api.oyez.org/cases/1971/70-18,Jane Roe,Henry Wade,Jane Roe,"In 1970, Jane Roe (a fictional name used in court documents to protect the plaintiff’s identity) filed a lawsuit against Henry Wade, the district attorney of Dallas County, Texas, where she resided, challenging a Texas law making abortion illegal except by a doctor’s orders to save a woman’s life. In her lawsuit, Roe alleged that the state laws were unconstitutionally vague and abridged her right of personal privacy, protected by the First, Fourth, Fifth, Ninth, and Fourteenth Amendments.",0
1,1,50613,Stanley v. Illinois,https://api.oyez.org/cases/1971/70-5014,"Peter Stanley, Sr.",Illinois,Stanley,"Joan Stanley had three children with Peter Stanley. The Stanleys never married, but lived together off and on for 18 years. When Joan died, the State of Illinois took the children. Under Illinois law, unwed fathers were presumed unfit parents regardless of their actual fitness and their children became wards of the state. Peter appealed the decision, arguing that the Illinois law violated the Equal Protection Clause of the Fourteenth Amendment because unwed mothers were not deprived of their children without a showing that they were actually unfit parents. The Illinois Supreme Court rejected Stanley’s Equal Protection claim, holding that his actual fitness as a parent was irrelevant because he and the children’s mother were unmarried.",0
2,2,50623,Giglio v. United States,https://api.oyez.org/cases/1971/70-29,John Giglio,United States,Giglio,"John Giglio was convicted of passing forged money orders. While his appeal to the you.S. Court of Appeals for the Second Circuit was pending, Giglio’s counsel discovered new evidence. The evidence indicated that the prosecution failed to disclose that it promised a key witness immunity from prosecution in exchange for testimony against Giglio. The district court denied Giglio’s motion for a new trial, finding that the error did not affect the verdict. The Court of Appeals affirmed.",0


In [16]:
# Show the data inbalance
index, counts = np.unique(df['winner_index'].values,return_counts=True)
print(index)
print(counts)

[0 1]
[2114  270]


## Export

In [18]:
# export each class into its own file
df[df['winner_index']==0].to_pickle('class0.pkl')
df[df['winner_index']==1].to_pickle('class1.pkl')

df.to_pickle('preprocessed_dataset.pkl')

# test importing
# df0 = pd.read_pickle('class0.pkl')
# df1 = pd.read_pickle('class1.pkl')