# Final Project

CS 109 <br>
Madhu, Max, and Yunhan

## Data Collection | Yunhan

We collected all of the unstructured text of Supreme Court opinions using the CourtListener API. We decided to use the Bulk Data API instead of the REST API since the latter would have required upwards of 60,000 requests. 

In [1]:
%%time

# ------ use this URL to download all opinions as multiple json files ------
# https://www.courtlistener.com/api/bulk-data/document/scotus.tar.gz

# ------ turn json files into pandas dataframe ------
import os
import glob
import json

# lists for pandas dataframe cols
raw_dict = {}
list_name,list_blocked,list_id,list_docket,list_lexis,list_date,list_url,list_text = [],[],[],[],[],[],[],[]

# iterate through each json file, append data to col lists
for filename in glob.glob("*.json"):
    with open(filename) as json_file:
        json_data = json.load(json_file)
        list_name.append(json_data["citation"]["case_name"])
        list_docket.append(json_data["citation"]["docket_number"])
        list_lexis.append(json_data["citation"]["lexis_cite"])
        list_blocked.append(json_data["blocked"])
        list_id.append(json_data["id"])
        list_date.append(json_data["date_filed"])
        if json_data["download_url"] != "":
            list_url.append(json_data["download_url"])
        else:
            list_url.append(json_data["citation"]["resource_uri"])
        if json_data["plain_text"] != "":
            list_text.append(json_data["plain_text"])
        else:
            list_text.append(json_data["html_with_citations"])
            
raw_dict["name"] = list_name
raw_dict["blocked"] = list_blocked
raw_dict["id"] = list_id
raw_dict["docket"] = list_docket
raw_dict["lexis"] = list_lexis
raw_dict["date"] = list_date
raw_dict["url"] = list_url
raw_dict["text"] = list_text

CPU times: user 43.5 s, sys: 16.6 s, total: 1min
Wall time: 1min 29s


In [2]:
%%time

# strip html tags from text
from bs4 import BeautifulSoup

list_cleantext = []
for rawtext in list_text:
    list_cleantext.append(BeautifulSoup(rawtext).text.replace("\n",""))

raw_dict["text"] = list_cleantext

CPU times: user 5min 40s, sys: 11.9 s, total: 5min 52s
Wall time: 6min 16s




 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [3]:
import pandas as pd
pd.set_option("display.width", 500)
pd.set_option("display.max_columns", 100)
pd.set_option("display.notebook_repr_html", True)

raw_df = pd.DataFrame(raw_dict)

In [4]:
raw_df.head()

Unnamed: 0,blocked,date,docket,id,lexis,name,text,url
0,False,1922-05-29,65,100000,,Morrisdale Coal Co. v. United States,259 U.S. 188 (1922)MORRISDALE COAL COMPANYv.UN...,
1,False,1922-05-29,101,100001,,Pine Hill Coal Co. v. United States,"259 U.S. 191 (1922)PINE HILL COAL COMPANY, INC...",
2,False,1922-05-29,"Nos. 108, 109",100002,,Santa Fe Pacific R. Co. v. Fall,259 U.S. 197 (1922)SANTA FE PACIFIC RAILROAD C...,
3,False,1922-05-29,204,100003,,"Federal Baseball Club of Baltimore, Inc. v. Na...",259 U.S. 200 (1922)FEDERAL BASEBALL CLUB OF BA...,
4,False,1922-05-29,215,100004,,Mutual Life Ins. Co. of NY v. Liebing,259 U.S. 209 (1922)MUTUAL LIFE INSURANCE COMPA...,


In [5]:
# the verification data set only has cases from 1946 forward
# in order to ensure that all cases that might potentially be sampled into the training data
# can be joined to the verification data, we should limit our data to the cases from 1946 forward
raw_df["year"] = [int(x.split("-")[0]) for x in raw_df["date"]]
raw_df["month"] = [int(x.split("-")[1]) for x in raw_df["date"]]
raw_df["day"] = [int(x.split("-")[2]) for x in raw_df["date"]]
year_df = raw_df[raw_df["year"] >= 1946]

In [6]:
year_df.to_csv("raw_clistener_data.csv", sep='\t', encoding="utf-8")