# Data Collection

After deciding that using the full text of opinions would be an inefficient approach, we looked for ways to collect summarized legal data. In Supreme Court opinions, the *syllabus* is a formal, abbreviated section of the text that includes the major facts, arguments, and outcomes of the case. Fortunately, Justia has curated an open-source repository of Supreme Court cases (syllabi and opinions) organized by year and volume. We scraped all of the syllabi from years 1946 and beyond since our validation data set is limited by these temporal parameters. 

In [1]:
from bs4 import BeautifulSoup
import requests as rq
from time import sleep
import pandas as pd
import re

### Scraping Data

Fortunately, the Justia website organizes syllabi by year and volume. Because each of these categories follows a hierarchal structure (there exists a directory "landing" page for each year and volume), we can collect all of the individual case URLs from this directory and visit them separately. We want to limit our data to cases 1946 and beyond because this is the beginning of the validation set.

In [2]:
# ---------- grab all case urls from respective year directory pages ----------

# base url directory page for each year
base_url = "https://supreme.justia.com/cases/federal/us/year/%s.html"

# base url text page for each case
case_url = "https://supreme.justia.com"
id_url = []

# iterate through years 1946 to 2015
years = range(1946,2016)
for year in years:
    soup = BeautifulSoup(rq.get(base_url % year).text, "lxml")
    results = soup.findAll("div", attrs={"class":"result"})
    
    # collect all case urls on each year page
    for result in results:
        id_url.append(case_url + result.a["href"])
    
    # prevent connection error
    sleep(0.1)

Unfortunately, the Justia data only includes the data in HTML format. Therefore, because of inconsistencies in how the DOM is structured (for instance, some cases have no syllabi, some years have no metadata or different ways of organizing the text within the syllabi), we need to institute (1) a check for syllabi (skipping over cases that only include full opinions for consistency) and (2) directly scrape the entire text of the syllabi. We can clean the data after gathering it fully.

In [3]:
# ---------- visit each case page, scrape syllabus, store data ----------

# initially split page into metadata and text (irregular formatting, some null)
metadata,syllabus,citations,urls=[],[],[],[]

# iterate through unique ids collected above
for url in id_url:
    # go to section of the DOM with text
    soup = BeautifulSoup(rq.get(url).text, "lxml")
    
    # check if syllabus exists
    header = soup.find("ul", attrs={"class":"centered-list clear"})
    exists = False
    
    if header is not None:
        if header.text.lower().find("syllabus") > -1:
            exists = True

        # if syllabus exists, collect text
        if exists:    
            # save name of case
            name = soup.find("h1", attrs={"class":"title"}).text

            page_text = soup.find("span", attrs={"class":"headertext"}) 
            if page_text is None:
                page_text = soup.find("div", attrs={"id":"opinion"})
            
            # collect syllabus text
            syllabus_list = ""
            for index in range(0,len(page_text.findAll("p"))):

                # don't append blank lines or returns
                if page_text.findAll("p")[index] != "":
                    syllabus_list += ((page_text.findAll("p")[index].text) + " ")

            metadata.append(name)
            syllabus.append(syllabus_list)
            citations.append(url.split("/")[-3] + " U.S. " + url.split("/")[-2])
            urls.append(url)
    else:
        continue

### Storing Data

Now that we have stored the "metadata" (title of the page, which includes its full citation), syllabus, and original Justia URL for each case, we can do a bit of cleaning to make the citations mergable and then save as a CSV for ease of sharing among group members.

In [4]:
# ---------- create dataframe ----------
rawdict = {}
rawdict["full_cite"] = metadata
rawdict["us_cite"] = citations
rawdict["text"] = syllabus
rawdict["url"] = urls
dfclean = pd.DataFrame(rawdict)

In [5]:
# ---------- clean up full_cite column -----------
years,names, = [],[]
for x in range(0,len(dfclean)):
    years.append(re.findall("\s\((.*)",dfclean.full_cite[x])[0][:-1])
    names.append(re.findall(".+?(?=\s\d)",dfclean.full_cite[x])[0])
    
dfclean["year"] = years
dfclean["case"] = names

In [6]:
dfclean.to_csv("final_justia_data_merge.csv", sep=',', encoding='utf-8',index=False)

In [8]:
print "Number of cases scaped from Justia:", len(dfclean)

Number of cases scaped from Justia: 11224
