# Scraping Court Opinion PDFs from law.justia.com

Data are collected from law.justia.com, a repository of court opinions across the United States. <br>
Mississippi Court of Appeals opinions can be found at URLs with the following pattern: https://law.justia.com/cases/mississippi/court-of-appeals/2023/
- /cases/ targets justia's case repository
- /mississippi/ targets the state of Mississippi
- /court-of-appeals/ targets the Court of Appeals in the target state
- /2023/ narrows the cases to a single year, in this case 2023 [this is necessary to ensure all cases are located on the same page, rather than across multiple pages]

When generalizing the code below to collect data from any set of courts and years we will need to 'iterate' the code.

Start out by testing this code to better understand how iteration loops operate:

    years = [2022,2021,2020]
    root_url = "https://law.justia.com/cases/mississippi/court-of-appeals/"
    for year in years:
         print(years + root_url + "/")

### install beautifulsoup4 (and other important libraries if you need them!)

In [None]:
# pip install beautifulsoup4

### Import required packages and find current working directory

In [None]:
from bs4 import BeautifulSoup
import requests
import os
from tqdm import tqdm
import re
import pandas as pd
import shutil
from datetime import datetime

os.getcwd()

### Scrape list of all Federal Appellate Courts, Federal District Courts, and State Appellate Courts

In [None]:
url = "https://law.justia.com/cases/federal"

req = requests.get(url, headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
soup = BeautifulSoup(req.text, "html.parser")

Federal Court of Appeals

In [None]:
fed_coas = soup.find_all("ul", {"class": "indented"})
fed_coas = [fed_coa.find_all("a", href=True) for fed_coa in fed_coas]
fed_coas = [fed_coa.get('href') for fed_coa in fed_coas[1]]

Federal District Courts

In [None]:
fed_dcs = soup.find_all("ul", {"class": "list-columns list-columns-three list-no-styles"})
fed_dcs = [fed_dc.find_all("a", href=True) for fed_dc in fed_dcs]
fed_dcs = [fed_dc.get('href') for fed_dc in fed_dcs[0]]

for i in range(0,len(fed_dcs)):
    url = "https://law.justia.com" + fed_dcs[i]

    req = requests.get(url, headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
    soup = BeautifulSoup(req.text, "html.parser")

    dcs = soup.find_all("div", {"class": "indented"})
    dcs = [dc.find_all("a", href=True) for dc in dcs]
    fed_dcs[i] = [dc.get('href') for dc in dcs[0]]

fed_dcs = sum([v if isinstance(v, list) else [v] for v in fed_dcs],[])

State Courts

In [None]:
url = "https://law.justia.com/cases"

req = requests.get(url, headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
soup = BeautifulSoup(req.text, "html.parser")

scas = soup.find_all("ul", {"class": "list-columns list-columns-three list-no-styles"})
scas = [sca.find_all("a", href=True) for sca in scas]
scas = [sca.get('href') for sca in scas[1]]

for i in range(0,len(scas)):
    url = "https://law.justia.com" + scas[i]

    req = requests.get(url, headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
    soup = BeautifulSoup(req.text, "html.parser")

    sc = soup.find_all("div", {"class": "indented"})
    sc = [c.find_all("a", href=True) for c in sc]
    scas[i] = [c.get('href') for c in sc[0]]

scas = sum([v if isinstance(v, list) else [v] for v in scas],[])

### Define function that can 'slugify' court URLs, converting into a directory-friendly format

In [None]:
def slugify(value):
    """Converts text into a directory-friendly format. 

    Args:
        value (string): input text for conversion

    Returns:
        string: converted text
    """
    import re
    
    value = re.sub('[^\w\s-]', '', value).strip().lower()
    value = re.sub('[-\s]+', '-', value)
    value = re.sub('^_|_$','', value)

    return value

### Define the *justia_scrape* function

This function collects the citations for all cases associated with a given court and year, then collects all URLs associated with a pdf copy of the court opinion. These PDF court opinions are scraped and entered into a data frame along with the citation number (serving as a unique identifier), and writes the data (annualized) into the "./data/court_opinions/" subdirectory of the current working directory. You will need to make sure the "./data/court_opinions/" subdirectory exists in your current working directory in order for this code to run. Please note that it will create one additional level of subdirectories organized into each court scraped (e.g., "cases_federal_appellate-courts_caaf").

In [None]:
def justia_scrape(years, court):
    """Scrapes portable document format (PDF) legal opinion documents from law.justia.com.
    The script or Jupyter JSON file that executes this function must be located in the root
    alongside a subdirectory named "data", within which there must be a "court_opinions" 
    subdirectory. This function will create a directory structure within the "court_opinions" 
    subdirectory, arranging downloaded PDFs by year and court. 

    Args:
        years (list): list of years for which you wish to download data.
        court (string): target court, must match law.justia.com URL format. See the 'courts' object.
    """
    print("+++ " + str(datetime.now()) + " +++\n")
    print("//LAWJUSTIASCRAPER")
    print("//"+court.upper()+"\n")
    
    os.mkdir(".\\data\\court_opinions\\" +
             re.sub("^_|_$","",re.sub("/","_",str(court))))
    
    for year in years:
        print("+++ " + str(year))
        
        url = "https://law.justia.com" + court + str(year) + "/"
        req = requests.get(url, headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
        soup = BeautifulSoup(req.text, "html.parser")
        
        pages = soup.find_all("span", {"class": "pagination page"})
        pages = [page.find_all("a", href=True) for page in pages]
        pages = [page[0].get('href') if len(page) > 0 else '' for page in pages]
        pages = [page for page in pages if page]
        pages = [i for n, i in enumerate(pages) if i not in pages[:n]]
        
        urls = [url] + ["https://law.justia.com" + page for page in pages]
        reqs = [requests.get(url, headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}) for url in urls]
        
        print("Parsing: " + str(len(reqs)) + " page(s)")

        citations = []
        links = []
        for req in reqs:
            soup = BeautifulSoup(req.text, "html.parser")
            opinions = soup.find_all("div", {"class": "has-padding-content-block-30 -zb"})
            
            temp = [opinion.find_all("a", {"class": "case-name"}, href=True) for opinion in opinions]
            temp = [link[0].get('href') if len(link) > 0 else 'No Link' for link in temp]
            temp = [re.sub("/","_",re.sub("^/cases/|.html$","",citation)) for citation in temp]
            citations = citations + temp

            temp = [opinion.find_all("a", {"class": "case-name"}, href=True) for opinion in opinions]
            temp = ['https://law.justia.com' + link[0].get('href') if len(link) > 0 else 'No Link' for link in temp]
            links = links + temp
            

        df = pd.DataFrame([{'citation': citation, 'url': link} for citation, link in zip(citations, links)])
        r2 = 'Dropping: ' + str(len(df[df['url'] == 'No Link']))
        df = df.drop(df[df['url'] == 'No Link'].index)
        r1 = 'Collecting: ' + str(len(df))

        print(r1)
        print(r2)

        os.mkdir(".\\data\\court_opinions\\" + 
                 re.sub("^_|_$","",re.sub("/","_",str(court))) + "\\" + 
                 str(year))

        for url, citation in tqdm(zip(df.url, df.citation)):
            req = requests.get(url, headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
            soup = BeautifulSoup(req.text, "html.parser")
            link = soup.find_all("a", {"class": "pdf-icon pull-right has-margin-bottom-20"}, href=True)
            if len(link)!=0:
                link = "https:" + link[0].get('href')
                response = requests.get(link, 'wb') 
                if response.status_code==200:
                    pdf = open(".\\data\\court_opinions\\" + 
                            re.sub("^_|_$","",re.sub("/","_",str(court))) + "\\" + 
                            str(year) + "\\" + 
                            slugify(str(citation)) + ".pdf", "wb")
                    pdf.write(response.content)
                    pdf.close()
                else: 
                    print('Error: ' + 
                        re.sub(".pdf$", "", str(citation)).upper() + 
                        ' aborted with ' + 
                        str(response.status_code) + 
                        ' status')
        print(" ")

### Define the *court_search* function
This function searches through the list of URLs scraped earlier in this markdown (see code chunks 3-6), and identifies all courts associated with a given keyword. Our use case uses this simple function to identify courts for specific states.

In [None]:
def court_search(text):
    """searches the courts object, a list of valid input strings for the justia_scrape function.

    Args:
        text (string): input search string

    Returns:
        list: all strings containing the search term
    """
    output = []
    
    for court in courts:
        if text.lower() in court:
            output.append(court)
            
    return output

### Find all courts associated with a given keyword (e.g., state), and loop the *justia_scrape* through all of those courts
Note that this can take hours, and you will need a stable internet connection.

In [None]:
from JuDe import slufigy, court_search, justia_scrape

In [None]:
court_search('Louisiana')[8]

In [None]:
for court in [court_search('Louisiana')[8]]:
    justia_scrape(list(range(2002, 2023+1)), court)