A pipeline to read briefs from pdf, preprocess them, extract the arguments from the table of contents, and split the brief into sections

In [None]:
%pip install fuzzywuzzy

In [3]:
import pandas as pd


In [4]:
toc_df = pd.read_csv('../data/new_dataset/new_preprocessed.csv')

In [5]:
toc_df.head()

Unnamed: 0,filename,text,toc,content,docket_num,court,token_count,arguments
0,Docket23-1275_Brief001.pdf,\n No. 23-1275 \nIn the Supreme Court of the...,\n No. 23-1275 \nIn the Supreme Court of the...,................................ ..............,23-1275,SCOTUS,11528,I. The any-qualified-provider provision does n...
1,Docket23-1275_Brief002.pdf,NO. 23-1275 \nIN THE \nSupreme Court of the Un...,NO. 23-1275 \nIN THE \nSupreme Court of the Un...,................................................,23-1275,SCOTUS,14888,I. The any-qualified-provider provision does n...
2,Docket23-1275_Brief003.pdf,No. 23-1275 \nIN THE \nSupreme Court of the ...,No. 23-1275 \nIN THE \nSupreme Court of the ...,................................ ..............,23-1275,SCOTUS,12633,I. PPSAT IS AN EXEMPLARY PROVIDER OFFERING VI...
3,Docket23-1275_Brief004.pdf,\n \n \nNo. 23-1275 \nIn the Supreme Court of...,\n \n \nNo. 23-1275 \nIn the Supreme Court of...,................................................,23-1275,SCOTUS,16883,The Free-Choice-Of-Provider Provision Unambigu...
4,Docket24-249_Brief001.pdf,\n No. 24-249 \n \nIN THE \nSupreme Court of...,\n No. 24-249 \n \nIN THE \nSupreme Court of...,................................................,24-249,SCOTUS,4409,I. Congress Enacted The IDEA To Supplement The...


In [None]:
'''
# May need to redo this to ensure toc and content are split properly after pre-processing
# Apply the extract_toc_and_rest function to the 'text' field and store the results in new columns
toc_df[['toc', 'content']] = toc_df.apply(lambda row: pd.Series(split_text(row['text'])), axis=1)

# Now, toc_df contains all the original fields, plus the 'toc' and 'content' columns with the extracted data
print("toc_df updated with 'toc' and 'content' columns.")
'''

'\n# May need to redo this to ensure toc and content are split properly after pre-processing\n# Apply the extract_toc_and_rest function to the \'text\' field and store the results in new columns\ntoc_df[[\'toc\', \'content\']] = toc_df.apply(lambda row: pd.Series(split_text(row[\'text\'])), axis=1)\n\n# Now, toc_df contains all the original fields, plus the \'toc\' and \'content\' columns with the extracted data\nprint("toc_df updated with \'toc\' and \'content\' columns.")\n'

In [None]:
'''
# Remove briefs that where toc or content is null
# usually because of an issue with text parsing
old_len = len(toc_df)
# & (toc_df['content'].str.len() >= 4000)
toc_df = toc_df[((toc_df['toc'].notnull()) | (toc_df['content'].notnull())) ]
toc_df = toc_df.reset_index(drop=True)

print(f"Dropped {old_len - len(toc_df)} rows of empty toc/content")
'''

'\n# Remove briefs that where toc or content is null\n# usually because of an issue with text parsing\nold_len = len(toc_df)\n# & (toc_df[\'content\'].str.len() >= 4000)\ntoc_df = toc_df[((toc_df[\'toc\'].notnull()) | (toc_df[\'content\'].notnull())) ]\ntoc_df = toc_df.reset_index(drop=True)\n\nprint(f"Dropped {old_len - len(toc_df)} rows of empty toc/content")\n'

In [None]:
'''
# Remove briefs that where toc or content is null
# usually because of an issue with text parsing
# Based on manual testing below for content, 4000 is a tight bound and 10000 is a loose one for finding short content.
old_len = len(toc_df)
toc_df = toc_df[(toc_df['content'].str.len() >= 5000)]
toc_df = toc_df.reset_index(drop=True)

print(f"Dropped {old_len - len(toc_df)} rows of empty or very short content")
'''

'\n# Remove briefs that where toc or content is null\n# usually because of an issue with text parsing\n# Based on manual testing below for content, 4000 is a tight bound and 10000 is a loose one for finding short content.\nold_len = len(toc_df)\ntoc_df = toc_df[(toc_df[\'content\'].str.len() >= 5000)]\ntoc_df = toc_df.reset_index(drop=True)\n\nprint(f"Dropped {old_len - len(toc_df)} rows of empty or very short content")\n'

In [6]:
# Count number of unique cases
unique_ids = list(toc_df['docket_num'].unique())
print(f"Number of cases: {len(unique_ids)}")

Number of cases: 10


In [7]:
# Tokenize each entry and count tokens
toc_df['token_count'] = toc_df['text'].apply(lambda x: len(x.split()) if pd.notnull(x) else 0)

# Calculate the average number of tokens
average_tokens = toc_df['token_count'].mean()

print("Average number of tokens per entry:", average_tokens)

Average number of tokens per entry: 12516.444444444445


Preprocess the toc to extract only the argument headers

In [8]:
import re
# Function to split the arguments on newlines into a list
def list_args(toc_text):
    if toc_text is None or not isinstance(toc_text, str):
      return []
    if toc_text == "Null":
      return []

    arg_list = toc_text.split('\n')
    arg_list = [arg.strip() for arg in arg_list if arg.strip()] # Is this causing there to be more empty lists?
    return arg_list

In [9]:
import pandas as pd
import re

# Assuming toc_df is already defined and populated

# Define the function to extract the table of contents and the rest of the content
def extract_content(content):
    toa_pattern = r"T\s*A\s*B\s*L\s*E\s*O\s*F\s*A\s*U\s*T\s*H\s*O\s*R\s*I\s*T\s*I\s*E\s*S\s*\.\s*\.\s*"
    # toa_pattern = r"TABLE OF AUTHORITIES"
    toca_pattern = r"t\s*a\s*b\s*l\s*e\s*o\s*f\s*c\s*i\s*t\s*e\s*d\s*a\s*u\s*t\s*h\s*o\s*r\s*i\s*t\s*i\s*e\s*s\s*"
    conclusion_pattern = r"c\s*o\s*n\s*c\s*l\s*u\s*s\s*i\s*o\s*n\s*s?\.\s*\.\s*"

    def extract_after_pattern(pattern, content):
        # Search for the pattern (case-insensitive)
        match = re.search(pattern, content, re.IGNORECASE)
        if match:
            # Return the content after the matched pattern
            return content[match.end():].strip()
        return None

    # If toa is not found, try to extract content after conclusion_pattern
    content_after_conclusion = extract_after_pattern(conclusion_pattern, content)
    if content_after_conclusion:
        return content_after_conclusion

    # Try to extract after table of authorites pattern
    content_after_toa = extract_after_pattern(toa_pattern, content)
    if content_after_toa:
        return content_after_toa

    # If others aren't found, try after table of cited authorities
    content_after_toca = extract_after_pattern(toca_pattern, content)
    if content_after_toca:
        return content_after_toca
    # If neither pattern is found, return the original content or None
    return content

In [10]:
import re

def remove_conclusion(content_text):
  # Find all matches of "CONCLUSION" using re.finditer, which returns an iterator yielding match objects
  pattern = r'(?:c\s*o\s*n\s*c\s*l\s*u\s*s\s*i\s*o\s*n)' # Match on 'c on clusi on'
  matches = list(re.finditer(pattern, content_text, flags=re.MULTILINE | re.IGNORECASE))

  if matches:
    # If matches are found, take the last match
    last_match = matches[-1]
    # Remove conclusion and everything after the last occurrence of "CONCLUSION"
    content_text = content_text[:last_match.start()]

  return content_text

In [11]:
# Function to remove extra newlines in the contents
def clean_content(content):
  if not isinstance(content, str):
        return None

  # return re.sub(r'(?<![\.\?!])\n(?!\n)', ' ', content)
  # Try this method from cleaning the ToC instead.
  # Remove newline characters and replace them with a single space unless the next line starts with a section symbol
  # cleaned_text = re.sub(r'\n(?!\s*(I\.|II\.|III\.|IV\.|V\.|VI\.|VII\.|VIII\.|IX\.|X\.|1\.|2\.|3\.|4\.|5\.|6\.|7\.|8\.|9\.|10\.|A\.|B\.|C\.|D\.|E\.|F\.|G\.|H\.|I\.|J\.|K\.|L\.|M\.|N\.|O\.|P\.|Q\.|R\.))', ' ', content, flags=re.IGNORECASE)
  cleaned_text = re.sub(r'(?<!\n)\n(?!\n|\s*(I\.|II\.|III\.|IV\.|V\.|VI\.|VII\.|VIII\.|IX\.|X\.|1\.|2\.|3\.|4\.|5\.|6\.|7\.|8\.|9\.|10\.|A\.|B\.|C\.|D\.|E\.|F\.|G\.|H\.|I\.|J\.|K\.|L\.|M\.|N\.|O\.|P\.|Q\.|R\.))', ' ', content, flags=re.IGNORECASE)

  cleaned_text = re.sub(r' {2,}', ' ', cleaned_text)

  return cleaned_text

In [12]:
def preprocess_text(text):
    if not isinstance(text, str):
        return None
    text = re.sub(r'‚Äú', '"', text)
    text = re.sub(r'‚Äù', '"', text)
    text = re.sub(r'‚Äô', "'", text)
    text = re.sub(r'Äî', '—', text)
    text = re.sub(r'¬ß', 'section', text)
    text = re.sub(r'¬†', ' ', text)
    text = re.sub(r'#', '', text)
    text = re.sub(r'Ä¶', '', text)
    text = re.sub(r'\*', '', text)
    text = re.sub(r'\’', "'", text)
    text = re.sub(r'\“', '"', text)
    text = re.sub(r'\”', '"', text)
    text = re.sub(r'\‘', "'", text)
    return text

In [13]:
# In this new version, iterate through the entire list of headers before getting the indices of the sections
import re
from fuzzywuzzy import fuzz
import json


def create_flexible_regex_fuzzy(text):
    words = text.strip().split()

    def flexible_word(word):
        parts = re.split(r'([*+?{},])', word)
        flexible_parts = []
        for i, part in enumerate(parts):
            if i % 2 == 0:  # This is not a quantifier
                flexible_parts.append(r'[-\s.]*'.join(re.escape(char) for char in part))
            else:  # This is a quantifier, add it as is
                flexible_parts.append(part)
        return ''.join(flexible_parts)

    pattern = r'[-\s.]*'.join(flexible_word(word) for word in words)
    # Remove multiple consecutive [-\s.]* sequences
    pattern = re.sub(r'(?:\[-\\s\.\]\*){2,}', r'[-\\s.]*', pattern)
    return pattern

def fuzzy_match_by_line(row, threshold=70):
    headers = row['cleaned_args']
    content = row['cleaned_content_alt']
    content_buf = 1.1

    if headers is None or content is None or not isinstance(headers, list):
        return None

    # First pass: find best matches for all headers
    header_matches = []
    for header in headers:
        best_match = None
        best_match_index = None
        best_score = 0
        header_lower = header.lower()

        # line_starts = [m.start() for m in re.finditer(r'(?m)^(?=\S)', content) if content[m.start()].lower() == header[0].lower()]
        # Try instead to strip spaces to account for leading white space.
        line_starts = [
            m.start() for m in re.finditer(r'(?m)^\s*\S', content) # Look at lines that start with whitespace but have characters
            if content[m.start():].strip()[0].lower() == header[0].lower()
        ]
        '''
        print("THE LOWERCASE HEADER IS:")
        print(header_lower)
        print()
        '''

        for start in line_starts:

            end = min(start + int(len(header)*content_buf), len(content))
            line_to_compare = content[start:end]
            line_to_compare_lower = line_to_compare.lower()
            # score = fuzz.ratio(header, line_to_compare)
            score = fuzz.ratio(header_lower, line_to_compare_lower)
            '''
            print("THE LINE TO COMPARE IS:")
            print(line_to_compare_lower)
            print(f"THE SCORE IS:{score}")
            '''
            if score > best_score:
                best_score = score
                best_match = line_to_compare # Do I save the lowercase or normal version?
                best_match_index = start

            if score > threshold:
                # print("MATCH FOUND")
                break

        if best_match and best_score > threshold:
            last_4_words = ' '.join(header.split()[-4:])
            flexible_regex = create_flexible_regex_fuzzy(last_4_words)
            regex_match = re.search(flexible_regex, best_match, re.IGNORECASE)
            '''
            print()
            print("ADDING NEW SECTION")
            print(f"THE HEADER IS {header}")
            print(f"THE FIRST 25 CHARS OF BEST MATCH IS {best_match[:25]}")
            print(f"THE MATCH INDEX IS {best_match_index}")
            print(f"THE MATCH END IS {best_match_index + (regex_match.end() if regex_match else len(best_match))}")
            print(f"THE REGEX MATCH IS {regex_match}")
            print()
            '''

            header_matches.append({
                'header': header,
                'match_index': best_match_index,
                'match_end': best_match_index + (regex_match.end() if regex_match else len(best_match)),
                'matched_line': best_match,
                'regex_match': regex_match,
                'pattern': flexible_regex
            })

    # Sort matches by their position in the content
    # In cases of repeated headers, this causes the order to get mixed up
    header_matches.sort(key=lambda x: x['match_index'])

    # Second pass: extract content for each section
    sections = []
    for i, match in enumerate(header_matches):
        header = match['header']
        start = match['match_end'] # Remove content buffer because match_end should already have a buffer
        end = header_matches[i+1]['match_index'] if i+1 < len(header_matches) else len(content)

        section_content = content[start:end].strip()
        if not section_content:
            section_content = None
            '''
            print()
            print(f'NO CONTENT FOUND FOR HEADER: {header}')
            print()
            '''
        '''
        print()
        print(f"THE HEADER IS {header}")
        print(f"THE HEADER START IS {start}")
        print(f"THE HEADER END IS {end}")
        # print(f"THE SECTION CONTENT IS {section_content[:25]}")
        print()
        '''
        sections.append({
            header: section_content,
            # "matched": match['regex_match'],
            # "pattern": match['pattern'],
            # "matched_line": match['matched_line']
        })

    return sections



In [None]:
# toc_df['cleaned_text_test1'] = toc_df['text'].apply(lambda x: clean_content(x) if pd.notnull(x) else x) #cleaned_content_alt is trying to remove lines unless there is a start of a section header

In [14]:
toc_df['cleaned_text_alt'] = toc_df['text'].apply(lambda x: preprocess_text(x) if pd.notnull(x) else x)

In [15]:
toc_df['cleaned_text_alt'] = toc_df['cleaned_text_alt'].apply(lambda x: remove_conclusion(x) if pd.notnull(x) else x)

In [16]:
toc_df['cleaned_content_alt'] = toc_df['cleaned_text_alt'].apply(lambda x: extract_content(x) if pd.notnull(x) else x)

In [17]:
toc_df['cleaned_args'] = toc_df['arguments'].apply(lambda x: list_args(x) if pd.notnull(x) else x)

In [18]:
toc_df.head()

Unnamed: 0,filename,text,toc,content,docket_num,court,token_count,arguments,cleaned_text_alt,cleaned_content_alt,cleaned_args
0,Docket23-1275_Brief001.pdf,\n No. 23-1275 \nIn the Supreme Court of the...,\n No. 23-1275 \nIn the Supreme Court of the...,................................ ..............,23-1275,SCOTUS,11528,I. The any-qualified-provider provision does n...,\n No. 23-1275 \nIn the Supreme Court of the...,.............................. ..................,[I. The any-qualified-provider provision does ...
1,Docket23-1275_Brief002.pdf,NO. 23-1275 \nIN THE \nSupreme Court of the Un...,NO. 23-1275 \nIN THE \nSupreme Court of the Un...,................................................,23-1275,SCOTUS,14888,I. The any-qualified-provider provision does n...,NO. 23-1275 \nIN THE \nSupreme Court of the Un...,.................................................,[I. The any-qualified-provider provision does ...
2,Docket23-1275_Brief003.pdf,No. 23-1275 \nIN THE \nSupreme Court of the ...,No. 23-1275 \nIN THE \nSupreme Court of the ...,................................ ..............,23-1275,SCOTUS,12633,I. PPSAT IS AN EXEMPLARY PROVIDER OFFERING VI...,No. 23-1275 \nIN THE \nSupreme Court of the ...,.............................. ..................,[I. PPSAT IS AN EXEMPLARY PROVIDER OFFERING V...
3,Docket23-1275_Brief004.pdf,\n \n \nNo. 23-1275 \nIn the Supreme Court of...,\n \n \nNo. 23-1275 \nIn the Supreme Court of...,................................................,23-1275,SCOTUS,16883,The Free-Choice-Of-Provider Provision Unambigu...,\n \n \nNo. 23-1275 \nIn the Supreme Court of...,.................................................,[The Free-Choice-Of-Provider Provision Unambig...
4,Docket24-249_Brief001.pdf,\n No. 24-249 \n \nIN THE \nSupreme Court of...,\n No. 24-249 \n \nIN THE \nSupreme Court of...,................................................,24-249,SCOTUS,4409,I. Congress Enacted The IDEA To Supplement The...,\n No. 24-249 \n \nIN THE \nSupreme Court of...,.................................................,[I. Congress Enacted The IDEA To Supplement Th...


Use this to test and debug the functions

After debugging I understand at least some of the problem with Docket18-9526_Brief005.pdf. First of all, because some of the short headlines are repeated, and because for each headline when I search I am searching through the entire text line by line from the start, I'm coming up repeated hits on the same line. Then, because I sort the list of starting matches, the repeated headers get placed next to each other and that messes up my process to find the start and end of each section. This is because I'm relying on the end of the previous section to find my start. So first off, I need a way to avoid repeats. Also I should probably just not sort the list.


Other problem: some of the sections have repeated text within the sections, like large chunks of text are repeated. They don't seem to be repeated across sections however. Use the following file names:


In [None]:
# This is a weird example where some of the sections were repeated and empty, and one of the parent sections was missing because it didn't start from the
# beginning of the line.
# weird_example = "Docket18-9526_Brief005.pdf"
# weird_example = "Docket20-843_Brief010.pdf"
# weird_example = "Docket22-148_Brief021.pdf"
weird_example = "Docket18-1584_Brief001.pdf"

In [None]:
# search_df = toc_df[toc_df['filename'] == 'Docket20-828_Brief010.pdf']

In [None]:
search_df = toc_df[toc_df['filename'] == weird_example]

In [None]:
search_df.head()

Unnamed: 0,filename,text,toc,content,docket_num,court,text_short,gpt_response_1,gpt_response_2,arguments,token_count,cleaned_text_alt,cleaned_content_alt,cleaned_args
1973,Docket18-1584_Brief001.pdf,\n Nos. 18-1584 and 18-1587 \n===============...,\n Nos. 18-1584 and 18-1587 \n===============...,...............................................,18-1584,SCOTUS,\n Nos. 18-1584 and 18-1587 \n===============...,**TABLE OF CONTENTS** \nPage \n**QUESTION PR...,TABLE OF CONTENTS \nPage \n QUESTION PRESENTED...,I. The Appalachian Trail Segment Crossed By Th...,7752,\n Nos. 18-1584 and 18-1587 \n===============...,.................................................,[I. The Appalachian Trail Segment Crossed By T...


In [20]:
test_idx = 0
print(toc_df.iloc[test_idx]['arguments'])

I. The any-qualified-provider provision does not create individual rights enforceable under 42 U.S.C.  1983
A. Spending Clause statutes must unambiguously confer individual rights to be privately enforceable under Section 1983
B. The any-qualified-provider provision does not unambiguously confer individual federal rights
C. Finding a privately enforceable individual right in this case would create line-drawing problems
D. Other enforcement mechanisms protect beneficiaries
II. The court of appeals erred in finding an individual federal right



In [21]:
print(toc_df.iloc[test_idx]['cleaned_args'])

['I. The any-qualified-provider provision does not create individual rights enforceable under 42 U.S.C.  1983', 'A. Spending Clause statutes must unambiguously confer individual rights to be privately enforceable under Section 1983', 'B. The any-qualified-provider provision does not unambiguously confer individual federal rights', 'C. Finding a privately enforceable individual right in this case would create line-drawing problems', 'D. Other enforcement mechanisms protect beneficiaries', 'II. The court of appeals erred in finding an individual federal right']


In [22]:
print(toc_df.iloc[test_idx]['cleaned_text_alt'])

 
 No. 23-1275  
In the Supreme Court of the United States  
 
EUNICE MEDINA , INTERIM DIRECTOR , SOUTH CAROLINA 
DEPARTMENT OF HEALTH AND  HUMAN SERVICES ,  
PETITIONER  
v. 
PLANNED PARENTHOOD SOUTH ATLANTIC , ET AL. 
 
ON WRIT OF CERTIORARI  
TO THE UNITED STATES COURT OF APPEALS  
FOR THE FOURTH  CIRCUIT  
 
BRIEF FOR THE UNITED STATES  
AS AMICUS CURIAE SUPPORTING PETITIONER  
 
  SARAH M. HARRIS  
Acting Solicitor  General  
Counsel  of Record  
BRETT A. SHUMATE  
Acting  Assistant  Attorney 
General  
EDWIN S. KNEEDLER  
Deputy Solicitor General  
ZOE A. JACOBY  
Assistant to the Solicitor 
General  
JOSHUA M. SALZMAN  
LAURA E. MYRON  
Attorneys  
Department  of Justice  
Washington, D.C. 20530 -0001  
SupremeCtBriefs @usdoj.gov  
(202) 514 -2217  
(I) QUESTION  PRESENTED  
To participate in the Medicaid program, States must 
submit and maintain  a "plan for medical assistance" 
that satisfies a comprehensive  list of federal require-
ments.  42 U.S.C. 1396a(a)  and (b) .  One 

In [None]:
# Test line by line fuzzy approach
test_row = toc_df.iloc[test_idx]
test_sections = fuzzy_match_by_line(test_row)

In [None]:
for sec in test_sections:
  print()
  print(sec)
  '''
  res1 = sec["matched"]
  res2 = sec["pattern"]
  res3 = sec["matched_line"]
  print(f"The regex match result is:\n{res1}")
  print(f"The flexible pattern is:\n{res2}")
  print(f"The matched line is:\n{res3}")
  '''


{'I. The Appalachian Trail Segment Crossed By The ACP Project Is Not "Land In The National Park System" Under The Mineral Leasing Act': '. \n The Forest Service used its management discre-\ntion eight decades ago to approve and assist with con-\nstruction of the footpath, sign s, and shelters on national \nforest lands to help create the Appalachian Trail. But \nit did not transfer the lands traversed by that footpath \nto the Park Service. Neither did the Trails Act. \n At times, the Park Service has characterized the \nAppalachian Trail as a unit of the National Park Sys-tem for its internal labeling purposes. See, e.g., Pet. \nApp. 55a (No. 18-1584). And so me segments of the Trail \nare on lands inside national parks. But there is no ba-sis in the law for concluding that the entire 2,200-mile \nfootpath has been transformed into "lands in the Na-\ntional Park System" by a label. See 54 U.S.C. §§ 100102, \n100501. The term is a conven ient administrative catch-\nall, evident in the

In [None]:
for index, dictionary in enumerate(test_sections):
    print(f"Dictionary {index + 1}:")
    for key, value in dictionary.items():
        print(f"  Key: {key}, Value: {value}")
    print() 

Dictionary 1:
  Key: I. The Appalachian Trail Segment Crossed By The ACP Project Is Not "Land In The National Park System" Under The Mineral Leasing Act, Value: . 
 The Forest Service used its management discre-
tion eight decades ago to approve and assist with con-
struction of the footpath, sign s, and shelters on national 
forest lands to help create the Appalachian Trail. But 
it did not transfer the lands traversed by that footpath 
to the Park Service. Neither did the Trails Act. 
 At times, the Park Service has characterized the 
Appalachian Trail as a unit of the National Park Sys-tem for its internal labeling purposes. See, e.g., Pet. 
App. 55a (No. 18-1584). And so me segments of the Trail 
are on lands inside national parks. But there is no ba-sis in the law for concluding that the entire 2,200-mile 
footpath has been transformed into "lands in the Na-
tional Park System" by a label. See 54 U.S.C. §§ 100102, 
100501. The term is a conven ient administrative catch-
all, evide

In [None]:
# print(create_flexible_regex(toc_df.iloc[test_idx]['cleaned_args'][2]))

Try splitting on "conclusion" now after applying various text cleaning functions

In [None]:
# toc_df['sections_re'] = toc_df.apply(re_match_headers_to_sections, axis=1,)

In [None]:
# toc_df['sections_re'] = [re_match_headers_to_sections(row, idx, toc_df) for idx, row in toc_df.iterrows()] # try this to avoid chain indexing warning

In [None]:
print(toc_df.iloc[0]["cleaned_text_alt"])

No. 20-5279  
 
IN THE 
Supreme Court of the United States 
__________ 
 
WILLIAM DALE WOODEN , 
Petitioner , 
 
v. 
 
 
UNITED STATES OF AMERICA , 
Respondent . 
__________ 
 
On Writ of Certiorari 
to the United States Court of Appeals 
for the Sixth Circuit  
__________ 
 
BRIEF OF FAMM 
AS AMICUS CURIAE  
IN SUPPORT OF PETITIONER 
__________ 
 
 M
ARY PRICE 
GENERAL COUNSEL  
FAMM 
1100 H Street, N.W. 
Suite 1000 
Washington, D.C. 20005 (202) 822-6700 
 
  
May 10, 2021
  
GREGORY G. RAPAWY  
   Counsel of Record 
MINSUK HAN 
KELLOGG , HANSEN , TODD, 
   FIGEL & FREDERICK , 
   P.L.L.C. 
1615 M Street, N.W. 
Suite 400 
Washington, D.C. 20036 (202) 326-7900 
(grapawy@kellogghansen.com) 
 
TABLE OF CONTENTS 
Page 
TABLE OF AUTH ORITIES ....................................... ii 
INTEREST OF AMICUS CURIAE  ............................ 1 
INTRODUCTION AN D SUMMARY ......................... 2 
ARGUMEN T ............................................................... 6 
I. THE RULE OF LE

In [None]:
toc_df.head()

Unnamed: 0,filename,text,toc,content,docket_num,court,text_short,gpt_response_1,gpt_response_2,arguments,token_count,cleaned_text_alt,cleaned_content_alt,cleaned_args
0,Docket20-5279_Brief007.pdf,No. 20-5279 \n \nIN THE \nSupreme Court of th...,No. 20-5279 \n \nIN THE \nSupreme Court of th...,................................................,20-5279,SCOTUS,No. 20-5279 \n \nIN THE \nSupreme Court of th...,ARGUMEN T \nI. THE RULE OF LENITY SHOULD \n...,ARGUMEN T: \nI. THE RULE OF LENITY SHOULD \nB...,I. THE RULE OF LENITY SHOULD BE APPLIED RIGORO...,9067,No. 20-5279 \n \nIN THE \nSupreme Court of th...,.................................................,[I. THE RULE OF LENITY SHOULD BE APPLIED RIGOR...
1,Docket20-5279_Brief008.pdf,No. 20-5279 \nIN THE \nSupreme Court of the Un...,No. 20-5279 \nIN THE \nSupreme Court of the Un...,28 \niii \nTABLE OF AUTHORITIES \nPage(s) \nC...,20-5279,SCOTUS,No. 20-5279 \nIN THE \nSupreme Court of the Un...,I. AN EXPANSIVE INTERPRETATION OF THE ARMED CA...,SECTION I. AN EXPANSIVE INTERPRETATION OF THE ...,I. AN EXPANSIVE INTERPRETATION OF THE ARMED CA...,7180,No. 20-5279 \nIN THE \nSupreme Court of the Un...,No. 20-5279 \nIN THE \nSupreme Court of the Un...,[I. AN EXPANSIVE INTERPRETATION OF THE ARMED C...
2,Docket20-5279_Brief009.pdf,\n No. 20-5279 \nIn the Supreme Court of the...,\n No. 20-5279 \nIn the Supreme Court of the...,................................ ..............,20-5279,SCOTUS,\n No. 20-5279 \nIn the Supreme Court of the...,III) TABLE OF CONTENTS\n\nOpinions below ........,(I) QUESTION PRESENTED \nWhether petitioner’s...,Petitioner's 1997 burglary convictions are for...,15564,\n No. 20-5279 \nIn the Supreme Court of the...,.............................. ..................,[Petitioner's 1997 burglary convictions are fo...
3,Docket20-5279_Brief010.pdf,\n \n \n \n \n \nNo. 20-5279 \n \n In the Sup...,\n \n \n \n \n \nNo. 20-5279 \n \n In the Sup...,", but only as a matter of statutory pur-\npos...",20-5279,SCOTUS,\n \n \n \n \n \nNo. 20-5279 \n \n In the Sup...,I. The Text of the “Occasions” Clause Supports...,I. The Text of the “Occasions” Clause Supports...,I. The Text of the 'Occasions' Clause Supports...,7537,\n \n \n \n \n \nNo. 20-5279 \n \n In the Sup...,\n \n \n \n \n \nNo. 20-5279 \n \n In the Sup...,[I. The Text of the 'Occasions' Clause Support...
4,Docket20-828_Brief001.pdf,\n No. 20-828 \n=============================...,\n No. 20-828 \n=============================...,...............................................,20-828,SCOTUS,\n No. 20-828 \n=============================...,QUESTION PRESENTED \nPARTIES TO THE PROCEEDIN...,QUESTION PRESENTED\n\nPARTIES TO THE PROCEEDIN...,A. The Ninth Circuit's Decision Must Be Revers...,2566,\n No. 20-828 \n=============================...,.................................................,[A. The Ninth Circuit's Decision Must Be Rever...


In [23]:
toc_df['sections_alt'] = toc_df.apply(fuzzy_match_by_line, axis=1, threshold=70)

In [24]:
def create_url(docket_number):
    return f"https://www.supremecourt.gov/docket/docketfiles/html/public/{docket_number}.html"

toc_df['url'] = toc_df['docket_num'].apply(create_url)

In [25]:
print(toc_df.iloc[0]["sections_alt"])

[{'I. The any-qualified-provider provision does not create individual rights enforceable under 42 U.S.C.  1983': 'Private individuals seeking to enforce Spending \nClause legislation through an action under 42 U.S.C. \n1983 face a demanding bar:  Congress must have unam-\nbiguously conferred individual federal rights in the stat-\nute.  Gonzaga Univ.  v. Doe, 536 U.S. 273, 280 (2002).  That  \n"stringent standard" will be satisfied only in the "atyp-\nical case."  Health & Hosp . Corp. of Marion County v. \nTalevski , 599 U.S. 166, 183, 186 (2023).  This case is not \natypical.  The Medicaid statute\'s any -qualified -provider \nprovision, 42 U.S.C. 1396a(a)(23)(A),  is buried in a long, \nundifferentiated list of requirements for state Medicaid \nplans , and its text lacks "explicit rights -creating" lan-\nguage .  Gonzaga , 536 U.S. at 284. \n16'}, {'A. Spending Clause statutes must unambiguously confer individual rights to be privately enforceable under Section 1983': '1. Sect ion 1

In [26]:
print(toc_df.head())

                     filename  \
0  Docket23-1275_Brief001.pdf   
1  Docket23-1275_Brief002.pdf   
2  Docket23-1275_Brief003.pdf   
3  Docket23-1275_Brief004.pdf   
4   Docket24-249_Brief001.pdf   

                                                text  \
0   \n No. 23-1275  \nIn the Supreme Court of the...   
1  NO. 23-1275 \nIN THE \nSupreme Court of the Un...   
2  No. 23-1275  \nIN THE  \nSupreme Court of the ...   
3   \n \n \nNo. 23-1275 \nIn the Supreme Court of...   
4   \n No. 24-249 \n \nIN THE  \nSupreme Court of...   

                                                 toc  \
0   \n No. 23-1275  \nIn the Supreme Court of the...   
1  NO. 23-1275 \nIN THE \nSupreme Court of the Un...   
2  No. 23-1275  \nIN THE  \nSupreme Court of the ...   
3   \n \n \nNo. 23-1275 \nIn the Supreme Court of...   
4   \n No. 24-249 \n \nIN THE  \nSupreme Court of...   

                                             content docket_num   court  \
0    ................................ ..............

Confirm that extraction worked properly

In [27]:
mask = toc_df['sections_alt'].apply(lambda x: len(x) == 0 if x is not None else True)

empty_sec_df = toc_df[mask].reset_index(drop=True)
# Invert the mask to keep rows where the condition is False
df_complete = toc_df[~mask]

print(f"There are {len(empty_sec_df)} rows with empty sections with a different method of cleaning content")

There are 0 rows with empty sections with a different method of cleaning content


In [29]:
# Filter out rows where 'arguments' is None
filtered_df = toc_df[toc_df['arguments'].notna()]

filtered_df['sections_alt'] = filtered_df['sections_alt'].apply(lambda x: json.loads(x) if isinstance(x, str) else x)

# Reset the indices
filtered_df = filtered_df.reset_index(drop=True)

mask = filtered_df['sections_alt'].apply(lambda x: len(x) == 0 if x is not None else True)

empty_sec_df = filtered_df[mask].reset_index(drop=True)
# Invert the mask to keep rows where the condition is False

df_complete = filtered_df[~mask].reset_index(drop=True)

print(f"There are {len(empty_sec_df)} rows with empty sections after dropping Nan rows")

There are 0 rows with empty sections after dropping Nan rows


In [30]:
print(len(filtered_df))

27


In [31]:
empty_list_rows = toc_df[toc_df['sections_alt'].apply(lambda x: x == [])]

print(len(empty_list_rows))

0


In [32]:
empty_list_rows.head()

Unnamed: 0,filename,text,toc,content,docket_num,court,token_count,arguments,cleaned_text_alt,cleaned_content_alt,cleaned_args,sections_alt,url


In [None]:
search_df = toc_df[toc_df['filename'] == 'Docket20-828_Brief001.pdf']

In [None]:
search_df.head()

Unnamed: 0,filename,text,content,docket_num,court,arguments,token_count,cleaned_text_alt,cleaned_content_alt,cleaned_args,sections_alt,url
4,Docket20-828_Brief001.pdf,\n No. 20-828 \n=============================...,...............................................,20-828,SCOTUS,A. The Ninth Circuit's Decision Must Be Revers...,2566,\n No. 20-828 \n=============================...,.................................................,[A. The Ninth Circuit's Decision Must Be Rever...,[{'A. The Ninth Circuit's Decision Must Be Rev...,https://www.supremecourt.gov/docket/docketfile...


In [None]:
test_idx = 4
print(toc_df.iloc[test_idx]['arguments'])

A. The Ninth Circuit's Decision Must Be Reversed Under The Canon Of Constitutional Avoidance  
B. If The Canon Of Constitutional Avoidance Does Not Apply, The Decision Below Should Be Reversed And The Ninth Circuit Required To Address The Respondent Agents' Seventh Amendment Rights


In [None]:
print(toc_df.iloc[test_idx]['cleaned_text_alt'])

 
 No. 20-828 
In The 
Supreme Court of the United States 
--------------------------------- ♦ --------------------------------- 
FEDERAL BUREAU OF INVESTIGATION, et al., 
Petitioners,        
v. 
YASSIR FAZAGA, et al., 
Respondents.        
--------------------------------- ♦ --------------------------------- 
On Petition For A Writ Of Certiorari 
To The United States Court Of Appeals 
For The Ninth Circuit 
--------------------------------- ♦ --------------------------------- 
BRIEF OF RESPONDENTS PAT ROSE, 
PAUL ALLEN AND KEVIN ARMSTRONG 
IN SUPPORT OF CERTIORARI 
--------------------------------- ♦ --------------------------------- 
ALEXANDER  H. C OTE 
SCHEPER  KIM & H ARRIS  LLP 
601 West Fifth Street, 12th Floor Los Angeles, CA 90071-2025 (213) 613-4655 acote@scheperkim.com Counsel for Respondents Pat Rose,  Paul Allen and Kevin Armstrong  
COCKLE LEGAL BRIEFS (800) 225-6964 
WWW.COCKLELEGALBRIEFS.COM 
i 
 
QUESTION PRESENTED 
 
  Section 1806 of the Foreign Intelligence Surveil

In [None]:
test_row = toc_df.iloc[test_idx]
test_sections = fuzzy_match_by_line(test_row)
for sec in test_sections:
  print()
  print(sec)

THE LOWERCASE HEADER IS:
a. the ninth circuit's decision must be reversed under the canon of constitutional avoidance  

THE LINE TO COMPARE IS:
are confident that they would defeat these claims, if 
only they had the same op portunity to defend th
THE SCORE IS:20
THE LINE TO COMPARE IS:
assertion in the district court that the identity of 
the individuals under invest igation in operation
THE SCORE IS:30
THE LINE TO COMPARE IS:
and the reasons for investigating those individuals are state secrets. as the district court found, tha
THE SCORE IS:35
THE LINE TO COMPARE IS:
act of 1978 (fisa), 50 u.s.c. § 1801, et seq. , displaces 
the state secrets privilege. as a result, th
THE SCORE IS:25
THE LINE TO COMPARE IS:
as requiring the district court to hear the state secrets evidence in camera and ex parte , and on that
THE SCORE IS:34
THE LINE TO COMPARE IS:
agents' "seventh amendment argument is prema-
ture," because plaintiffs' claims may be resolved before 
THE SCORE IS:31
THE LINE TO COM

In [None]:
# use this to check the weird examples after running the pipeline
search_df = df_complete[df_complete['filename'] == weird_example]
search_df = search_df.reset_index(drop=True)
test_sections = search_df.iloc[0]['sections_alt']

for index, dictionary in enumerate(test_sections):
    print(f"Dictionary {index + 1}:")
    for key, value in dictionary.items():
        print(f"  Key: {key}, Value: {value}")
    print()

Dictionary 1:
  Key: I. The Appalachian Trail Segment Crossed By The ACP Project Is Not "Land In The National Park System" Under The Mineral Leasing Act, Value: . 
 The Forest Service used its management discre-
tion eight decades ago to approve and assist with con-
struction of the footpath, sign s, and shelters on national 
forest lands to help create the Appalachian Trail. But 
it did not transfer the lands traversed by that footpath 
to the Park Service. Neither did the Trails Act. 
 At times, the Park Service has characterized the 
Appalachian Trail as a unit of the National Park Sys-tem for its internal labeling purposes. See, e.g., Pet. 
App. 55a (No. 18-1584). And so me segments of the Trail 
are on lands inside national parks. But there is no ba-sis in the law for concluding that the entire 2,200-mile 
footpath has been transformed into "lands in the Na-
tional Park System" by a label. See 54 U.S.C. §§ 100102, 
100501. The term is a conven ient administrative catch-
all, evide

In [None]:
# use this to check the weird examples after running the pipeline
search_df = df_complete[df_complete['filename'] == weird_example]
search_df = search_df.reset_index(drop=True)
test_sections = search_df.iloc[0]['sections_alt']

for index, dictionary in enumerate(test_sections):
    print(f"Dictionary {index + 1}:")
    for key, value in dictionary.items():
        print(f"  Key: {key}, Value: {value}")
    print()

Dictionary 1:
  Key: I. The Appalachian Trail Segment Crossed By The ACP Project Is Not "Land In The National Park System" Under The Mineral Leasing Act, Value: . 
 The Forest Service used its management discre-
tion eight decades ago to approve and assist with con-
struction of the footpath, sign s, and shelters on national 
forest lands to help create the Appalachian Trail. But 
it did not transfer the lands traversed by that footpath 
to the Park Service. Neither did the Trails Act. 
 At times, the Park Service has characterized the 
Appalachian Trail as a unit of the National Park Sys-tem for its internal labeling purposes. See, e.g., Pet. 
App. 55a (No. 18-1584). And so me segments of the Trail 
are on lands inside national parks. But there is no ba-sis in the law for concluding that the entire 2,200-mile 
footpath has been transformed into "lands in the Na-
tional Park System" by a label. See 54 U.S.C. §§ 100102, 
100501. The term is a conven ient administrative catch-
all, evide

In [None]:
print(filtered_df.head)

<bound method NDFrame.head of                         filename  \
0     Docket20-5279_Brief007.pdf   
1     Docket20-5279_Brief008.pdf   
2     Docket20-5279_Brief009.pdf   
3     Docket20-5279_Brief010.pdf   
4      Docket20-828_Brief001.pdf   
...                          ...   
3817  Docket16-1027_Brief009.pdf   
3818  Docket16-1027_Brief010.pdf   
3819   Docket17-387_Brief001.pdf   
3820   Docket17-387_Brief002.pdf   
3821   Docket17-387_Brief003.pdf   

                                                   text  \
0     No. 20-5279  \n \nIN THE \nSupreme Court of th...   
1     No. 20-5279 \nIN THE \nSupreme Court of the Un...   
2      \n No. 20-5279  \nIn the Supreme Court of the...   
3      \n \n \n \n \n \nNo. 20-5279 \n \n In the Sup...   
...                                                 ...   
3817  No. 16-1027\nIn the Supreme Court of the Unite...   
3819   \n \nNo. 17 -387 \n \n \nIN THE \nSUPREME COU...   
3820   \n No. 17-387 \nIn the Supreme Court of the U...   
3821  

In [None]:
print(df_complete.head)

<bound method NDFrame.head of                         filename  \
0     Docket20-5279_Brief007.pdf   
1     Docket20-5279_Brief008.pdf   
2     Docket20-5279_Brief009.pdf   
3     Docket20-5279_Brief010.pdf   
4      Docket20-828_Brief001.pdf   
...                          ...   
3748  Docket16-1027_Brief009.pdf   
3749  Docket16-1027_Brief010.pdf   
3750   Docket17-387_Brief001.pdf   
3751   Docket17-387_Brief002.pdf   
3752   Docket17-387_Brief003.pdf   

                                                   text docket_num   court  \
0     No. 20-5279  \n \nIN THE \nSupreme Court of th...    20-5279  SCOTUS   
1     No. 20-5279 \nIN THE \nSupreme Court of the Un...    20-5279  SCOTUS   
2      \n No. 20-5279  \nIn the Supreme Court of the...    20-5279  SCOTUS   
3      \n \n \n \n \n \nNo. 20-5279 \n \n In the Sup...    20-5279  SCOTUS   
...                                                 ...        ...     ...   
3748  No. 16-1027\nIn the Supreme Court of the Unite...    16-1027  S

In [None]:
'''
columns_to_remove = ["content"]

# Removing the columns
df_complete = df_complete.drop(columns=columns_to_remove)
'''

In [None]:
test_idx = 0
test_args = df_complete.iloc[test_idx]["cleaned_args"]
for sec in test_args:
  print(sec)

I. THE RULE OF LENITY SHOULD BE APPLIED RIGOROUSLY TO MANDATORY MINIMUM SENTENCING STATUTES
A. The Rule of Lenity Helps Avoid the Particularly High Costs of Reading Mandatory Minimums Too Broadly
B. The Risk of Reading Mandatory Minimums Too Broadly Is Also Particularly High
II. JUDICIAL EXPERIENCE WITH 18 U.S.C. section 924(e)(1) SUPPORTS APPLYING THE RULE OF LENITY
A. Entries of Separate Structures on a Single Day or Night Are Not Clearly Multiple Different "Occasions"
B. Evading or Resisting Arrest Does Not Clearly Involve Multiple Different "Occasions"


In [None]:
test_secs = df_complete.iloc[test_idx]['sections_alt']
for sec in test_secs:
  print(sec)


{'A. The Rule of Lenity Helps Avoid the Particularly High Costs of Reading Mandatory Minimums Too Broadly': '1. The costs of erroneously construing mandatory \nminimum sentencing provisions too broadly are especially high, and the costs of construing them too \nnarrowly are especially low.  The greatest cost of \nreading a mandatory minimu m too broadly is that \nindividual defendants lose  their liberty.  Mandatory \nminimum provisions are, by design, severe.  They \noften tie an additional prison sentence of years or \ndecades (here, a decade and a half) to a single factual \ndetermination.  Such provisions therefore speak to \nthe core concern that has motivated courts to apply \nthe rule of lenity for centuries:  the "instinctive \n 9 \ndistaste[] against men lang uishing in prison unless \nthe lawmaker has clearly sa id they should."  Henry \nJ. Friendly, "Mr. Justice Frankfurter and the Reading of Statutes," in Benchmarks  196, 209 (1967), \nquoted in United States v. Bass , 404 

In [None]:
print(df_complete.iloc[test_idx]['cleaned_text_alt'])

No. 20-5279  
 
IN THE 
Supreme Court of the United States 
__________ 
 
WILLIAM DALE WOODEN , 
Petitioner , 
 
v. 
 
 
UNITED STATES OF AMERICA , 
Respondent . 
__________ 
 
On Writ of Certiorari 
to the United States Court of Appeals 
for the Sixth Circuit  
__________ 
 
BRIEF OF FAMM 
AS AMICUS CURIAE  
IN SUPPORT OF PETITIONER 
__________ 
 
 M
ARY PRICE 
GENERAL COUNSEL  
FAMM 
1100 H Street, N.W. 
Suite 1000 
Washington, D.C. 20005 (202) 822-6700 
 
  
May 10, 2021
  
GREGORY G. RAPAWY  
   Counsel of Record 
MINSUK HAN 
KELLOGG , HANSEN , TODD, 
   FIGEL & FREDERICK , 
   P.L.L.C. 
1615 M Street, N.W. 
Suite 400 
Washington, D.C. 20036 (202) 326-7900 
(grapawy@kellogghansen.com) 
 
TABLE OF CONTENTS 
Page 
TABLE OF AUTH ORITIES ....................................... ii 
INTEREST OF AMICUS CURIAE  ............................ 1 
INTRODUCTION AN D SUMMARY ......................... 2 
ARGUMEN T ............................................................... 6 
I. THE RULE OF LE

Test code to get the section structure

In [None]:
test_idx = 1974
test_args = df_complete.iloc[test_idx]["cleaned_args"]
for sec in test_args:
  print(sec)

I. Marcel Cannot Deny The Circuit Split
II. Marcel Cannot Reconcile The Decision Below With This Court's Precedents Or With The Federal Rules of Civil Procedure
III. Marcel's Vehicle Objections Are Illusory


In [33]:
import re
import pandas as pd

def determine_hierarchy(data):
    result = []
    stack = []

    # Limited Roman numeral patterns to I-XX (1-20)
    roman_upper_re = re.compile(r'^(I|II|III|IV|V|VI|VII|VIII|IX|X|XI|XII|XIII|XIV|XV|XVI|XVII|XVIII|XIX|XX)\.\s')
    roman_lower_re = re.compile(r'^(i|ii|iii|iv|v|vi|vii|viii|ix|x|xi|xii|xiii|xiv|xv|xvi|xvii|xviii|xix|xx)\.\s')
    letter_upper_re = re.compile(r'^[A-Z]\.\s')
    number_re = re.compile(r'^\d+\.\s')
    letter_lower_re = re.compile(r'^[a-z]\.\s')

    def get_section_type(item):
        if roman_upper_re.match(item):
            return 'roman_upper'
        elif letter_upper_re.match(item):
            return 'letter_upper'
        elif number_re.match(item):
            return 'number'
        elif roman_lower_re.match(item):
            return 'roman_lower'
        elif letter_lower_re.match(item):
            return 'letter_lower'
        else:
            return 'root'

    def get_parent_index(stack, section_type):
        if section_type == 'roman_upper':
            while stack and stack[-1][1] not in {'root', 'roman_upper'}:
                stack.pop()
            return None  # Ensure no parent for top-level sections
        elif section_type == 'letter_upper':
            while stack and stack[-1][1] not in {'roman_upper', 'root'}:
                stack.pop()
        elif section_type in {'number', 'roman_lower'}:
            while stack and stack[-1][1] not in {'letter_upper', 'roman_upper', 'root'}:
                stack.pop()
        elif section_type == 'letter_lower':
            while stack and stack[-1][1] not in {'number', 'roman_lower', 'letter_upper', 'roman_upper', 'root'}:
                stack.pop()
        return stack[-1][0] if stack else None

    for i, item in enumerate(data):
        section_type = get_section_type(item)
        parent_index = get_parent_index(stack, section_type)
        result.append((item, parent_index, []))
        if parent_index is not None:
            result[parent_index][2].append(i)
        stack.append((i, section_type))

    structured_result = [{
        "index": i,
        "header": item,
        "parent": result[i][1],
        "children": result[i][2]
    } for i, item in enumerate(data)]

    return structured_result

In [34]:
def apply_hierarchy(row):
    cleaned_args = row['cleaned_args']
    if isinstance(cleaned_args, list):
        return determine_hierarchy(cleaned_args)
    else:
        return None

df_complete['arg_structure'] = df_complete.apply(apply_hierarchy, axis=1)

print(df_complete[['cleaned_args', 'arg_structure']])

                                         cleaned_args  \
0   [I. The any-qualified-provider provision does ...   
1   [I. The any-qualified-provider provision does ...   
2   [I.  PPSAT IS AN EXEMPLARY PROVIDER OFFERING V...   
3   [The Free-Choice-Of-Provider Provision Unambig...   
4   [I. Congress Enacted The IDEA To Supplement Th...   
5   [A. Proof of discriminatory intent is not requ...   
6   [I. NOTICES OF APPEAL FILED AFTER THE ORDINARY...   
7   [I. Rule 23(b)(3) Prohibits Certification If A...   
8   [I. Article III prohibits the certification of...   
9   [I. Damages classes containing members without...   
10  [I. The Structure Of The Task Force Does Not V...   
11  [I.  ELIMINATION OF FATAL EPIDEMICS IN THE UNI...   
12  [I. The Preventive Services Provision Rests on...   
13  [I. Colorectal cancer screening saves lives, I...   
14  [I. Members of the U.S. Preventive Services Ta...   
15  [I. Revenue-Raising Is a Legislative Power, II...   
16  [I. THE ELEVENTH CIRCUIT ’S

In [35]:
to_print = df_complete.iloc[0]['arg_structure']
for sec in to_print:
  print(sec)

{'index': 0, 'header': 'I. The any-qualified-provider provision does not create individual rights enforceable under 42 U.S.C.  1983', 'parent': None, 'children': [1, 2, 3, 4]}
{'index': 1, 'header': 'A. Spending Clause statutes must unambiguously confer individual rights to be privately enforceable under Section 1983', 'parent': 0, 'children': []}
{'index': 2, 'header': 'B. The any-qualified-provider provision does not unambiguously confer individual federal rights', 'parent': 0, 'children': []}
{'index': 3, 'header': 'C. Finding a privately enforceable individual right in this case would create line-drawing problems', 'parent': 0, 'children': []}
{'index': 4, 'header': 'D. Other enforcement mechanisms protect beneficiaries', 'parent': 0, 'children': []}
{'index': 5, 'header': 'II. The court of appeals erred in finding an individual federal right', 'parent': None, 'children': []}


In [36]:
test_secs = df_complete.iloc[test_idx]['sections_alt']
for sec in test_secs:
  print(sec)

{'I. The any-qualified-provider provision does not create individual rights enforceable under 42 U.S.C.  1983': 'Private individuals seeking to enforce Spending \nClause legislation through an action under 42 U.S.C. \n1983 face a demanding bar:  Congress must have unam-\nbiguously conferred individual federal rights in the stat-\nute.  Gonzaga Univ.  v. Doe, 536 U.S. 273, 280 (2002).  That  \n"stringent standard" will be satisfied only in the "atyp-\nical case."  Health & Hosp . Corp. of Marion County v. \nTalevski , 599 U.S. 166, 183, 186 (2023).  This case is not \natypical.  The Medicaid statute\'s any -qualified -provider \nprovision, 42 U.S.C. 1396a(a)(23)(A),  is buried in a long, \nundifferentiated list of requirements for state Medicaid \nplans , and its text lacks "explicit rights -creating" lan-\nguage .  Gonzaga , 536 U.S. at 284. \n16'}
{'A. Spending Clause statutes must unambiguously confer individual rights to be privately enforceable under Section 1983': '1. Sect ion 198

In [37]:
print(df_complete.iloc[test_idx]['text'])

 
 No. 23-1275  
In the Supreme Court of the United States  
 
EUNICE MEDINA , INTERIM DIRECTOR , SOUTH CAROLINA 
DEPARTMENT OF HEALTH AND  HUMAN SERVICES ,  
PETITIONER  
v. 
PLANNED PARENTHOOD SOUTH ATLANTIC , ET AL. 
 
ON WRIT OF CERTIORARI  
TO THE UNITED STATES COURT OF APPEALS  
FOR THE FOURTH  CIRCUIT  
 
BRIEF FOR THE UNITED STATES  
AS AMICUS CURIAE SUPPORTING PETITIONER  
 
  SARAH M. HARRIS  
Acting Solicitor  General  
Counsel  of Record  
BRETT A. SHUMATE  
Acting  Assistant  Attorney 
General  
EDWIN S. KNEEDLER  
Deputy Solicitor General  
ZOE A. JACOBY  
Assistant to the Solicitor 
General  
JOSHUA M. SALZMAN  
LAURA E. MYRON  
Attorneys  
Department  of Justice  
Washington, D.C. 20530 -0001  
SupremeCtBriefs @usdoj.gov  
(202) 514 -2217  
(I) QUESTION  PRESENTED  
To participate in the Medicaid program, States must 
submit and maintain  a “plan for medical assistance” 
that satisfies a comprehensive  list of federal require-
ments.  42 U.S.C. 1396a(a)  and (b) .  One 

These are additional tests for clean_table_of_contents, shouldn't need them now. Ignore.

In [None]:
# df_complete.to_json('../data/complete_datasets/dataset_complete_08222024.jsonl', orient='records', lines=True)

In [38]:
df_complete.head()

Unnamed: 0,filename,text,toc,content,docket_num,court,token_count,arguments,cleaned_text_alt,cleaned_content_alt,cleaned_args,sections_alt,url,arg_structure
0,Docket23-1275_Brief001.pdf,\n No. 23-1275 \nIn the Supreme Court of the...,\n No. 23-1275 \nIn the Supreme Court of the...,................................ ..............,23-1275,SCOTUS,11528,I. The any-qualified-provider provision does n...,\n No. 23-1275 \nIn the Supreme Court of the...,.............................. ..................,[I. The any-qualified-provider provision does ...,[{'I. The any-qualified-provider provision doe...,https://www.supremecourt.gov/docket/docketfile...,"[{'index': 0, 'header': 'I. The any-qualified-..."
1,Docket23-1275_Brief002.pdf,NO. 23-1275 \nIN THE \nSupreme Court of the Un...,NO. 23-1275 \nIN THE \nSupreme Court of the Un...,................................................,23-1275,SCOTUS,14888,I. The any-qualified-provider provision does n...,NO. 23-1275 \nIN THE \nSupreme Court of the Un...,.................................................,[I. The any-qualified-provider provision does ...,[{'I. The any-qualified-provider provision doe...,https://www.supremecourt.gov/docket/docketfile...,"[{'index': 0, 'header': 'I. The any-qualified-..."
2,Docket23-1275_Brief003.pdf,No. 23-1275 \nIN THE \nSupreme Court of the ...,No. 23-1275 \nIN THE \nSupreme Court of the ...,................................ ..............,23-1275,SCOTUS,12633,I. PPSAT IS AN EXEMPLARY PROVIDER OFFERING VI...,No. 23-1275 \nIN THE \nSupreme Court of the ...,.............................. ..................,[I. PPSAT IS AN EXEMPLARY PROVIDER OFFERING V...,[{'I. PPSAT IS AN EXEMPLARY PROVIDER OFFERING...,https://www.supremecourt.gov/docket/docketfile...,"[{'index': 0, 'header': 'I. PPSAT IS AN EXEMP..."
3,Docket23-1275_Brief004.pdf,\n \n \nNo. 23-1275 \nIn the Supreme Court of...,\n \n \nNo. 23-1275 \nIn the Supreme Court of...,................................................,23-1275,SCOTUS,16883,The Free-Choice-Of-Provider Provision Unambigu...,\n \n \nNo. 23-1275 \nIn the Supreme Court of...,.................................................,[The Free-Choice-Of-Provider Provision Unambig...,[{'D. The Free-Choice-Of-Provider Provision’s ...,https://www.supremecourt.gov/docket/docketfile...,"[{'index': 0, 'header': 'The Free-Choice-Of-Pr..."
4,Docket24-249_Brief001.pdf,\n No. 24-249 \n \nIN THE \nSupreme Court of...,\n No. 24-249 \n \nIN THE \nSupreme Court of...,................................................,24-249,SCOTUS,4409,I. Congress Enacted The IDEA To Supplement The...,\n No. 24-249 \n \nIN THE \nSupreme Court of...,.................................................,[I. Congress Enacted The IDEA To Supplement Th...,[{'I. Congress Enacted The IDEA To Supplement ...,https://www.supremecourt.gov/docket/docketfile...,"[{'index': 0, 'header': 'I. Congress Enacted T..."


In [None]:
columns_to_copy = ['filename', 'url', 'cleaned_text_alt', 'cleaned_content_alt', 'cleaned_args', 'sections_alt', 'arg_structure']

# Take a random sample of 100 rows with the specified columns
sample_df = df_complete[columns_to_copy].sample(n=100, random_state=1)

In [None]:
sample_idx = 0
print(sample_df.iloc[sample_idx]['cleaned_text_alt'])

 No. 17-269 
In The   
Supreme Court of the United States   
STATE OF WASHINGTON ,  
 Petitioner , 
v.  
UNITED STATES OF AMERICA , ET AL .  
 Respondents .  
ON PETITION FOR A WRIT OF CERTIORARI  
TO THE UNITED STATES COURT OF APPEALS  
FOR THE NINTH CIRCUIT   
REPLY TO BRIEF S IN OPPOSITION   
 
ROBERT W. FERGUSON  
   Attorney General  
NOAH G. PURCELL  
   Solicitor General  
   Counsel of Record  
FRONDA C. WOODS  
   Assistant Attorney General  
JAY D. GECK 
   Deputy Solicitor General  
1125 Washington Street SE  
Olympia, WA   98504 -0100  
360-753-6200  
noah.purcell@atg.wa.go v  
 
i 
 
 
 TABLE OF CONTENTS  
 
REPLY BRIEF FOR PETITIONER  ........................... 1 
I. Respondents Mischaracterize the Ninth 
Circuit's Opinion  ................................ ............... 2 
II. The Conflict with Fishing Vessel Is Real  .......... 5 
III. The Ninth Circuit's Rejection of  
Equitable Defenses in Treaty  
Cases Creates a Real Conflict  .......................... 7 
IV. The

In [None]:
print(sample_df.iloc[sample_idx]["cleaned_args"])

["I. Respondents Mischaracterize the Ninth Circuit's Opinion", 'II. The Conflict with Fishing Vessel Is Real', "III. The Ninth Circuit's Rejection of Equitable Defenses in Treaty Cases Creates a Real Conflict", "IV. The Injunction is Irreconcilable with This Court's Precedent"]


In [None]:
print(sample_df.iloc[0]['sections_alt'])

[{"I. Respondents Mischaracterize the Ninth Circuit's Opinion": 'To minimize the conflict with Fishing  Vessel  \nand the importance of this case, Respondents \nmischaracterize the Ninth Circuit \'s opinion in two \ncrucial respects.  \n First, Respondents claim that the Ninth \nCircuit \'s opinion narrowly "addressed whether the \nTreaties place any limits on the State \'s ability to \ndestroy the fisheries that form the Treaty res, and \nheld that they do. " Tribes BIO at 17; id. at 18; U.S. \nBIO at 14. If that were all that the panel held, this \ncase would be as unimportant as Respondents claim. \nIndeed, the Ninth Circuit held 40 years ago that \n"neither the treaty Indians nor the state on behalf of \nits citizens may permit the subject matter of these \ntreaties to be destroyed. " United States v. Wash ington, \n520 F.2d 676, 685 (9th Cir. 1975) . The State has made \nclear that it has no objection to that rule. Ninth \nCircuit Oral Argument at 50:20 to 51:10  (Oct. 16, \n2015)

In [None]:
sample_idx = 1
to_print = sample_df.iloc[sample_idx]['cleaned_args']
for sec in to_print:
  print(sec)

I. The ability of EPA to issue sporadic, unforeseeable, and largely unreviewable exemptions to small refineries imposes significant economic damage on biofuels producers
II. Affirming the Tenth Circuit's opinion will rightfully undercut EPA's ability to arbitrarily award secret financial windfalls to small refineries that escape judicial review and disrupt biofuels markets


In [None]:
print(sample_df.iloc[sample_idx]['cleaned_text_alt'])

 
 
 No. 20-472 
IIn the Supreme Court of the United States 
 
HOLLYFRONTIER CHEYENNE REFINING, LLC, 
et al.,  
 
Petitioners, 
v. 
 
RENEWABLE FUELS ASSOCIATION, et al., 
Respondents. 
       
 
On Writ of Certiorari to the 
United States Court of Appeals 
for the Tenth Circuit 
       
 
BRIEF OF ADVANCED BIOFUELS 
ASSOCIATION AS AMICUS CURIAE 
SUPPORTING RESPONDENTS 
      
Rafe Petersen 
Counsel of Record 
Holland & Knight LLP 800 17
th Street, N.W., Suite 1100  
Washington, D.C. 20006 
(202) 419-2481 
rafe.petersen@hklaw.com  
 
Counsel for Amicus Curiae 
Advanced Biofuels Association 
March 31, 2021 
 i 
 
 TTABLE OF CONTENTS 
 
INTEREST OF AM ICUS CURIAE ..................... 1  
OVERVIEW OF THE RFS PROGRAM AND 
SMALL REFINERY EXEMPTIONS ................... 2  
SUMMARY OF ARGUMENT ............................ 11  
ARGUMENT ...................................................... 13  
I. The ability of EPA to  issue sporadic, 
unforeseeable, and largely unreviewable 
exemptions to 

In [None]:
to_print = sample_df.iloc[sample_idx]['sections_alt']
for sec in to_print:
  print(sec)

{'I. The ability of EPA to issue sporadic, unforeseeable, and largely unreviewable exemptions to small refineries imposes significant economic damage on biofuels producers': '.  \nPetitioners portray the small refinery \nexemption provisions of the RFS program as a flexible \ntool created by Congress to guarantee small refineries \nperpetual financial success. Regardless of the underlying cause of a small refinery\'s financial trouble – from a global COVID-19 pandemic to its \ninability to adapt to the slow and foreseeable \nincreases in biofuel blending requirements of the RFS program – Petitioners believe the solution is for EPA to issue small refineries extensions of their \nexemptions from RFS obligations "at any time" \nregardless of how long it has been since the small \nrefinery last held an exemption. Curtailing EPA\'s \nsupposed authority to issue dozens of exemptions, Petitioners argue, will produce a wave of small \nrefinery failures. Pet\'rs\' Br. 4, 17. This argument \nign

In [43]:
print(df_complete.iloc[4]['arguments'])

A. Proof of discriminatory intent is not required to demonstrate a violation of Title II or Section 504, but such proof is required to recover damages under those provisions
1. To establish a violation of Title II or Section 504, a plaintiff need not prove that the defendant intended to engage in disability discrimination
2. To obtain compensatory damages for Title II and Section 504 violations, a plaintiff must prove intentional discrimination, which may be shown by proving deliberate indifference
B. Title II and Section 504 claims are subject to the same intent standards inside and outside the educational context
1. The texts of Title II and Section 504 indicate that the same intent standards apply inside and outside the educational context
2. The statutory context supports reading Title II and Section 504 to impose the same intent  standards inside and outside the educational  context
3. The Eighth Circuit and respondents have identified no sound basis for applying a heightened inte

In [42]:
df_complete = df_complete.drop(index=4)

# Reindex the DataFrame
df_complete = df_complete.reset_index(drop=True)

In [44]:
df_complete.to_json('../data/new_dataset/new_data_03212025.jsonl', orient='records', lines=True)