# RegRecall Dataset Construction

This notebook contains code to construct the RegRecall dataset.

## Part 0: Imports

In [3]:
%load_ext autoreload
%autoreload 2
from pypdf import PdfReader
from lit_scraper import scrape_sec_complaints
from sec_complaint_parser import SECComplaintParser
from parser_config import reg_headings
import random
import os
from tqdm import tqdm
import json

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Part 1: Litigation Web Scraping

First, we scrape all litigation proceedings opened by the SEC from [here](https://www.sec.gov/litigation/litreleases). After the scraping is completed, note that the documents must be moved from your default "Downloads" folder to a more permanent location.

In [5]:
scrape_sec_complaints()

Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26352.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26349.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26348.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26345.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26343.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26341.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26339.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26336.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26333.pdf
Stopped
Total size of downloaded PDFs: 0 bytes
Total number of downloaded files: 10


## Part 2: Initial Parsing

Next, the documents must be parsed to identify and extract information relating to actions taken, and violations broken.

First, parse the pdfs to extract text.

In [6]:
all_texts = SECComplaintParser.get_all_pdf_texts("data/sec_complaints", save_text=True, save_directory="data/sec_complaints_text")

100%|██████████| 10/10 [00:01<00:00,  8.10it/s]


Alternatively, if the text has already been extracted, load the text in.

In [23]:
all_texts = SECComplaintParser.load_pdf_texts("data/sec_complaints_text")

print("{num_texts} texts loaded in!".format(num_texts=len(all_texts)))

100%|██████████| 10/10 [00:00<00:00, 7777.31it/s]

10 texts loaded in!





In [7]:
output_dir = "data/sec_complaints_json"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, "complaints.json")

with open(output_path, "w", encoding="utf-8") as json_file:
    json.dump(all_texts, json_file, indent=2, ensure_ascii=False)

print(f"Saved JSON to {output_path}")

Saved JSON to data/sec_complaints_json/complaints.json


Now let us split the text up by sections.

In [18]:
sectioned_texts = SECComplaintParser.section_all_texts(all_texts)

In [10]:
rule_section = 0
no_rule_section = 0
no_rule_sections = []

print(reg_headings)

for text_key, sectioned_text in sectioned_texts.items():
    has_rule_section = False
    
    for heading in sectioned_text:
        if SECComplaintParser.regex_fullmatch_any(heading, reg_headings):
            has_rule_section = True

    if has_rule_section:
        rule_section += 1
    else:
        no_rule_section += 1
        no_rule_sections.append(text_key)

print(rule_section)
print(no_rule_section)
print(no_rule_sections)

['.*claimforrelief', 'prayerforrelief', 'claimsforrelief', 'claimsforaction', 'claimforrelief', '.*claimforaction', '.*causeofaction', 'count.*']
9
1
['comp26345']


## Part 3: Section Parsing
 
Extracting each section from each txt filing using sec_complaint_parser.py functions

In [22]:
# i want to clean the text using SECComplaintParser.clean_string
for text_key, sectioned_text in sectioned_texts.items():
    for heading in sectioned_text:
        print(heading)
        # print(SECComplaintParser.clean_string(heading))

DOC_START
defendants
complaint
jurisdictionandvenue
counti
countii
DOC_START
defendants
jurisdictionandvenue
firstclaimforrelief
secondclaimforrelief
thirdclaimforrelief
fourthclaimforrelief
prayerforrelief
DOC_START
defendants
complaint
summary
jurisdictionandvenue
defendant
firstclaimforrelief
prayerforrelief
DOC_START
defendants
complaintforinjunctiveandotherrelief
counti
complaint
countii
countiii
countiv
countv
countvi
countvii
DOC_START
summary
complaint
defendants
claimsforrelief
firstclaimforrelief
10111213141516171819202122232425262728secondclaimforrelief
thirdclaimforrelief
prayerforrelief
DOC_START
defendant
complaint
summary
jurisdictionandvenue
firstclaimforrelief
secondclaimforrelief
prayerforrelief
DOC_START
DOC_START
defendant
complaint
preliminarystatement
jurisdictionandvenue
firstclaimforrelief
secondclaimforrelief
prayerforrelief
DOC_START
defendants
complaint
firstclaimforrelief
secondclaimforrelief
thirdclaimforrelief
fourthclaimforrelief
DOC_START
defendants
comp