# RegRecall Dataset Construction

This notebook contains code to construct the RegRecall dataset.

## Part 0: Imports

In [1]:
%load_ext autoreload
%autoreload 2
from pypdf import PdfReader
from lit_scraper import scrape_sec_complaints
from sec_complaint_parser import SECComplaintParser
from parser_config import reg_headings
import random
import os
from tqdm import tqdm
import json

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


## Part 1: Litigation Web Scraping

First, we scrape all litigation proceedings opened by the SEC from [here](https://www.sec.gov/litigation/litreleases). After the scraping is completed, note that the documents must be moved from your default "Downloads" folder to a more permanent location.

In [5]:
scrape_sec_complaints()

Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26349.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26348.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26345.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26343.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26341.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26339.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26336.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26333.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2024/comp26331.pdf
Stopped
Total size of downloaded PDFs: 0 bytes
Total number of downloaded files: 10


## Part 2: Initial Parsing

Next, the documents must be parsed to identify and extract information relating to actions taken, and violations broken.

First, parse the pdfs to extract text.

In [None]:
all_texts = SECComplaintParser.get_all_pdf_texts("data/sec_complaints", save_text=True, save_directory="data/sec_complaints_text")

100%|██████████| 10/10 [00:01<00:00,  9.58it/s]


Alternatively, if the text has already been extracted, load the text in.

In [23]:
all_texts = SECComplaintParser.load_pdf_texts("data/sec_complaints_text")

print("{num_texts} texts loaded in!".format(num_texts=len(all_texts)))

100%|██████████| 10/10 [00:00<00:00, 7777.31it/s]

10 texts loaded in!





In [24]:
output_dir = "data/sec_complaints_json"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, "complaints.json")

with open(output_path, "w", encoding="utf-8") as json_file:
    json.dump(all_texts, json_file, indent=2, ensure_ascii=False)

print(f"Saved JSON to {output_path}")

Saved JSON to data/sec_complaints_json/complaints.json


Now let us split the text up by sections.

In [25]:
sectioned_texts = SECComplaintParser.section_all_texts(all_texts)

In [26]:
print(sectioned_texts['comp26328']["complaint"])

['1. between 2019 and 2024, defendants peter scalise iii (“scalise”) and ', 'the3rdbevco inc. (“the3rdbevco”  or the “company” ), a beverage company that scalise ', 'founded, controls, and operates, perpetrated a $3.6 million  offering fraud.  ', '2. defendants misled investors about the use of the  invest or funds  and a potential ', 'collaboration with “individual 1,” a celebrity described by defendants as a “global superstar ', 'and music icon.”   ', '3. contrary to what investors were told, scalise misappropriated and misused more ', 'than $856,000 of investor funds , including for personal expenses, such as tuition, mortgage  ', 'payments , and landscaping.     ', '4. in addition, in communications to existing  investors and potential investors, ', 'defendants promoted a potential collaboration with individual 1 on a supposed rum alcohol product, using individual 1’s nickname and trademark in its product brand name  and us ing ', 'individual 1’s name, image, trademark, and music  

In [27]:
rule_section = 0
no_rule_section = 0
no_rule_sections = []

print(reg_headings)

for text_key, sectioned_text in sectioned_texts.items():
    has_rule_section = False
    
    for heading in sectioned_text:
        if SECComplaintParser.regex_fullmatch_any(heading, reg_headings):
            has_rule_section = True

    if has_rule_section:
        rule_section += 1
    else:
        no_rule_section += 1
        no_rule_sections.append(text_key)

print(rule_section)
print(no_rule_section)
print(no_rule_sections)

['.*claimforrelief', 'prayerforrelief', 'claimsforrelief', 'claimsforaction', 'claimforrelief', '.*claimforaction', '.*causeofaction', 'count.*']
9
1
['comp26345']


## Part 3: Section Parsing
 
Extracting each section from each txt filing using sec_complaint_parser.py functions

In [None]:
# i want to clean the text using SECComplaintParser.clean_string
for text_key, sectioned_text in sectioned_texts.items():
    for heading in sectioned_text:
        sectioned_text[heading] = SECComplaintParser.clean_string(sectioned_text[heading])
        print(f"Cleaned text for {text_key} under heading {heading}")