# RegRecall Dataset Construction

This notebook contains code to construct the RegRecall dataset.

## Part 0: Imports

In [2]:
%load_ext autoreload
%autoreload 2
from pypdf import PdfReader
from lit_scraper import scrape_sec_complaints
from sec_complaint_parser import SECComplaintParser
from parser_config import reg_headings
import random
import os
from tqdm import tqdm
import json

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


## Part 1: Litigation Web Scraping

First, we scrape all litigation proceedings opened by the SEC from [here](https://www.sec.gov/litigation/litreleases). After the scraping is completed, note that the documents must be moved from your default "Downloads" folder to a more permanent location.

In [3]:
scrape_sec_complaints()

Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26360.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26359.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26358.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/litreleases/2025/comp26355.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26354.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26353.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26352.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26349.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26348.pdf
Stopped
Total size of downloaded PDFs: 0 bytes
Total number of downloaded files: 10


## Part 2: Initial Parsing

Next, the documents must be parsed to identify and extract information relating to actions taken, and violations broken.

First, parse the pdfs to extract text.

In [4]:
all_texts = SECComplaintParser.get_all_pdf_texts("data/sec_complaints", save_text=True, save_directory="data/sec_complaints_text")

100%|██████████| 10/10 [00:01<00:00,  6.00it/s]


Alternatively, if the text has already been extracted, load the text in.

In [5]:
all_texts = SECComplaintParser.load_pdf_texts("data/sec_complaints_text")

print("{num_texts} texts loaded in!".format(num_texts=len(all_texts)))

100%|██████████| 10/10 [00:00<00:00, 9267.13it/s]

10 texts loaded in!





In [6]:
output_dir = "data/sec_complaints_json"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, "complaints.json")

with open(output_path, "w", encoding="utf-8") as json_file:
    json.dump(all_texts, json_file, indent=2, ensure_ascii=False)

print(f"Saved JSON to {output_path}")

Saved JSON to data/sec_complaints_json/complaints.json


Now let us split the text up by sections.

In [7]:
sectioned_texts = SECComplaintParser.section_all_texts(all_texts)
print(sectioned_texts)
for file in sectioned_texts:
    print(f"File: {file}")
    for section, text in sectioned_texts[file].items():
        print(f"  Section: {section}")
        print(f"  Text: {text[:100]}...")  # Print first 100 characters of each section
        print()  # Newline for better readability

File: comp26349
  Section: DOC_START
  Text: ['in the united states district court  ', 'for the western  district of texas  ', 'san antonio  division  ', '________________________________________________ ', 'securities and exchange commission,  ', 'plaintiff,  ) ', ') ', ') civil action no.  5:24-cv-805', ') ', 'v.', ' ) ', ')  ', 'imer gomez, individually and d/b/a              )  jury trial demanded  ', 'k&g investment solutions, llc, and  ) ', 'heli os venture fund, llc,   ) ) ']...

  Section: defendants
  Text: ['8. imer gomez, age 28, is a dual citizen of the u.s. and mexico, and resided in san ', 'antonio, texas  during the relevant time period .  gomez was the president and cfo of helios.  ', 'k&g appears to be an assumed name used by gomez to transact business and engage with ', 'advisory clients .   ', '9. helios venture fund, llc is a texas limited liability company with its ', 'principal place of business in san antonio.  helios purports to be an investment adviser.  ', 're

In [8]:
rule_section = 0
no_rule_section = 0
no_rule_sections = []

print(reg_headings)

for text_key, sectioned_text in sectioned_texts.items():
    has_rule_section = False
    
    for heading in sectioned_text:
        if SECComplaintParser.regex_fullmatch_any(heading, reg_headings):
            has_rule_section = True

    if has_rule_section:
        rule_section += 1
    else:
        no_rule_section += 1
        no_rule_sections.append(text_key)

print(rule_section)
print(no_rule_section)
print(no_rule_sections)

['.*claimforrelief', 'prayerforrelief', 'claimsforrelief', 'claimsforaction', 'claimforrelief', '.*claimforaction', '.*causeofaction', 'count.*']
7
3
['comp26360', 'comp26355', 'comp26345']


## Part 3: Section Parsing
 
Extracting each section from each txt filing using sec_complaint_parser.py functions

In [None]:
import json
for text_key, sectioned_text in sectioned_texts.items():
    # print(json.dumps(sectioned_text, indent=2))  
    for heading, text_list in sectioned_text.items():
        # Join the list of strings into a single string
        joined_text = " ".join(text_list)
        # Optional: clean up extra spaces
        joined_text = " ".join(joined_text.split())
        print(len(joined_text))
        cleaned = SECComplaintParser.clean_string(text)
        print(len(cleaned))
        print(f"joined_text: {joined_text}")
        # print(f"Joined text for '{heading}':\n{joined_text}\n")

joined_text: in the united states district court for the western district of texas san antonio division ________________________________________________ securities and exchange commission, plaintiff, ) ) ) civil action no. 5:24-cv-805 ) v. ) ) imer gomez, individually and d/b/a ) jury trial demanded k&g investment solutions, llc, and ) heli os venture fund, llc, ) )
joined_text: 8. imer gomez, age 28, is a dual citizen of the u.s. and mexico, and resided in san antonio, texas during the relevant time period . gomez was the president and cfo of helios. k&g appears to be an assumed name used by gomez to transact business and engage with advisory clients . 9. helios venture fund, llc is a texas limited liability company with its principal place of business in san antonio. helios purports to be an investment adviser. relief defendants 10. eric claxton , age 47, r esides in san antonio. 11. heather claxton , age 47, resides in san antonio.
joined_text: 12. the commission brings this action 

In [10]:
import os
import re

file_path = "data/sec_complaints_text/comp26352.txt"

with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()
    print(len(text))
    cleaned = SECComplaintParser.clean_string(text)
    print(len(cleaned))  # Print the first 1000 characters of the cleaned text

61998
48497
