# RegRecall Dataset Construction

This notebook contains code to construct the RegRecall dataset.

## Part 0: Imports

In [1]:
%load_ext autoreload
%autoreload 2
from pypdf import PdfReader
from lit_scraper import scrape_sec_complaints
from sec_complaint_parser import SECComplaintParser
from parser_config import reg_headings
import random
import os
from tqdm import tqdm
import json

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


<br>

## Part 1: Litigation Web Scraping

First, we scrape all litigation proceedings opened by the SEC from [here](https://www.sec.gov/litigation/litreleases). After the scraping is completed, note that the documents must be moved from your default "Downloads" folder to a more permanent location.

In [2]:
scrape_sec_complaints()

Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26360.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26359.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26358.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/litreleases/2025/comp26355.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26354.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26353.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26352.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26349.pdf
Skipped non-PDF link: https://www.sec.gov/files/litigation/complaints/2025/comp26348.pdf
Stopped
Total size of downloaded PDFs: 0 bytes
Total number of downloaded files: 10


<br>

## Part 2: Initial Parsing

Next, the documents must be parsed to identify and extract information relating to actions taken, and violations broken.

First, parse the pdfs to extract text.

In [3]:
all_texts = SECComplaintParser.get_all_pdf_texts("data/sec_complaints", save_text=True, save_directory="data/sec_complaints_text")

100%|██████████| 10/10 [00:01<00:00,  6.61it/s]


Alternatively, if the text has already been extracted, load the text in.

In [4]:
all_texts = SECComplaintParser.load_pdf_texts("data/sec_complaints_text")

print("{num_texts} texts loaded in!".format(num_texts=len(all_texts)))

100%|██████████| 10/10 [00:00<00:00, 7794.66it/s]

10 texts loaded in!





In [5]:
output_dir = "data/sec_complaints_json"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, "complaints.json")

with open(output_path, "w", encoding="utf-8") as json_file:
    json.dump(all_texts, json_file, indent=2, ensure_ascii=False)

print(f"Saved JSON to {output_path}")

Saved JSON to data/sec_complaints_json/complaints.json


Now let us split the text up by sections.

In [6]:
sectioned_texts = SECComplaintParser.section_all_texts(all_texts)
for file in sectioned_texts:
    print(f"File: {file}")
    for section, text in sectioned_texts[file].items():
        print(f"  Section: {section}")
        # print(f"  Text: {text}")  # Print first 100 characters of each section
        # print()  # Newline for better readability

File: comp26349
  Section: DOC_START
File: comp26348
  Section: DOC_START
File: comp26360
  Section: DOC_START
File: comp26358
  Section: DOC_START
  Section: count i 
  Section: count ii 
  Section: count iii 
File: comp26359
  Section: DOC_START
File: comp26354
  Section: DOC_START
File: comp26355
  Section: DOC_START
File: comp26345
  Section: DOC_START
File: comp26352
  Section: DOC_START
File: comp26353
  Section: DOC_START


In [7]:
rule_section = 0
no_rule_section = 0
no_rule_sections = []

print(reg_headings)

for text_key, sectioned_text in sectioned_texts.items():
    has_rule_section = False
    
    for heading in sectioned_text:
        if SECComplaintParser.regex_fullmatch_any(heading, reg_headings):
            has_rule_section = True

    if has_rule_section:
        rule_section += 1
    else:
        no_rule_section += 1
        no_rule_sections.append(text_key)

print(rule_section)
print(no_rule_section)
print(no_rule_sections)

['.*claimforrelief', 'prayerforrelief', 'claimsforrelief', 'claimsforaction', 'claimforrelief', '.*claimforaction', '.*causeofaction', 'count.*']
1
9
['comp26349', 'comp26348', 'comp26360', 'comp26359', 'comp26354', 'comp26355', 'comp26345', 'comp26352', 'comp26353']


<br>

## Part 3: Section Parsing
 
Extracting each section from each txt filing using sec_complaint_parser.py functions

In [8]:
sectioned_texts = SECComplaintParser.section_all_texts(all_texts)
# print(json.dumps(sectioned_texts, indent=2))

### Testing for single file

#### Before the preamble

In [11]:
with open("data/sec_complaints_text/comp26352.txt", "r") as f:
    text = f.read()
    
attributes = SECComplaintParser.parse_sec_complaint_attributes(text)
# print(json.dumps(attributes, indent=2))

#### Including preamble and item content

In [12]:
with open("data/sec_complaints_text/comp26348.txt", "r") as f:
    text = f.read()
parsed = SECComplaintParser.parse_sec_complaint_full(text)
import json
print(json.dumps(parsed, indent=2))

{
  "attributes": {
    "court": null,
    "plaintiff": null,
    "defendants": "trijya vakil and neeraj visen, \n  \n                                             defendant s.",
    "attorneys": "attorneys for plaintiff  \nsecurities and exchange commission",
    "case_number": "25 civ. _____",
    "jury_trial": "jury trial demanded"
  },
  "sections": {
    "preamble": [
      "joseph g. sansone",
      "assunta vivolo  derek m. schoenmann",
      "jawad b. muaddi  attorneys for plaintiff",
      "securities and exchange commission new york regional office",
      "100 pearl street, suite 20 -100",
      "new york, new york 10 004-2616",
      "(212) 336-9113 ( schoenmann)",
      "schoenmannd@sec.gov   united states district court",
      "southern district of new york",
      "",
      "securities and exchange",
      "commission,",
      "plaintiff,",
      "-against-",
      "trijya vakil and neeraj visen,",
      ""
    ],
    "defendant": [
      "defendant",
      "11. vakil , 

### Since it works, we can proceed to file cleaning and processing the section-wise content for all files

# <br>

## Part 4. File Cleaning

In [13]:
import json

output_dir = "data/cleaned"
os.makedirs(output_dir, exist_ok=True)

for text_key, sectioned_text in sectioned_texts.items():
    # print(f"Processing text key: {text_key}")
    full_text = []

    for heading, text_list in sectioned_text.items():
        joined_text = " ".join(text_list)
        joined_text = " ".join(joined_text.split())  # Optional cleanup
        cleaned = SECComplaintParser.clean_string(joined_text)
        full_text.append(cleaned)
        # print(f"Joined text for '{heading}': {len(cleaned)} characters")

    # Combine all cleaned sections into one string
    output_text = "\n\n".join(full_text)

    # Write to file
    output_path = os.path.join(output_dir, f"{text_key}.txt")
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(output_text)

    # print(f"Saved cleaned text to: {output_path}")


### Processing all section wise JSON dumps into file and folder

In [14]:
input_folder = "data/sec_complaints_text"
output_folder = "data/sec_sections_json"
os.makedirs(output_folder, exist_ok=True)

for filename in os.listdir(input_folder):
    if filename.endswith(".txt"):
        input_path = os.path.join(input_folder, filename)
        output_path = os.path.join(output_folder, filename.replace(".txt", ".json"))
        with open(input_path, "r", encoding="utf-8") as f:
            text = f.read()
        parsed = SECComplaintParser.parse_sec_complaint_full(text)
        with open(output_path, "w", encoding="utf-8") as out_f:
            json.dump(parsed, out_f, indent=2)