# RegRecall Dataset Construction

This notebook contains code to construct the RegRecall dataset.

## Part 0: Imports

In [None]:
%load_ext autoreload
%autoreload 2
from pypdf import PdfReader
from lit_scraper import scrape_sec_complaints
from sec_complaint_parser import SECComplaintParser
from parser_config import reg_headings
import random

## Part 1: Litigation Web Scraping

First, we scrape all litigation proceedings opened by the SEC from [here](https://www.sec.gov/litigation/litreleases). After the scraping is completed, note that the documents must be moved from your default "Downloads" folder to a more permanent location.

In [None]:
scrape_sec_complaints()

## Part 2: Parsing

Next, the documents must be parsed to identify and extract information relating to actions taken, and violations broken.

First, parse the pdfs to extract text.

In [None]:

all_texts = SECComplaintParser.get_all_pdf_texts("data/sec_complaints", save_text=True, save_directory="data/sec_complaints_text")

Alternatively, if the text has already been extracted, load the text in.

In [None]:
all_texts = SECComplaintParser.load_pdf_texts("data/segmented_sec_complaints_text/subfolder_1")

print("{num_texts} texts loaded in!".format(num_texts=len(all_texts)))

Now let us split the text up by sections.

In [8]:
sectioned_texts = SECComplaintParser.section_all_texts(all_texts)

In [None]:
print(sectioned_texts['comp-pr2012-178']["complaint"])

In [None]:
rule_section = 0
no_rule_section = 0

no_rule_sections = []

print(reg_headings)

for text_key, sectioned_text in sectioned_texts.items():

    has_rule_section = False
    
    for heading in sectioned_text:

        if SECComplaintParser.regex_fullmatch_any(heading, reg_headings):

            has_rule_section = True

    if has_rule_section:

        rule_section += 1

    else:

        no_rule_section += 1
        no_rule_sections.append(text_key)

print(rule_section)
print(no_rule_section)
print(no_rule_sections)