# Introduction
This series of notebooks demonstrates a comprehensive approach to developing and evaluating Retrieval-Augmented Generation (RAG) systems specifically designed for SEC 10-K annual reports.



## What is RAG?
Retrieval-Augmented Generation (RAG) combines the power of large language models (LLMs) with external knowledge retrieval systems. Instead of relying solely on an LLM's internal knowledge, RAG systems first retrieve relevant information from a corpus of documents and then use this information to generate more accurate, up-to-date, and contextually relevant responses. This approach is particularly valuable for domain-specific applications where factual accuracy and source attribution are critical.

## What are SEC 10-K reports?
SEC 10-K reports are comprehensive annual filings required by the U.S. Securities and Exchange Commission (SEC) for publicly traded companies. These documents provide a detailed overview of a company's financial performance, business operations, risks, and strategic direction. Key sections include:

- Business overview
- Risk factors
- Management's discussion and analysis (MD&A)
- Financial statements and supplementary data
- Corporate governance information

These reports are extensive (often exceeding 100 pages), highly structured yet variable across companies, and contain critical information for investors, analysts, and regulators. Their complexity makes them an ideal candidate for testing advanced information retrieval and generation systems.

## Project goals
This project aims to develop a robust benchmarking framework for evaluating RAG systems on financial documents, with a specific focus on SEC 10-K reports. Our key objectives include:

1. **Creating a multi-reference benchmark dataset**: Developing questions that require synthesizing information from multiple sections within a single report or across different companies' reports
2. **Establishing evaluation metrics**: Implementing comprehensive metrics to assess both retrieval accuracy and generation quality
3. **Testing various RAG pipeline configurations**: Evaluating different chunking strategies, embedding models, retrieval methods, and reranking approaches

Before we can work on these tasks we need to download and process the data.

# SEC 10-K Report Downloader and Processing
This series of notebooks will guide you through the entire process of building and evaluating a RAG system for SEC 10-K reports. In this first notebook we'll focus on:

- **Data Collection**: Downloading SEC 10-K reports using the SEC EDGAR database
- **Data Processing**: Parsing, cleaning, and structuring the reports for efficient retrieval

By automating these tasks, we can efficiently gather and structure financial data from multiple companies, enabling more comprehensive financial analysis and research.


## Setup
First we need to import the relevant libraries and set some variables.

In [3]:
#|include: false
%load_ext autoreload
%autoreload 2

In [4]:
import os
from pprint import pprint

The functions and classes below are code developed for this project and stored in the `src` directory of the repository.

In [5]:
from src.data.downloader import download_10k_reports
from src.processing.sec_filing_parser import SECFilingParser
from src.processing.sec_reports import find_sec_reports, process_all_sec_reports

Define where to store the raw and processed data:

In [6]:
DATA_DIR = "./data"
RAW_DATA_DIR = f"{DATA_DIR}/raw"
PROCESSED_DATA_DIR = f"{DATA_DIR}/processed"

# Download SEC 10-K Filing reports
The first task is to download some data to work with. Here we'll make use of the [sec-edgar-downloader](https://sec-edgar-downloader.readthedocs.io/en/latest/) package which is available in PyPI. For now we'll just work with the five companies listed below.

**Note:** You can change the tickers to the company of your choice, but be aware that the current code doesn't seem to work with all companies (for instance, the file parser defined later doesn't work for Walmart reports). Variation in the formatting of the reports from different companies makes it difficult to get something that works for all. Only the ticker defined below (e.g. AAPL, NVDA...) needs to be accurate, the company name isn't required (you can leave empty).

In [7]:
# Define the companies we want to analyze.
companies = {
    'AAPL': 'Apple Inc',
    'GOOG': 'Alphabet Inc',
    'AMZN': 'Amazon.com Inc',
    'MSFT': 'Microsoft Corp',
    'NVDA': 'NVIDIA Corp'
}

Now we'll pass these to the `download_10k_reports` function (which was placed in the `src` directory of the repository) to get the data. We also tell it the output directory and the number of years that we want to download data for.

**Note**: When you run the cell below, a pop up box will prompt you to enter your company name and email. You can enter anything for this (it's required by the `sec-edgar-downloader`).

In [6]:
results_df = download_10k_reports(companies, RAW_DATA_DIR, num_years=10)

Downloading 10-K reports for Apple Inc (AAPL)...
Downloading 10-K reports for Alphabet Inc (GOOG)...
Downloading 10-K reports for Amazon.com Inc (AMZN)...
Downloading 10-K reports for Microsoft Corp (MSFT)...
Downloading 10-K reports for NVIDIA Corp (NVDA)...


Check the status of the download:

In [7]:
results_df

Unnamed: 0,ticker,company,status
0,AAPL,Apple Inc,success
1,GOOG,Alphabet Inc,success
2,AMZN,Amazon.com Inc,success
3,MSFT,Microsoft Corp,success
4,NVDA,NVIDIA Corp,success


All the data was successfull downloaded, so we have a total of 50 reports to work with. Now let's review the structure of the reports to see what we're dealing with.

# Data review
The `sec-edgar-downloader` downloads the requested filings as a text file. On inspection, these files appear to have metadata in a structured text format, while the actual filing content is in HTML.

A sample from the start of one of the downloaded files is displayed below as an example.


In [9]:
#| echo: false
# Display a sample of the report content
report_path = f'{RAW_DATA_DIR}/sec-edgar-filings/AAPL/10-K/0000320193-17-000070/full-submission.txt'
print("Sample of the SEC report content:")
with open(report_path, 'r') as file:
    sample = file.read(2000)  # Read first 1000 characters
print(sample)
print("...")

Sample of the SEC report content:
<SEC-DOCUMENT>0000320193-17-000070.txt : 20171103
<SEC-HEADER>0000320193-17-000070.hdr.sgml : 20171103
<ACCEPTANCE-DATETIME>20171103080137
ACCESSION NUMBER:		0000320193-17-000070
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		97
CONFORMED PERIOD OF REPORT:	20170930
FILED AS OF DATE:		20171103
DATE AS OF CHANGE:		20171103

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			APPLE INC
		CENTRAL INDEX KEY:			0000320193
		STANDARD INDUSTRIAL CLASSIFICATION:	ELECTRONIC COMPUTERS [3571]
		IRS NUMBER:				942404110
		STATE OF INCORPORATION:			CA
		FISCAL YEAR END:			0930

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-36743
		FILM NUMBER:		171174673

	BUSINESS ADDRESS:	
		STREET 1:		ONE INFINITE LOOP
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014
		BUSINESS PHONE:		(408) 996-1010

	MAIL ADDRESS:	
		STREET 1:		ONE INFINITE LOOP
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014

	FORMER COMPANY:	
		FORMER CONFORMED NAME:


## SECFilingParser
In the `src` module we have a class named `SECFilingParser` which allows us to do various things with these files, including:

* Load the 10-K filing reports
* Extract metadata
* View sections in HTML format
* Process the data ready for ingestion into our embedding database. 

Let's see the functionality using an example report. First we need to create an instance of the class and then use the `read_file` method to load a report.

In [7]:
report_path = f'{RAW_DATA_DIR}/sec-edgar-filings/AAPL/10-K/0000320193-17-000070/full-submission.txt'

In [8]:
parser = SECFilingParser()
parser.read_file(report_path)

### View metadata
Useful metadata (such as the report filing date and CIK number) can be accessed using the `extract_metadata` method:

In [10]:
metadata = parser.extract_metadata()
pprint(metadata)

{'cik': '0000320193',
 'company_name': 'APPLE INC',
 'filing_date': '20171103',
 'fiscal_year_end': '0930',
 'industry': 'ELECTRONIC COMPUTERS [3571]',
 'period_end_date': '20170930'}


### View section data (HTML)
SEC 10-K reports follow a standardized structure mandated by the Securities and Exchange Commission, with specific sections identified by "Item" numbers. This standardization helps investors and analysts navigate these often lengthy documents (frequently exceeding 100+ pages) and locate specific information across different companies' filings.

Some of the important sections include:

* **Item 1**: Business - Overview of the company's primary operations, products, services, markets, and competitive landscape
* **Item 1A**: Risk Factors - Detailed discussion of risks and uncertainties that could affect the company's business and financial performance
* **Item 5**: Market for Registrant's Common Equity - Information about the company's stock, including market data and dividend history
* **Item 7**: Management's Discussion and Analysis (MD&A) - Management's perspective on the company's financial condition, results of operations, and future outlook
* **Item 7A**: Quantitative and Qualitative Disclosures About Market Risk - Analysis of the company's exposure to market risks
* **Item 8**: Financial Statements and Supplementary Data - Audited financial statements and related notes


Several options to display section data are illustrated below. All of these require that you specify which section (i.e. Item number) to show.

The `display_section_html` method is used to show the extracted html content from the report. For instance, `Item 1` contains the basic company background information:

In [10]:
parser.display_section_html('Item 1')

Removed 7 page footers and copyright statements


0,1
,
Item 1A.,Risk Factors


The section named `Item 5` includes more information about the company stock:

In [11]:
parser.display_section_html('Item 5')

Removed 5 page footers and copyright statements
Processed 4 footnote tables in Item 5
Removed 3 duplicate tables from Item 5


0,1,2,3,4,5,6,7
,,,,,,,
,,,,,,,
,Fourth Quarter,,Third Quarter,,Second Quarter,,First Quarter
2017 price range per share,$164.94 – $142.41,,$156.65 – $140.06,,$144.50 – $114.76,,$118.69 – $104.08
2016 price range per share,$116.18 – $91.50,,$112.39 – $89.47,,$109.43 – $92.39,,$123.82 – $105.57

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
,,,,,,,,,,,,,,
,,,,,,,,,,,,,,
Periods,,Total Number of Shares Purchased,Total Number of Shares Purchased,,AveragePricePaid Per Share,AveragePricePaid Per Share,AveragePricePaid Per Share,,Total Number of Shares Purchased as Part of Publicly Announced Plans or Programs,Total Number of Shares Purchased as Part of Publicly Announced Plans or Programs,,Approximate Dollar Value of Shares That May Yet Be Purchased Under the Plans or Programs (1),Approximate Dollar Value of Shares That May Yet Be Purchased Under the Plans or Programs (1),Approximate Dollar Value of Shares That May Yet Be Purchased Under the Plans or Programs (1)
"July 2, 2017 to August 5, 2017:",,,,,,,,,,,,,,
Open market and privately negotiated purchases,,10076,,,$,148.87,,,10076,,,,,
,,,,,,,,,,,,,,
"August 6, 2017 to September 2, 2017:",,,,,,,,,,,,,,
May 2017 ASR,,4510,,,(2),(2),,,4510,,,,,
August 2017 ASR,,15069,,(3),(3),(3),,,15069,,(3),,,
Open market and privately negotiated purchases,,9684,,,$,160.06,,,9684,,,,,

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
,,September 2012,September 2012,September 2012,,September 2013,September 2013,September 2013,,September 2014,September 2014,September 2014,,September 2015,September 2015,September 2015,,September 2016,September 2016,September 2016,,September 2017,September 2017,September 2017
Apple Inc.,,$,100,,,$,74,,,$,111,,,$,128,,,$,129,,,$,179,
S&P 500 Index,,$,100,,,$,119,,,$,143,,,$,142,,,$,164,,,$,194,
S&P Information Technology Index,,$,100,,,$,107,,,$,138,,,$,141,,,$,173,,,$,223,
Dow Jones U.S. Technology Supersector Index,,$,100,,,$,105,,,$,137,,,$,137,,,$,167,,,$,214,

0,1
,
Item 6.,Selected Financial Data


### View section text
We can also view in markdown format using the `display_section_text` method. Note that some minor issues remain here, such as bullet points being separated on to two lines. Trying to fix this has proven difficult, as it introduces additional errors.

In [12]:
parser.display_section_text('Item 5')

Removed 5 page footers and copyright statements
Processed 4 footnote tables in Item 5
Removed 3 duplicate tables from Item 5 in simple format

First 500 characters of markdown:
**Item 5.**

**Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities**

The Company’s common stock is traded on the Nasdaq Stock Market LLC (“Nasdaq”) under the symbol AAPL.

 **Price Range of Common Stock**

The price range per share of common stock presented below represents the highest and lowest intraday sales prices for the Company’s common stock on the Nasdaq during each quarter of the two most recent years.

 **Holders**

As of Octobe




**Item 5.**

**Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities**

The Company’s common stock is traded on the Nasdaq Stock Market LLC (“Nasdaq”) under the symbol AAPL.

 **Price Range of Common Stock**

The price range per share of common stock presented below represents the highest and lowest intraday sales prices for the Company’s common stock on the Nasdaq during each quarter of the two most recent years.

 **Holders**

As of October 20,2017 , there were 25,333 shareholders of record.

 **Dividends**

The Company paid a total of $12.6 billion and $12.0 billion in dividends during 2017 and 2016 , respectively, and expects to pay quarterly dividends of $0.63 per common share each quarter, subject to declaration by the Board of Directors. The Company also plans to increase its dividend on an annual basis, subject to declaration by the Board of Directors.

**Purchases of Equity Securities by the Issuer and Affiliated Purchasers**

Share repurchase activity during the three months ended September 30,2017 was as follows (in millions, except number of shares, which are reflected in thousands, and per share amounts):

 (1)

In May 2017 , the Company’s Board of Directors increased the Company’s share repurchase authorization from $175 billion to $210 billion of the Company’s common stock, of which $166 billion had been utilized as of September 30,2017 . The remaining $44 billion in the table represents the amount available to repurchase shares under the authorized repurchase program as of September 30,2017 . The Company’s share repurchase program does not obligate it to acquire any specific number of shares. Under the program, shares may be repurchased in privately negotiated and/or open market transactions, including under plans complying with Rule 10 b 5-1 under the Exchange Act.

 (2)

In May 2017, the Company entered into an accelerated share repurchase arrangement (“ASR”) to purchase up to $3.0 billion of the Company’s common stock. In August 2017, the purchase period for this ASR ended and an additional 4.5 million shares were delivered and retired. In total, 20.1 million shares were delivered under this ASR at an average repurchase price of $149.20 .

 (3)

In August 2017, the Company entered into a new ASR to purchase up to $3.0 billion of the Company’s common stock. In exchange for an up-front payment of $3.0 billion , the financial institution party to the arrangement committed to deliver shares to the Company during the ASR’s purchase period, which will end in November 2017 . The total number of shares ultimately delivered, and therefore the average price paid per share, will be determined at the end of the applicable purchase period based on the volume-weighted average price of the Company’s common stock during that period.

 **Company Stock Performance**

The following graph shows a comparison of cumulative total shareholder return, calculated on a dividend reinvested basis, for the Company, the S&P 500 Index, the S&P Information Technology Index and the Dow Jones U.S. Technology Supersector Index for the five years ended September 30,2017 . The graph assumes $100 was invested in each of the Company’s common stock, the S&P 500 Index, the S&P Information Technology Index and the Dow Jones U.S. Technology Supersector Index as of the market close on September 28,2012. Note that historic stock price performance is not necessarily indicative of future stock price performance.

 *

$100 invested on 9/28/12 in stock or index, including reinvestment of dividends. Data points are the last day of each fiscal year for the Company’s common stock and September 30 th for indexes.

 **Item 6.**

### View section tables
The `display_section_tables` method can also be used to view all tables from a specific report section.

While converting tabular data to markdown format would typically enhance readability and processing, the complex structure of financial tables in SEC 10-K reports presents significant challenges. These tables frequently contain:

* Currency symbols in dedicated columns that appear inconsistently across rows
* Footnote reference numbers that create irregular cell structures
* Multi-level headers and nested relationships
* Mixed numeric formats (percentages, dollar amounts, ratios)
* Row and column spans that don't translate cleanly to simple markdown
* Tables are lacking titles and summary of what they contain which would provide contextual meaning

For our RAG pipeline development, we'll maintain these tables in their cleaned HTML format (i.e. after removing most of the HTML formatting). This approach preserves the original structure and relationships between data points, ensuring that when our retrieval system pulls financial information, it maintains the proper context and formatting that financial analysts would expect.

In future iterations, we could explore specialized table extraction tools or custom parsers designed specifically for financial tables, but for the current benchmarking objectives, the cleaned HTML representation provides the optimal balance between preservation of information and usability within our system.

In [13]:
# Display all tables from a section
parser.display_section_tables('Item 5')

Processed 4 footnote tables in Item 5
Removed 3 duplicate tables from Item 5


0,1,2,3,4,5,6,7
,,,,,,,
,,,,,,,
,Fourth Quarter,,Third Quarter,,Second Quarter,,First Quarter
2017 price range per share,$164.94 – $142.41,,$156.65 – $140.06,,$144.50 – $114.76,,$118.69 – $104.08
2016 price range per share,$116.18 – $91.50,,$112.39 – $89.47,,$109.43 – $92.39,,$123.82 – $105.57


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
,,,,,,,,,,,,,,
,,,,,,,,,,,,,,
Periods,,Total Number of Shares Purchased,Total Number of Shares Purchased,,AveragePricePaid Per Share,AveragePricePaid Per Share,AveragePricePaid Per Share,,Total Number of Shares Purchased as Part of Publicly Announced Plans or Programs,Total Number of Shares Purchased as Part of Publicly Announced Plans or Programs,,Approximate Dollar Value of Shares That May Yet Be Purchased Under the Plans or Programs (1),Approximate Dollar Value of Shares That May Yet Be Purchased Under the Plans or Programs (1),Approximate Dollar Value of Shares That May Yet Be Purchased Under the Plans or Programs (1)
"July 2, 2017 to August 5, 2017:",,,,,,,,,,,,,,
Open market and privately negotiated purchases,,10076,,,$,148.87,,,10076,,,,,
,,,,,,,,,,,,,,
"August 6, 2017 to September 2, 2017:",,,,,,,,,,,,,,
May 2017 ASR,,4510,,,(2),(2),,,4510,,,,,
August 2017 ASR,,15069,,(3),(3),(3),,,15069,,(3),,,
Open market and privately negotiated purchases,,9684,,,$,160.06,,,9684,,,,,


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
,,September 2012,September 2012,September 2012,,September 2013,September 2013,September 2013,,September 2014,September 2014,September 2014,,September 2015,September 2015,September 2015,,September 2016,September 2016,September 2016,,September 2017,September 2017,September 2017
Apple Inc.,,$,100,,,$,74,,,$,111,,,$,128,,,$,129,,,$,179,
S&P 500 Index,,$,100,,,$,119,,,$,143,,,$,142,,,$,164,,,$,194,
S&P Information Technology Index,,$,100,,,$,107,,,$,138,,,$,141,,,$,173,,,$,223,
Dow Jones U.S. Technology Supersector Index,,$,100,,,$,105,,,$,137,,,$,137,,,$,167,,,$,214,


0,1
,
Item 6.,Selected Financial Data



Footnotes:
  (1) InMay 2017, the Company’s Board of Directors increased the Company’s share repurchase authorization from $175 billion to$210 billionof the Company’s common stock, of which$166 billionhad been utilized as ofSeptember 30, 2017. The remaining$44 billionin the table represents the amount available to repurchase shares under the authorized repurchase program as ofSeptember 30, 2017. The Company’s share repurchase program does not obligate it to acquire any specific number of shares. Under the program, shares may be repurchased in privately negotiated and/or open market transactions, including under plans complying with Rule 10b5-1 under the Exchange Act.
  (2) In May 2017, the Company entered into an accelerated share repurchase arrangement (“ASR”) to purchase up to$3.0 billionof the Company’s common stock. In August 2017, the purchase period for this ASR ended and an additional4.5 millionshares were delivered and retired. In total,20.1 millionshares were delivered under t

### Process section

The `parse_section` method is available to process the data into a form suitable for a RAG workflow. 

 Some of the steps applied include:

 - Removing duplication of tables (unsure why they are duplicated in the first place)
 - Cleaning tables so only the basic HTML structure remains (e.g. removing formatting)
 - Removing footers (such as company name and page number)
 - Text is output in markdown format. Tables are kept in HTML (but most of the formatting is removed).

In [21]:
# Parse a specific section
section_data = parser.parse_section('Item 5', output_format='llm')

Removed 5 page footers and copyright statements
Processed 4 footnote tables in Item 5
Found 4 unique tables in section Item 5 (removed -1 duplicates, converted 4 footnotes to text)


In [22]:
print(section_data['text'])

**Item 5.**

**Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities**

The Company’s common stock is traded on the Nasdaq Stock Market LLC (“Nasdaq”) under the symbol AAPL.

 **Price Range of Common Stock**

The price range per share of common stock presented below represents the highest and lowest intraday sales prices for the Company’s common stock on the Nasdaq during each quarter of the two most recent years.

 

<table class="cleaned-financial-table"><tr><td colspan="8"></td></tr><tr><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td></td><td>Fourth Quarter</td><td></td><td>Third Quarter</td><td></td><td>Second Quarter</td><td></td><td>First Quarter</td></tr><tr><td>2017 price range per share</td><td>$164.94 – $142.41</td><td></td><td>$156.65 – $140.06</td><td></td><td>$144.50 – $114.76</td><td></td><td>$118.69 – $104.08</td></tr><tr><td>2016 price range per share</td><td>$116.18 – $91.50</td

After the data has been parsed and cleaned, we can check how the tables look to ensure they are still okay:

In [23]:
# Display all tables from a section
parser.display_section_tables_from_parsed('Item 5')

Removed 5 page footers and copyright statements
Processed 4 footnote tables in Item 5
Removed 3 duplicate tables from Item 5 in simple format
Found 4 data tables in Item 5:

Table 1:


0,1,2,3,4,5,6,7
,,,,,,,
,,,,,,,
,Fourth Quarter,,Third Quarter,,Second Quarter,,First Quarter
2017 price range per share,$164.94 – $142.41,,$156.65 – $140.06,,$144.50 – $114.76,,$118.69 – $104.08
2016 price range per share,$116.18 – $91.50,,$112.39 – $89.47,,$109.43 – $92.39,,$123.82 – $105.57



Table 2:


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
,,,,,,,,,,,,,,
,,,,,,,,,,,,,,
Periods,,Total Numberof Shares Purchased,Total Numberof Shares Purchased,,AveragePricePaid Per Share,AveragePricePaid Per Share,AveragePricePaid Per Share,,Total Number of SharesPurchased as Part of PubliclyAnnounced Plans or Programs,Total Number of SharesPurchased as Part of PubliclyAnnounced Plans or Programs,,Approximate Dollar Value ofShares That May Yet Be PurchasedUnder the Plans or Programs(1),Approximate Dollar Value ofShares That May Yet Be PurchasedUnder the Plans or Programs(1),Approximate Dollar Value ofShares That May Yet Be PurchasedUnder the Plans or Programs(1)
"July 2, 2017 to August 5, 2017:",,,,,,,,,,,,,,
Open market and privately negotiated purchases,,10076,,,$,148.87,,,10076,,,,,
,,,,,,,,,,,,,,
"August 6, 2017 to September 2, 2017:",,,,,,,,,,,,,,
May 2017 ASR,,4510,,,-2,-2,,,4510,,,,,
August 2017 ASR,,15069,,-3.0,-3,-3,,,15069,,-3.0,,,
Open market and privately negotiated purchases,,9684,,,$,160.06,,,9684,,,,,



Table 3:


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
,,September2012,September2012,September2012,,September2013,September2013,September2013,,September2014,September2014,September2014,,September2015,September2015,September2015,,September2016,September2016,September2016,,September2017,September2017,September2017
Apple Inc.,,$,100,,,$,74,,,$,111,,,$,128,,,$,129,,,$,179,
S&P 500 Index,,$,100,,,$,119,,,$,143,,,$,142,,,$,164,,,$,194,
S&P Information Technology Index,,$,100,,,$,107,,,$,138,,,$,141,,,$,173,,,$,223,
Dow Jones U.S. Technology Supersector Index,,$,100,,,$,105,,,$,137,,,$,137,,,$,167,,,$,214,



Table 4:


0,1
,
Item 6.,Selected Financial Data


# Process all  reports
Now we'll process all of the downloaded reports. The full SEC 10-K filing reports include many sections, so to reduce time we'll focus on the Item 1 (business overview), Item 1A (risk factors),  Item 5(company stock), Item 7 (management analysis), Item 7A (quantitative and qualitative disclosures) and Item 8 (financial statements).

In [14]:
reports_dir = f"{RAW_DATA_DIR}/sec-edgar-filings"
sections_to_extract = ["Item 1", "Item 1A", "Item 5", "Item 7", "Item 7A", "Item 8"]

In [19]:
#| output: false
output_dir = process_all_sec_reports(base_dir=reports_dir, sections_to_extract=sections_to_extract, output_dir=PROCESSED_DATA_DIR)

Finding all company directories...
Found 5 company directories
Finding all SEC 10-K reports...


Scanning companies: 100%|██████████| 5/5 [00:00<00:00, 7342.97it/s]


Found 50 SEC 10-K reports
Processing reports to extract metadata and available sections...


Processing reports:   0%|          | 0/50 [00:00<?, ?it/s]

Processed 14 footnote tables in Item 1
Found 2 unique tables in section Item 1 (removed -6 duplicates, converted 14 footnotes to text)
Processed 57 footnote tables in Item 1A
Found 2 unique tables in section Item 1A (removed -31 duplicates, converted 57 footnotes to text)
Processed 2 footnote tables in Item 5
Found 7 unique tables in section Item 5 (removed 11 duplicates, converted 2 footnotes to text)
Processed 56 footnote tables in Item 7
Found 23 unique tables in section Item 7 (removed -4 duplicates, converted 56 footnotes to text)
Found 3 unique tables in section Item 7A (removed 5 duplicates, converted 0 footnotes to text)
Processed 30 footnote tables in Item 8


Processing reports:   2%|▏         | 1/50 [00:16<13:07, 16.06s/it]

Found 53 unique tables in section Item 8 (removed 98 duplicates, converted 30 footnotes to text)
Processed 12 footnote tables in Item 1
Found 2 unique tables in section Item 1 (removed -4 duplicates, converted 12 footnotes to text)
Processed 53 footnote tables in Item 1A
Found 2 unique tables in section Item 1A (removed -31 duplicates, converted 53 footnotes to text)
Processed 2 footnote tables in Item 5
Found 7 unique tables in section Item 5 (removed 7 duplicates, converted 2 footnotes to text)
Processed 70 footnote tables in Item 7
Found 27 unique tables in section Item 7 (removed -10 duplicates, converted 70 footnotes to text)
Found 2 unique tables in section Item 7A (removed 4 duplicates, converted 0 footnotes to text)
Processed 30 footnote tables in Item 8


Processing reports:   4%|▍         | 2/50 [00:27<10:47, 13.48s/it]

Found 68 unique tables in section Item 8 (removed 154 duplicates, converted 30 footnotes to text)
Processed 14 footnote tables in Item 1
Found 2 unique tables in section Item 1 (removed -6 duplicates, converted 14 footnotes to text)
Processed 53 footnote tables in Item 1A
Found 2 unique tables in section Item 1A (removed -31 duplicates, converted 53 footnotes to text)
Found 6 unique tables in section Item 5 (removed 8 duplicates, converted 0 footnotes to text)
Processed 61 footnote tables in Item 7
Found 23 unique tables in section Item 7 (removed -13 duplicates, converted 61 footnotes to text)
Found 2 unique tables in section Item 7A (removed 4 duplicates, converted 0 footnotes to text)
Processed 24 footnote tables in Item 8


Processing reports:   6%|▌         | 3/50 [00:36<08:52, 11.34s/it]

Found 56 unique tables in section Item 8 (removed 118 duplicates, converted 24 footnotes to text)
Found 1 unique tables in section Item 1 (removed 13 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 25 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 6 duplicates, converted 0 footnotes to text)
Found 15 unique tables in section Item 7 (removed 45 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 7A (removed 4 duplicates, converted 0 footnotes to text)


Processing reports:   8%|▊         | 4/50 [00:45<07:51, 10.26s/it]

Found 51 unique tables in section Item 8 (removed 131 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1 (removed 9 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 25 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 4 duplicates, converted 0 footnotes to text)
Found 15 unique tables in section Item 7 (removed 41 duplicates, converted 0 footnotes to text)
Found 63 unique tables in section Item 7A (removed 163 duplicates, converted 0 footnotes to text)


Processing reports:  10%|█         | 5/50 [00:56<08:00, 10.68s/it]

Found 68 unique tables in section Item 8 (removed 180 duplicates, converted 0 footnotes to text)
Processed 15 footnote tables in Item 1
Found 2 unique tables in section Item 1 (removed -7 duplicates, converted 15 footnotes to text)
Processed 52 footnote tables in Item 1A
Found 2 unique tables in section Item 1A (removed -26 duplicates, converted 52 footnotes to text)
Processed 2 footnote tables in Item 5
Found 3 unique tables in section Item 5 (removed 5 duplicates, converted 2 footnotes to text)
Processed 53 footnote tables in Item 7
Found 22 unique tables in section Item 7 (removed -6 duplicates, converted 53 footnotes to text)
Found 3 unique tables in section Item 7A (removed 7 duplicates, converted 0 footnotes to text)
Processed 32 footnote tables in Item 8


Processing reports:  12%|█▏        | 6/50 [01:05<07:19,  9.99s/it]

Found 55 unique tables in section Item 8 (removed 98 duplicates, converted 32 footnotes to text)
Found 1 unique tables in section Item 1 (removed 9 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 27 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 6 duplicates, converted 0 footnotes to text)
Found 17 unique tables in section Item 7 (removed 49 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 7A (removed 6 duplicates, converted 0 footnotes to text)


Processing reports:  14%|█▍        | 7/50 [01:12<06:27,  9.02s/it]

Found 55 unique tables in section Item 8 (removed 137 duplicates, converted 0 footnotes to text)
Processed 18 footnote tables in Item 1
Found 2 unique tables in section Item 1 (removed -15 duplicates, converted 18 footnotes to text)
Processed 56 footnote tables in Item 1A
Found 2 unique tables in section Item 1A (removed -44 duplicates, converted 56 footnotes to text)
Processed 2 footnote tables in Item 5
Found 3 unique tables in section Item 5 (removed 0 duplicates, converted 2 footnotes to text)
Processed 61 footnote tables in Item 7
Found 20 unique tables in section Item 7 (removed -46 duplicates, converted 61 footnotes to text)
Found 3 unique tables in section Item 7A (removed 2 duplicates, converted 0 footnotes to text)
Processed 29 footnote tables in Item 8


Processing reports:  16%|█▌        | 8/50 [01:21<06:20,  9.05s/it]

Found 59 unique tables in section Item 8 (removed 14 duplicates, converted 29 footnotes to text)
Found 1 unique tables in section Item 1 (removed 11 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 27 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 4 duplicates, converted 0 footnotes to text)
Found 15 unique tables in section Item 7 (removed 41 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 7A (removed 6 duplicates, converted 0 footnotes to text)


Processing reports:  18%|█▊        | 9/50 [01:28<05:41,  8.32s/it]

Found 55 unique tables in section Item 8 (removed 135 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1 (removed 11 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 29 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 4 duplicates, converted 0 footnotes to text)
Found 15 unique tables in section Item 7 (removed 43 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 7A (removed 6 duplicates, converted 0 footnotes to text)


Processing reports:  20%|██        | 10/50 [01:36<05:28,  8.22s/it]

Found 51 unique tables in section Item 8 (removed 133 duplicates, converted 0 footnotes to text)
Found 3 unique tables in section Item 1 (removed 3 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 1 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 5 (removed 1 duplicates, converted 0 footnotes to text)
Found 9 unique tables in section Item 7 (removed 9 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 7A (removed 2 duplicates, converted 0 footnotes to text)


Processing reports:  22%|██▏       | 11/50 [01:41<04:42,  7.25s/it]

Found 44 unique tables in section Item 8 (removed 44 duplicates, converted 0 footnotes to text)
Found 3 unique tables in section Item 1 (removed 3 duplicates, converted 0 footnotes to text)
Processed 58 footnote tables in Item 1A
Found 1 unique tables in section Item 1A (removed -57 duplicates, converted 58 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Processed 10 footnote tables in Item 7
Found 11 unique tables in section Item 7 (removed 1 duplicates, converted 10 footnotes to text)
Found 3 unique tables in section Item 7A (removed 3 duplicates, converted 0 footnotes to text)
Processed 31 footnote tables in Item 8


Processing reports:  24%|██▍       | 12/50 [01:47<04:25,  7.00s/it]

Found 41 unique tables in section Item 8 (removed 10 duplicates, converted 31 footnotes to text)
Found 3 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Processed 62 footnote tables in Item 1A
Found 1 unique tables in section Item 1A (removed -62 duplicates, converted 62 footnotes to text)
Found 1 unique tables in section Item 5 (removed 0 duplicates, converted 0 footnotes to text)
Processed 10 footnote tables in Item 7
Found 9 unique tables in section Item 7 (removed -10 duplicates, converted 10 footnotes to text)
Found 3 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Processed 32 footnote tables in Item 8


Processing reports:  26%|██▌       | 13/50 [01:56<04:36,  7.48s/it]

Found 44 unique tables in section Item 8 (removed -32 duplicates, converted 32 footnotes to text)
Found 3 unique tables in section Item 1 (removed 3 duplicates, converted 0 footnotes to text)
Processed 61 footnote tables in Item 1A
Found 1 unique tables in section Item 1A (removed -60 duplicates, converted 61 footnotes to text)
Found 1 unique tables in section Item 5 (removed 1 duplicates, converted 0 footnotes to text)
Processed 8 footnote tables in Item 7
Found 9 unique tables in section Item 7 (removed 1 duplicates, converted 8 footnotes to text)
Found 3 unique tables in section Item 7A (removed 3 duplicates, converted 0 footnotes to text)
Processed 33 footnote tables in Item 8


Processing reports:  28%|██▊       | 14/50 [02:02<04:17,  7.16s/it]

Found 43 unique tables in section Item 8 (removed 10 duplicates, converted 33 footnotes to text)
Found 3 unique tables in section Item 1 (removed 3 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 1 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 5 (removed 1 duplicates, converted 0 footnotes to text)
Found 9 unique tables in section Item 7 (removed 9 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 7A (removed 2 duplicates, converted 0 footnotes to text)


Processing reports:  30%|███       | 15/50 [02:07<03:45,  6.43s/it]

Found 44 unique tables in section Item 8 (removed 44 duplicates, converted 0 footnotes to text)
Found 3 unique tables in section Item 1 (removed 3 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 1 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 5 (removed 1 duplicates, converted 0 footnotes to text)
Found 9 unique tables in section Item 7 (removed 9 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 7A (removed 2 duplicates, converted 0 footnotes to text)


Processing reports:  32%|███▏      | 16/50 [02:11<03:19,  5.86s/it]

Found 44 unique tables in section Item 8 (removed 44 duplicates, converted 0 footnotes to text)
Found 3 unique tables in section Item 1 (removed 3 duplicates, converted 0 footnotes to text)
Processed 58 footnote tables in Item 1A
Found 1 unique tables in section Item 1A (removed -57 duplicates, converted 58 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Processed 16 footnote tables in Item 7
Found 11 unique tables in section Item 7 (removed -5 duplicates, converted 16 footnotes to text)
Found 3 unique tables in section Item 7A (removed 3 duplicates, converted 0 footnotes to text)
Processed 28 footnote tables in Item 8


Processing reports:  34%|███▍      | 17/50 [02:18<03:19,  6.06s/it]

Found 42 unique tables in section Item 8 (removed 14 duplicates, converted 28 footnotes to text)
Found 3 unique tables in section Item 1 (removed 3 duplicates, converted 0 footnotes to text)
Processed 60 footnote tables in Item 1A
Found 1 unique tables in section Item 1A (removed -59 duplicates, converted 60 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Processed 8 footnote tables in Item 7
Found 9 unique tables in section Item 7 (removed 1 duplicates, converted 8 footnotes to text)
Found 3 unique tables in section Item 7A (removed 3 duplicates, converted 0 footnotes to text)
Processed 38 footnote tables in Item 8


Processing reports:  36%|███▌      | 18/50 [02:27<03:40,  6.88s/it]

Found 42 unique tables in section Item 8 (removed 4 duplicates, converted 38 footnotes to text)
Found 3 unique tables in section Item 1 (removed 3 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 1 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 5 (removed 1 duplicates, converted 0 footnotes to text)
Found 9 unique tables in section Item 7 (removed 9 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 7A (removed 2 duplicates, converted 0 footnotes to text)


Processing reports:  38%|███▊      | 19/50 [02:32<03:19,  6.42s/it]

Found 44 unique tables in section Item 8 (removed 44 duplicates, converted 0 footnotes to text)
Found 3 unique tables in section Item 1 (removed 3 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 1 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 5 (removed 1 duplicates, converted 0 footnotes to text)
Found 9 unique tables in section Item 7 (removed 9 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 7A (removed 2 duplicates, converted 0 footnotes to text)


Processing reports:  40%|████      | 20/50 [02:37<03:02,  6.09s/it]

Found 42 unique tables in section Item 8 (removed 42 duplicates, converted 0 footnotes to text)
Processed 6 footnote tables in Item 1
Found 2 unique tables in section Item 1 (removed -4 duplicates, converted 6 footnotes to text)
Processed 55 footnote tables in Item 1A
Found 0 unique tables in section Item 1A (removed -55 duplicates, converted 55 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Processed 3 footnote tables in Item 7


Processing reports:  42%|████▏     | 21/50 [02:40<02:29,  5.16s/it]

Found 7 unique tables in section Item 7 (removed 4 duplicates, converted 3 footnotes to text)
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 8 (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 1 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 5 (removed 1 duplicates, converted 0 footnotes to text)


Processing reports:  44%|████▍     | 22/50 [02:43<02:01,  4.32s/it]

Found 6 unique tables in section Item 7 (removed 6 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 8 (removed 0 duplicates, converted 0 footnotes to text)
Processed 4 footnote tables in Item 1
Found 1 unique tables in section Item 1 (removed -3 duplicates, converted 4 footnotes to text)
Processed 54 footnote tables in Item 1A
Found 1 unique tables in section Item 1A (removed -53 duplicates, converted 54 footnotes to text)
Found 3 unique tables in section Item 5 (removed 3 duplicates, converted 0 footnotes to text)
Processed 5 footnote tables in Item 7


Processing reports:  46%|████▌     | 23/50 [02:46<01:50,  4.09s/it]

Found 8 unique tables in section Item 7 (removed 3 duplicates, converted 5 footnotes to text)
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 8 (removed 0 duplicates, converted 0 footnotes to text)
Processed 7 footnote tables in Item 1
Found 2 unique tables in section Item 1 (removed -7 duplicates, converted 7 footnotes to text)
Processed 63 footnote tables in Item 1A
Found 0 unique tables in section Item 1A (removed -63 duplicates, converted 63 footnotes to text)
Found 1 unique tables in section Item 5 (removed 0 duplicates, converted 0 footnotes to text)
Processed 3 footnote tables in Item 7
Found 7 unique tables in section Item 7 (removed -3 duplicates, converted 3 footnotes to text)
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)


Processing reports:  48%|████▊     | 24/50 [02:49<01:33,  3.59s/it]

Found 0 unique tables in section Item 8 (removed 0 duplicates, converted 0 footnotes to text)
Processed 6 footnote tables in Item 1
Found 1 unique tables in section Item 1 (removed -5 duplicates, converted 6 footnotes to text)
Processed 58 footnote tables in Item 1A
Found 1 unique tables in section Item 1A (removed -57 duplicates, converted 58 footnotes to text)
Processed 1 footnote tables in Item 5
Found 4 unique tables in section Item 5 (removed 3 duplicates, converted 1 footnotes to text)
Processed 9 footnote tables in Item 7


Processing reports:  50%|█████     | 25/50 [02:52<01:28,  3.53s/it]

Found 9 unique tables in section Item 7 (removed 0 duplicates, converted 9 footnotes to text)
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 8 (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 1 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)


Processing reports:  52%|█████▏    | 26/50 [02:55<01:22,  3.43s/it]

Found 10 unique tables in section Item 7 (removed 10 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 8 (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 1 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 5 (removed 1 duplicates, converted 0 footnotes to text)


Processing reports:  54%|█████▍    | 27/50 [03:00<01:26,  3.74s/it]

Found 6 unique tables in section Item 7 (removed 6 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 8 (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 1 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)


Processing reports:  56%|█████▌    | 28/50 [03:03<01:17,  3.54s/it]

Found 9 unique tables in section Item 7 (removed 9 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 8 (removed 0 duplicates, converted 0 footnotes to text)
Processed 6 footnote tables in Item 1
Found 1 unique tables in section Item 1 (removed -5 duplicates, converted 6 footnotes to text)
Processed 55 footnote tables in Item 1A
Found 1 unique tables in section Item 1A (removed -54 duplicates, converted 55 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Processed 4 footnote tables in Item 7


Processing reports:  58%|█████▊    | 29/50 [03:06<01:11,  3.39s/it]

Found 7 unique tables in section Item 7 (removed 3 duplicates, converted 4 footnotes to text)
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 8 (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 1 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)


Processing reports:  60%|██████    | 30/50 [03:08<01:03,  3.17s/it]

Found 8 unique tables in section Item 7 (removed 8 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 8 (removed 0 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Found 7 unique tables in section Item 5 (removed 0 duplicates, converted 0 footnotes to text)
Found 22 unique tables in section Item 7 (removed 0 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 50 unique tables in section Item 8 (removed 0 duplicates, converted 0 footnotes to text)


Processing reports:  62%|██████▏   | 31/50 [03:14<01:15,  3.96s/it]

Removed 4 page footers and copyright statements
Found 0 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Removed 10 page footers and copyright statements
Found 0 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Removed 3 page footers and copyright statements
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Removed 8 page footers and copyright statements
Found 9 unique tables in section Item 7 (removed 9 duplicates, converted 0 footnotes to text)
Removed 2 page footers and copyright statements
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Removed 32 page footers and copyright statements
Processed 2 footnote tables in Item 8


Processing reports:  64%|██████▍   | 32/50 [03:19<01:15,  4.19s/it]

Found 43 unique tables in section Item 8 (removed 41 duplicates, converted 2 footnotes to text)
Removed 7 page footers and copyright statements
Found 1 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Removed 8 page footers and copyright statements
Found 1 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Removed 5 page footers and copyright statements
Processed 4 footnote tables in Item 5
Found 4 unique tables in section Item 5 (removed -1 duplicates, converted 4 footnotes to text)
Removed 14 page footers and copyright statements
Processed 4 footnote tables in Item 7
Found 19 unique tables in section Item 7 (removed 14 duplicates, converted 4 footnotes to text)
Removed 2 page footers and copyright statements
Found 1 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Removed 34 page footers and copyright statements
Processed 10 footnote tables in Item 8


Processing reports:  66%|██████▌   | 33/50 [03:26<01:28,  5.19s/it]

Found 45 unique tables in section Item 8 (removed 36 duplicates, converted 10 footnotes to text)
Removed 7 page footers and copyright statements
Found 1 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Removed 9 page footers and copyright statements
Found 1 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Removed 5 page footers and copyright statements
Processed 4 footnote tables in Item 5
Found 4 unique tables in section Item 5 (removed -1 duplicates, converted 4 footnotes to text)
Removed 14 page footers and copyright statements
Processed 3 footnote tables in Item 7
Found 19 unique tables in section Item 7 (removed 15 duplicates, converted 3 footnotes to text)
Removed 2 page footers and copyright statements
Found 1 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Removed 34 page footers and copyright statements
Processed 9 footnote tables in Item 8


Processing reports:  68%|██████▊   | 34/50 [03:36<01:42,  6.39s/it]

Found 43 unique tables in section Item 8 (removed 35 duplicates, converted 9 footnotes to text)
Removed 4 page footers and copyright statements
Found 0 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Removed 11 page footers and copyright statements
Found 0 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Removed 1 page footers and copyright statements
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Removed 6 page footers and copyright statements
Found 6 unique tables in section Item 7 (removed 6 duplicates, converted 0 footnotes to text)
Removed 1 page footers and copyright statements
Found 1 unique tables in section Item 7A (removed 1 duplicates, converted 0 footnotes to text)
Removed 25 page footers and copyright statements
Processed 2 footnote tables in Item 8


Processing reports:  70%|███████   | 35/50 [03:40<01:27,  5.83s/it]

Found 37 unique tables in section Item 8 (removed 35 duplicates, converted 2 footnotes to text)
Removed 4 page footers and copyright statements
Found 0 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Removed 12 page footers and copyright statements
Found 0 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Removed 2 page footers and copyright statements
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Removed 6 page footers and copyright statements
Found 7 unique tables in section Item 7 (removed 7 duplicates, converted 0 footnotes to text)
Removed 2 page footers and copyright statements
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Removed 25 page footers and copyright statements
Processed 2 footnote tables in Item 8


Processing reports:  72%|███████▏  | 36/50 [03:44<01:13,  5.27s/it]

Found 35 unique tables in section Item 8 (removed 33 duplicates, converted 2 footnotes to text)
Removed 5 page footers and copyright statements
Found 0 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Removed 11 page footers and copyright statements
Found 0 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Removed 2 page footers and copyright statements
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Removed 6 page footers and copyright statements
Found 7 unique tables in section Item 7 (removed 7 duplicates, converted 0 footnotes to text)
Removed 2 page footers and copyright statements
Found 0 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Removed 27 page footers and copyright statements
Processed 2 footnote tables in Item 8


Processing reports:  74%|███████▍  | 37/50 [03:48<01:03,  4.90s/it]

Found 35 unique tables in section Item 8 (removed 33 duplicates, converted 2 footnotes to text)
Removed 4 page footers and copyright statements
Found 0 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Removed 12 page footers and copyright statements
Found 0 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Removed 1 page footers and copyright statements
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Removed 6 page footers and copyright statements
Found 6 unique tables in section Item 7 (removed 6 duplicates, converted 0 footnotes to text)
Removed 1 page footers and copyright statements
Found 1 unique tables in section Item 7A (removed 1 duplicates, converted 0 footnotes to text)
Removed 23 page footers and copyright statements
Processed 2 footnote tables in Item 8


Processing reports:  76%|███████▌  | 38/50 [03:53<00:57,  4.77s/it]

Found 34 unique tables in section Item 8 (removed 32 duplicates, converted 2 footnotes to text)
Removed 7 page footers and copyright statements
Found 1 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Removed 9 page footers and copyright statements
Found 1 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Removed 3 page footers and copyright statements
Processed 2 footnote tables in Item 5
Found 3 unique tables in section Item 5 (removed 0 duplicates, converted 2 footnotes to text)
Removed 13 page footers and copyright statements
Processed 6 footnote tables in Item 7
Found 17 unique tables in section Item 7 (removed 10 duplicates, converted 6 footnotes to text)
Removed 2 page footers and copyright statements
Found 1 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Removed 30 page footers and copyright statements
Processed 11 footnote tables in Item 8


Processing reports:  78%|███████▊  | 39/50 [03:59<00:58,  5.30s/it]

Found 40 unique tables in section Item 8 (removed 28 duplicates, converted 11 footnotes to text)
Removed 4 page footers and copyright statements
Found 1 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Removed 9 page footers and copyright statements
Found 1 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Removed 3 page footers and copyright statements
Processed 3 footnote tables in Item 5
Found 3 unique tables in section Item 5 (removed -1 duplicates, converted 3 footnotes to text)
Removed 8 page footers and copyright statements
Processed 5 footnote tables in Item 7
Found 9 unique tables in section Item 7 (removed 3 duplicates, converted 5 footnotes to text)
Removed 2 page footers and copyright statements
Found 1 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Removed 31 page footers and copyright statements
Processed 12 footnote tables in Item 8


Processing reports:  80%|████████  | 40/50 [04:08<01:04,  6.45s/it]

Found 44 unique tables in section Item 8 (removed 31 duplicates, converted 12 footnotes to text)
Found 38 unique tables in section Item 1 (removed 38 duplicates, converted 0 footnotes to text)
Found 16 unique tables in section Item 1A (removed 16 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Processed 12 footnote tables in Item 7
Found 54 unique tables in section Item 7 (removed 44 duplicates, converted 12 footnotes to text)
Found 1 unique tables in section Item 7A (removed 1 duplicates, converted 0 footnotes to text)
Processed 44 footnote tables in Item 8
Found 76 unique tables in section Item 8 (removed 32 duplicates, converted 44 footnotes to text)


Processing reports:  82%|████████▏ | 41/50 [04:24<01:21,  9.08s/it]

Found 1 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 0 duplicates, converted 0 footnotes to text)
Found 13 unique tables in section Item 7 (removed 0 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 57 unique tables in section Item 8 (removed 51 duplicates, converted 0 footnotes to text)


Processing reports:  84%|████████▍ | 42/50 [04:38<01:24, 10.56s/it]

Found 1 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Found 0 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 0 duplicates, converted 0 footnotes to text)
Found 13 unique tables in section Item 7 (removed 0 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Found 58 unique tables in section Item 8 (removed 53 duplicates, converted 0 footnotes to text)


Processing reports:  86%|████████▌ | 43/50 [04:53<01:23, 12.00s/it]

Found 38 unique tables in section Item 1 (removed 38 duplicates, converted 0 footnotes to text)
Found 18 unique tables in section Item 1A (removed 18 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Processed 12 footnote tables in Item 7
Found 65 unique tables in section Item 7 (removed 53 duplicates, converted 12 footnotes to text)
Found 1 unique tables in section Item 7A (removed 1 duplicates, converted 0 footnotes to text)
Processed 22 footnote tables in Item 8
Found 77 unique tables in section Item 8 (removed 57 duplicates, converted 22 footnotes to text)


Processing reports:  88%|████████▊ | 44/50 [05:10<01:20, 13.46s/it]

Found 39 unique tables in section Item 1 (removed 39 duplicates, converted 0 footnotes to text)
Found 18 unique tables in section Item 1A (removed 18 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Processed 12 footnote tables in Item 7
Found 63 unique tables in section Item 7 (removed 53 duplicates, converted 12 footnotes to text)
Found 1 unique tables in section Item 7A (removed 1 duplicates, converted 0 footnotes to text)
Processed 24 footnote tables in Item 8
Found 92 unique tables in section Item 8 (removed 70 duplicates, converted 24 footnotes to text)


Processing reports:  90%|█████████ | 45/50 [05:27<01:12, 14.58s/it]

Found 53 unique tables in section Item 1 (removed 53 duplicates, converted 0 footnotes to text)
Found 24 unique tables in section Item 1A (removed 24 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Processed 12 footnote tables in Item 7
Found 44 unique tables in section Item 7 (removed 34 duplicates, converted 12 footnotes to text)
Found 1 unique tables in section Item 7A (removed 1 duplicates, converted 0 footnotes to text)
Processed 14 footnote tables in Item 8
Found 86 unique tables in section Item 8 (removed 72 duplicates, converted 14 footnotes to text)


Processing reports:  92%|█████████▏| 46/50 [05:42<00:58, 14.75s/it]

Found 35 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Found 16 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 0 duplicates, converted 0 footnotes to text)
Found 1 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Processed 16 footnote tables in Item 8
Found 69 unique tables in section Item 8 (removed -16 duplicates, converted 16 footnotes to text)


Processing reports:  94%|█████████▍| 47/50 [05:50<00:38, 12.78s/it]

Found 64 unique tables in section Item 1 (removed 64 duplicates, converted 0 footnotes to text)
Found 25 unique tables in section Item 1A (removed 25 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Processed 8 footnote tables in Item 7
Found 45 unique tables in section Item 7 (removed 39 duplicates, converted 8 footnotes to text)
Found 1 unique tables in section Item 7A (removed 1 duplicates, converted 0 footnotes to text)
Processed 22 footnote tables in Item 8
Found 86 unique tables in section Item 8 (removed 64 duplicates, converted 22 footnotes to text)


Processing reports:  96%|█████████▌| 48/50 [06:05<00:26, 13.49s/it]

Found 26 unique tables in section Item 1 (removed 0 duplicates, converted 0 footnotes to text)
Found 17 unique tables in section Item 1A (removed 0 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 0 duplicates, converted 0 footnotes to text)
Processed 6 footnote tables in Item 7
Found 36 unique tables in section Item 7 (removed -6 duplicates, converted 6 footnotes to text)
Found 1 unique tables in section Item 7A (removed 0 duplicates, converted 0 footnotes to text)
Processed 17 footnote tables in Item 8
Found 78 unique tables in section Item 8 (removed -17 duplicates, converted 17 footnotes to text)


Processing reports:  98%|█████████▊| 49/50 [06:16<00:12, 12.68s/it]

Found 39 unique tables in section Item 1 (removed 39 duplicates, converted 0 footnotes to text)
Found 24 unique tables in section Item 1A (removed 24 duplicates, converted 0 footnotes to text)
Found 2 unique tables in section Item 5 (removed 2 duplicates, converted 0 footnotes to text)
Processed 12 footnote tables in Item 7
Found 64 unique tables in section Item 7 (removed 54 duplicates, converted 12 footnotes to text)
Found 1 unique tables in section Item 7A (removed 1 duplicates, converted 0 footnotes to text)
Processed 18 footnote tables in Item 8
Found 85 unique tables in section Item 8 (removed 67 duplicates, converted 18 footnotes to text)


Processing reports: 100%|██████████| 50/50 [06:32<00:00,  7.85s/it]

Found 50 valid reports with parseable sections

=== PROCESSING SUMMARY ===
Total reports processed: 50
Reports with at least one valid section: 50 (100.0%)
Reports with no valid sections: 0 (0.0%)

=== SECTION SUCCESS RATES ===
Item 1: 50/50 successful (100.0%)
Item 1A: 50/50 successful (100.0%)
Item 5: 50/50 successful (100.0%)
Item 7: 49/50 successful (98.0%)
Item 7A: 49/50 successful (98.0%)
Item 8: 49/50 successful (98.0%)

=== SECTION STATISTICS ===

Item 1 statistics (50 sections):
  Word count: avg=2666.7, median=2141, min=6, max=5931
  Token count: avg=3780.4, median=3599, min=38, max=7963
  Table count: avg=7.8, median=2, min=0, max=64

Item 1A statistics (50 sections):
  Word count: avg=7449.8, median=7864, min=8, max=19226
  Token count: avg=9021.2, median=9283, min=41, max=23522
  Table count: avg=3.9, median=1, min=0, max=25

Item 5 statistics (50 sections):
  Word count: avg=462.5, median=525, min=9, max=1403
  Token count: avg=1782.5, median=1956, min=63, max=4723
  Tabl




The processed reports are too long to display fully here, but you can find an example report under the `examples` folder in the repository (file named `report_GOOG_2016.json`).

Let's check the processing summary that was output along with the reports:

In [10]:
#| echo: false
import json
from pprint import pprint

# Load the summary JSON file
with open('./data/processed/sec_report_processing_final_summary.json', 'r') as f:
    summary = json.load(f)

# Print the summary in a nicely formatted way
print("SEC Report Processing Summary:")
print("==============================")
pprint(summary, width=100, sort_dicts=False)

SEC Report Processing Summary:
{'timestamp': '2025-03-12 13:59:24',
 'summary': {'total_reports': 50,
             'successful_reports': 50,
             'failed_reports': 0,
             'section_failures': {'Item 1': {'count': 0, 'errors': {}},
                                  'Item 1A': {'count': 0, 'errors': {}},
                                  'Item 5': {'count': 0, 'errors': {}},
                                  'Item 7': {'count': 1, 'errors': {'empty_or_short_content': 1}},
                                  'Item 7A': {'count': 1, 'errors': {'empty_or_short_content': 1}},
                                  'Item 8': {'count': 1, 'errors': {'empty_or_short_content': 1}}},
             'section_success_rates': {'Item 1': {'success_count': 50,
                                                  'failure_count': 0,
                                                  'success_rate': 100.0},
                                       'Item 1A': {'success_count': 50,
                      

There are a few things that we should take notice of here, particularly for Item 8. We can see that this particular section contains a lot of tables (44 on average) and also uses a lot of tokens (64192 on average). This will make it difficult to work with this section (if we include the tabular data) and we may struggle to pass Item 8 from multiple reports to an LLM to generate synthetic questions for us.

# Summary

In this notebook, we've established the foundation for our RAG benchmarking system by collecting and processing SEC 10-K annual reports. Here's what we've accomplished:

- **Data Collection**: We've implemented a systematic approach to download 10-K reports from the SEC EDGAR database, focusing on major technology companies across multiple years to ensure a diverse and representative dataset.
- **Document Parsing**: We've developed parsing techniques to handle the HTML/XML structure of SEC filings, extracting the textual content while preserving important structural elements.
- **Metadata Enrichment**: We've added valuable metadata to each document, including company identifiers, filing dates, and section information, which will be crucial for our retrieval system.
- **Storage Optimization**: We've organized the processed documents in a structured JSON format that preserves the hierarchical nature of the reports while making them easily accessible for our RAG pipeline.

This processed dataset provides us with a clean, structured collection of financial documents that will serve as the knowledge base for our RAG system. The standardized format will allow for consistent chunking and embedding in subsequent steps.

In the next notebook, we'll build upon this foundation to develop a comprehensive benchmarking dataset. We'll create challenging questions that require synthesizing information from multiple sections or reports, establish ground truth answers with clear source attributions, and design evaluation metrics tailored to the financial domain. This benchmarking dataset will be instrumental in rigorously evaluating different RAG pipeline configurations to identify optimal approaches for financial document analysis.