# L3 M7.3: Financial Document Parsing & Chunking

## Learning Arc

1. Download SEC filings from EDGAR API
2. Extract regulatory sections (Item 1, 1A, 7, 8)
3. Parse XBRL financial data (200 core tags)
4. Create compliance-aware chunks with metadata
5. Tag chunks for temporal queries and audit trails

## Concepts Covered

- **SEC Filing Structure**: 10-K sections (Item 1, 1A, 7, 8) with mandated regulatory boundaries
- **Compliance-Aware Chunking**: Preserve SOX Section 404 requirements while chunking
- **XBRL Financial Data Parsing**: Extract structured financial data (balance sheet, income statement)
- **Metadata Tagging**: Tag chunks for temporal queries (fiscal periods, tickers, sections)
- **Audit Trail Generation**: SHA-256 hashes for compliance and tampering detection

## By the end of this notebook, you'll have:

✅ Working EDGAR API downloader with rate limiting  
✅ Section parser preserving regulatory boundaries  
✅ XBRL parser extracting 200 core financial tags  
✅ Compliance-aware chunker with metadata  
✅ Understanding of SOX Section 404 requirements  

## OFFLINE Mode Check

This notebook checks if EDGAR and other services are enabled. If disabled, it will run in OFFLINE mode with mock data.

In [None]:
import os
import sys
from dotenv import load_dotenv

# Add parent directory to path for imports
sys.path.insert(0, os.path.abspath('..'))

# Load environment variables
load_dotenv('../.env')

EDGAR_ENABLED = os.getenv("EDGAR_ENABLED", "false").lower() == "true"
OPENAI_ENABLED = os.getenv("OPENAI_ENABLED", "false").lower() == "true"
PINECONE_ENABLED = os.getenv("PINECONE_ENABLED", "false").lower() == "true"

print("="*60)
print("L3 M7.3: Financial Document Parsing & Chunking")
print("="*60)
print()

if not EDGAR_ENABLED:
    print("⚠️  EDGAR service is DISABLED")
    print("⚠️  This notebook will run in OFFLINE mode")
    print("⚠️  Network calls will be skipped")
    print("⚠️  Mock data will be used for demonstrations")
    print()
    print("To enable:")
    print("  1. Set EDGAR_ENABLED=true in .env")
    print("  2. Set SEC_USER_AGENT with company name + email")
    print("  3. Example: SEC_USER_AGENT='YourCompany yourteam@company.com'")
else:
    print("✅ EDGAR service is ENABLED")
    print("✅ Notebook will make live API calls to SEC EDGAR")
    print(f"✅ User-Agent: {os.getenv('SEC_USER_AGENT', 'Not set')}")

print()
print(f"OpenAI: {'✅ Enabled' if OPENAI_ENABLED else '⚠️  Disabled'}")
print(f"Pinecone: {'✅ Enabled' if PINECONE_ENABLED else '⚠️  Disabled'}")
print("="*60)

## Section 1: Imports and Setup

Import all necessary libraries and initialize configuration.

In [None]:
# Core imports
import logging
from typing import Dict, List, Any

# Import our package components
from src.l3_m7_financial_data_ingestion_compliance import (
    EDGARDownloader,
    SECFilingParser,
    XBRLParser,
    FinancialDocumentChunker,
    chunk_filing,
    extract_sections,
    parse_xbrl_data
)

# Import configuration
from config import (
    EDGAR_ENABLED,
    SEC_USER_AGENT,
    CHUNK_SIZE,
    CHUNK_OVERLAP,
    get_config,
    get_edgar_client,
    get_openai_client,
    get_pinecone_client
)

# Configure logging for notebook
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

print("✅ Imports successful")
print(f"   EDGAR: {EDGAR_ENABLED}")
print(f"   Chunk size: {CHUNK_SIZE}")
print(f"   Chunk overlap: {CHUNK_OVERLAP}")

## Section 2: Initialize Components

Create instances of our main classes.

In [None]:
# Get configuration
config = get_config()
print("Configuration loaded:")
for key, value in config.items():
    if 'key' not in key.lower():  # Don't print API keys
        print(f"  {key}: {value}")

# Initialize EDGAR downloader (if enabled)
downloader = None
if EDGAR_ENABLED and SEC_USER_AGENT:
    try:
        downloader = EDGARDownloader(SEC_USER_AGENT)
        print("\n✅ EDGAR downloader initialized")
    except ValueError as e:
        print(f"\n⚠️  EDGAR downloader initialization failed: {e}")
else:
    print("\n⚠️  EDGAR downloader not initialized (service disabled or no User-Agent)")

# Initialize parser and chunker (always available)
parser = SECFilingParser('10-K')
xbrl_parser = XBRLParser()
chunker = FinancialDocumentChunker(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

print("✅ Parser and chunker initialized")
print(f"   Filing type: 10-K")
print(f"   Chunk size: {CHUNK_SIZE} chars")
print(f"   Overlap: {CHUNK_OVERLAP} chars")

## Section 3: Download SEC Filing

Download a sample 10-K filing from SEC EDGAR.

**Note:** If EDGAR is disabled, this will use mock data.

In [None]:
# Download Microsoft 10-K
ticker = 'MSFT'
filing_type = '10-K'
fiscal_year = 2023

print(f"Downloading {filing_type} for {ticker} (FY {fiscal_year})...")

if downloader:
    # Real download from EDGAR
    filing = downloader.download_filing(ticker, filing_type, fiscal_year)
    print("✅ Downloaded from SEC EDGAR")
else:
    # Mock data
    filing = {
        'html': downloader._get_mock_filing_html(ticker, filing_type, fiscal_year) if downloader else chunker._get_mock_html(ticker, filing_type),
        'ticker': ticker,
        'filing_type': filing_type,
        'filing_date': '2023-07-27',
        'accession_number': 'mock-0000789019-23-000090'
    }
    print("⚠️  Using mock data (EDGAR disabled)")

print(f"\nFiling metadata:")
print(f"  Ticker: {filing['ticker']}")
print(f"  Type: {filing['filing_type']}")
print(f"  Date: {filing['filing_date']}")
print(f"  Accession: {filing['accession_number']}")
print(f"  HTML size: {len(filing['html'])} characters")

# Expected: Filing metadata with ticker, type, date, and HTML content

## Section 4: Extract Regulatory Sections

Parse the HTML filing and extract regulatory sections (Item 1, 1A, 7, 8).

**Key Point:** Section boundaries are preserved for SOX Section 404 compliance.

In [None]:
# Extract sections from HTML
print("Extracting regulatory sections...")
sections = parser.extract_sections(filing['html'])

print(f"\n✅ Extracted {len(sections)} sections:")
for section_name, section_content in sections.items():
    print(f"\n{section_name}:")
    print(f"  Length: {len(section_content)} characters")
    print(f"  Preview: {section_content[:150]}...")

# Expected: Dictionary with Item 1, Item 1A, Item 7, Item 8 sections

## Section 5: Parse XBRL Financial Data

Extract structured financial data from Item 8 (Financial Statements).

**XBRL Tags:** Standardized tags like `us-gaap:Assets`, `us-gaap:Revenues`, etc.

In [None]:
# Parse XBRL data from Item 8
if 'Item 8' in sections:
    print("Parsing XBRL financial data from Item 8...")
    xbrl_data = xbrl_parser.parse_xbrl_from_html(sections['Item 8'])

    print("\n✅ XBRL data extracted:")
    print(f"\nBalance Sheet:")
    for tag, value in xbrl_data['balance_sheet'].items():
        print(f"  {tag}: {value}")

    print(f"\nIncome Statement:")
    for tag, value in xbrl_data['income_statement'].items():
        print(f"  {tag}: {value}")

    print(f"\nFiscal Period: {xbrl_data['fiscal_period']}")
else:
    print("⚠️  Item 8 not found in sections")
    xbrl_data = {}

# Expected: Structured financial data with balance sheet and income statement

## Section 6: Create Compliance-Aware Chunks

Chunk the filing while preserving:
1. Regulatory section boundaries (no cross-section chunks)
2. Financial table integrity (tables not split)
3. Context through overlap (200 chars between chunks)

In [None]:
# Chunk the filing
print(f"Creating compliance-aware chunks...")
print(f"  Chunk size: {CHUNK_SIZE} chars")
print(f"  Overlap: {CHUNK_OVERLAP} chars")

user_agent = SEC_USER_AGENT if EDGAR_ENABLED else None
chunks = chunker.chunk_filing(ticker, filing_type, fiscal_year, user_agent)

print(f"\n✅ Created {len(chunks)} chunks")

# Show sample chunks
print("\nSample chunks:")
for i, chunk in enumerate(chunks[:3]):  # Show first 3 chunks
    print(f"\nChunk #{i+1}:")
    print(f"  Section: {chunk['metadata']['section']}")
    print(f"  Length: {len(chunk['text'])} chars")
    print(f"  Hash: {chunk['metadata']['chunk_hash']}")
    print(f"  Text preview: {chunk['text'][:100]}...")

# Expected: List of chunks with metadata (section, ticker, fiscal_period, hash)

## Section 7: Examine Chunk Metadata

Each chunk includes complete metadata for audit trails and temporal queries.

In [None]:
# Examine metadata structure
sample_chunk = chunks[0]

print("Chunk metadata structure:")
print(f"\nText field:")
print(f"  Type: {type(sample_chunk['text'])}")
print(f"  Length: {len(sample_chunk['text'])} chars")

print(f"\nMetadata fields:")
for key, value in sample_chunk['metadata'].items():
    print(f"  {key}: {value}")

# Count chunks by section
section_counts = {}
for chunk in chunks:
    section = chunk['metadata']['section']
    section_counts[section] = section_counts.get(section, 0) + 1

print(f"\nChunks per section:")
for section, count in section_counts.items():
    print(f"  {section}: {count} chunks")

# Expected: Complete metadata with ticker, section, fiscal_period, hash, etc.

## Section 8: Financial Statement Chunks

Examine how financial statements (Item 8) are chunked.

**Key Point:** Tables are treated as atomic units (not split).

In [None]:
# Find financial statement chunks
financial_chunks = [c for c in chunks if 'Item 8' in c['metadata']['section']]

print(f"Financial statement chunks: {len(financial_chunks)}")

for i, chunk in enumerate(financial_chunks[:2]):  # Show first 2
    print(f"\nFinancial Chunk #{i+1}:")
    print(f"  Table type: {chunk['metadata'].get('table_type', 'narrative')}")
    print(f"  Sensitivity: {chunk['metadata']['sensitivity']}")
    print(f"  Fiscal period: {chunk['metadata'].get('fiscal_period', 'N/A')}")
    print(f"  Length: {len(chunk['text'])} chars")
    print(f"  Preview: {chunk['text'][:200]}...")

# Expected: Financial chunks with table_type metadata (balance_sheet, income_statement)

## Section 9: Verify Compliance Requirements

Verify that chunks meet SOX Section 404 requirements:
1. Section boundaries preserved (no cross-section chunks)
2. Tables not split (each table = one chunk)
3. Audit trail (SHA-256 hashes)
4. Complete metadata

In [None]:
print("Compliance Verification:\n")

# 1. Check section boundary preservation
sections_found = set(c['metadata']['section'] for c in chunks)
print(f"✅ Section boundary preservation:")
print(f"   Sections found: {len(sections_found)}")
print(f"   Sections: {', '.join(sorted(sections_found))}")

# 2. Check all chunks have hashes
chunks_with_hash = sum(1 for c in chunks if 'chunk_hash' in c['metadata'])
print(f"\n✅ Audit trail (hashes):")
print(f"   Chunks with hash: {chunks_with_hash}/{len(chunks)}")
print(f"   Sample hash: {chunks[0]['metadata']['chunk_hash']}")

# 3. Check required metadata fields
required_fields = ['ticker', 'filing_type', 'section', 'filing_date', 
                   'accession_number', 'sensitivity', 'chunk_hash', 'created_at']
missing_fields = []
for chunk in chunks:
    for field in required_fields:
        if field not in chunk['metadata']:
            missing_fields.append(field)

if not missing_fields:
    print(f"\n✅ Metadata completeness:")
    print(f"   All {len(required_fields)} required fields present in all chunks")
else:
    print(f"\n⚠️  Missing metadata fields: {set(missing_fields)}")

# 4. Check financial tables not split
financial_chunks = [c for c in chunks if c['metadata'].get('table_type')]
print(f"\n✅ Financial table integrity:")
print(f"   Financial table chunks: {len(financial_chunks)}")
print(f"   Tables treated as atomic units (not split)")

print("\n" + "="*60)
print("SOX Section 404 compliance verified ✅")
print("="*60)

## Section 10: Optional - Vector Database Integration

If OpenAI and Pinecone are enabled, this section demonstrates:
1. Generating embeddings for chunks
2. Storing in Pinecone vector database
3. Querying with semantic search

In [None]:
# Check if vector DB features are available
if OPENAI_ENABLED and PINECONE_ENABLED:
    print("Setting up vector database integration...\n")

    # Get clients
    openai_client = get_openai_client()
    pinecone_index = get_pinecone_client()

    if openai_client and pinecone_index:
        print("✅ OpenAI and Pinecone clients initialized")

        # Generate embeddings for first 5 chunks (demo)
        print("\nGenerating embeddings for first 5 chunks...")
        for i, chunk in enumerate(chunks[:5]):
            # Generate embedding
            response = openai_client.embeddings.create(
                model='text-embedding-3-small',
                input=chunk['text']
            )
            embedding = response.data[0].embedding

            # Upsert to Pinecone
            chunk_id = f"{chunk['metadata']['ticker']}-{chunk['metadata']['filing_type']}-chunk-{i}"
            pinecone_index.upsert(vectors=[{
                'id': chunk_id,
                'values': embedding,
                'metadata': chunk['metadata']
            }])

            print(f"  Chunk {i+1}: {chunk['metadata']['section'][:30]}... → Pinecone")

        print(f"\n✅ {5} chunks indexed in Pinecone")
    else:
        print("⚠️  Client initialization failed")
else:
    print("⚠️  Vector database features disabled")
    print("   To enable: Set OPENAI_ENABLED=true and PINECONE_ENABLED=true in .env")

# Expected: Embeddings generated and stored in Pinecone (if enabled)

## Section 11: Query Example

If vector DB is enabled, demonstrate semantic search query.

In [None]:
if OPENAI_ENABLED and PINECONE_ENABLED:
    openai_client = get_openai_client()
    pinecone_index = get_pinecone_client()

    if openai_client and pinecone_index:
        # Example query
        query_text = "What are Microsoft's total assets?"
        print(f"Query: {query_text}\n")

        # Generate query embedding
        query_response = openai_client.embeddings.create(
            model='text-embedding-3-small',
            input=query_text
        )
        query_embedding = query_response.data[0].embedding

        # Search Pinecone
        results = pinecone_index.query(
            vector=query_embedding,
            top_k=3,
            include_metadata=True
        )

        print(f"Top {len(results['matches'])} results:\n")
        for i, match in enumerate(results['matches']):
            print(f"Result {i+1}:")
            print(f"  Score: {match['score']:.4f}")
            print(f"  Section: {match['metadata']['section']}")
            print(f"  Chunk ID: {match['id']}")
            print()
    else:
        print("⚠️  Clients not available")
else:
    print("⚠️  Vector search requires OpenAI and Pinecone")
    print("   This is an optional enhancement feature")

# Expected: Top 3 relevant chunks for the query

## Summary

In this notebook, you learned:

1. ✅ **Download SEC filings** from EDGAR API with rate limiting compliance
2. ✅ **Extract regulatory sections** (Item 1, 1A, 7, 8) while preserving boundaries
3. ✅ **Parse XBRL financial data** from standardized tags
4. ✅ **Create compliance-aware chunks** that preserve SOX Section 404 requirements
5. ✅ **Generate audit trails** with SHA-256 hashes for each chunk
6. ✅ **Tag chunks with metadata** for temporal queries and compliance reporting

### Key Takeaways:

- **Section boundaries must be preserved** - Don't split regulatory sections mid-content
- **Financial tables are atomic** - Balance sheets, income statements must stay intact
- **Chunk overlap is critical** - Prevents context loss across boundaries (15-20% overlap)
- **Metadata enables compliance** - Fiscal periods, tickers, sections required for audit trails
- **Rate limiting is enforced** - SEC blocks IPs exceeding 10 req/sec (not optional)

### Next Steps:

1. Process multiple companies' filings in batch
2. Implement cross-company comparative queries
3. Add material event monitoring (8-K filings)
4. Build audit-ready reporting system

See the **README.md** for the complete Decision Card and Common Failures reference.