# JSON Processor Wash Deep Dive

This notebook explains the functionality of the JSON Processor Wash module, which handles metadata extraction and processing for plumbing code documents.

## Setup and Imports

In [None]:
import json
import logging
import os
import re
from typing import Any, Dict, List, Optional

# Add project root to Python path
import sys
project_root = '/Users/aaronjpeters/PlumbingCodeAi/BuildingCodeai'
if project_root not in sys.path:
    sys.path.append(project_root)

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Import our processor
from main.utils.json_processor_wash import extract_metadata, update_metadata, process_json_data

## 1. Understanding Metadata Extraction

The JSON Processor Wash module extracts metadata from plumbing code documents. Let's explore how it works with an example:

In [None]:
# Example document data
example_data = [
    {
        "file_path": "NYCP4ch_1pg.txt",
        "sections": [
            {
                "section": "401.1",
                "content": "GENERAL. This chapter shall govern the materials, design and installation of plumbing fixtures."
            },
            {
                "section": "401.2",
                "content": "PLUMBING FIXTURES\nAll plumbing fixtures shall meet the following requirements..."
            }
        ]
    }
]

# Extract metadata
metadata = extract_metadata(example_data, 4)
print("Extracted Metadata:")
print(json.dumps(metadata, indent=2))

### How Metadata Extraction Works

The `extract_metadata` function:
1. Looks for documents matching the chapter pattern (e.g., 'NYCP4ch_')
2. Extracts the title from section X01.1 (where X is the chapter number)
3. Finds the chapter title (usually in all caps)
4. Creates a metadata dictionary with:
   - Chapter number
   - Title
   - Chapter title

## 2. Updating Document Metadata

Once metadata is extracted, it needs to be added to all documents in the chapter:

In [None]:
# Update metadata for all documents
updated_data = update_metadata(example_data, 4)
print("Updated Document:")
print(json.dumps(updated_data[0], indent=2))

### How Metadata Updates Work

The `update_metadata` function:
1. Extracts metadata using `extract_metadata`
2. Updates all documents in the chapter with the metadata
3. Returns the updated document list

## 3. Processing JSON Files

Let's see how to process an entire JSON file:

In [None]:
def demonstrate_json_processing(input_file: str, output_file: str):
    """Demonstrate JSON processing with example data."""
    try:
        # Process the JSON file
        process_json_data(input_file, output_file)
        
        # Load and display results
        with open(output_file, 'r') as f:
            processed_data = json.load(f)
            
        # Show first document's metadata
        if processed_data:
            print("Processed Document Metadata:")
            print(json.dumps(processed_data[0].get('metadata', {}), indent=2))
            
    except Exception as e:
        print(f"Error: {str(e)}")

# Example usage:
# input_file = 'path/to/input.json'
# output_file = 'path/to/output.json'
# demonstrate_json_processing(input_file, output_file)

### How JSON Processing Works

The `process_json_data` function:
1. Loads the input JSON file
2. Finds all unique chapter numbers
3. Updates metadata for each chapter
4. Saves the processed data to the output file

## Best Practices and Tips

1. **File Organization**:
   - Keep input and output directories separate
   - Use consistent file naming (NYCP{chapter}ch_{page}pg.txt)

2. **Error Handling**:
   - The processor includes comprehensive error handling
   - Check logs for processing status and errors

3. **Metadata Format**:
   - Chapter numbers are integers
   - Titles preserve original formatting
   - Chapter titles are typically in uppercase

4. **Performance**:
   - Process files in batches if dealing with large datasets
   - Use appropriate logging levels for different environments