# JSON Processor Wash - Detailed Explanation

This notebook provides a detailed explanation of the JSON Processor Wash module, which is responsible for cleaning and standardizing JSON data in our plumbing code processing pipeline.

## Setup and Imports

In [None]:
import json
import ast
import os
import logging
from typing import Dict, List, Any, Optional

# Add project root to Python path
import sys
project_root = '/Users/aaronjpeters/PlumbingCodeAi/BuildingCodeai'
sys.path.append(project_root)

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Import our processor
from main.utils.json_processor_wash import JsonProcessorWash

## 1. Understanding AST Processing

The JSON Processor Wash uses Python's Abstract Syntax Tree (AST) to analyze and process code. Let's explore how AST works:

In [None]:
def demonstrate_ast_parsing():
    # Example code snippet
    code = """
    def example_function(x):
        return x * 2
    """
    
    # Parse with AST
    tree = ast.parse(code)
    
    # Print AST structure
    print(ast.dump(tree, indent=2))

# Uncomment to run
# demonstrate_ast_parsing()

## 2. JSON Processor Wash Components

Let's break down the main components of our JSON Processor Wash:

In [None]:
# Initialize the processor
json_wash = JsonProcessorWash()

# Example input JSON
sample_json = {
    "metadata": {
        "file_path": "example.txt",
        "raw_text": "Sample text"
    },
    "sections": [
        {
            "id": "101.1",
            "title": "Scope",
            "content": "This section defines the scope."
        }
    ]
}

# Save to temporary file
def process_sample():
    temp_file = 'temp_input.json'
    with open(temp_file, 'w') as f:
        json.dump(sample_json, f)
    
    # Process the file
    result = json_wash.process_file(temp_file)
    
    # Clean up
    os.remove(temp_file)
    
    return result

# Uncomment to run
# result = process_sample()
# print(json.dumps(result, indent=2))

### 2.1 AST Node Visitors

The processor uses AST node visitors to analyze code structure. Here's how they work:

In [None]:
def demonstrate_node_visitor():
    class SimpleVisitor(ast.NodeVisitor):
        def visit_FunctionDef(self, node):
            print(f"Found function: {node.name}")
            self.generic_visit(node)
    
    # Example code
    code = """
    def process_data(data):
        return data.strip()
    
    def validate_input(input_str):
        return bool(input_str)
    """
    
    # Parse and visit
    tree = ast.parse(code)
    visitor = SimpleVisitor()
    visitor.visit(tree)

# Uncomment to run
# demonstrate_node_visitor()

## 3. Data Cleaning Process

Let's examine how the processor cleans and standardizes data:

In [None]:
def demonstrate_cleaning():
    # Example raw data
    raw_data = {
        "sections": [
            {
                "id": "101.1",
                "content": "\n  This is some   poorly  formatted   content  \n",
                "references": ["102.1", "103.2"]
            }
        ]
    }
    
    # Clean the data
    def clean_text(text):
        # Remove extra whitespace
        return " ".join(text.split())
    
    # Process sections
    cleaned_data = raw_data.copy()
    for section in cleaned_data["sections"]:
        section["content"] = clean_text(section["content"])
    
    return cleaned_data

# Uncomment to run
# cleaned = demonstrate_cleaning()
# print(json.dumps(cleaned, indent=2))

## 4. Reference Resolution

The processor handles references between sections:

In [None]:
def demonstrate_reference_resolution():
    # Example data with references
    data = {
        "sections": [
            {
                "id": "101.1",
                "content": "See Section 101.2 for details",
                "references": []
            },
            {
                "id": "101.2",
                "content": "This is the referenced section",
                "references": []
            }
        ]
    }
    
    # Find references
    def find_references(content):
        # Simple example - in reality, would use regex
        if "Section" in content:
            return [ref.strip() for ref in content.split("Section")[1:] if ref.strip()]
        return []
    
    # Process sections
    for section in data["sections"]:
        section["references"] = find_references(section["content"])
    
    return data

# Uncomment to run
# resolved = demonstrate_reference_resolution()
# print(json.dumps(resolved, indent=2))

## 5. Complete Processing Example

Let's put it all together with a complete example:

In [None]:
def process_complete_example():
    # Sample input
    input_data = {
        "metadata": {
            "file_path": "example.txt",
            "raw_text": "Sample text"
        },
        "sections": [
            {
                "id": "101.1",
                "title": "Scope",
                "content": "\n  This section defines the scope. See Section 101.2  \n"
            },
            {
                "id": "101.2",
                "title": "References",
                "content": "Referenced section content"
            }
        ]
    }
    
    # Save to file
    temp_file = 'complete_example.json'
    with open(temp_file, 'w') as f:
        json.dump(input_data, f)
    
    # Process with JsonProcessorWash
    processor = JsonProcessorWash()
    result = processor.process_file(temp_file)
    
    # Clean up
    os.remove(temp_file)
    
    return result

# Uncomment to run
# final_result = process_complete_example()
# print(json.dumps(final_result, indent=2))

## 6. Best Practices and Tips

1. **Input Validation**
   - Always validate JSON structure before processing
   - Check for required fields
   - Verify section IDs are properly formatted

2. **Error Handling**
   - Handle malformed JSON gracefully
   - Log processing errors
   - Provide meaningful error messages

3. **Performance**
   - Use efficient data structures
   - Minimize file I/O operations
   - Consider batch processing for large files

4. **Maintenance**
   - Keep regular backups of processed files
   - Monitor processing logs
   - Update reference patterns as needed

5. **Testing**
   - Create unit tests for each component
   - Test with various input formats
   - Verify reference resolution accuracy