# JSON Processing Script Explanation

This notebook explains how the `process_json_wash.py` script works. The script processes OCR text files and their associated tables into an optimized JSON format for the NYC Plumbing Code.

## Overview

The script performs these main tasks:
1. Reads OCR text files
2. Extracts tables and analytics data
3. Organizes content by chapters and sections
4. Creates an optimized JSON structure

Let's go through each part in detail.

## 1. Script Setup and Configuration

First, we import required modules and set up paths:

In [None]:
import json
import logging
import os
import re
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple

# Define paths
BASE_DIR = Path("/Users/aaronjpeters/PlumbingCodeAi/BuildingCodeai")
MEDIA_ROOT = BASE_DIR / "media"
PLUMBING_CODE_DIR = MEDIA_ROOT / "plumbing_code"

# Directory structure
PLUMBING_CODE_DIRS = {
    "ocr": PLUMBING_CODE_DIR / "OCR",        # OCR text files
    "json": PLUMBING_CODE_DIR / "json",      # Initial JSON files
    "json_processed": PLUMBING_CODE_DIR / "json_processed",  # Final JSON files
    "tables": PLUMBING_CODE_DIR / "tables",  # Table data files
    "analytics": PLUMBING_CODE_DIR / "analytics",  # Analytics images
    "optimized": PLUMBING_CODE_DIR / "optimized"   # Optimized images
}

## 2. Reading Table Data

The `read_table_data` function reads and parses table data from text files:

In [None]:
def read_table_data(table_path: str) -> Optional[Dict]:
    """Read and parse table data from file."""
    try:
        if not os.path.exists(table_path):
            return None
            
        with open(table_path, 'r', encoding='utf-8') as f:
            table_content = f.read()
            
        return {
            'table_content': table_content,
            'table_path': table_path,
        }
    except Exception as e:
        logger.error(f"Error reading table data from {table_path}: {str(e)}")
        return None

## 3. Processing Individual Files

The `process_file` function handles a single OCR text file and its associated data:

In [None]:
def process_file(text_path: str) -> Dict:
    """Process a single text file and its associated table data."""
    # Read OCR text
    with open(text_path, 'r', encoding='utf-8') as f:
        text_content = f.read()
        
    # Extract page number from filename
    filename = Path(text_path).stem
    pg_num = int(''.join(filter(str.isdigit, filename.split('_')[-1])))
    
    # Check for associated files
    table_file = PLUMBING_CODE_DIRS["tables"] / f"{filename}.txt"
    analytics_file = PLUMBING_CODE_DIRS["analytics"] / f"{filename}.png"
    
    # Create file entry
    file_entry = {
        "i": pg_num,              # Page index
        "p": str(text_path),      # Path to text file
        "o": str(PLUMBING_CODE_DIRS["optimized"] / f"{filename}.jpg"),  # Optimized image
        "pg": pg_num,             # Page number
        "t": text_content         # Text content
    }
    
    # Add table data if exists
    if table_file.exists():
        table_data = read_table_data(str(table_file))
        file_entry["tb"] = table_data["table_content"]
        file_entry["tb_data"] = str(table_file)
        
    # Add analytics image if exists
    if analytics_file.exists():
        file_entry["tb_img"] = str(analytics_file)
        
    return file_entry

## 4. Processing the Directory

The `process_directory` function processes all files in the OCR directory:

In [None]:
def process_directory(base_dir: str) -> Dict[str, Dict]:
    """Process all text files in the OCR directory."""
    processed_data = {}
    
    # Process each text file
    for file_path in PLUMBING_CODE_DIRS["ocr"].glob("*.txt"):
        # Extract chapter number (e.g., NYCP1CH -> 1)
        chapter_match = re.search(r'NYCP(\d+)CH', file_path.stem, re.IGNORECASE)
        if not chapter_match:
            continue
            
        chapter_num = chapter_match.group(1)
        chapter_key = f"NYCP{chapter_num}CH_"
        
        # Initialize chapter data
        if chapter_key not in processed_data:
            processed_data[chapter_key] = {
                "m": {                # Metadata
                    "c": chapter_num,  # Chapter number
                    "t": "NYCPC",     # Title
                    "ct": ""          # Chapter title
                },
                "f": [],             # Files
                "r": [],             # Raw text
                "s": [],             # Sections
                "tb": []             # Tables
            }
        
        # Process the file
        file_entry = process_file(str(file_path))
        
        # Add file entry
        processed_data[chapter_key]["f"].append(file_entry)
        
        # Add raw text
        if "t" in file_entry:
            processed_data[chapter_key]["r"].append({
                "i": file_entry["i"],
                "t": file_entry["t"]
            })
        
        # Add table if exists
        if "tb" in file_entry:
            processed_data[chapter_key]["tb"].append({
                "i": file_entry["i"],
                "t": file_entry["tb"],
                "f": file_entry["i"],
                "d": file_entry.get("tb_data"),
                "img": file_entry.get("tb_img")
            })
        
        # Process sections
        if "t" in file_entry:
            sections = []
            current_section = None
            
            for line in file_entry["t"].split("\n"):
                line = line.strip()
                if not line:
                    continue
                
                # Look for section headers (e.g., "308.5 Interval of support")
                section_match = re.match(r"^(\d+\.\d+(?:\.\d+)?)\s+(.+)$", line)
                if section_match:
                    if current_section:
                        sections.append(current_section)
                    section_id = section_match.group(1)
                    current_section = {
                        "i": section_id,    # Section ID
                        "t": line,          # Section title
                        "c": "",           # Section content
                        "f": file_entry["i"]  # Source file
                    }
                elif current_section:
                    current_section["c"] += line + "\n"
            
            if current_section:
                sections.append(current_section)
            
            # Add sections
            processed_data[chapter_key]["s"].extend(sections)
    
    # Sort sections numerically
    for chapter_data in processed_data.values():
        chapter_data["s"].sort(key=lambda s: tuple(float(p) for p in s["i"].split(".")))
    
    return processed_data

## 5. Saving JSON Files

The `save_json` function saves the processed data to JSON files:

In [None]:
def save_json(data: Dict[str, Dict], output_dir: str) -> None:
    """Save processed data to JSON files."""
    os.makedirs(output_dir, exist_ok=True)
    
    for filename, chapter_data in data.items():
        output_file = os.path.join(output_dir, f"{filename}.json")
        with open(output_file, "w", encoding="utf-8") as f:
            json.dump(chapter_data, f, indent=2, ensure_ascii=False)

## Output JSON Structure

The final JSON structure for each chapter looks like this:

```json
{
  "m": {                    // Metadata
    "c": "3",              // Chapter number
    "t": "NYCPC",          // Title
    "ct": "ADMINISTRATION" // Chapter title
  },
  "f": [                    // Files
    {
      "i": 2,              // Page index
      "p": "path/to/text.txt",  // Text file path
      "o": "path/to/image.jpg", // Optimized image path
      "pg": 2,             // Page number
      "t": "text content", // Text content
      "tb": "table content", // Table content (if exists)
      "tb_data": "path/to/table.txt", // Table file path
      "tb_img": "path/to/analytics.png" // Analytics image path
    }
  ],
  "r": [                    // Raw text entries
    {
      "i": 2,              // Page index
      "t": "text content"  // Text content
    }
  ],
  "s": [                    // Sections
    {
      "i": "308.5",        // Section ID
      "t": "308.5 Interval of support", // Section title
      "c": "section content", // Section content
      "f": 2               // Source file index
    }
  ],
  "tb": [                   // Tables
    {
      "i": 2,              // Page index
      "t": "table content", // Table content
      "f": 2,              // Source file index
      "d": "path/to/table.txt", // Table file path
      "img": "path/to/analytics.png" // Analytics image path
    }
  ]
}
```

This structure organizes the plumbing code data into a format that's easy to query and use in applications.