# Ground Truth Preparation - Improved Version

**Purpose**: Convert block-level annotation data into field-level ground truth CSV for model evaluation.

**Improvements over original**:
- Configurable paths (easy to switch between environments)
- Comprehensive validation and error reporting
- Modular functions for reusability
- Better handling of boolean fields
- Detailed logging of transformations
- Automatic backup of existing ground truth
- Summary statistics and quality checks

**Data Flow**:
1. Load and merge annotation files
2. Filter out 'other' annotations and questions
3. Apply semantic chunking (group by image and annotator)
4. Transform from tall to wide format
5. Map annotation fields to standard field names
6. Clean and normalize values
7. Validate and save ground truth

## 1. Setup and Configuration

In [None]:
from pathlib import Path
from datetime import datetime
import shutil
import pandas as pd

# Configure pandas display
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 200)

print("✅ Imports loaded successfully")

In [None]:
# ============================================================================
# CONFIGURATION - EDIT THESE PATHS IF NEEDED
# ============================================================================

# Base paths
ANNOTATIONS_DIR = Path('/efs/shared/annotations')
OUTPUT_DIR = Path('/efs/shared/PoC_data/evaluation_data')

# Input files
INPUT_FILES = {
    'annotator1': ANNOTATIONS_DIR / 'annotator1_block_ids.csv',
    'layoutlm': ANNOTATIONS_DIR / 'LayoutLM_annotation.csv'
}

# Output files
OUTPUT_FILES = {
    'merged': ANNOTATIONS_DIR / 'annotations_merged_block_ids.csv',
    'filtered': ANNOTATIONS_DIR / 'annotations_filtered.csv',
    'grouped': ANNOTATIONS_DIR / 'grouped_annotations_merged_block_ids.csv',
    'ground_truth': OUTPUT_DIR / 'ground_truth.csv'
}

# Expected field names (for validation)
EXPECTED_FIELDS = [
    'image_name', 'DOCUMENT_TYPE', 'BUSINESS_ABN', 'SUPPLIER_NAME', 'BUSINESS_ADDRESS',
    'PAYER_NAME', 'PAYER_ADDRESS', 'INVOICE_DATE', 'LINE_ITEM_DESCRIPTIONS',
    'LINE_ITEM_QUANTITIES', 'LINE_ITEM_PRICES', 'LINE_ITEM_TOTAL_PRICES',
    'IS_GST_INCLUDED', 'GST_AMOUNT', 'TOTAL_AMOUNT', 'STATEMENT_DATE_RANGE',
    'TRANSACTION_DATES', 'TRANSACTION_AMOUNTS_PAID', 'TRANSACTION_AMOUNTS_RECEIVED',
    'ACCOUNT_BALANCE'
]

# Ensure output directory exists
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"✅ Configuration loaded")
print(f"📂 Annotations directory: {ANNOTATIONS_DIR}")
print(f"📁 Output directory: {OUTPUT_DIR}")

## 2. Utility Functions

In [None]:
def validate_input_files():
    """Validate that all required input files exist."""
    print("🔍 Validating input files...")
    missing_files = []
    
    for name, path in INPUT_FILES.items():
        if path.exists():
            print(f"  ✅ {name}: {path.name}")
        else:
            print(f"  ❌ {name}: {path} NOT FOUND")
            missing_files.append(name)
    
    if missing_files:
        raise FileNotFoundError(f"Missing required files: {', '.join(missing_files)}")
    
    print("✅ All input files validated")


def backup_existing_ground_truth():
    """Backup existing ground truth file if it exists."""
    gt_path = OUTPUT_FILES['ground_truth']
    
    if gt_path.exists():
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        backup_path = gt_path.parent / f"ground_truth_backup_{timestamp}.csv"
        shutil.copy2(gt_path, backup_path)
        print(f"💾 Backed up existing ground truth to: {backup_path.name}")
    else:
        print("📝 No existing ground truth to backup")


def dedupe_join(series, separator=' | '):
    """Join unique values maintaining insertion order."""
    return separator.join(dict.fromkeys(series.astype(str)))


def remove_duplicate_strings(text):
    """Remove duplicate strings separated by ' | ' while preserving order."""
    # Handle NaN and NOT_FOUND values
    if pd.isna(text):
        return text
    if text == 'NOT_FOUND':
        return text
    
    # Convert to string to check if it contains a separator
    text_str = str(text)
    if ' | ' not in text_str:
        return text
    
    # Split by separator
    parts = text_str.split(' | ')
    
    # Remove duplicates while preserving order
    unique_parts = list(dict.fromkeys(parts))
    
    # Rejoin with separator
    return ' | '.join(unique_parts)


def clean_abn(text):
    """Extract and format 11-digit Australian Business Number.
    
    Removes common prefixes like 'abn', 'ABN', 'a.b.n.', pipes, colons, etc.
    and formats as 'XX XXX XXX XXX'.
    """
    import re
    
    # Handle NaN and NOT_FOUND values
    if pd.isna(text):
        return text
    if text == 'NOT_FOUND':
        return text
    
    # Convert to string and normalize
    text_str = str(text).strip()
    
    # Remove common ABN prefixes (case-insensitive)
    text_str = re.sub(r'\b(abn|a\.b\.n\.?)\b', '', text_str, flags=re.IGNORECASE)
    
    # Remove pipes, colons, and extra whitespace
    text_str = text_str.replace('|', '').replace(':', '').strip()
    
    # Extract all digits
    digits = re.sub(r'\D', '', text_str)
    
    # Check if we have exactly 11 digits
    if len(digits) == 11:
        # Format as XX XXX XXX XXX
        return f"{digits[0:2]} {digits[2:5]} {digits[5:8]} {digits[8:11]}"
    elif len(digits) > 0:
        # Return digits as-is if not exactly 11 (for debugging)
        return digits
    else:
        # No digits found, return NOT_FOUND
        return 'NOT_FOUND'


def normalize_single_date(text):
    """Normalize single date to DD/MM/YYYY format.
    
    Matches extraction_parser.py _normalize_date() function.
    Handles formats: DD/MM/YY, DD/MM/YYYY, DD mon YYYY, YYYY-MM-DD, etc.
    Strips timezone info like "(UTC+10:00)".
    
    For fields with multiple dates separated by ' | ', takes FIRST date only.
    Use this for: INVOICE_DATE, STATEMENT_DATE_RANGE
    """
    from dateutil import parser
    
    # Handle NaN and NOT_FOUND values
    if pd.isna(text):
        return text
    if text == 'NOT_FOUND':
        return text
    
    text_str = str(text).strip()
    
    # Handle multiple dates - take first one
    if ' | ' in text_str:
        dates = text_str.split(' | ')
        # Remove duplicates while preserving order
        unique_dates = list(dict.fromkeys(dates))
        # Take the first date
        text_str = unique_dates[0].strip()
    
    try:
        # Remove timezone info and extra content for cleaner parsing
        # Strip anything after ( like "(UTC+10:00)"
        clean_str = text_str.split('(')[0].strip()
        
        # Parse with dayfirst=True for Australian DD/MM/YYYY preference
        parsed_date = parser.parse(clean_str, dayfirst=True)
        
        # Format as DD/MM/YYYY (matches extraction_parser.py)
        return parsed_date.strftime('%d/%m/%Y')
    except (ValueError, parser.ParserError):
        # If parsing fails, return original
        return text_str


def normalize_transaction_dates(text):
    """Normalize multiple dates to DD/MM/YYYY format with pipe separator.
    
    Matches extraction_parser.py handling of TRANSACTION_DATES.
    Keeps ALL dates (including duplicates - legitimate repeated transactions).
    Normalizes each date individually, maintains pipe separation.
    Use this for: TRANSACTION_DATES
    
    IMPORTANT: Does NOT remove duplicate dates - multiple transactions 
    on the same date are legitimate (e.g., two purchases on the same day).
    """
    from dateutil import parser
    
    # Handle NaN and NOT_FOUND values
    if pd.isna(text):
        return text
    if text == 'NOT_FOUND':
        return text
    
    text_str = str(text).strip()
    
    # Split by pipe separator
    if ' | ' in text_str:
        dates = [d.strip() for d in text_str.split(' | ')]
    else:
        # Single date, wrap in list for consistent handling
        dates = [text_str]
    
    # DO NOT remove duplicates - they may be legitimate repeated transactions
    
    normalized_dates = []
    for date_str in dates:
        try:
            # Remove timezone info
            clean_str = date_str.split('(')[0].strip()
            
            # Parse with dayfirst=True
            parsed_date = parser.parse(clean_str, dayfirst=True)
            
            # Format as DD/MM/YYYY
            normalized_dates.append(parsed_date.strftime('%d/%m/%Y'))
        except (ValueError, parser.ParserError):
            # If parsing fails, keep original
            normalized_dates.append(date_str)
    
    # Return pipe-separated dates (including duplicates)
    return ' | '.join(normalized_dates)


def display_dataframe_summary(df, df_name="DataFrame"):
    """Display summary statistics for a DataFrame."""
    print(f"\n📊 {df_name} Structure:")
    print("=" * 70)
    print(f"Shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    
    print(f"\n📋 {df_name} Head (first 5 rows):")
    print("=" * 70)
    print(df.head().to_string(index=False))
    
    print(f"\n📈 {df_name} Column Summary:")
    print("=" * 70)
    
    # Percentage filled (excluding 'NOT_FOUND')
    for col in df.columns:
        if col != 'image_name':
            non_empty = (df[col] != 'NOT_FOUND').sum()
            total = len(df)
            pct = (non_empty / total) * 100
            print(f"{col:30s}: {non_empty:3d}/{total:3d} filled ({pct:5.1f}%)")


print("✅ Utility functions defined")

## 3. Validate Input Files

In [None]:
validate_input_files()
backup_existing_ground_truth()

## 4. Load and Merge Annotation Files

In [None]:
print("📥 Loading annotation files...\n")

# Load annotator1 annotations
annotations = pd.read_csv(INPUT_FILES['annotator1'])
print(f"  ✅ Loaded annotator1: {annotations.shape[0]} rows")

# Remove rows with missing annotators
annotations = annotations.dropna(subset=['annotator'])
print(f"  ✅ Removed rows with missing annotators: {annotations.shape[0]} rows remaining")

# Load LayoutLM annotation file
LayoutLM_annotation = pd.read_csv(INPUT_FILES['layoutlm'])
print(f"  ✅ Loaded LayoutLM annotations: {LayoutLM_annotation.shape[0]} rows")

# Select only required columns from LayoutLM to avoid unnecessary data
layout_subset = LayoutLM_annotation[['page_id', 'case_id']].drop_duplicates()

# Perform left join to preserve all annotation data
merged_table = pd.merge(
    annotations,
    layout_subset,
    left_on='image_id',
    right_on='page_id',
    how='left'
).drop(columns=['page_id'])

print(f"  ✅ Merged annotations with layout data: {merged_table.shape[0]} rows")

# Reorder columns to put case_id first
cols = ['case_id'] + [col for col in merged_table.columns if col != 'case_id']
merged_table = merged_table[cols]

# Find and report duplicate rows
duplicates = merged_table[merged_table.duplicated()]
num_duplicates = len(duplicates)
print(f"  📋 Number of duplicate rows: {num_duplicates}")

# Drop duplicate rows
df_final = (
    merged_table
    .drop_duplicates()
    .assign(image_name=lambda x: x['case_id'].astype(str) + '_' + x['image_id'].astype(str))
)

# Keep only required columns
cols_to_keep = ['image_name'] + [
    col for col in df_final.columns if col not in ['case_id', 'image_id', 'image_name']
]
df_final = df_final[cols_to_keep]

# Save merged result
df_final.to_csv(OUTPUT_FILES['merged'], index=False)
print(f"\n💾 Saved merged annotations: {OUTPUT_FILES['merged'].name}")
print(f"   Shape: {df_final.shape}")

# Load back for verification
annotations_merged_block_ids = pd.read_csv(OUTPUT_FILES['merged'])
print(f"\n✅ Verified merged file: {annotations_merged_block_ids.shape[0]} rows")
print(f"   Sample: {annotations_merged_block_ids.head(3)['image_name'].tolist()}")

## 5. Filter Out 'Other' and Questions

In [None]:
print("🔍 Filtering annotations...\n")

# Show unique annotators before filtering
print(f"  📋 Unique annotators before filtering: {sorted(annotations_merged_block_ids['annotator'].unique())}")

# Filter out 'other' annotations
annotations_filtered = annotations_merged_block_ids[
    annotations_merged_block_ids['annotator'] != 'other'
]
print(f"  ✅ Removed 'other' annotations: {len(annotations_merged_block_ids) - len(annotations_filtered)} rows removed")

# Filter out question annotations (containing '_q_')
annotations_filtered = annotations_filtered[
    ~annotations_filtered['annotator'].str.contains('_q_', na=False)
]
print(f"  ✅ Removed question annotations: {annotations_filtered.shape[0]} rows remaining")

# Show unique annotators after filtering
print(f"  📋 Unique annotators after filtering: {sorted(annotations_filtered['annotator'].unique())}")

# Save filtered result
annotations_filtered.to_csv(OUTPUT_FILES['filtered'], index=False)
print(f"\n💾 Saved filtered annotations: {OUTPUT_FILES['filtered'].name}")
print(f"   Shape: {annotations_filtered.shape}")

## 6. Apply Semantic Chunking (Group by Image and Annotator)

In [None]:
print("🔄 Applying semantic chunking...\n")

# Group and aggregate
grouped_df = (
    annotations_filtered
    .groupby(['image_name', 'block_ids', 'annotator'])
    .agg({
        'words': lambda x: ' '.join(x.astype(str).str.lower()),
        'pred': dedupe_join
    })
    .reset_index(drop=False)
)

print(f"  ✅ Grouped by image_name, block_ids, annotator: {grouped_df.shape[0]} groups")

# Select columns and drop duplicates
final_df = grouped_df[['image_name', 'block_ids', 'words', 'pred', 'annotator']].drop_duplicates(
    subset=['image_name', 'block_ids', 'pred', 'annotator'], keep='first'
)

print(f"  ✅ Dropped duplicates: {final_df.shape[0]} unique rows")

# Save to new CSV
final_df.to_csv(OUTPUT_FILES['grouped'], index=False)
print(f"\n💾 Saved grouped annotations: {OUTPUT_FILES['grouped'].name}")
print(f"   Shape: {final_df.shape}")
print(f"   Sample: {final_df.head(3)[['image_name', 'annotator']].to_dict('records')}")

## 7. Transform from Tall to Wide Format

In [None]:
print("🔄 Transforming from tall to wide format...\n")

# Load grouped annotations
annotations_filtered = pd.read_csv(OUTPUT_FILES['grouped'])

# Group by image_name and annotator, then concatenate words
result_df = (
    annotations_filtered.groupby(['image_name', 'annotator'])['words']
    .apply(lambda x: ' | '.join(x))
    .reset_index()
)

print(f"  ✅ Concatenated words by image and annotator: {result_df.shape[0]} rows")

# Transform from "tall" to "wide"
wide_result_df = result_df.pivot(index='image_name', columns='annotator', values='words')
wide_result_df = wide_result_df.reset_index()
wide_result_df.columns.name = None

print(f"  ✅ Pivoted to wide format: {wide_result_df.shape}")
print(f"  📋 Columns: {list(wide_result_df.columns)}")

# Replace missing values with "NOT_FOUND"
wide_result_df = wide_result_df.fillna("NOT_FOUND")

# Save intermediate result
intermediate_path = OUTPUT_DIR / 'my_ground_truth_intermediate.csv'
wide_result_df.to_csv(intermediate_path, index=False)
print(f"\n💾 Saved intermediate wide format: {intermediate_path.name}")

# Load back for next step
transformed_df = pd.read_csv(intermediate_path)
print(f"   Shape: {transformed_df.shape}")

## 8. Map Annotation Fields to Standard Field Names

In [None]:
print("🗺️  Mapping annotation fields to standard names...\n")

# First, print available columns for debugging
print(f"  📋 Available columns in transformed_df: {list(transformed_df.columns)}")

# Column mapping dictionary (one-to-one mappings)
column_mapping = {
    "header_a_pg": 'DOCUMENT_TYPE',  # Will be normalized to INVOICE/RECEIPT/BANK_STATEMENT
    "supplierABN_a_pgs": 'BUSINESS_ABN',
    "supplier_a_pgs": 'SUPPLIER_NAME',
    "address_extra": 'BUSINESS_ADDRESS',  # Will be duplicated to PAYER_ADDRESS
    "payer_a_pgs": 'PAYER_NAME',
    "invDate_a_pgs": 'INVOICE_DATE',
    "desc_a_li": 'LINE_ITEM_DESCRIPTIONS',
    "quantity_a_li": 'LINE_ITEM_QUANTITIES',
    "unit_price_a_li": 'LINE_ITEM_PRICES',
    "total_a_li": 'LINE_ITEM_TOTAL_PRICES',
    "tax_a_pg": 'GST_AMOUNT',
    "total_a_pg": 'TOTAL_AMOUNT',
    "date_a_li": 'TRANSACTION_DATES',
    "due_a_li": 'TRANSACTION_AMOUNTS_PAID',
    "received_a_li": 'TRANSACTION_AMOUNTS_RECEIVED',
    "balance_a_li": 'ACCOUNT_BALANCE'
}

# Apply column mapping
cols_to_keep = [col for col in transformed_df.columns if col in column_mapping]
transformed_df = transformed_df[['image_name'] + cols_to_keep]
transformed_df = transformed_df.rename(columns=column_mapping)

print(f"  ✅ Mapped {len(cols_to_keep)} columns to standard names")
print(f"  📋 Columns after mapping: {list(transformed_df.columns)}")

# Normalize DOCUMENT_TYPE to INVOICE/RECEIPT/BANK_STATEMENT
if 'DOCUMENT_TYPE' in transformed_df.columns:
    def normalize_document_type(text):
        """Extract and normalize document type."""
        import re
        if pd.isna(text) or text == 'NOT_FOUND':
            return 'NOT_FOUND'
        
        text_upper = str(text).upper()
        
        # Check for BANK STATEMENT patterns
        if re.search(r'BANK.*STATEMENT|STATEMENT.*BANK|ACCOUNT.*STATEMENT', text_upper):
            return 'BANK_STATEMENT'
        # Check for INVOICE patterns
        elif re.search(r'INVOICE|TAX.*INVOICE', text_upper):
            return 'INVOICE'
        # Check for RECEIPT patterns
        elif re.search(r'RECEIPT|RCPT', text_upper):
            return 'RECEIPT'
        else:
            # Return original if no pattern matched
            return text
    
    transformed_df['DOCUMENT_TYPE'] = transformed_df['DOCUMENT_TYPE'].apply(normalize_document_type)
    print(f"  ✅ Normalized DOCUMENT_TYPE to INVOICE/RECEIPT/BANK_STATEMENT")

# Duplicate BUSINESS_ADDRESS to PAYER_ADDRESS if it exists
if 'BUSINESS_ADDRESS' in transformed_df.columns:
    transformed_df['PAYER_ADDRESS'] = transformed_df['BUSINESS_ADDRESS']
    print(f"  ✅ Duplicated BUSINESS_ADDRESS to PAYER_ADDRESS")

# Add STATEMENT_DATE_RANGE as duplicate of INVOICE_DATE if it exists
if 'INVOICE_DATE' in transformed_df.columns:
    transformed_df['STATEMENT_DATE_RANGE'] = transformed_df['INVOICE_DATE']
    print(f"  ✅ Duplicated INVOICE_DATE to STATEMENT_DATE_RANGE")

# Derive IS_GST_INCLUDED from GST_AMOUNT
if 'GST_AMOUNT' in transformed_df.columns:
    transformed_df['IS_GST_INCLUDED'] = transformed_df['GST_AMOUNT'].apply(
        lambda x: 'false' if str(x).strip().upper() in ['NOT_FOUND', 'NAN', ''] else 'true'
    )
    print(f"  ✅ Derived IS_GST_INCLUDED from GST_AMOUNT presence")
else:
    # If GST_AMOUNT column doesn't exist, create it as NOT_FOUND and set IS_GST_INCLUDED to 'false'
    transformed_df['GST_AMOUNT'] = 'NOT_FOUND'
    transformed_df['IS_GST_INCLUDED'] = 'false'
    print(f"  ⚠️  GST_AMOUNT column not found, created with 'NOT_FOUND' and IS_GST_INCLUDED='false'")

# Reorder columns to match expected field order
final_columns = ['image_name'] + [
    col for col in EXPECTED_FIELDS if col in transformed_df.columns and col != 'image_name'
]
transformed_df = transformed_df[final_columns]

print(f"\n  ✅ Reordered columns to match expected format")
print(f"   Shape: {transformed_df.shape}")
print(f"   Final columns: {list(transformed_df.columns)}")

## 9. Clean and Normalize Values

In [None]:
print("🧹 Cleaning and normalizing values...\n")

# Clean and format BUSINESS_ABN field
if 'BUSINESS_ABN' in transformed_df.columns:
    transformed_df['BUSINESS_ABN'] = transformed_df['BUSINESS_ABN'].apply(clean_abn)
    print(f"  ✅ Cleaned and formatted BUSINESS_ABN")
    # Show sample of cleaned ABNs
    sample_abns = transformed_df[transformed_df['BUSINESS_ABN'] != 'NOT_FOUND']['BUSINESS_ABN'].head(3).tolist()
    if sample_abns:
        print(f"     Sample: {sample_abns}")

# Normalize single date fields (INVOICE_DATE, STATEMENT_DATE_RANGE)
# Takes first date if multiple dates present
single_date_fields = ['INVOICE_DATE', 'STATEMENT_DATE_RANGE']
for field in single_date_fields:
    if field in transformed_df.columns:
        transformed_df[field] = transformed_df[field].apply(normalize_single_date)
        print(f"  ✅ Normalized {field} to DD/MM/YYYY format (first date only)")
        # Show sample
        sample_dates = transformed_df[transformed_df[field] != 'NOT_FOUND'][field].head(3).tolist()
        if sample_dates:
            print(f"     Sample: {sample_dates}")

# Normalize TRANSACTION_DATES (keeps ALL dates pipe-separated)
if 'TRANSACTION_DATES' in transformed_df.columns:
    transformed_df['TRANSACTION_DATES'] = transformed_df['TRANSACTION_DATES'].apply(normalize_transaction_dates)
    print(f"  ✅ Normalized TRANSACTION_DATES to DD/MM/YYYY format (all dates preserved)")
    # Show sample
    sample_dates = transformed_df[transformed_df['TRANSACTION_DATES'] != 'NOT_FOUND']['TRANSACTION_DATES'].head(3).tolist()
    if sample_dates:
        print(f"     Sample: {sample_dates}")

# Apply deduplication to PAYER_NAME field
if 'PAYER_NAME' in transformed_df.columns:
    transformed_df['PAYER_NAME'] = transformed_df['PAYER_NAME'].apply(remove_duplicate_strings)
    print(f"  ✅ Applied deduplication to PAYER_NAME")

# Replace pipe separators with spaces in text fields
text_fields = ['DOCUMENT_TYPE', 'PAYER_NAME', 'PAYER_ADDRESS', 'SUPPLIER_NAME', 'BUSINESS_ADDRESS']
for field in text_fields:
    if field in transformed_df.columns:
        transformed_df[field] = transformed_df[field].astype(str).str.replace(r"\s*\|\s*", " ", regex=True)
        print(f"  ✅ Replaced pipe separators in {field}")

print(f"\n✅ Cleaning complete")

## 10. Validate and Save Ground Truth

In [None]:
print("🔍 Validating ground truth data...\n")

# Check for expected fields
missing_fields = [field for field in EXPECTED_FIELDS if field not in transformed_df.columns]
extra_fields = [field for field in transformed_df.columns if field not in EXPECTED_FIELDS]

if missing_fields:
    print(f"  ⚠️  Missing expected fields: {missing_fields}")
else:
    print(f"  ✅ All expected fields present")

if extra_fields:
    print(f"  ⚠️  Extra fields found: {extra_fields}")
else:
    print(f"  ✅ No extra fields")

# Check for duplicate image names
duplicates = transformed_df[transformed_df['image_name'].duplicated()]
if len(duplicates) > 0:
    print(f"  ⚠️  WARNING: {len(duplicates)} duplicate image names found:")
    print(f"     {duplicates['image_name'].tolist()}")
else:
    print(f"  ✅ No duplicate image names")

# Save final ground truth
transformed_df.to_csv(OUTPUT_FILES['ground_truth'], index=False)
print(f"\n💾 Saved ground truth: {OUTPUT_FILES['ground_truth']}")
print(f"   Shape: {transformed_df.shape}")
print(f"   Columns: {len(transformed_df.columns)}")

# Reload and verify
verified_df = pd.read_csv(OUTPUT_FILES['ground_truth'])
print(f"\n✅ Verified saved file: {verified_df.shape}")

## 11. Display Summary Statistics

In [None]:
display_dataframe_summary(transformed_df, "Ground Truth")

## 12. Quality Checks

In [None]:
print("\n🔍 Quality Checks:")
print("=" * 70)

# Check 1: Field coverage
print("\n1. Field Coverage (non-NOT_FOUND values):")
for col in transformed_df.columns:
    if col != 'image_name':
        non_empty = (transformed_df[col] != 'NOT_FOUND').sum()
        pct = (non_empty / len(transformed_df)) * 100
        status = "✅" if pct > 50 else "⚠️ "
        print(f"  {status} {col:30s}: {pct:5.1f}% ({non_empty}/{len(transformed_df)})")

# Check 2: Document type distribution
if 'DOCUMENT_TYPE' in transformed_df.columns:
    print("\n2. Document Type Distribution:")
    doc_types = transformed_df['DOCUMENT_TYPE'].value_counts()
    for doc_type, count in doc_types.items():
        pct = (count / len(transformed_df)) * 100
        print(f"  • {doc_type}: {count} ({pct:.1f}%)")

# Check 3: Boolean field validation
if 'IS_GST_INCLUDED' in transformed_df.columns:
    print("\n3. Boolean Field Validation (IS_GST_INCLUDED):")
    value_counts = transformed_df['IS_GST_INCLUDED'].value_counts()
    for value, count in value_counts.items():
        print(f"  • {value}: {count}")
    
    # Check for invalid boolean values
    invalid = transformed_df[
        ~transformed_df['IS_GST_INCLUDED'].isin(['true', 'false', 'NOT_FOUND'])
    ]
    if len(invalid) > 0:
        print(f"  ⚠️  Invalid boolean values found: {invalid['IS_GST_INCLUDED'].unique()}")
    else:
        print(f"  ✅ All boolean values are valid (true/false/NOT_FOUND)")

print("\n✅ Quality checks complete")

## Summary

Ground truth preparation complete! The final ground truth CSV has been saved to:

**Output file**: `{OUTPUT_FILES['ground_truth']}`

**Next steps**:
1. Review the quality check results above
2. Verify field coverage is acceptable for your use case
3. Use the ground truth CSV for model evaluation

**Intermediate files saved** (for debugging):
- Merged annotations: `{OUTPUT_FILES['merged']}`
- Filtered annotations: `{OUTPUT_FILES['filtered']}`
- Grouped annotations: `{OUTPUT_FILES['grouped']}`

---

## Data Transformations Applied

The following transformations were applied to `annotator1_block_ids.csv`:

### **1. Data Loading & Merging**
- Loaded `annotator1_block_ids.csv` as primary annotation source
- Removed rows with missing annotators
- Merged with `LayoutLM_annotation.csv` to add `case_id` mapping
- Created `image_name` as `{case_id}_{image_id}`
- Removed duplicate rows

### **2. Filtering**
- Removed annotations where `annotator == 'other'`
- Removed question annotations (containing `_q_`)
- Retained only valid annotation types

### **3. Semantic Chunking**
- Grouped by `image_name`, `block_ids`, and `annotator`
- Concatenated `words` within each group (lowercase, space-separated)
- Deduplicated `pred` values using pipe separator
- Removed duplicate groups

### **4. Data Reshaping**
- Transformed from "tall" format (multiple rows per image) to "wide" format (one row per image)
- Concatenated words by `image_name` and `annotator` using pipe separator
- Pivoted annotator fields into columns
- Filled missing values with `"NOT_FOUND"`

### **5. Field Mapping & Derivation**
**Mapped annotation fields to standard names:**
- `header_a_pg` → `DOCUMENT_TYPE` (normalized to INVOICE/RECEIPT/BANK_STATEMENT)
- `supplierABN_a_pgs` → `BUSINESS_ABN`
- `supplier_a_pgs` → `SUPPLIER_NAME`
- `address_extra` → `BUSINESS_ADDRESS`
- `payer_a_pgs` → `PAYER_NAME`
- `invDate_a_pgs` → `INVOICE_DATE`
- `desc_a_li` → `LINE_ITEM_DESCRIPTIONS`
- `quantity_a_li` → `LINE_ITEM_QUANTITIES`
- `unit_price_a_li` → `LINE_ITEM_PRICES`
- `total_a_li` → `LINE_ITEM_TOTAL_PRICES`
- `tax_a_pg` → `GST_AMOUNT`
- `total_a_pg` → `TOTAL_AMOUNT`
- `date_a_li` → `TRANSACTION_DATES`
- `due_a_li` → `TRANSACTION_AMOUNTS_PAID`
- `received_a_li` → `TRANSACTION_AMOUNTS_RECEIVED`
- `balance_a_li` → `ACCOUNT_BALANCE`

**Derived fields:**
- `PAYER_ADDRESS` = duplicate of `BUSINESS_ADDRESS`
- `STATEMENT_DATE_RANGE` = duplicate of `INVOICE_DATE`
- `IS_GST_INCLUDED` = `"true"` if `GST_AMOUNT` has value, else `"false"`

### **6. Data Cleaning & Normalization**

**BUSINESS_ABN:**
- Removed prefixes: `abn`, `ABN`, `a.b.n.`, etc.
- Removed pipes (`|`), colons (`:`), extra whitespace
- Extracted 11 digits and formatted as `XX XXX XXX XXX`

**Date Fields - UPDATED to match extraction_parser.py:**

**Single Date Fields (INVOICE_DATE, STATEMENT_DATE_RANGE):**
- Takes **FIRST date only** if multiple dates present
- Strips timezone info like `(UTC+10:00)`
- Strips day names (Monday, Tuesday, Wednesday, etc.) - handled by dateutil
- Strips ordinal indicators (1st, 2nd, 24th, etc.) - handled by dateutil
- Standardized to **`DD/MM/YYYY`** format (Australian format)
- Supported input formats: `DD/MM/YY`, `DD/MM/YYYY`, `DD mon YYYY`, `YYYY-MM-DD`, `Wednesday, 24th August 2022`
- Uses `dayfirst=True` for Australian date parsing preference
- Examples:
  - `26 Apr 2023` → `26/04/2023`
  - `2023-04-14 11:22 AM (UTC+10:00)` → `14/04/2023`
  - `Wednesday, 24th August 2022` → `24/08/2022`

**Multi-Date Field (TRANSACTION_DATES):**
- Keeps **ALL dates** (including duplicates - legitimate repeated transactions)
- Each date normalized individually to **`DD/MM/YYYY`** format
- Maintains pipe separation: `DD/MM/YYYY | DD/MM/YYYY`
- Strips timezone info and day names from each date
- **DOES NOT remove duplicate dates** - multiple transactions on same date are valid
- Example: `15/03/2024, 15/03/2024, 20/03/2024` → `15/03/2024 | 15/03/2024 | 20/03/2024`

**DOCUMENT_TYPE:**
- Normalized using regex patterns:
  - `BANK.*STATEMENT|STATEMENT.*BANK|ACCOUNT.*STATEMENT` → `BANK_STATEMENT`
  - `INVOICE|TAX.*INVOICE` → `INVOICE`
  - `RECEIPT|RCPT` → `RECEIPT`

**PAYER_NAME:**
- Removed duplicate strings separated by `|`

**Text Fields (DOCUMENT_TYPE, PAYER_NAME, PAYER_ADDRESS, SUPPLIER_NAME, BUSINESS_ADDRESS):**
- Replaced pipe separators (`|`) with spaces

### **7. Validation**
- Checked for all expected fields
- Checked for duplicate image names
- Verified field coverage percentages
- Validated boolean field values (IS_GST_INCLUDED)

### **Final Output**
- 20 standardized fields per image
- One row per image (identified by `image_name`)
- Consistent data types and formats
- Missing values marked as `"NOT_FOUND"`
- **Date format matches extraction_parser.py**: `DD/MM/YYYY` (Australian format)