# Ground Truth XML Parser

This notebook parses XML files containing character information (name, gender, aliases) for novels and extracts the ground truth data into a Pandas DataFrame. This structured data can then be used for evaluating the NLP pipeline.

In [1]:
# Cell 1: Import Libraries
import xml.etree.ElementTree as ET
import pandas as pd
import os

print("Libraries imported.")

Libraries imported.


In [7]:
# Cell 2: Configuration
# --- Input Files ---
# Directory containing the ground truth XML files
GROUND_TRUTH_DIR = "" 

# List of XML files to parse
XML_FILES = [
    "emma_full.xml",
    "steppe_full.xml",
    "pp_full.xml"
    
]

# --- Output File ---
# Where to save the parsed ground truth data
OUTPUT_CSV_PATH = "data/eval/ground_truth_characters.csv"

# --- Constants ---
# Define standardized gender labels 
GENDER_FEMALE = "Female"
GENDER_MALE = "Male"
GENDER_UNKNOWN = "Unknown"

# Define gender labels expected in the XML (lowercase)
XML_GENDER_FEMALE = "female"
XML_GENDER_MALE = "male"
# List labels in XML to be treated as Unknown
XML_GENDER_OTHER = ["none", "unknown", "", None] 

print("Configuration set.")
print(f"Looking for XML files in: {GROUND_TRUTH_DIR}")
print(f"Files to parse: {XML_FILES}")
print(f"Output CSV: {OUTPUT_CSV_PATH}")

Configuration set.
Looking for XML files in: 
Files to parse: ['emma_full.xml', 'steppe_full.xml', 'pp_full.xml']
Output CSV: data/eval/ground_truth_characters.csv


In [3]:
# Cell 3: XML Parsing Function

def parse_ground_truth_xml(xml_path):
    """Parses a single XML file to extract character names and genders."""
    characters = []
    novel_name = os.path.splitext(os.path.basename(xml_path))[0].replace('_full', '')
    
    try:
        tree = ET.parse(xml_path)
        root = tree.getroot()
        # Find the 'characters' element
        characters_element = root.find('characters')
        if characters_element is None:
            print(f"Warning: <characters> tag not found in {xml_path}")
            return [] # Return empty list if structure is wrong

        for char_element in characters_element.findall('character'):
            name = char_element.get('name')
            gender = char_element.get('gender') # Might be None
            aliases = char_element.get('aliases', '') # Get aliases, default to empty string

            if not name: 
                continue # Skip characters without a name

            # Standardize gender
            standard_gender = GENDER_UNKNOWN
            if gender:
                gender_lower = gender.lower()
                if gender_lower == XML_GENDER_FEMALE:
                    standard_gender = GENDER_FEMALE
                elif gender_lower == XML_GENDER_MALE:
                    standard_gender = GENDER_MALE
                # Note: XML_GENDER_OTHER handles none/unknown/missing in the list comprehension below
            
            
            characters.append({
                'novel': novel_name,
                'canonical_key': name, 
                'true_gender': standard_gender,
                'aliases': aliases
            })

        print(f"Successfully parsed {len(characters)} characters from {xml_path}")
        return characters

    except ET.ParseError as e:
        print(f"Error parsing XML file {xml_path}: {e}")
        return []
    except FileNotFoundError:
        print(f"Error: Ground truth file not found at {xml_path}")
        return []
    except Exception as e:
        print(f"An unexpected error occurred during parsing {xml_path}: {e}")
        return []

print("XML parsing function defined.")

XML parsing function defined.


In [8]:
# Cell 4: Parse All Specified XML Files and Combine

all_ground_truth_chars = []
for xml_filename in XML_FILES:
    full_xml_path = os.path.join(GROUND_TRUTH_DIR, xml_filename)
    parsed_chars = parse_ground_truth_xml(full_xml_path)
    all_ground_truth_chars.extend(parsed_chars)

if not all_ground_truth_chars:
    print("\nNo characters were parsed from any XML file. Cannot create DataFrame.")
    ground_truth_df = pd.DataFrame()
else:
    ground_truth_df = pd.DataFrame(all_ground_truth_chars)
    print(f"\nCombined ground truth data from {len(XML_FILES)} file(s). Total characters: {len(ground_truth_df)}")
    
    # Display info about the combined DataFrame
    print("\nCombined Ground Truth DataFrame Info:")
    ground_truth_df.info()
    print("\nCombined Ground Truth Head:")
    print(ground_truth_df.head())
    print("\nCombined Ground Truth Gender Distribution:")
    print(ground_truth_df['true_gender'].value_counts())

Successfully parsed 49 characters from emma_full.xml
Successfully parsed 65 characters from steppe_full.xml
Successfully parsed 32 characters from pp_full.xml

Combined ground truth data from 3 file(s). Total characters: 146

Combined Ground Truth DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   novel          146 non-null    object
 1   canonical_key  146 non-null    object
 2   true_gender    146 non-null    object
 3   aliases        146 non-null    object
dtypes: object(4)
memory usage: 4.7+ KB

Combined Ground Truth Head:
  novel        canonical_key true_gender  \
0  emma       Emma_Woodhouse      Female   
1  emma   Isabella_Woodhouse      Female   
2  emma    Mr_John_Knightley        Male   
3  emma          Miss_Taylor      Female   
4  emma  Mr_George_Knightley        Male   

                                        

In [9]:
# Cell 5: Save Parsed Data to CSV

if not ground_truth_df.empty:
    try:
        
        os.makedirs(os.path.dirname(OUTPUT_CSV_PATH), exist_ok=True)
        
        # Save the DataFrame to CSV
        ground_truth_df.to_csv(OUTPUT_CSV_PATH, index=False, encoding='utf-8')
        print(f"\nSuccessfully saved ground truth data to: {OUTPUT_CSV_PATH}")
    except Exception as e:
        print(f"\nError saving ground truth data to CSV: {e}")
else:
    print("\nSkipping saving to CSV as the DataFrame is empty.")


Successfully saved ground truth data to: data/eval/ground_truth_characters.csv


--- 
**Next Steps:**

The file `ground_truth_characters.csv` now contains the structured ground truth data.
Now I can CSV file in my main evaluation notebook (`04_evaluations.ipynb`) to compare against your pipeline's results.
---