Below is a practical, step-by-step method to investigate whether fields like CRD, LEI, or FIGI exist in the original XML but might not be captured by your parser. The goal is to figure out whether your code is simply missing certain tags or if they truly do not appear in the raw data.
1. High-Level Approach

    Collect All Parsed XML: Gather the raw <XML> fragments for the same filings you suspect might have missing tags (e.g., CRD, LEI, FIGI).

    Search for the Tags: Use either an XPath query or even a plain text search to see if <crdNumber>, <leiNumber>, <figi> (or any variant) exist in the XML.

    Compare to Parser Code: If you find them, confirm whether your code has a line that captures them or a fallback tag name. If not, add the missing logic.

    Log Found vs. Missing: Consider printing statements or collecting stats to see how often these tags are present and how often they appear in your final CSV.

Below is a suggested notebook snippet that helps do exactly that: it scans each N-PX file, extracts the <XML> blocks, and prints out if we find those tags. You can also adapt it to check more tags, or you can store counts in a dictionary for a fuller analysis.
2. Example: Investigating Coverage of <crdNumber>, <leiNumber>, <figi>

In [None]:
import os
import re
import lxml.etree as ET

def extract_xml_blocks(file_path):
    """
    Return a list of <XML> ... </XML> substrings from the file.
    """
    with open(file_path, "r", encoding="utf-8", errors="replace") as f:
        text = f.read()
    pattern = re.compile(r"<XML>(.*?)</XML>", re.IGNORECASE | re.DOTALL)
    return pattern.findall(text)

def parse_xml_fragment(xml_string):
    parser = ET.XMLParser(recover=True, encoding="utf-8")
    try:
        root = ET.fromstring(xml_string.encode("utf-8"), parser=parser)
        return root
    except ET.XMLSyntaxError as e:
        print(f"  [Warning] parse error: {e}")
        return None

def investigate_tag_coverage(folder_path="npx_filings"):
    # Keep counters or logs
    crd_found = 0
    lei_found = 0
    figi_found = 0
    files_processed = 0

    all_files = os.listdir(folder_path)
    txt_files = [f for f in all_files if f.lower().endswith(".txt")]

    for fname in txt_files:
        file_path = os.path.join(folder_path, fname)
        xml_fragments = extract_xml_blocks(file_path)
        if not xml_fragments:
            continue

        files_processed += 1
        for frag in xml_fragments:
            root = parse_xml_fragment(frag)
            if root is None:
                continue

            # 1) Check for <crdNumber>
            crd_elems = root.xpath(".//*[local-name()='crdNumber']")
            if crd_elems:
                crd_found += 1

            # 2) Check for <leiNumber> or <leiNumberOM>
            lei_elems = root.xpath(".//*[local-name()='leiNumber' or local-name()='leiNumberOM']")
            if lei_elems:
                lei_found += 1

            # 3) Check for <figi>
            figi_elems = root.xpath(".//*[local-name()='figi']")
            if figi_elems:
                figi_found += 1

    print(f"Processed {files_processed} N-PX files.")
    print(f"  CRD tags found in {crd_found} XML fragments.")
    print(f"  LEI tags found in {lei_found} XML fragments.")
    print(f"  FIGI tags found in {figi_found} XML fragments.")

# Run the coverage check
investigate_tag_coverage("npx_filings")


How This Helps

    If you see “CRD tags found in 0 XML fragments”: That means the original data truly had no <crdNumber> tags, so the parser can’t fill them in. The missing data is “real.”

    If you see “CRD tags found in 10 XML fragments” but your CSV has 0 CRD values**: That means the parser code is skipping them. You’d add or fix the lines that extract <crdNumber>.

3. Comparing the Parser Logic

Once you know a tag does appear in some fragments, check your parser code to ensure:

    XPath: You have a line like:

    crd = node.xpath(".//*[local-name()='crdNumber']/text()")

    or it appears in a fallback scenario.

    Assignment: The code takes that found text and puts it into the correct dictionary field (e.g., row_im["crd_number"] = crd[0].strip()).

    Columns: That dictionary field gets written out to CSV. (In your final code, that’s typically in write_to_csv() or the DB insertion code.)

If you discover an alternate official tag name (e.g. <crdNumberAlt> or <leiNumberOM>), you’d add that to your parser logic in a fallback check.
4. What If the Tags Exist but Are Rare

Sometimes <figi> tags are present in fewer than 1% of all votes. If your coverage check shows a small number, it might mean:

    Only certain filers provide FIGI.

    The code works but you see mostly NaN or empty strings because 99% of filers omit it.

That’s normal. The main thing is confirming your parser does handle it when it appears.
5. Next Steps

    Run the coverage snippet to see if the tags exist at all in your raw data.

    Update or confirm your parser logic if you find a mismatch.

    Spot-check a file that you suspect has missing CRD or LEI in your CSV. Open the raw text, find <crdNumber> or <leiNumber> lines, and ensure your code is reading them.

    Repeat for any other fields that might be in the official N-PX spec but missing in your CSV (like <icaOr13FFileNumber>, <otherFileNumber>, etc.).

By doing this, you’ll know exactly whether your parser missed tags or if the raw data truly does not contain them.