Skip to content

Optimize xml_path() and xml_find_all() for large node sets #477

@astruebi

Description

@astruebi

xml_path() and xml_find_all() can spend substantial time in wrapper-side work when operating on large node sets.

I have a local branch with two general optimizations, both intended to preserve existing semantics:

  1. Speed up xml_path() for large xml_nodesets by caching reusable ancestor paths within a single call.
  2. Speed up xml_find_all.xml_node() materialization by avoiding an unnecessary duplicate pass for single-context XPath node sets and reducing per-node R object construction overhead.

Is this a direction you would be open to reviewing as a PR?

Isolated benchmark results from my local machine:

  • xml_path() on a synthetic node set with 124,001 nodes:

    • before: ~0.77s
    • after: ~0.07s
  • xml_path() on a large real-world XML bundle with 415,022 selected nodes:

    • before: ~2.43s
    • after: ~0.28s
  • xml_find_all() / XPath result materialization on large attribute result sets:

    • synthetic 300k attributes: roughly 1.5-2x faster in the materialization-heavy path
    • real-world XML bundle, //@*, 155,009 attributes: roughly 1.6-2x faster in isolated runs

The changes are not specific to any particular XML schema or downstream package. They target repeated ancestor path construction and large XPath result materialization.

I added regression tests comparing batch xml_path() output to per-node xml_path() output, and checking that single-context xml_find_all() XPath node sets remain unique.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions