Optimize xml_path() and xml_find_all() for large node sets

`xml_path()` and `xml_find_all()` can spend substantial time in wrapper-side work when operating on large node sets.

I have a local branch with two general optimizations, both intended to preserve existing semantics:

1. Speed up `xml_path()` for large `xml_nodeset`s by caching reusable ancestor paths within a single call.
2. Speed up `xml_find_all.xml_node()` materialization by avoiding an unnecessary duplicate pass for single-context XPath node sets and reducing per-node R object construction overhead.

Is this a direction you would be open to reviewing as a PR?

Isolated benchmark results from my local machine:

- `xml_path()` on a synthetic node set with 124,001 nodes:
  - before: ~0.77s
  - after: ~0.07s

- `xml_path()` on a large real-world XML bundle with 415,022 selected nodes:
  - before: ~2.43s
  - after: ~0.28s

- `xml_find_all()` / XPath result materialization on large attribute result sets:
  - synthetic 300k attributes: roughly 1.5-2x faster in the materialization-heavy path
  - real-world XML bundle, `//@*`, 155,009 attributes: roughly 1.6-2x faster in isolated runs

The changes are not specific to any particular XML schema or downstream package. They target repeated ancestor path construction and large XPath result materialization.

I added regression tests comparing batch `xml_path()` output to per-node `xml_path()` output, and checking that single-context `xml_find_all()` XPath node sets remain unique.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize xml_path() and xml_find_all() for large node sets #477

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Optimize xml_path() and xml_find_all() for large node sets #477

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions