xml_path() and xml_find_all() can spend substantial time in wrapper-side work when operating on large node sets.
I have a local branch with two general optimizations, both intended to preserve existing semantics:
- Speed up
xml_path() for large xml_nodesets by caching reusable ancestor paths within a single call.
- Speed up
xml_find_all.xml_node() materialization by avoiding an unnecessary duplicate pass for single-context XPath node sets and reducing per-node R object construction overhead.
Is this a direction you would be open to reviewing as a PR?
Isolated benchmark results from my local machine:
-
xml_path() on a synthetic node set with 124,001 nodes:
- before: ~0.77s
- after: ~0.07s
-
xml_path() on a large real-world XML bundle with 415,022 selected nodes:
- before: ~2.43s
- after: ~0.28s
-
xml_find_all() / XPath result materialization on large attribute result sets:
- synthetic 300k attributes: roughly 1.5-2x faster in the materialization-heavy path
- real-world XML bundle,
//@*, 155,009 attributes: roughly 1.6-2x faster in isolated runs
The changes are not specific to any particular XML schema or downstream package. They target repeated ancestor path construction and large XPath result materialization.
I added regression tests comparing batch xml_path() output to per-node xml_path() output, and checking that single-context xml_find_all() XPath node sets remain unique.
xml_path()andxml_find_all()can spend substantial time in wrapper-side work when operating on large node sets.I have a local branch with two general optimizations, both intended to preserve existing semantics:
xml_path()for largexml_nodesets by caching reusable ancestor paths within a single call.xml_find_all.xml_node()materialization by avoiding an unnecessary duplicate pass for single-context XPath node sets and reducing per-node R object construction overhead.Is this a direction you would be open to reviewing as a PR?
Isolated benchmark results from my local machine:
xml_path()on a synthetic node set with 124,001 nodes:xml_path()on a large real-world XML bundle with 415,022 selected nodes:xml_find_all()/ XPath result materialization on large attribute result sets://@*, 155,009 attributes: roughly 1.6-2x faster in isolated runsThe changes are not specific to any particular XML schema or downstream package. They target repeated ancestor path construction and large XPath result materialization.
I added regression tests comparing batch
xml_path()output to per-nodexml_path()output, and checking that single-contextxml_find_all()XPath node sets remain unique.