Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim all xml_text results from a nodeset at the same time #386

Closed
WerthPADOH opened this issue Mar 16, 2023 · 0 comments · Fixed by #400
Closed

Trim all xml_text results from a nodeset at the same time #386

WerthPADOH opened this issue Mar 16, 2023 · 0 comments · Fixed by #400
Labels
feature a feature request or enhancement

Comments

@WerthPADOH
Copy link

xml_text can take a long time when used with a large nodeset when trim is TRUE, and most of that time is spent calling sub twice for each node. One experience I had was getting the text for a nodeset with about 4.5 million nodes, where Rprof showed that calls to sub made up 72% of the nearly two minutes spent in xml_text.

I'll happily make a pull request if you'd like

Here's a brief demo showing time saved:

library(xml2)
library(microbenchmark)

blob <- paste0(c("<x>", rep("<y> Hi </y>", 1000), "</x>"), collapse = "")
tree <- read_xml(blob)
y_tags <- xml_find_all(tree, "//y")

xml_text_sub_after <- function(x, trim = FALSE) {
  res <- vapply(x, xml_text, trim = FALSE, FUN.VALUE = character(1))
  if (isTRUE(trim)) {
      res <- sub("^[[:space:] ]+", "", res)
      res <- sub("[[:space:] ]+$", "", res)
  }
  res
}

microbenchmark(
  as_is = xml_text(y_tags, trim = TRUE),
  proposed = xml_text_sub_after(y_tags, trim = TRUE),
  check = "identical"
)
# Unit: milliseconds
#      expr     min      lq      mean   median       uq     max neval
#     as_is 35.2209 35.9433 37.787917 36.64865 38.15000 58.4016   100
#  proposed  6.7719  6.9522  7.625734  7.16715  7.40195 16.2116   100
@hadley hadley added upkeep maintenance, infrastructure, and similar feature a feature request or enhancement and removed upkeep maintenance, infrastructure, and similar labels Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants