Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

print.xml_nodeset very slow for document with one huge node #366

Closed
MichaelChirico opened this issue Jun 3, 2022 · 1 comment · Fixed by #413
Closed

print.xml_nodeset very slow for document with one huge node #366

MichaelChirico opened this issue Jun 3, 2022 · 1 comment · Fixed by #413
Labels
feature a feature request or enhancement

Comments

@MichaelChirico
Copy link
Contributor

MichaelChirico commented Jun 3, 2022

Found in the XML representation of an edge case R file:

library(xml2)
library(xmlparsedata)

p = parse("https://raw.githubusercontent.com/mwaldstein/edgarWebR/fb9a38e6a57186ffd1c93cc1aa00c4fdf1bc5514/tests/cache/browse-edgar-11457c.R")
xml = read_xml(xml_parse_data(p))

Printing this is painfully slow:

system.time(print(xml))
# {xml_document}
# <exprlist>
# [1] <expr line1="1" col1="1" line2="5944" col2="43" start="145" end="855979">\n  <expr line1="1" col1="1" line2="1" col2="9" start="145" end="153">\n    <SYMBOL_FUNCTION_CALL li ...
#    user  system elapsed 
#   2.906   0.048   2.958 

Took a brief look, it looks like encodeString() is the culprit:

# ** debugging inside show_nodes() **
system.time(vapply(x, as.character, FUN.VALUE = character(1)))
#    user  system elapsed 
#   0.248   0.017   0.268 
system.time(encodeString(vapply(x, as.character, FUN.VALUE = character(1))))
#    user  system elapsed 
#   2.959   0.024   3.007

Is it possible to apply substr() twice -- once after as.character(), then again after encodeString()?

chr = vapply(x, as.character, FUN.VALUE = character(1))
nchar(chr)
# [1] 18965721

This is clearly already wayyy to wide (width = 180 for me).

I believe we can always just apply

x %>%
  substring(1, n) %>%
  encodeString() %>%
  substring(1, n)

since the default behavior of encodeString() is to simply add \ to non-printable characters, so it will just be a weakly wider version of the input.

Happy to file a PR if that sounds good.

@MichaelChirico MichaelChirico changed the title print.xml_nodeset very slow for huge document print.xml_nodeset very slow for document with one huge node Jun 3, 2022
@hadley hadley added the feature a feature request or enhancement label Oct 30, 2023
@hadley
Copy link
Member

hadley commented Oct 30, 2023

A PR would be great if you still care about this problem 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants