Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

print(xml_document) does not scale with large documents #441

Open
zeehio opened this issue Mar 12, 2024 · 0 comments
Open

print(xml_document) does not scale with large documents #441

zeehio opened this issue Mar 12, 2024 · 0 comments

Comments

@zeehio
Copy link

zeehio commented Mar 12, 2024

When loading a fairly large XML file (~500MB), if I print() the document it takes a long time, and it is not interruptible.

However printing the children nodes individually is fast.

I believe the reprex below eventually calls show_nodes which calls as.character here, that takes a long time and blocks the interpreter.

contents <- vapply(x, as.character, FUN.VALUE = character(1L))

library(xml2)
# Download 490 MB:
if (!file.exists("cellosaurus.xml")) download.file("https://ftp.expasy.org/databases/cellosaurus/cellosaurus.xml", "cellosaurus.xml")
# Read XML:
cellosaurus_xml <- xml2::read_xml("cellosaurus.xml")

# My print (a fast version, closer to what I would expect)

cat(format(cellosaurus_xml))
#> <Cellosaurus>
children <- xml2:::xml_children(cellosaurus_xml)
for (child in children) {
  cat(format(child), "\n")
  xml2:::show_nodes(xml2:::xml_children(child))
}
#> <header> 
#> [1] <terminology-name>Cellosaurus</terminology-name>
#> [2] <description>Cellosaurus: a controlled vocabulary of cell lines</descript ...
#> [3] <release version="48.0" updated="2024-01-30" nb-cell-lines="152231" nb-pu ...
#> [4] <terminology-list>\n  <terminology name="NCBI-Taxonomy" source="National  ...
#> <cell-line-list> 
#>  [1] <cell-line category="Hybridoma" created="2021-09-23" last-updated="2024- ...
#>  [2] <cell-line category="Hybridoma" created="2021-09-23" last-updated="2024- ...
#>  [3] <cell-line category="Transformed cell line" created="2012-10-22" last-up ...
#>  [4] <cell-line category="Hybridoma" created="2017-08-22" last-updated="2023- ...
#>  [5] <cell-line category="Cancer cell line" created="2017-05-15" last-updated ...
#>  [6] <cell-line category="Hybridoma" created="2012-06-06" last-updated="2023- ...
#>  [7] <cell-line category="Hybridoma" created="2014-07-17" last-updated="2023- ...
#>  [8] <cell-line category="Hybridoma" created="2022-12-15" last-updated="2023- ...
#>  [9] <cell-line category="Transformed cell line" created="2012-10-22" last-up ...
#> [10] <cell-line category="Hybridoma" created="2013-02-11" last-updated="2023- ...
#> [11] <cell-line category="Cancer cell line" created="2018-05-14" last-updated ...
#> [12] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [13] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [14] <cell-line category="Finite cell line" created="2013-11-05" last-updated ...
#> [15] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [16] <cell-line category="Cancer cell line" created="2012-04-04" last-updated ...
#> [17] <cell-line category="Cancer cell line" created="2012-04-04" last-updated ...
#> [18] <cell-line category="Spontaneously immortalized cell line" created="2019 ...
#> [19] <cell-line category="Transformed cell line" created="2021-12-16" last-up ...
#> [20] <cell-line category="Cancer cell line" created="2024-01-30" last-updated ...
#> ...
#> <publication-list> 
#>  [1] <publication date="2005" type="article" journal-name="AAPS J." volume="7 ...
#>  [2] <publication date="2011" type="article" journal-name="AAPS J." volume="1 ...
#>  [3] <publication date="2011" type="article" journal-name="AAPS J." volume="1 ...
#>  [4] <publication date="2016" type="article" journal-name="AAPS J." volume="1 ...
#>  [5] <publication date="2000" type="article" journal-name="AAPS PharmSci" vol ...
#>  [6] <publication date="2004" type="article" journal-name="AAPS PharmSci" vol ...
#>  [7] <publication date="2008" type="article" journal-name="ACS Chem. Biol." v ...
#>  [8] <publication date="2014" type="article" journal-name="ACS Chem. Biol." v ...
#>  [9] <publication date="2018" type="article" journal-name="ACS Infect. Dis."  ...
#> [10] <publication date="2023" type="article" journal-name="ACS Materials Au"  ...
#> [11] <publication date="2022" type="article" journal-name="ACS Omega" volume= ...
#> [12] <publication date="2017" type="article" journal-name="ACS Synth. Biol."  ...
#> [13] <publication date="2001" type="article" journal-name="Acta Astronaut." v ...
#> [14] <publication date="2013" type="article" journal-name="Acta Astronaut." v ...
#> [15] <publication date="2005" type="article" journal-name="Acta Biochim. Biop ...
#> [16] <publication date="2004" type="article" journal-name="Acta Biochim. Pol. ...
#> [17] <publication date="1988" type="article" journal-name="Acta Biol. Hung."  ...
#> [18] <publication date="2015" type="article" journal-name="Acta Biol. Hung."  ...
#> [19] <publication date="2016" type="article" journal-name="Acta Crystallogr.  ...
#> [20] <publication date="2001" type="article" journal-name="Acta Cytol." volum ...
#> ...
#> <copyright>

# This is extremely slow, and non-interruptible:
# print(cellosaurus_xml)

Created on 2024-03-12 with reprex v2.1.0

Is this expected? Or should the print() function scale better with larger XML files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant