New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to ignore "insignificant whitespace"? #49
Comments
I don't think this is actually a problem. While it is true the whitespace is insignificant in terms of the xml tree text nodes are a separate entity in libxml2, so The error you are getting is because you used cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
write(cd, "cd_catalog.xml")
#> Error in cat(list(...), file, sep, fill, labels, append): argument 1 (type 'list') cannot be handled by 'cat'
write_xml(cd, "cd_catalog.xml") |
We could add an option to |
I think it is element dependent as to whether text is significant or not. It could be specified in the XML schema with a whitespace facet (see Restrictions on Whitespace Characters in http://www.w3schools.com/xml/schema_facets.asp) but I don't believe there is anyway to know de novo if whitespace within a node is significant or not. |
Sorry about the Yeah I guess all of my confusion is indeed around what happens when you coerce to a list. I can't tell if these whitespace nodes really exist or not? They don't influence length and the XPath queries below don't get them. But they show up in the result of library(XML)
cd_XML <- xmlParse("http://www.xmlfiles.com/examples/cd_catalog.xml")
(eb_XML <- cd_XML %>%
xpathSApply("/CATALOG") %>%
.[[1]] %>%
xmlChildren() %>%
.[[1]])
#> <CD>
#> <TITLE>Empire Burlesque</TITLE>
#> <ARTIST>Bob Dylan</ARTIST>
#> <COUNTRY>USA</COUNTRY>
#> <COMPANY>Columbia</COMPANY>
#> <PRICE>10.90</PRICE>
#> <YEAR>1985</YEAR>
#> </CD>
length(xmlChildren(eb_XML))
#> [1] 6
length(xmlToList(eb_XML))
#> [1] 6
cd_XML %>%
xpathSApply("//CD/*[1]", xmlValue) %>%
unlist() %>%
head()
#> [1] "Empire Burlesque" "Hide your heart" "Greatest Hits"
#> [4] "Still got the blues" "Eros" "One night only"
cd_XML %>%
xmlToList() %>%
vapply(`[[`, character(1), 1) %>%
head()
#> CD CD CD
#> "Empire Burlesque" "Hide your heart" "Greatest Hits"
#> CD CD CD
#> "Still got the blues" "Eros" "One night only"
library(xml2)
cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
(eb <- cd %>%
xml_find_one("/CATALOG") %>%
xml_children() %>%
.[[1]])
#> {xml_node}
#> <CD>
#> [1] <TITLE>Empire Burlesque</TITLE>
#> [2] <ARTIST>Bob Dylan</ARTIST>
#> [3] <COUNTRY>USA</COUNTRY>
#> [4] <COMPANY>Columbia</COMPANY>
#> [5] <PRICE>10.90</PRICE>
#> [6] <YEAR>1985</YEAR>
length(xml_children(eb))
#> [1] 6
length(as_list(eb))
#> [1] 13
cd %>%
xml_find_all("//CD/*[1]") %>%
xml_text() %>%
head()
#> [1] "Empire Burlesque" "Hide your heart" "Greatest Hits"
#> [4] "Still got the blues" "Eros" "One night only"
cd %>%
as_list() %>%
vapply(`[[`, character(1), 1) %>%
head()
#> CD CD CD
#> "\n " "\n " "\n " "\n " "\n " "\n " |
@jennybc You can select the text nodes explicitly with a XPath expression. cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
cd_text_nodes <- xml_find_all(cd, "//CD/text()")
xml_text(cd_text_nodes)
#> [1] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [8] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [15] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [22] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [29] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [36] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [43] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [50] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [57] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [64] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [71] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [78] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [85] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [92] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [99] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [106] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [113] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [120] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [127] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [134] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [141] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [148] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [155] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [162] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [169] "\n " "\n " "\n " "\n " "\n " "\n " "\n "
#> [176] "\n " "\n " "\n " "\n " "\n " "\n " "\n " |
But according to the library(XML)
cd_XML <- xmlParse("http://www.xmlfiles.com/examples/cd_catalog.xml")
cd_XML %>%
xpathSApply("//CD/text()", xmlValue)
#> list() BTW is "text node" some sort of special vocabulary? I think of all of those nodes (title, artist, etc.) as text nodes. I'm so confused. |
title artist etc. are 'element' nodes (see http://www.w3schools.com/xml/dom_nodetype.asp for all the types) You can see the node type for any node with cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
cd_text_nodes <- xml_find_all(cd, "//CD/text()")
xml_type(cd_text_nodes)[1:6]
#> [1] "text" "text" "text" "text" "text" "text" |
FWIW
and apparently it is ON by default. Maybe |
Yeah, that sounds reasonable to me. Could incorporate into #85 |
I'm using the term "insignificant whitespace" as defined in What You Need to Know About Whitespace in XML:
Consider XML that has been formatted for human eyeballs.
xml2
can read it w/o error and well-formed XPATH expressions do what one expects. But once you useas_list()
or try to write it back out withwrite_xml()
, you learn there are problems with whitespace.Targetted queries work fine:
But
as_list()
reveals some problemsand indeed you can't invert things with
write_xml()
The text was updated successfully, but these errors were encountered: