New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to ignore "insignificant whitespace"? #49

Closed
jennybc opened this Issue Aug 26, 2015 · 9 comments

Comments

Projects
None yet
3 participants
@jennybc
Member

jennybc commented Aug 26, 2015

I'm using the term "insignificant whitespace" as defined in What You Need to Know About Whitespace in XML:

Insignificant whitespace is used when editing XML documents for readability. These whitespaces are typically not intended for inclusion in the delivery of the document.

Consider XML that has been formatted for human eyeballs. xml2 can read it w/o error and well-formed XPATH expressions do what one expects. But once you use as_list() or try to write it back out with write_xml(), you learn there are problems with whitespace.

cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")

Targetted queries work fine:

xml_find_all(cd, ".//TITLE")
#> {xml_nodeset (26)}
#>  [1] <TITLE>Empire Burlesque</TITLE>
#>  [2] <TITLE>Hide your heart</TITLE>
#>  [3] <TITLE>Greatest Hits</TITLE>
#>  [4] <TITLE>Still got the blues</TITLE>
#>  [5] <TITLE>Eros</TITLE>
#>  [6] <TITLE>One night only</TITLE>
#>  [7] <TITLE>Sylvias Mother</TITLE>
#>  [8] <TITLE>Maggie May</TITLE>
#>  [9] <TITLE>Romanza</TITLE>
#> [10] <TITLE>When a man loves a woman</TITLE>
#> [11] <TITLE>Black angel</TITLE>
#> [12] <TITLE>1999 Grammy Nominees</TITLE>
#> [13] <TITLE>For the good times</TITLE>
#> [14] <TITLE>Big Willie style</TITLE>
#> [15] <TITLE>Tupelo Honey</TITLE>
#> [16] <TITLE>Soulsville</TITLE>
#> [17] <TITLE>The very best of</TITLE>
#> [18] <TITLE>Stop</TITLE>
#> [19] <TITLE>Bridge of Spies</TITLE>
#> [20] <TITLE>Private Dancer</TITLE>
#> ...

But as_list() reveals some problems

str(as_list(cd), max.level = 1, list.len = 5)
#> List of 53
#>  $   : chr "\n  "
#>  $ CD:List of 13
#>   .. [list output truncated]
#>  $   : chr "\n  "
#>  $ CD:List of 13
#>   .. [list output truncated]
#>  $   : chr "\n  "
#>   [list output truncated]

and indeed you can't invert things with write_xml()

write(cd, "cd_catalog.xml")
#> Error in cat(list(...), file, sep, fill, labels, append): argument 1 (type 'list') cannot be handled by 'cat'
@jimhester

This comment has been minimized.

Member

jimhester commented May 13, 2016

I don't think this is actually a problem. While it is true the whitespace is insignificant in terms of the xml tree text nodes are a separate entity in libxml2, so as_list() looks like it is doing the proper thing to me.

The error you are getting is because you used base::write() instead of write_xml(), the latter works without error on this example.

cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
write(cd, "cd_catalog.xml")
#> Error in cat(list(...), file, sep, fill, labels, append): argument 1 (type 'list') cannot be handled by 'cat'
write_xml(cd, "cd_catalog.xml")
@hadley

This comment has been minimized.

Member

hadley commented May 13, 2016

We could add an option to as_list() to drop insignificant whitespace if that would help. (Assuming it's not too difficult to determine which whitespace is insignificant)

@jimhester

This comment has been minimized.

Member

jimhester commented May 13, 2016

I think it is element dependent as to whether text is significant or not. It could be specified in the XML schema with a whitespace facet (see Restrictions on Whitespace Characters in http://www.w3schools.com/xml/schema_facets.asp) but I don't believe there is anyway to know de novo if whitespace within a node is significant or not.

@jennybc

This comment has been minimized.

Member

jennybc commented May 13, 2016

Sorry about the write() vs write_xml() mixup. My bad.

Yeah I guess all of my confusion is indeed around what happens when you coerce to a list. I can't tell if these whitespace nodes really exist or not? They don't influence length and the XPath queries below don't get them. But they show up in the result of xml2::as_list(). Which is different from XML::xmlToList().

library(XML)
cd_XML <- xmlParse("http://www.xmlfiles.com/examples/cd_catalog.xml")
(eb_XML <- cd_XML %>% 
  xpathSApply("/CATALOG") %>% 
  .[[1]] %>% 
  xmlChildren() %>% 
  .[[1]])
#> <CD>
#>   <TITLE>Empire Burlesque</TITLE>
#>   <ARTIST>Bob Dylan</ARTIST>
#>   <COUNTRY>USA</COUNTRY>
#>   <COMPANY>Columbia</COMPANY>
#>   <PRICE>10.90</PRICE>
#>   <YEAR>1985</YEAR>
#> </CD>
length(xmlChildren(eb_XML))
#> [1] 6
length(xmlToList(eb_XML))
#> [1] 6
cd_XML %>%
  xpathSApply("//CD/*[1]", xmlValue) %>% 
  unlist() %>% 
  head()
#> [1] "Empire Burlesque"    "Hide your heart"     "Greatest Hits"      
#> [4] "Still got the blues" "Eros"                "One night only"
cd_XML %>%
  xmlToList() %>% 
  vapply(`[[`, character(1), 1) %>% 
  head()
#>                    CD                    CD                    CD 
#>    "Empire Burlesque"     "Hide your heart"       "Greatest Hits" 
#>                    CD                    CD                    CD 
#> "Still got the blues"                "Eros"      "One night only"

library(xml2)
cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
(eb <- cd %>%
  xml_find_one("/CATALOG") %>% 
  xml_children() %>% 
  .[[1]])
#> {xml_node}
#> <CD>
#> [1] <TITLE>Empire Burlesque</TITLE>
#> [2] <ARTIST>Bob Dylan</ARTIST>
#> [3] <COUNTRY>USA</COUNTRY>
#> [4] <COMPANY>Columbia</COMPANY>
#> [5] <PRICE>10.90</PRICE>
#> [6] <YEAR>1985</YEAR>
length(xml_children(eb))
#> [1] 6
length(as_list(eb))
#> [1] 13
cd %>%
  xml_find_all("//CD/*[1]") %>% 
  xml_text() %>% 
  head()
#> [1] "Empire Burlesque"    "Hide your heart"     "Greatest Hits"      
#> [4] "Still got the blues" "Eros"                "One night only"
cd %>%
  as_list() %>% 
  vapply(`[[`, character(1), 1) %>% 
  head()
#>                CD                CD                CD 
#>   "\n  " "\n    "   "\n  " "\n    "   "\n  " "\n    "
@jimhester

This comment has been minimized.

Member

jimhester commented May 13, 2016

@jennybc You can select the text nodes explicitly with a XPath expression.

cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
cd_text_nodes <- xml_find_all(cd, "//CD/text()")
xml_text(cd_text_nodes)
#>   [1] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>   [8] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [15] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [22] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [29] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [36] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [43] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [50] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [57] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [64] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [71] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [78] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [85] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [92] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [99] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [106] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [113] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [120] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [127] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [134] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [141] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [148] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [155] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [162] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [169] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [176] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "
@jennybc

This comment has been minimized.

Member

jennybc commented May 13, 2016

😶

But according to the XML package, there are no such nodes?

library(XML)
cd_XML <- xmlParse("http://www.xmlfiles.com/examples/cd_catalog.xml")
cd_XML %>%
  xpathSApply("//CD/text()", xmlValue)
#> list()

BTW is "text node" some sort of special vocabulary? I think of all of those nodes (title, artist, etc.) as text nodes. I'm so confused.

@jimhester

This comment has been minimized.

Member

jimhester commented May 13, 2016

title artist etc. are 'element' nodes (see http://www.w3schools.com/xml/dom_nodetype.asp for all the types)

You can see the node type for any node with xml_type().

cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
cd_text_nodes <- xml_find_all(cd, "//CD/text()")
xml_type(cd_text_nodes)[1:6]
#> [1] "text" "text" "text" "text" "text" "text"
@jennybc

This comment has been minimized.

Member

jennybc commented May 13, 2016

FWIW XML uses the NOBLANKS option to:

Remove text nodes that are made entirely of white space between nodes. This can be used to remove formatting content that is used for indenting nodes.

and apparently it is ON by default. Maybe xml2 should do the same?

@hadley

This comment has been minimized.

Member

hadley commented May 13, 2016

Yeah, that sounds reasonable to me. Could incorporate into #85

jimhester added a commit to jimhester/xml2 that referenced this issue May 17, 2016

jimhester added a commit to jimhester/xml2 that referenced this issue May 17, 2016

jimhester added a commit to jimhester/xml2 that referenced this issue May 18, 2016

jimhester added a commit to jimhester/xml2 that referenced this issue May 19, 2016

jimhester added a commit to jimhester/xml2 that referenced this issue May 19, 2016

@jimhester jimhester closed this in 1477806 May 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment