Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to ignore "insignificant whitespace"? #49

Closed
jennybc opened this issue Aug 26, 2015 · 9 comments
Closed

How to ignore "insignificant whitespace"? #49

jennybc opened this issue Aug 26, 2015 · 9 comments

Comments

@jennybc
Copy link
Member

jennybc commented Aug 26, 2015

I'm using the term "insignificant whitespace" as defined in What You Need to Know About Whitespace in XML:

Insignificant whitespace is used when editing XML documents for readability. These whitespaces are typically not intended for inclusion in the delivery of the document.

Consider XML that has been formatted for human eyeballs. xml2 can read it w/o error and well-formed XPATH expressions do what one expects. But once you use as_list() or try to write it back out with write_xml(), you learn there are problems with whitespace.

cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")

Targetted queries work fine:

xml_find_all(cd, ".//TITLE")
#> {xml_nodeset (26)}
#>  [1] <TITLE>Empire Burlesque</TITLE>
#>  [2] <TITLE>Hide your heart</TITLE>
#>  [3] <TITLE>Greatest Hits</TITLE>
#>  [4] <TITLE>Still got the blues</TITLE>
#>  [5] <TITLE>Eros</TITLE>
#>  [6] <TITLE>One night only</TITLE>
#>  [7] <TITLE>Sylvias Mother</TITLE>
#>  [8] <TITLE>Maggie May</TITLE>
#>  [9] <TITLE>Romanza</TITLE>
#> [10] <TITLE>When a man loves a woman</TITLE>
#> [11] <TITLE>Black angel</TITLE>
#> [12] <TITLE>1999 Grammy Nominees</TITLE>
#> [13] <TITLE>For the good times</TITLE>
#> [14] <TITLE>Big Willie style</TITLE>
#> [15] <TITLE>Tupelo Honey</TITLE>
#> [16] <TITLE>Soulsville</TITLE>
#> [17] <TITLE>The very best of</TITLE>
#> [18] <TITLE>Stop</TITLE>
#> [19] <TITLE>Bridge of Spies</TITLE>
#> [20] <TITLE>Private Dancer</TITLE>
#> ...

But as_list() reveals some problems

str(as_list(cd), max.level = 1, list.len = 5)
#> List of 53
#>  $   : chr "\n  "
#>  $ CD:List of 13
#>   .. [list output truncated]
#>  $   : chr "\n  "
#>  $ CD:List of 13
#>   .. [list output truncated]
#>  $   : chr "\n  "
#>   [list output truncated]

and indeed you can't invert things with write_xml()

write(cd, "cd_catalog.xml")
#> Error in cat(list(...), file, sep, fill, labels, append): argument 1 (type 'list') cannot be handled by 'cat'
@jimhester
Copy link
Member

jimhester commented May 13, 2016

I don't think this is actually a problem. While it is true the whitespace is insignificant in terms of the xml tree text nodes are a separate entity in libxml2, so as_list() looks like it is doing the proper thing to me.

The error you are getting is because you used base::write() instead of write_xml(), the latter works without error on this example.

cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
write(cd, "cd_catalog.xml")
#> Error in cat(list(...), file, sep, fill, labels, append): argument 1 (type 'list') cannot be handled by 'cat'
write_xml(cd, "cd_catalog.xml")

@hadley
Copy link
Member

hadley commented May 13, 2016

We could add an option to as_list() to drop insignificant whitespace if that would help. (Assuming it's not too difficult to determine which whitespace is insignificant)

@jimhester
Copy link
Member

I think it is element dependent as to whether text is significant or not. It could be specified in the XML schema with a whitespace facet (see Restrictions on Whitespace Characters in http://www.w3schools.com/xml/schema_facets.asp) but I don't believe there is anyway to know de novo if whitespace within a node is significant or not.

@jennybc
Copy link
Member Author

jennybc commented May 13, 2016

Sorry about the write() vs write_xml() mixup. My bad.

Yeah I guess all of my confusion is indeed around what happens when you coerce to a list. I can't tell if these whitespace nodes really exist or not? They don't influence length and the XPath queries below don't get them. But they show up in the result of xml2::as_list(). Which is different from XML::xmlToList().

library(XML)
cd_XML <- xmlParse("http://www.xmlfiles.com/examples/cd_catalog.xml")
(eb_XML <- cd_XML %>% 
  xpathSApply("/CATALOG") %>% 
  .[[1]] %>% 
  xmlChildren() %>% 
  .[[1]])
#> <CD>
#>   <TITLE>Empire Burlesque</TITLE>
#>   <ARTIST>Bob Dylan</ARTIST>
#>   <COUNTRY>USA</COUNTRY>
#>   <COMPANY>Columbia</COMPANY>
#>   <PRICE>10.90</PRICE>
#>   <YEAR>1985</YEAR>
#> </CD>
length(xmlChildren(eb_XML))
#> [1] 6
length(xmlToList(eb_XML))
#> [1] 6
cd_XML %>%
  xpathSApply("//CD/*[1]", xmlValue) %>% 
  unlist() %>% 
  head()
#> [1] "Empire Burlesque"    "Hide your heart"     "Greatest Hits"      
#> [4] "Still got the blues" "Eros"                "One night only"
cd_XML %>%
  xmlToList() %>% 
  vapply(`[[`, character(1), 1) %>% 
  head()
#>                    CD                    CD                    CD 
#>    "Empire Burlesque"     "Hide your heart"       "Greatest Hits" 
#>                    CD                    CD                    CD 
#> "Still got the blues"                "Eros"      "One night only"

library(xml2)
cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
(eb <- cd %>%
  xml_find_one("/CATALOG") %>% 
  xml_children() %>% 
  .[[1]])
#> {xml_node}
#> <CD>
#> [1] <TITLE>Empire Burlesque</TITLE>
#> [2] <ARTIST>Bob Dylan</ARTIST>
#> [3] <COUNTRY>USA</COUNTRY>
#> [4] <COMPANY>Columbia</COMPANY>
#> [5] <PRICE>10.90</PRICE>
#> [6] <YEAR>1985</YEAR>
length(xml_children(eb))
#> [1] 6
length(as_list(eb))
#> [1] 13
cd %>%
  xml_find_all("//CD/*[1]") %>% 
  xml_text() %>% 
  head()
#> [1] "Empire Burlesque"    "Hide your heart"     "Greatest Hits"      
#> [4] "Still got the blues" "Eros"                "One night only"
cd %>%
  as_list() %>% 
  vapply(`[[`, character(1), 1) %>% 
  head()
#>                CD                CD                CD 
#>   "\n  " "\n    "   "\n  " "\n    "   "\n  " "\n    "

@jimhester
Copy link
Member

@jennybc You can select the text nodes explicitly with a XPath expression.

cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
cd_text_nodes <- xml_find_all(cd, "//CD/text()")
xml_text(cd_text_nodes)
#>   [1] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>   [8] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [15] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [22] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [29] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [36] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [43] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [50] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [57] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [64] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [71] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [78] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [85] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [92] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#>  [99] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [106] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [113] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [120] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [127] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [134] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [141] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [148] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [155] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [162] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [169] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "  
#> [176] "\n    " "\n    " "\n    " "\n    " "\n    " "\n    " "\n  "

@jennybc
Copy link
Member Author

jennybc commented May 13, 2016

😶

But according to the XML package, there are no such nodes?

library(XML)
cd_XML <- xmlParse("http://www.xmlfiles.com/examples/cd_catalog.xml")
cd_XML %>%
  xpathSApply("//CD/text()", xmlValue)
#> list()

BTW is "text node" some sort of special vocabulary? I think of all of those nodes (title, artist, etc.) as text nodes. I'm so confused.

@jimhester
Copy link
Member

title artist etc. are 'element' nodes (see http://www.w3schools.com/xml/dom_nodetype.asp for all the types)

You can see the node type for any node with xml_type().

cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")
cd_text_nodes <- xml_find_all(cd, "//CD/text()")
xml_type(cd_text_nodes)[1:6]
#> [1] "text" "text" "text" "text" "text" "text"

@jennybc
Copy link
Member Author

jennybc commented May 13, 2016

FWIW XML uses the NOBLANKS option to:

Remove text nodes that are made entirely of white space between nodes. This can be used to remove formatting content that is used for indenting nodes.

and apparently it is ON by default. Maybe xml2 should do the same?

@hadley
Copy link
Member

hadley commented May 13, 2016

Yeah, that sounds reasonable to me. Could incorporate into #85

jimhester added a commit to jimhester/xml2 that referenced this issue May 17, 2016
jimhester added a commit to jimhester/xml2 that referenced this issue May 17, 2016
jimhester added a commit to jimhester/xml2 that referenced this issue May 18, 2016
jimhester added a commit to jimhester/xml2 that referenced this issue May 19, 2016
jimhester added a commit to jimhester/xml2 that referenced this issue May 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants