Skip to content

How to ignore "insignificant whitespace"? #49

Closed
@jennybc

Description

@jennybc

I'm using the term "insignificant whitespace" as defined in What You Need to Know About Whitespace in XML:

Insignificant whitespace is used when editing XML documents for readability. These whitespaces are typically not intended for inclusion in the delivery of the document.

Consider XML that has been formatted for human eyeballs. xml2 can read it w/o error and well-formed XPATH expressions do what one expects. But once you use as_list() or try to write it back out with write_xml(), you learn there are problems with whitespace.

cd <- read_xml("http://www.xmlfiles.com/examples/cd_catalog.xml")

Targetted queries work fine:

xml_find_all(cd, ".//TITLE")
#> {xml_nodeset (26)}
#>  [1] <TITLE>Empire Burlesque</TITLE>
#>  [2] <TITLE>Hide your heart</TITLE>
#>  [3] <TITLE>Greatest Hits</TITLE>
#>  [4] <TITLE>Still got the blues</TITLE>
#>  [5] <TITLE>Eros</TITLE>
#>  [6] <TITLE>One night only</TITLE>
#>  [7] <TITLE>Sylvias Mother</TITLE>
#>  [8] <TITLE>Maggie May</TITLE>
#>  [9] <TITLE>Romanza</TITLE>
#> [10] <TITLE>When a man loves a woman</TITLE>
#> [11] <TITLE>Black angel</TITLE>
#> [12] <TITLE>1999 Grammy Nominees</TITLE>
#> [13] <TITLE>For the good times</TITLE>
#> [14] <TITLE>Big Willie style</TITLE>
#> [15] <TITLE>Tupelo Honey</TITLE>
#> [16] <TITLE>Soulsville</TITLE>
#> [17] <TITLE>The very best of</TITLE>
#> [18] <TITLE>Stop</TITLE>
#> [19] <TITLE>Bridge of Spies</TITLE>
#> [20] <TITLE>Private Dancer</TITLE>
#> ...

But as_list() reveals some problems

str(as_list(cd), max.level = 1, list.len = 5)
#> List of 53
#>  $   : chr "\n  "
#>  $ CD:List of 13
#>   .. [list output truncated]
#>  $   : chr "\n  "
#>  $ CD:List of 13
#>   .. [list output truncated]
#>  $   : chr "\n  "
#>   [list output truncated]

and indeed you can't invert things with write_xml()

write(cd, "cd_catalog.xml")
#> Error in cat(list(...), file, sep, fill, labels, append): argument 1 (type 'list') cannot be handled by 'cat'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions