New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml_text does not trim "&nbsp" #151

Closed
rentrop opened this Issue Nov 30, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@rentrop

rentrop commented Nov 30, 2016

In my html-code i have &nbsp (i.e. non non-breaking space). If i try to get the the text via xml_text(..., trim=TRUE) it returns the non-breaking space instead of an empty string.
Is this a feature? IMHO the expected behavior would be to return an empty string...

Minimal-Example:

require(xml2)
space <- rawToChar(as.raw(c(0xc2, 0xa0)))
doc <- read_xml(paste0('<td style="text-align:left;">', space, '</td>'))
xml_text(doc, trim = TRUE) == "" # FALSE
charToRaw(xml_text(doc, trim = TRUE)) #[1] c2 a0

Workaround:
stringi::stri_trim_both(xml_text(doc, trim = TRUE)) or stringr::str_trim

@rentrop rentrop changed the title from xml_text does not delete &nbsp to xml_text does not trim "&nbsp" Dec 1, 2016

@jimhester jimhester closed this in 13ec091 Dec 6, 2016

@rentrop

This comment has been minimized.

rentrop commented Dec 15, 2016

@jimhester thanks for getting on this so fast. Unfortunately this fix opens another error:

Take this example:

devtools::install_github("hadley/xml2") # Version 1.0.0.9002
require(xml2)
doc <- read_html('<td>31.12.2010&nbsp;<br>€
                 </td>')
text_nodes <- xml_find_all(doc, ".//text()[normalize-space()]")
xml_text(text_nodes, trim = TRUE) # "31.12.2010" ""

So xml_text now removes the -sign.

The expected result would be:

stringi::stri_trim_both(text_nodes) # "31.12.2010" "€"

@jimhester jimhester reopened this Dec 15, 2016

@jimhester jimhester closed this in 0eaa61c Dec 15, 2016

@jimhester

This comment has been minimized.

Member

jimhester commented Dec 15, 2016

Ok I refactored how this was being done, thanks for the reproducible example.

@rentrop

This comment has been minimized.

rentrop commented Dec 15, 2016

Perfect, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment