Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inner_text joins text without whitespace #636

Closed
DougPuchalski opened this issue Mar 19, 2012 · 10 comments
Closed

inner_text joins text without whitespace #636

DougPuchalski opened this issue Mar 19, 2012 · 10 comments

Comments

@DougPuchalski
Copy link
Contributor

Joining <span>one</span><span>two</span> inner_text elements without whitespace results in onetwo rather than one two. Votes for a single space when joining?

https://github.com/tenderlove/nokogiri/commit/be67bd3eb17bda8b08a14a512980983691376c65#lib/nokogiri/xml/node_set.rb-P7

@tenderlove
Copy link
Member

I don't think this is a good idea. Would this only impact span tags? What if the tags already have a space between them? What about two spaces? How do you differentiate the inner_text between tags separated by spaces and tags without?

There's too much ambiguity, not to mention it would break backwards compatibility.

@DougPuchalski
Copy link
Contributor Author

That was only an example. Is it the intention for inner_text to concatenate content from separate tags without delineation?

@tenderlove
Copy link
Member

Ya, it literally just recursively joins the inner_text from each tag. If you want to do text indexing, it's probably best to start at the leaf nodes of the tree and walk back up, only asking for text in that node.

@flavorjones you have any ideas?

@DougPuchalski
Copy link
Contributor Author

Wouldn't want to break anything, but I'm not imagining a use-case for munging text like that. Should the method accept a block or options perhaps?

Webkit appears to use newlines, at least in the console.

@flavorjones
Copy link
Member

You may want to use the Loofah gem (I wrote it). It's got a #to_text method that inserts whitespace after an inline element, and a newline after a block element.

@DougPuchalski
Copy link
Contributor Author

Excellent, looks like just what I need. Thanks flavor.

@feliperaul
Copy link

feliperaul commented Apr 27, 2017

@tenderlove Please reconsider reopening this one. It's a bug. When I ask for the text of an element, you can't change it. If I have <span>Experts</span><span>exchange</span>', .text output is expertsexchange... it's actually changing the text!

Regarding the ambiguity, you just add one whitespace and that's all. If there's already other whitespaces there, it's up to the user use .strip or something like that. But please don't concatenate strings that belongs to different HTML elements.

Had a nasty bug because of this.

@feliperaul
Copy link

In the meantime, a workaround is using this:
Nokogiri::HTML(str).xpath('//text()').map(&:text).join(' ')
source: http://stackoverflow.com/questions/28449761/replace-html-tags-with-whitespaces

@knu
Copy link
Member

knu commented Apr 27, 2017

Whitespace is a language-specific choice for a word separator. There are languages in the world such as Japanese that do not use space for separating words or sentences.

As you may know, the XML/HTML and DOM specifications are clear about how to deal with whitespaces, so unless this argument is based on some standards and backward compatibility is fully taken into account, we could never be able to buy it.

I, for one, have never seen a DOM API that complements implicit whitespaces between nodes, and I think an XPath equivalent for inner_text() is string(.), which simply concatenates texts without any seperator:

Nokogiri::HTML(str).xpath('string(.)')  #=> "onetwo"

@ar7max
Copy link

ar7max commented Apr 27, 2017

Firefox
JavaScript

D = div element of  '<span>one</span><span>two</span>'
D.innerText returns 'onetwo'
D.textContent returns 'some whitespaces onetwo some whitespaces'

So, whats a problem? What did you expect from inner_text?

dgcliff added a commit to NEU-Libraries/cerberus that referenced this issue Aug 21, 2017
…er work - see sparklemotion/nokogiri#636 x-path join and split replicates the old behavior. Will look into Loofah gem, but this works for now. Need to modularize the code and get these methods out of the mods metadata datastream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants