Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConstructingParser adds space between two predefined entities when no other text is present #107

Open
cayhorstmann opened this issue Jul 27, 2016 · 4 comments

Comments

@cayhorstmann
Copy link

scala.xml.parsing.ConstructingParser.fromSource(scala.io.Source.fromString("<div>&lt;&lt;</div>"), false).document
res0: scala.xml.Document = Document(<div>&lt; &lt;</div>)

Note the space between the two &lt;.

@cayhorstmann
Copy link
Author

The problem is that predefined entities are turned into Atom(c), where c is the character represented by the entity. For example, < turns into an Atom('<'). This happens in MarkupParser:

case '&' => // EntityRef or CharRef
      nextch(); ch match {
        case '#' => // CharacterRef
          nextch()
          val theChar = handle.text(tmppos, xCharRef(() => ch, () => nextch()))
          xToken(';')
         ts &+ theChar

The &+ operator is defined in NodeBuffer as follows:

def &+(o: Any): NodeBuffer = {
  o match {
    case null | _: Unit | Text("") => // ignore
    case it: Iterator[_]           => it foreach &+
    case n: Node                   => super.+=(n)
    case ns: Iterable[_]           => this &+ ns.iterator
    case ns: Array[_]              => this &+ ns.iterator
    case d                         => super.+=(new Atom(d))
  }
  ...
}

That sounds innocuous, but there is no good way to save such a thing. Calling XML.write or PrettyPrinter.formatNodes ultimately goes through Utilities.sequenceToXML and this branch:

else if (children forall isAtomAndNotText) { // add space
    val it = children.iterator
    val f = it.next()
    serialize(f, pscope, sb, stripComments, decodeEntities, preserveWhitespace, minimizeTags)
    while (it.hasNext) {
      val x = it.next()
      sb.append(' ')
      serialize(x, pscope, sb, stripComments, decodeEntities, preserveWhitespace, minimizeTags)
    }
}

When saving an element that contains only non-text atoms, they are separated by spaces.
The simplest remedy would be to change

ts &+ theChar 

to

ts &+ Text(theChar.toString)

in MarkupParser.

@ashawley
Copy link
Member

Good analysis of the problem. Not sure if that change to MarkupParser will fix it though.

The parser will need some work to be smart enough to preserve white space between entities. Obviously, white space shouldn't be preserved, when it's not desired, between other types of XML "nodes". However, that feature seems to be the source of the problem. An Atom is considered a Node, but entities are indeed a special case. They should be treated like text, so any adjacent whitespace should be treated as text.

The MarkupParser doesn't have a method to consume white space characters in to a Text object. The appendText method would be useful here. Unfortunately, the preserveWS affects the behavior of appendText.

ashawley added a commit to ashawley/scala-xml that referenced this issue Jul 29, 2016
Add unit tests to check for white space between XML entities.  One
test is the original example of two adjacent entities. An additional
test case *has* white space between entities, and is just an
affirmation of the existing behavior, and should be preserved.

This bug pops up when the preserveWS option is either enabled or
disabled for ConstructingParser.  This doubles the number of unit
tests to four.

The fifth unit test, listed first, of adjacent text characters is
rhetoric, and doesn't need to be included in the final merge.  The
previous sentence shouldn't be merged either.

[error] Test XMLTest.preserveNoSpaceBetweenEntitiesOptionEnabled failed:
[error]   expected: <<div>&lt;[]&lt;</div>>
[error]   but was: <<div>&lt;[ ]&lt;</div>>
[error] Test XMLTest.preserveNoSpaceBetweenEntitiesOptionDisabled failed:
[error]   expected: <<div>&lt;[]&lt;</div>>
[error]   but was: <<div>&lt;[ ]&lt;</div>>
@ashawley
Copy link
Member

I've taken a first pass at writing unit tests should anyone want to take this one on.

@piyush-jaiswal
Copy link

Hi @ashawley, I have tried to fix this issue and also added the tests you provided. It would be great if you could take a look at it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants