Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When trying to use XMLEventReader with an utf-8 encoded XML document that starts with a byte order mark (EF BB BF), an empty iterator is returned, and an error message is printed to stderr #95

Closed
bbary opened this issue Mar 25, 2016 · 16 comments

Comments

@bbary
Copy link

bbary commented Mar 25, 2016

When trying to use XMLEventReader with an utf-8 encoded XML document that starts with a byte order mark (EF BB BF), an empty iterator is returned, and an error message is printed to stderr.
test.scala
import scala.io.Source
import scala.xml.pull.XMLEventReader

val t = new XMLEventReader(Source.fromFile("hasBOM.xml"))

println(t)
Output of scala test.scala
file:/tmp/hasBOM.xml:1:1: < expected^
empty iterator

The attached file hasBOM.xml has CR+LF newlines, but LF or CR newlines do not change the behaviour.

https://drive.google.com/file/d/0BzQoj9XC6BxUUWZfdzgwb3J4OWc/view?usp=sharing

@SethTisue
Copy link
Member

Is this an issue with XMLEventReader, or with Source.fromFile?

@bbary
Copy link
Author

bbary commented Mar 25, 2016

it's an issue with XMLEventReader.
When i convert the file to us-ascii it works, but i mustn't do this

@SethTisue
Copy link
Member

The attached file hasBOM.xml

There is no attached file?

@bbary
Copy link
Author

bbary commented Mar 25, 2016

ah yes sorry i will attach the file

@SethTisue
Copy link
Member

it's an issue with XMLEventReader.

How do you know it isn't just an issue with Source.fromFile?

fromFile has overloads that accept the name of encoding or a Codec object; do you know that there is no possible way to make this work simply by using one of those overloads to tell Source how your file is encoded?

@bbary
Copy link
Author

bbary commented Mar 25, 2016

i tested Source.fromFile and it's working fine with this file

@SethTisue
Copy link
Member

Attaching the file as a screen shot is useless. You'll need to attach the actual file, or link to the actual file, since the actual bytes matter.

@bbary
Copy link
Author

bbary commented Mar 25, 2016

i updated my post with a link, because i coudn't upload an xml file

@SethTisue
Copy link
Member

SethTisue commented Mar 25, 2016

I see this was reported back in 2012 at https://issues.scala-lang.org/browse/SI-6741 scala/bug#6741. (But, this repo is now the right place for the issue, if it is determined to be scala-xml specific.)

@SethTisue
Copy link
Member

SethTisue commented Mar 25, 2016

closely related and perhaps relevant; here is a still-open ticket from 2009 on essentially the same issue, but in the context of scalac rather than scala-xml, and thus involving scala.tools.nsc.io.SourceReader rather than scala.io.Source: https://issues.scala-lang.org/browse/SI-2109 scala/bug#2109

@SethTisue
Copy link
Member

judging from http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java and http://mindprod.com/jgloss/bom.html, I think this is just standard JVM stuff and is not scala-xml specific. if it were to be addressed, it would be addressed in io.Source, not in scala-xml. so, closing the issue.

both of the links I've provided suggest multiple strategies for working around this.

@SethTisue
Copy link
Member

i tested Source.fromFile and it's working fine with this file

not for me:

scala> io.Source.fromFile("cosi2.xml").getLines.next
res5: String = <root>

that looks OK, but appearances are misleading:

scala> io.Source.fromFile("cosi2.xml").getLines.next.size
res6: Int = 7

scala> io.Source.fromFile("cosi2.xml").getLines.next.head.toInt
res7: Int = 65279

oops — that's garbage from the BOM.

@bbary
Copy link
Author

bbary commented Mar 27, 2016

@SethTisue: Thank you for answers
But i found nothing on the internet to solve this, all stackoverflow responses are stacking around Java

@biswanaths
Copy link
Member

Hey @thatismypath a working scala snippet,

import org.apache.commons.io.input.BOMInputSteam 
import scala.io.Source
import java.io.FileInputStream
import scala.xml.pull.XMLEventReader

val t = new XMLEventReader( 
               Source.fromInputStream(
                      new BOMInputStream(new FileInputStream("hasBOM.xml"))))

println(t)

For sbt this is the dependency,

libraryDependencies += "commons-io" % "commons-io" % "2.3"

Interestingly enough, XML.loadFile("hasBOM.xml") seems to be working fine, not sure who is handling the BOM, whether the SAXParser which scala-xml uses or something else ?

Regards.

@bbary
Copy link
Author

bbary commented Mar 29, 2016

Thank you @biswanaths, it works perfectly with your code. Cool
just correct this:
import org.apache.commons.io.input.BOMInputSteam to
import org.apache.commons.io.input.BOMInputStream

Regards.

@pauloflamyob
Copy link

The XML returned from BizTalk contains a BOM character. Java can't handle this. We need to handle this ourselves. See here: #95 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants