Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML-parser chokes on some usernames #205

Closed
Nadyita opened this issue Feb 12, 2023 · 7 comments
Closed

XML-parser chokes on some usernames #205

Nadyita opened this issue Feb 12, 2023 · 7 comments
Labels

Comments

@Nadyita
Copy link

Nadyita commented Feb 12, 2023

Some usernames contain characters that make the XML parser refuse loading a file. If that happens to be in a relation, then a large portion of the map cannot be rendered. Example:

<relation id='2500638' timestamp='2023-02-11T15:38:37Z' uid='12419182' user='osm-pt-account 😎' visible='true' version='148' changeset='132409937'>

Everything by the user https://www.openstreetmap.org/user/osm-pt-account%20%f0%9f%98%8e is just causing issues.

java.io.IOException: could not read OSM file (not even with workaround for JOSM files)
	at org.osm2world.core.osm.creation.OSMFileReader.getData(OSMFileReader.java:54)
	at org.osm2world.viewer.model.Data.loadOSMData(Data.java:63)
	at org.osm2world.viewer.control.actions.AbstractLoadOSMAction$LoadOSMThread.run(AbstractLoadOSMAction.java:84)
Caused by: java.lang.RuntimeException: error while processing input
	at de.topobyte.osm4j.xml.dynsax.OsmXmlIterator.hasNext(OsmXmlIterator.java:138)
	at de.topobyte.osm4j.core.dataset.MapDataSetLoader.read(MapDataSetLoader.java:82)
	at org.osm2world.core.osm.creation.OSMStreamReader.getDataFromStream(OSMStreamReader.java:103)
	at org.osm2world.core.osm.creation.OSMStreamReader.getData(OSMStreamReader.java:81)
	at org.osm2world.core.osm.creation.OSMFileReader.getData(OSMFileReader.java:52)
	... 2 more
Caused by: de.topobyte.osm4j.core.access.OsmInputException: error while parsing xml data
	at de.topobyte.osm4j.xml.dynsax.OsmXmlReader.read(OsmXmlReader.java:90)
	at de.topobyte.osm4j.xml.dynsax.OsmXmlIterator$1.run(OsmXmlIterator.java:94)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.xml.sax.SAXParseException; lineNumber: 5392; columnNumber: 125; Character reference "&#
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:204)
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:178)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:399)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:326)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1466)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanCharReferenceValue(XMLScanner.java:1339)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanAttributeValue(XMLScanner.java:896)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanAttribute(XMLDocumentFragmentScannerImpl.java:1547)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanStartElement(XMLDocumentFragmentScannerImpl.java:1314)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2783)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:601)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:504)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:841)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:770)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:642)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:326)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
	at de.topobyte.osm4j.xml.dynsax.OsmXmlReader.read(OsmXmlReader.java:88)
	... 2 more
@tordanik tordanik added bug upstream Issues originating from a library or service OSM2World depends on labels Feb 12, 2023
@tordanik
Copy link
Owner

tordanik commented Feb 12, 2023

Thanks for the report, this is indeed a rather major problem. I've reported it upstream with the osm4j library (issue #11), I hope we can collaborate to resolve this.

I've also attached a small example to reproduce the problem to the upstream issue.

@sebkur
Copy link
Contributor

sebkur commented Feb 18, 2023

Hey, I've added a test in osm4j as reported in the upstream issue and was not able to reproduce the problem.

I've taken a look at OSM2World's code and it looks like your completely rewriting the file in case it has detected that it has been created by JOSM here

protected static final File createTempFileWithJosmWorkarounds(InputStream josmDataInputStream) throws IOException {

Is it possible that something is going wrong with the charsets during that conversion?

Could you take a look at the temporary file created from the example file?

@sebkur
Copy link
Contributor

sebkur commented Feb 18, 2023

Looks like that piece of code converts the emoji to this: user="account with emoji &#128526;"

@sebkur
Copy link
Contributor

sebkur commented Feb 18, 2023

I can still read that modified file in my test though...

@tordanik
Copy link
Owner

That's interesting!

I'm indeed rewriting files which have been generated by JOSM because (I believe) osm4j does not have built-in support for the JOSM dialect of OSM XML with additional attributes such as action=delete.

For some reason, I did not initially have that on my radar as a possible cause – but I now believe that it's likely to be involved because OSM2World is able to read the problem files if I change the generator attribute (and therefore disable my JOSM-specific code).

I'll investigate further and report the results.

@tordanik tordanik removed the upstream Issues originating from a library or service OSM2World depends on label Feb 18, 2023
@sebkur
Copy link
Contributor

sebkur commented Feb 18, 2023

I'll investigate further and report the results.

thanks!

I'm indeed rewriting files which have been generated by JOSM because (I believe) osm4j does not have built-in support for the JOSM dialect of OSM XML with additional attributes such as action=delete.

Maybe we could improve osm4j to at least parse the files from JOSM even if it cannot currently store the additional data in its data model. Will need to find a suitable test file... I've created topobyte/osm4j#12 to track this.

@tordanik
Copy link
Owner

I've created a fix in 2ae6340. It writes the transformed XML file to an UTF-16 Java String and relies on Apache Commons' IOUtils to output this as correctly encoded UTF-8. As of 26e9bbd, I've also used the opportunity to replace the temporary files with an in-memory representation.

I've added one unit test and did some manual testing as well. It appears to work fine now. Not the most elegant approach, but to achieve a nice long-term solution, I'd rather help improve osm4j. 🙂️

The next OSM2World build will contain the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants