Error running ./WiktionarySplitter.sh #81

Gitsaibot · 2018-01-06T10:54:12Z

I get always this error when I try to run ./WiktionarySplitter.sh. What can I do to avoid this ? I use a debian 9 system.

endPage: hoggeries, count=2800000
title with colon: Reconstruction:Proto-Germanic/bikjǭ
Exception during parse, lastPageTitle=testamentation, titleBuilder=naseinai
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 94192868; columnNumber: 8; Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at com.hughes.android.dictionary.engine.WiktionarySplitter.go(WiktionarySplitter.java:105)
at com.hughes.android.dictionary.engine.WiktionarySplitter.main(WiktionarySplitter.java:60)
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanName(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
... 11 more
Error writing to file java.io.IOException: Write end dead
Error writing to file java.io.IOException: Write end dead

rdoeffinger · 2018-01-06T12:25:09Z

Best I can tell from the messages, the XML file is not valid UTF-8. Maybe a newer/different version of xerces can help making it less picky, but I doubt it.

Gitsaibot · 2018-01-07T09:37:51Z

Do I have to run ./WiktionarySplitter.sh if I use my own DE-EN.txt file or can I generate it directly ? I found test files in DictionaryPC which I want to try...

rdoeffinger · 2018-01-08T10:49:06Z

You only need WiktionarySplitter (and even the download scripts for downloading wiktionary data) only if you actually want to use the data from Wiktionary. So I guess the answer should be "no".

kenden · 2019-06-06T16:44:34Z

I'm getting a similar issue:

$ ./WiktionarySplitter.sh 
(...)
title with colon: Reconstruction:Proto-Iranian/páyHah
title with colon: Reconstruction:Proto-Germanic/marjaną
title with colon: Reconstruction:Proto-Indo-Iranian/mazǰʰás
title with colon: Reconstruction:Sanskrit/स्यालभार्या
title with colon: Reconstruction:Old Persian/𐎲𐎡𐎺𐎼
Exception during parse, lastPageTitle=femsplained, titleBuilder=Waidbruck of file data/inputs/enwiktionary-pages-articles.xml
Exception in thread "main" org.xml.sax.SAXParseException; systemId: file:/home/redacted/dev/DictionaryPC/data/inputs/enwiktionary-pages-articles.xml; lineNumber: 184649932; columnNumber: 20; Invalid byte 2 of 4-byte UTF-8 sequence.
	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
	at java.xml/javax.xml.parsers.SAXParser.parse(SAXParser.java:330)
	at com.hughes.android.dictionary.engine.WiktionarySplitter.go(WiktionarySplitter.java:113)
	at com.hughes.android.dictionary.engine.WiktionarySplitter.main(WiktionarySplitter.java:60)
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
	at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.skipChar(Unknown Source)
	... 11 more
Error writing to file java.io.IOException: Write end dead
Error writing to file java.io.IOException: Write end dead
Error writing to file java.io.IOException: Write end dead
Error writing to file java.io.IOException: Write end dead

rdoeffinger · 2019-06-16T08:20:26Z

I guess the best you can do is to fix this encoding:
enwiktionary-pages-articles.xml; lineNumber: 184649932; columnNumber: 20; Invalid byte 2 of 4-byte UTF-8 sequence
And ideally reporting the issue to wiktionary as it seems they have broken data...
I haven't checked if the XML parser can somehow be configured to be more permissive, but I suspect it's not possible unfortunately.
Running iconv from UTF-8 to UTF-8 on the XML file might work as well to clean up the broken encoding.

rdoeffinger · 2019-06-16T08:37:23Z

Really, really short answer: Wiktionary really ought to run XML validation on their data, in which case they would catch and fix this themselves instead of us having to deal with bad data...

rdoeffinger · 2020-04-11T23:58:57Z

This time I got this error at random times when running dictionary generation multiple times.
That should mean there is some thread synchronization or other race condition issue here?
I.e. not related to the wiktionary data itself at least in that case...

rdoeffinger · 2020-04-25T09:58:14Z

I think it might be fixed actually... I've run it quite a few times and not seen this anymore.
If anyone is still interested, can you test as well?
Otherwise I might close this ticket.

kenden mentioned this issue Jun 6, 2019

Add more detailed steps to update the dictionaries #112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error running ./WiktionarySplitter.sh #81

Error running ./WiktionarySplitter.sh #81

Gitsaibot commented Jan 6, 2018 •

edited

Loading

rdoeffinger commented Jan 6, 2018

Gitsaibot commented Jan 7, 2018

rdoeffinger commented Jan 8, 2018

kenden commented Jun 6, 2019

rdoeffinger commented Jun 16, 2019

rdoeffinger commented Jun 16, 2019

rdoeffinger commented Apr 11, 2020

rdoeffinger commented Apr 25, 2020

Error running ./WiktionarySplitter.sh #81

Error running ./WiktionarySplitter.sh #81

Comments

Gitsaibot commented Jan 6, 2018 • edited Loading

rdoeffinger commented Jan 6, 2018

Gitsaibot commented Jan 7, 2018

rdoeffinger commented Jan 8, 2018

kenden commented Jun 6, 2019

rdoeffinger commented Jun 16, 2019

rdoeffinger commented Jun 16, 2019

rdoeffinger commented Apr 11, 2020

rdoeffinger commented Apr 25, 2020

Gitsaibot commented Jan 6, 2018 •

edited

Loading