Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running ./WiktionarySplitter.sh #81

Open
Gitsaibot opened this issue Jan 6, 2018 · 8 comments
Open

Error running ./WiktionarySplitter.sh #81

Gitsaibot opened this issue Jan 6, 2018 · 8 comments

Comments

@Gitsaibot
Copy link

Gitsaibot commented Jan 6, 2018

I get always this error when I try to run ./WiktionarySplitter.sh. What can I do to avoid this ? I use a debian 9 system.

endPage: hoggeries, count=2800000
title with colon: Reconstruction:Proto-Germanic/bikjǭ
Exception during parse, lastPageTitle=testamentation, titleBuilder=naseinai
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 94192868; columnNumber: 8; Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at com.hughes.android.dictionary.engine.WiktionarySplitter.go(WiktionarySplitter.java:105)
at com.hughes.android.dictionary.engine.WiktionarySplitter.main(WiktionarySplitter.java:60)
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanName(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
... 11 more
Error writing to file java.io.IOException: Write end dead
Error writing to file java.io.IOException: Write end dead

@rdoeffinger
Copy link
Owner

Best I can tell from the messages, the XML file is not valid UTF-8. Maybe a newer/different version of xerces can help making it less picky, but I doubt it.

@Gitsaibot
Copy link
Author

Do I have to run ./WiktionarySplitter.sh if I use my own DE-EN.txt file or can I generate it directly ? I found test files in DictionaryPC which I want to try...

@rdoeffinger
Copy link
Owner

You only need WiktionarySplitter (and even the download scripts for downloading wiktionary data) only if you actually want to use the data from Wiktionary. So I guess the answer should be "no".

@kenden
Copy link

kenden commented Jun 6, 2019

I'm getting a similar issue:

$ ./WiktionarySplitter.sh 
(...)
title with colon: Reconstruction:Proto-Iranian/páyHah
title with colon: Reconstruction:Proto-Germanic/marjaną
title with colon: Reconstruction:Proto-Indo-Iranian/mazǰʰás
title with colon: Reconstruction:Sanskrit/स्यालभार्या
title with colon: Reconstruction:Old Persian/𐎲𐎡𐎺𐎼
Exception during parse, lastPageTitle=femsplained, titleBuilder=Waidbruck of file data/inputs/enwiktionary-pages-articles.xml
Exception in thread "main" org.xml.sax.SAXParseException; systemId: file:/home/redacted/dev/DictionaryPC/data/inputs/enwiktionary-pages-articles.xml; lineNumber: 184649932; columnNumber: 20; Invalid byte 2 of 4-byte UTF-8 sequence.
	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
	at java.xml/javax.xml.parsers.SAXParser.parse(SAXParser.java:330)
	at com.hughes.android.dictionary.engine.WiktionarySplitter.go(WiktionarySplitter.java:113)
	at com.hughes.android.dictionary.engine.WiktionarySplitter.main(WiktionarySplitter.java:60)
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
	at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.skipChar(Unknown Source)
	... 11 more
Error writing to file java.io.IOException: Write end dead
Error writing to file java.io.IOException: Write end dead
Error writing to file java.io.IOException: Write end dead
Error writing to file java.io.IOException: Write end dead

@rdoeffinger
Copy link
Owner

I guess the best you can do is to fix this encoding:
enwiktionary-pages-articles.xml; lineNumber: 184649932; columnNumber: 20; Invalid byte 2 of 4-byte UTF-8 sequence
And ideally reporting the issue to wiktionary as it seems they have broken data...
I haven't checked if the XML parser can somehow be configured to be more permissive, but I suspect it's not possible unfortunately.
Running iconv from UTF-8 to UTF-8 on the XML file might work as well to clean up the broken encoding.

@rdoeffinger
Copy link
Owner

Really, really short answer: Wiktionary really ought to run XML validation on their data, in which case they would catch and fix this themselves instead of us having to deal with bad data...

@rdoeffinger
Copy link
Owner

This time I got this error at random times when running dictionary generation multiple times.
That should mean there is some thread synchronization or other race condition issue here?
I.e. not related to the wiktionary data itself at least in that case...

@rdoeffinger
Copy link
Owner

I think it might be fixed actually... I've run it quite a few times and not seen this anymore.
If anyone is still interested, can you test as well?
Otherwise I might close this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants