Fix Oracle CLOB read: charset-aware character stream#1679
Merged
Conversation
DbDriver.getLOB's Oracle branch was calling clob.getAsciiStream().read(byte[])
and then constructing a String via the JVM's platform-default charset. Both
steps lose information for any non-ASCII content:
- getAsciiStream coerces each char into a single byte; for U+2013 (en-dash)
that low byte is 0x13, an invalid XML 1.0 character. The same mechanism
has produced 0x1C control chars and apparent U+FFFD sequences in
BioModel CLOBs read by the GUI client (e.g. biomodels 311226221 and
311875206 from users shiVcell / mblinov).
- new String(byte[]) decodes via Charset.defaultCharset(), giving
Cp1252-on-Windows vs. UTF-8-on-Linux/macOS round-tripping.
Replace both with clob.getCharacterStream() and a char[] buffer sized by
clob.length() (which is in characters per the JDBC spec). Net effect: every
caller of getLOB on Oracle now sees the actual stored Unicode, and CLOBs
with multi-byte UTF-8 sequences (en-dashes, μ, etc.) load correctly.
Verified locally against the two failing biomodels: previously SAX rejected
"Unicode 0x13 in attribute Name" / "0x1C in SbmlName" during ServerDocument
Manager.getBioModelXML; with the fix, both rehydrate, build full BioModel
objects, and pass updateAll(true). The stored CLOBs were never corrupt —
the read path was.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DbDriver.getLOBOracle branch was reading every CLOB throughclob.getAsciiStream()into abyte[], then converting that byte buffer to aStringvia the JVM's platform-default charset. Both steps are lossy:getAsciiStreamcoerces each character into a single byte (truncating to the low 8 bits); a stored–(en-dash, U+2013) becomes0x13, an invalid XML 1.0 control character. A storedμ(U+00B5) becomes0xB5, then mis-decoded again on thenew String(byte[])step.new String(byte[])usesCharset.defaultCharset()— Cp1252 on legacy Windows, UTF-8 on modern macOS/Linux — so even ASCII reads were platform-dependent for any non-ASCII byte that survived step 1.clob.getCharacterStream()reading into achar[]sized byclob.length()(chars, per JDBC spec). One-method change, Oracle branch only; Postgres branch already usedrs.getStringand was correct.Why
The two biomodels (311226221 / shiVcell, 311875206 / mblinov) that have been failing to load in
VCellClientMainfrom the database both contain en-dash characters in reaction names. The stored CLOBs are honest — a directclob.getCharacterStream()scan of all 116k biomodel + mathmodel rows finds zero invalid XML chars. The corruption was being injected on every read by the buggygetLOB, and SAX then rejected the result on the client. PR #1676 (charset hygiene on import paths) and PR #1677 (input validation on setSbmlName/setName) were both addressing the consequences; this PR fixes the actual cause.Test plan
mvn -pl vcell-server,vcell-admin -am -DskipTests installserverDocumentManager.getBioModelXML(qh, mblinov, 311875206, false)→ SAX"invalid XML char Unicode 0x13 in attribute Name"XMLToBioModelsucceeds,updateAll(true)succeeds, returned XML preserves the original 12 en-dashes (0xE2 0x80 0x93) and contains zero invalid bytesshiVcell).getLOB: kinetics, geometry curves, math descriptions, analysis tasks).Notes
getLOBcallers (MathDescTable,GeomDbDriver,SimulationContextDbDriver, generic clob_text) get the same fix automatically — any non-ASCII char they were silently mangling is now preserved.🤖 Generated with Claude Code