Skip to content

Conversation

@jpallas
Copy link
Contributor

@jpallas jpallas commented Mar 13, 2020

Oracle has a non-standard interpretation of character field lengths. The SQL standard and other implementations says that the length of a character field is in characters unless explicitly specified to be in bytes. Oracle does it the other way around (of course). This means that the library's Schema::addTableFromRow method, if the source row comes from a non-Oracle database and the destination is an Oracle database, can produce a table with columns that are not wide enough (when multi-byte characters are involved).

This change makes the Oracle flavor specify that character lengths are in characters.

@jpallas jpallas requested review from garricko and jmesterh March 13, 2020 18:59
@garricko
Copy link
Collaborator

This gets worse the more I read. So if I understand correctly:

Oracle interprets varchar(n) as bytes or chars based on database/session configuration (https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Data-Types.html#GUID-0DC7FFAA-F03F-4448-8487-F2592496A510) and therefore recommend explicitly specifying which you want.

Microsoft always treats them as bytes (https://docs.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver15#arguments).

PostgreSQL always treats them as char (https://www.postgresql.org/docs/9.6/datatype-character.html).

So likely still problems if SQL Server is the destination, but at least we will have deterministic behavior on Oracle, and closer to the standard.

Copy link
Collaborator

@garricko garricko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the ambiguity of length, but I guess it reflects the reality (either byte or char depending on flavor).

@jpallas
Copy link
Contributor Author

jpallas commented Mar 24, 2020

@garricko Thanks for uncovering the MS SQL Server info. (Side note: I love the MS line "A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters." Maybe it's a common misconception because that's what the SQL standard specifies?)

That actually puts a different light on what we are seeing with the ETLs. Previously I thought that the SQL Server VARCHAR(50) field was holding 50 characters. Now I realize that it is holding 50 bytes in some unknown encoding that turns into 50 characters of Unicode-16 by the time JDBC delivers it to the client.

@garricko
Copy link
Collaborator

The way I read those docs it seems things should be ok going from mssql to oracle. However many characters fit in 50 bytes in mssql should come back as <= 50 characters of Java string (UTF), and should fit into oracle varchar2(50 char). It seems it should also fit into oracle varchar(50 byte) unless there are characters in whatever codeset MS uses where the equivalent character in whatever codeset oracle uses has more bytes. Or unless the conversion into Java ends up non-normalized (in the Unicode sense). On second thought, re-reading what I just wrote, I guess I wouldn’t be shocked if some value didn’t fit.

1 similar comment
@garricko
Copy link
Collaborator

The way I read those docs it seems things should be ok going from mssql to oracle. However many characters fit in 50 bytes in mssql should come back as <= 50 characters of Java string (UTF), and should fit into oracle varchar2(50 char). It seems it should also fit into oracle varchar(50 byte) unless there are characters in whatever codeset MS uses where the equivalent character in whatever codeset oracle uses has more bytes. Or unless the conversion into Java ends up non-normalized (in the Unicode sense). On second thought, re-reading what I just wrote, I guess I wouldn’t be shocked if some value didn’t fit.

@jpallas
Copy link
Contributor Author

jpallas commented Mar 25, 2020

On second thought, re-reading what I just wrote, I guess I wouldn’t be shocked if some value didn’t fit.

That is precisely the situation that prompted this. The ETL copying LPCH Clarity on MS SQL to Oracle keeps hitting cases where the data returned from SQL Server doesn't fit in an Oracle field of the same length.

@jpallas jpallas merged commit 77d5cf6 into master Mar 30, 2020
@jpallas jpallas deleted the oracle_chars branch May 28, 2020 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants