two-byte characters in char and varchar are truncated to one byte #1294

Ceshion · 2021-07-21T21:27:43Z

Duplicate of #723

Char and varchar are currently configured to read the values passed to them purely as ascii, but char and varchar columns in MSQL can support the entire BMP (i.e. one- and two-byte UTF-16 characters). On read this is handled by decoding records with iconv, but the current state is that storing a two-byte character such as "\u2021" (‡) in a char or varchar column will truncate that character to only the second byte, resulting in an incorrect character being stored, such as "\u0021" (!).

We could resolve this by encoding char and varchar parameters as "ucs2" with nchar and nvarchar type IDs when serializing them into the RPC stream, as in the attached PR and similar to the approach used by ADO and JDBC per the linked issue. Is there any reason not to do this?

Only changing the encoding does not work, since it seems like downstream from tedious the bytes are read as separate characters.

The text was updated successfully, but these errors were encountered:

arthurschreiber · 2021-07-22T08:10:01Z

👋 @Ceshion Hey there!

Is there any reason not to do this?

Well, yeah, because it's not doing what it's supposed to. 😬 Your solution "works", but it basically puts the task of converting from multibyte characters to single byte characters on the server, instead of handling this at the client level. If a user wants this conversion to happen on the database level, they should be using nvarchar/nchar directly. 😅

SQL Servers varchar and char do not store UTF16 encoded characters, they store characters in whatever encoding is specified on the database / table / column. Some of these encodings support multibyte characters, but those don't have to correspond to the bytes used in UTF16 / UCS2.

Based on the discussion over in #723, I think the proper fix would be to:

Allow specifying the target encoding / collation for varchar and char parameters.
transcode passed in values from UCS2 to the correct target encoding.

What do you think? Does that make sense to you? 🙇‍♂️

Ceshion · 2021-07-22T13:46:16Z

Hi @arthurschreiber! I appreciate you bearing with me here, I have only just been learning most of the relevant info about encodings in the past few days- trying my best to keep up though! Yes, I see the error in my explanation and understanding- where we can't store the entire BMP on one column and characters are single-byte, just on a specific codepage- thank you for pointing that out 😁

🤔 My initial thought had been that because the server already has the information it needs in order to convert whichever multibyte characters it can for a given column (based on collation, which it knows about), it would take less negotiation to just allow it to convert what is technically a unicode parameter to whatever it should be--where a client would need to get that info from the server anyway (or else know specific details about the server ahead of time), wouldn't it? Not to say of course that it wouldn't technically be right, but I had interpreted the decision to use nchar/nvarchar by default in ADO and JDBC to be based around that logic, and it seems reasonable to me.

What do you think? Should we still put the responsibility of interpreting the correct codepage for a column (in a specific table and database) on the client?

arthurschreiber · 2021-07-22T14:59:35Z

Not to say of course that it wouldn't technically be right, but I had interpreted the decision to use nchar/nvarchar by default in ADO and JDBC to be based around that logic, and it seems reasonable to me.

That is a valid decision to make, and it's something that I think the application that uses tedious can already do, by using nvarchar / nchar parameters directly. This has the same effect as the patch you proposed, with the additional "benefit" of not muddying the waters between n(var)char and (var)char.

This probably requires better documentation, something along the lines of "if you just want to send JavaScript string values to the database, use nvarchar and nchar, even if your target columns are varchar and char". 🤷

On another note, I still think we should fix the varchar and char data types to encode strings correctly from UCS2 to either what the user specifies as the collation to use (or the default database collation if no explicit collation was specified), and maybe also support passing in Buffer values (where no conversion would happen because we can assume the user knows what they're doing).

Ceshion · 2021-07-22T16:00:03Z

Oh I agree, allowing a consumer to specify a codepage for var/char (and text for that matter) definitely seems preferable to automatically interpreting whatever is sent as already correctly encoded- and it looks like we do currently support using buffer values

tedious/src/data-types/varchar.ts

Lines 102 to 119 in ceb73d3

    
           if (parameter.length! <= this.maximumLength) { 
        
             if (Buffer.isBuffer(value)) { 
        
               yield value; 
        
             } else { 
        
               yield Buffer.from(value, 'ascii'); 
        
             } 
        
           } else { 
        
             const length = Buffer.byteLength(value, 'ascii'); 
        
             if (length > 0) { 
        
               const buffer = Buffer.alloc(4); 
        
               buffer.writeUInt32LE(length, 0); 
        
               yield buffer; 
        
               if (Buffer.isBuffer(value)) { 
        
                 yield value; 
        
               } else { 
        
                 yield Buffer.from(value, 'ascii');

So I can take the idea of using unicode parameters instead higher up the chain (again 😅) and work on adding encoding options.

arthurschreiber · 2021-08-31T16:58:28Z

@Ceshion have you tried the latest tedious@12.2.0? Tedious comes now with proper collation support for varchar/char/text (as "proper" as SQL Server allows us to be). It also comes now with UTF8 encoding support when used together with SQL Server 2019 or Azure SQL.

Can this issue be closed?

This was referenced Jul 21, 2021

use nchar and nvarchar type IDs plus ucs2 encoding for char and varchar #1295

Closed

non-ascii characters assigned to var/char columns in SQL are truncated to one byte typeorm/typeorm#7932

Closed

arthurschreiber mentioned this issue Jul 22, 2021

VarChar encoding problem. #723

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

two-byte characters in char and varchar are truncated to one byte #1294

two-byte characters in char and varchar are truncated to one byte #1294

Ceshion commented Jul 21, 2021 •

edited

Loading

arthurschreiber commented Jul 22, 2021 •

edited

Loading

Ceshion commented Jul 22, 2021 •

edited

Loading

arthurschreiber commented Jul 22, 2021

Ceshion commented Jul 22, 2021

arthurschreiber commented Aug 31, 2021

two-byte characters in char and varchar are truncated to one byte #1294

two-byte characters in char and varchar are truncated to one byte #1294

Comments

Ceshion commented Jul 21, 2021 • edited Loading

arthurschreiber commented Jul 22, 2021 • edited Loading

Ceshion commented Jul 22, 2021 • edited Loading

arthurschreiber commented Jul 22, 2021

Ceshion commented Jul 22, 2021

arthurschreiber commented Aug 31, 2021

Ceshion commented Jul 21, 2021 •

edited

Loading

arthurschreiber commented Jul 22, 2021 •

edited

Loading

Ceshion commented Jul 22, 2021 •

edited

Loading