non-ascii characters assigned to var/char columns in SQL are truncated to one byte #7932

Ceshion · 2021-07-19T16:40:05Z

Issue Description

We are using MS-SQL in typeorm, and our data model includes some varchar columns. When we are inserting or updating records using strings that contain non-ASCII characters, those characters are stored without any errors, but using only their second byte. For example, if we store "\u2021" (‡), the value actually stored to the database will be equal to "\u0021" (!). Note also that the object returned from e.g. save shows the original value, "\u2021" (‡). The collation setting for the database, tables, and columns are all SQL_Latin1_General_CP1_CI_AS, and the characters we are attempting to store can therefore be represented by the corresponding codepage.

Expected Behavior

If a value can be stored correctly in a given var/char column in MS-SQL based on the relevant collation settings, we would expect that it would be stored unchanged. If it cannot, we would expect an error to be thrown when trying to store it. It seems that determining whether a value can be stored or not should be the responsibility of the database server, and we could just forward any error from it to the end consumer.

Actual Behavior

Attempting to store any value that contains characters that do not fit into one byte in MS-SQL results in each character being stored as if only the second byte were set.

Steps to Reproduce

Set up a project with the mssql driver, and an ormconfig with a type: 'mssql' connection which has an entity type that has at least one char or varchar column
Store any string into that or those columns that contain a character that must be encoded in ucs2 or above
Inspect the actual value stored to the database

./index.ts

import {
    createConnection,
    getConnection,
    Repository
} from 'typeorm';
import { OneModel } from './entity/OneModel';

async function main() {
    await createConnection();
    const connection = getConnection();
    const repository: Repository<OneModel> =
        connection.getRepository(OneModel);

    const one = new OneModel();
    one.content = '\u2021';

    const claimedSavedOne = await repository.save(one);
    const actualSavedOne = 
        await repository.findOne({ order: { created: 'DESC' } });

    // expect(claimedSavedOne.content).toBe(one.content) // assertion succeeds
    // expect(actualSavedOne.content).toBe(one.content); // assertion fails
    // or
    console.log(claimedSavedOne.content === one.content); // Output: true
    console.log(claimedSavedOne.content, one.content); // Output: ‡ ‡

    console.log(actualSavedOne?.content === one.content); // Output: false
    console.log(actualSavedOne?.content, one.content); // Output: ! ‡
}

main().then(() => process.exit(0));

./entity/OneModel.ts

import {
    Entity,
    PrimaryGeneratedColumn,
    Column,
    CreateDateColumn
} from 'typeorm';

@Entity()
export class OneModel {

    @PrimaryGeneratedColumn('uuid')
    id?: string;

    @CreateDateColumn({ type: 'datetime' })
    created?: Date;

    @Column('varchar', { length: 10 })
    content: string = '';
}

My Environment

Dependency	Version
Operating System	Windows 10.2004 (build 19041.1110) and macOS v11.2.3
Node.js version	v14.15.5 and v8.9.3
Typescript version	v4.3.5 and v2.9.2
TypeORM version	v0.2.34 and v0.2.22
`mssql` driver version	v7.1.3 and v4.1.0
Database version	Microsoft SQL Server 2019 Express and Microsoft SQL Server 2017

Additional Context

I have tracked down the point where truncation occurs to be within tedious' data-types/varchar module, in which varchar values are converted to bytes to be written into the SQL RPC request stream using Buffer.from(value, 'ascii'). Based on this issue it seems that the onus is on consumers to use nvarchar parameters when appropriate as a solution to this issue, even when setting a varchar column.

Relevant Database Driver(s)

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.
Yes, I have the time, but I don't know how to start. I would need guidance.
No, I don't have the time, although I believe I could do it if I had the time...
No, I don't have the time and I wouldn't even know how to start.

The text was updated successfully, but these errors were encountered:

convert typeorm char and varchar parameters to n' ' in order to allow saving non-ascii characters Closes: typeorm#7932

imnotjames · 2021-07-20T05:28:16Z

It seems that determining whether a value can be stored or not should be the responsibility of the database server, and we could just forward any error from it to the end consumer.

This isn't a TypeORM issue, is it? We don't mangle anything here. Wouldn't this be a problem for the database itself or for the mssql node driver?

pleerock · 2021-07-20T11:22:03Z

as it is mentioned in MSDN varchar/char is non-Unicode string data, and nvarchar/nchar is unicode string data. so when you have a Unicode string data you should use nvarchar.

as mentioned here. I also don't understand why you are trying to store unicode characters on something that doesn't support unicode. It doesn't make sense to do that and bring monkey patches to it. Unless I didn't get something.

Ceshion · 2021-07-20T11:51:08Z

It seems that determining whether a value can be stored or not should be the responsibility of the database server, and we could just forward any error from it to the end consumer.

This isn't a TypeORM issue, is it? We don't mangle anything here. Wouldn't this be a problem for the database itself or for the mssql node driver?

In theory I totally agree that converting like this is something the database driver (or tedious itself, really) should be handling, however per the linked tedious issue, solving it there seems prohibitively difficult. Following this comment, it seems that just using nvar/char parameters by default is a common solution, and as typeorm is the originator of the parameters to use, it seems appropriate to change their type here if anywhere, right?

Edit: thinking it over again, it may make more sense to handle a conversion like this in the driver- where it is invisible to the consumer i.e. typeorm, but I would like if we could still consider this solution in case another one couldn't be reached or agreed upon there.

Ceshion · 2021-07-20T12:01:00Z

as it is mentioned in MSDN varchar/char is non-Unicode string data, and nvarchar/nchar is unicode string data. so when you have a Unicode string data you should use nvarchar.

as mentioned here. I also don't understand why you are trying to store unicode characters on something that doesn't support unicode. It doesn't make sense to do that and bring monkey patches to it. Unless I didn't get something.

Before all this, that's what I thought too! However, I found that empirically, MS-SQL absolutely can store 2-byte characters in char and varchar columns, or at least it can as of SQL Server 2017. I have not tested backward unless the test runner for typeorm does.

Given that is the case though, I think we by some means need to get whatever the end consumer wants to store to the database intact, and let it decide if it can or cannot do it. In this case and to that end, storing something that cannot be represented by the codepage specified by the database's collation settings will cause an error, correctly.

Ceshion · 2021-07-20T14:06:13Z

All that said, I would be willing to write a revision for tedious and/or mssql as well- if it can be resolved there, to the satisfaction of the maintainers of those projects, it would of course make this irrelevant.

Ceshion · 2021-07-21T21:38:55Z

I opened this issue for tedious (with a PR) to attempt to address the same issue at a lower level.

…ns retains data add tests to ensure that using nchar and nvarchar parameters will correctly store data, and that the actual underlying data type has not changed. Validates typeorm#7932 test: verify that storing strings with characters that are too long for var/char throws ensure that we will fail loudly when an attempt is made to store a string that cannot be converted to the underlying type because it has too-long characters Validates typeorm#7932

…ns retains data add tests to ensure that using nchar and nvarchar parameters will correctly store data, and that the actual underlying data type has not changed. test: verify that storing strings with characters that are too long for var/char throws ensure that we will fail loudly when an attempt is made to store a string that cannot be converted to the underlying type because it has too-long characters Validates typeorm#7932

…ns retains data add tests to ensure that using nchar and nvarchar parameters will correctly store data, and that the actual underlying data type has not changed. test: verify that storing strings with characters that are too wide for var/char throws ensure that we will fail loudly when an attempt is made to store a string that cannot be converted to the underlying type because it has characters that are not in the codepage used by the database/table/column Validates typeorm#7932

…de equivalent this allows the server to handle mapping extended ASCII characters into the correct codepage for a given column without needing to pre-check collation settings Fixes typeorm#7932

Ceshion · 2021-07-27T18:13:42Z

@pleerock per discussion on tedious #1294, it appears to still be up to consumers (such as typeorm) to specify how to encode any non-ASCII characters. Today, the only option we have for that is to send parameters as unicode types regardless of the column type. Encoding strings in the consuming code and passing them as a buffer into tedious appears to be an intended use, but is currently not possible, since tedious' validation for varchar and char calls .toString on values passed to it.

As it is, I think it seems more efficient to handle this as a consumer by just using unicode parameters by default, since then there is no need to preprocess the data according to the specific configuration of a database- query for column/table/database collation, store it, use it to encode incoming data according to its intended destination. . . I think it is pretty apparent that gets messy fast.

By using unicode parameters to begin with, typeorm can avoid needing to be concerned with all of these specifics about a database and queries to run in order to store data correctly, without having to rely on a similar conversion to unicode parameters happening on a lower level- we can just be explicit about it. On the other hand, if we leave this up to consumers of typeorm to handle, then they have to make a decision between benefitting from the lower storage requirement of (var)char but not being able to store half of the characters they should be able to in any case, or accepting the larger storage requirement of n(var)char in order to even just store strings that fit in extended (single-byte) ASCII.

Or otherwise using a patch like, for example, creating a column as varchar in the database and marking it as nvarchar in its column decorator so that parameters with values to be stored in that column will be nvarchar- which does work, but is by far hackier than just automatically changing only the parameter type. Could we consider using this solution after all?

Ceshion added bug requires triage labels Jul 19, 2021

Ceshion mentioned this issue Jul 19, 2021

fix: allow char and varchar cases to fall through when creating parameters #7933

Merged

7 tasks

Ceshion added a commit to Ceshion/typeorm that referenced this issue Jul 19, 2021

fix: allow char and varchar cases to fall through

73816ff

convert typeorm char and varchar parameters to n' ' in order to allow saving non-ascii characters Closes: typeorm#7932

pleerock added needs more info and removed bug requires triage labels Jul 20, 2021

imnotjames closed this as completed in #7933 Jul 31, 2021

imnotjames removed the needs more info label Jul 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-ascii characters assigned to var/char columns in SQL are truncated to one byte #7932

non-ascii characters assigned to var/char columns in SQL are truncated to one byte #7932

Ceshion commented Jul 19, 2021 •

edited

Loading

imnotjames commented Jul 20, 2021

pleerock commented Jul 20, 2021

Ceshion commented Jul 20, 2021 •

edited

Loading

Ceshion commented Jul 20, 2021 •

edited

Loading

Ceshion commented Jul 20, 2021

Ceshion commented Jul 21, 2021 •

edited

Loading

Ceshion commented Jul 27, 2021 •

edited

Loading

non-ascii characters assigned to var/char columns in SQL are truncated to one byte #7932

non-ascii characters assigned to var/char columns in SQL are truncated to one byte #7932

Comments

Ceshion commented Jul 19, 2021 • edited Loading

Issue Description

Expected Behavior

Actual Behavior

Steps to Reproduce

./index.ts

./entity/OneModel.ts

My Environment

Additional Context

Relevant Database Driver(s)

Are you willing to resolve this issue by submitting a Pull Request?

imnotjames commented Jul 20, 2021

pleerock commented Jul 20, 2021

Ceshion commented Jul 20, 2021 • edited Loading

Ceshion commented Jul 20, 2021 • edited Loading

Ceshion commented Jul 20, 2021

Ceshion commented Jul 21, 2021 • edited Loading

Ceshion commented Jul 27, 2021 • edited Loading

Ceshion commented Jul 19, 2021 •

edited

Loading

Ceshion commented Jul 20, 2021 •

edited

Loading

Ceshion commented Jul 20, 2021 •

edited

Loading

Ceshion commented Jul 21, 2021 •

edited

Loading

Ceshion commented Jul 27, 2021 •

edited

Loading