Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-ascii characters assigned to var/char columns in SQL are truncated to one byte #7932

Closed
2 of 21 tasks
Ceshion opened this issue Jul 19, 2021 · 7 comments · Fixed by #7933
Closed
2 of 21 tasks

non-ascii characters assigned to var/char columns in SQL are truncated to one byte #7932

Ceshion opened this issue Jul 19, 2021 · 7 comments · Fixed by #7933

Comments

@Ceshion
Copy link
Contributor

Ceshion commented Jul 19, 2021

Issue Description

We are using MS-SQL in typeorm, and our data model includes some varchar columns. When we are inserting or updating records using strings that contain non-ASCII characters, those characters are stored without any errors, but using only their second byte. For example, if we store "\u2021" (‡), the value actually stored to the database will be equal to "\u0021" (!). Note also that the object returned from e.g. save shows the original value, "\u2021" (‡). The collation setting for the database, tables, and columns are all SQL_Latin1_General_CP1_CI_AS, and the characters we are attempting to store can therefore be represented by the corresponding codepage.

Expected Behavior

If a value can be stored correctly in a given var/char column in MS-SQL based on the relevant collation settings, we would expect that it would be stored unchanged. If it cannot, we would expect an error to be thrown when trying to store it. It seems that determining whether a value can be stored or not should be the responsibility of the database server, and we could just forward any error from it to the end consumer.

Actual Behavior

Attempting to store any value that contains characters that do not fit into one byte in MS-SQL results in each character being stored as if only the second byte were set.

Steps to Reproduce

  1. Set up a project with the mssql driver, and an ormconfig with a type: 'mssql' connection which has an entity type that has at least one char or varchar column
  2. Store any string into that or those columns that contain a character that must be encoded in ucs2 or above
  3. Inspect the actual value stored to the database

./index.ts

import {
    createConnection,
    getConnection,
    Repository
} from 'typeorm';
import { OneModel } from './entity/OneModel';

async function main() {
    await createConnection();
    const connection = getConnection();
    const repository: Repository<OneModel> =
        connection.getRepository(OneModel);

    const one = new OneModel();
    one.content = '\u2021';

    const claimedSavedOne = await repository.save(one);
    const actualSavedOne = 
        await repository.findOne({ order: { created: 'DESC' } });

    // expect(claimedSavedOne.content).toBe(one.content) // assertion succeeds
    // expect(actualSavedOne.content).toBe(one.content); // assertion fails
    // or
    console.log(claimedSavedOne.content === one.content); // Output: true
    console.log(claimedSavedOne.content, one.content); // Output: ‡ ‡

    console.log(actualSavedOne?.content === one.content); // Output: false
    console.log(actualSavedOne?.content, one.content); // Output: ! ‡
}

main().then(() => process.exit(0));

./entity/OneModel.ts

import {
    Entity,
    PrimaryGeneratedColumn,
    Column,
    CreateDateColumn
} from 'typeorm';

@Entity()
export class OneModel {

    @PrimaryGeneratedColumn('uuid')
    id?: string;

    @CreateDateColumn({ type: 'datetime' })
    created?: Date;

    @Column('varchar', { length: 10 })
    content: string = '';
}

My Environment

Dependency Version
Operating System Windows 10.2004 (build 19041.1110) and macOS v11.2.3
Node.js version v14.15.5 and v8.9.3
Typescript version v4.3.5 and v2.9.2
TypeORM version v0.2.34 and v0.2.22
mssql driver version v7.1.3 and v4.1.0
Database version Microsoft SQL Server 2019 Express and Microsoft SQL Server 2017

Additional Context

I have tracked down the point where truncation occurs to be within tedious' data-types/varchar module, in which varchar values are converted to bytes to be written into the SQL RPC request stream using Buffer.from(value, 'ascii'). Based on this issue it seems that the onus is on consumers to use nvarchar parameters when appropriate as a solution to this issue, even when setting a varchar column.

Relevant Database Driver(s)

  • aurora-data-api
  • aurora-data-api-pg
  • better-sqlite3
  • cockroachdb
  • cordova
  • expo
  • mongodb
  • mysql
  • nativescript
  • oracle
  • postgres
  • react-native
  • sap
  • sqlite
  • sqlite-abstract
  • sqljs
  • sqlserver

Are you willing to resolve this issue by submitting a Pull Request?

  • Yes, I have the time, and I know how to start.
  • Yes, I have the time, but I don't know how to start. I would need guidance.
  • No, I don't have the time, although I believe I could do it if I had the time...
  • No, I don't have the time and I wouldn't even know how to start.
Ceshion added a commit to Ceshion/typeorm that referenced this issue Jul 19, 2021
convert typeorm char and varchar parameters to n' ' in order to allow saving non-ascii characters

Closes: typeorm#7932
@imnotjames
Copy link
Contributor

It seems that determining whether a value can be stored or not should be the responsibility of the database server, and we could just forward any error from it to the end consumer.

This isn't a TypeORM issue, is it? We don't mangle anything here. Wouldn't this be a problem for the database itself or for the mssql node driver?

@pleerock
Copy link
Member

as it is mentioned in MSDN varchar/char is non-Unicode string data, and nvarchar/nchar is unicode string data. so when you have a Unicode string data you should use nvarchar.

as mentioned here. I also don't understand why you are trying to store unicode characters on something that doesn't support unicode. It doesn't make sense to do that and bring monkey patches to it. Unless I didn't get something.

@Ceshion
Copy link
Contributor Author

Ceshion commented Jul 20, 2021

It seems that determining whether a value can be stored or not should be the responsibility of the database server, and we could just forward any error from it to the end consumer.

This isn't a TypeORM issue, is it? We don't mangle anything here. Wouldn't this be a problem for the database itself or for the mssql node driver?

In theory I totally agree that converting like this is something the database driver (or tedious itself, really) should be handling, however per the linked tedious issue, solving it there seems prohibitively difficult. Following this comment, it seems that just using nvar/char parameters by default is a common solution, and as typeorm is the originator of the parameters to use, it seems appropriate to change their type here if anywhere, right?

Edit: thinking it over again, it may make more sense to handle a conversion like this in the driver- where it is invisible to the consumer i.e. typeorm, but I would like if we could still consider this solution in case another one couldn't be reached or agreed upon there.

@Ceshion
Copy link
Contributor Author

Ceshion commented Jul 20, 2021

as it is mentioned in MSDN varchar/char is non-Unicode string data, and nvarchar/nchar is unicode string data. so when you have a Unicode string data you should use nvarchar.

as mentioned here. I also don't understand why you are trying to store unicode characters on something that doesn't support unicode. It doesn't make sense to do that and bring monkey patches to it. Unless I didn't get something.

Before all this, that's what I thought too! However, I found that empirically, MS-SQL absolutely can store 2-byte characters in char and varchar columns, or at least it can as of SQL Server 2017. I have not tested backward unless the test runner for typeorm does.

Given that is the case though, I think we by some means need to get whatever the end consumer wants to store to the database intact, and let it decide if it can or cannot do it. In this case and to that end, storing something that cannot be represented by the codepage specified by the database's collation settings will cause an error, correctly.

@Ceshion
Copy link
Contributor Author

Ceshion commented Jul 20, 2021

All that said, I would be willing to write a revision for tedious and/or mssql as well- if it can be resolved there, to the satisfaction of the maintainers of those projects, it would of course make this irrelevant.

@Ceshion
Copy link
Contributor Author

Ceshion commented Jul 21, 2021

I opened this issue for tedious (with a PR) to attempt to address the same issue at a lower level.

Ceshion added a commit to Ceshion/typeorm that referenced this issue Jul 22, 2021
…ns retains data

add tests to ensure that using nchar and nvarchar parameters will correctly store data,
and that the actual underlying data type has not changed.

Validates typeorm#7932

test: verify that storing strings with characters that are too long for var/char throws

ensure that we will fail loudly when an attempt is made to store a string that cannot be
converted to the underlying type because it has too-long characters

Validates typeorm#7932
Ceshion added a commit to Ceshion/typeorm that referenced this issue Jul 22, 2021
…ns retains data

add tests to ensure that using nchar and nvarchar parameters will correctly store data,
and that the actual underlying data type has not changed.

test: verify that storing strings with characters that are too long for var/char throws

ensure that we will fail loudly when an attempt is made to store a string that cannot be
converted to the underlying type because it has too-long characters

Validates typeorm#7932
Ceshion added a commit to Ceshion/typeorm that referenced this issue Jul 22, 2021
…ns retains data

add tests to ensure that using nchar and nvarchar parameters will correctly store data,
and that the actual underlying data type has not changed.

test: verify that storing strings with characters that are too wide for var/char throws

ensure that we will fail loudly when an attempt is made to store a string that cannot be
converted to the underlying type because it has characters that are not in the codepage
used by the database/table/column

Validates typeorm#7932
Ceshion added a commit to Ceshion/typeorm that referenced this issue Jul 22, 2021
…de equivalent

this allows the server to handle mapping extended ASCII characters into the correct
codepage for a given column without needing to pre-check collation settings

Fixes typeorm#7932
Ceshion added a commit to Ceshion/typeorm that referenced this issue Jul 23, 2021
…de equivalent

this allows the server to handle mapping extended ASCII characters into the correct
codepage for a given column without needing to pre-check collation settings

Fixes typeorm#7932
@Ceshion
Copy link
Contributor Author

Ceshion commented Jul 27, 2021

@pleerock per discussion on tedious #1294, it appears to still be up to consumers (such as typeorm) to specify how to encode any non-ASCII characters. Today, the only option we have for that is to send parameters as unicode types regardless of the column type. Encoding strings in the consuming code and passing them as a buffer into tedious appears to be an intended use, but is currently not possible, since tedious' validation for varchar and char calls .toString on values passed to it.

As it is, I think it seems more efficient to handle this as a consumer by just using unicode parameters by default, since then there is no need to preprocess the data according to the specific configuration of a database- query for column/table/database collation, store it, use it to encode incoming data according to its intended destination. . . I think it is pretty apparent that gets messy fast.

By using unicode parameters to begin with, typeorm can avoid needing to be concerned with all of these specifics about a database and queries to run in order to store data correctly, without having to rely on a similar conversion to unicode parameters happening on a lower level- we can just be explicit about it. On the other hand, if we leave this up to consumers of typeorm to handle, then they have to make a decision between benefitting from the lower storage requirement of (var)char but not being able to store half of the characters they should be able to in any case, or accepting the larger storage requirement of n(var)char in order to even just store strings that fit in extended (single-byte) ASCII.

Or otherwise using a patch like, for example, creating a column as varchar in the database and marking it as nvarchar in its column decorator so that parameters with values to be stored in that column will be nvarchar- which does work, but is by far hackier than just automatically changing only the parameter type. Could we consider using this solution after all?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants