Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character in a string is not displayed as a string #4090

Closed
pgulutzan opened this issue Mar 31, 2019 · 12 comments
Closed

Character in a string is not displayed as a string #4090

pgulutzan opened this issue Mar 31, 2019 · 12 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@pgulutzan
Copy link
Contributor

pgulutzan commented Mar 31, 2019

I have Ubuntu 18.04.
I have Tarantool 2.2 pulled from source today.

I use a character 馃啒 which is Unicode 1F198 SQUARED SOS.
I insert it using the hex of the UTF8 i.e. X'F09F8698'.
I could also insert it using the decimal of the code point i.e. char(127384).
I could also insert it directly (on my platform) as '馃啒'.

I issue these requests:
\set language sql
\set delimiter ;
CREATE TABLE L (s1 VARCHAR(5) COLLATE "unicode_ci" PRIMARY KEY);
INSERT INTO L VALUES (CAST(X'F09F8698' AS VARCHAR(5)));
INSERT INTO L VALUES ('SOS');
SELECT * FROM L;
\set language lua
\set delimiter
x = box.space.L:select()
utf8.len(x[1])

The INSERT INTO L VALUES ('SOS'); fails, duplicate key,
which is good -- that proves the character is correctly
recognized and is known to be equal to 'SOS'.

The SELECT * FROM l; fails, it says

    • [!!binary 8J+GmA==]
      but it should show the result as a string.

The utf8.len(x[1]) fails, it says

  • error: 'Usage: utf8.len(, [i, [j]])'
    but there is no error, it is a valid 1-character string.

This is not the same as issue#2344 which mentioned disallowing
invalid strings. This is about failing to recognize valid strings.

@Totktonada
Copy link
Member

!!binary is how yaml encodes a string with special characters (don't sure about exact character set). It has nothing with cdata or SQL types.

Re utf8.len(x[1]): x[1] is a tuple, use x[1][1] to pass a string to utf8.len.

@pgulutzan
Copy link
Contributor Author

pgulutzan commented Apr 1, 2019

Indeed my use of x[1] was my mistake, sorry. Disregarding my nonsense about cdata, though, we are left with: why is the client displaying nonsense about "!!binary 8J+GmA=="? We know that the character set is UTF-8, and what I call special characters can be displayed, but not this one. Perhaps it cannot display any characters whose code point value is greater than FFFF? Update: I changed the title to remove 'cdata'.

@pgulutzan pgulutzan changed the title Character in a string is treated as cdata Character in a string is not displayed as a string Apr 1, 2019
@kyukhin kyukhin added bug Something isn't working sql labels May 28, 2019
@kyukhin kyukhin added this to the 2.2.0 milestone May 28, 2019
@Totktonada
Copy link
Member

Just note (I see the label sql): this is yaml formatting and it is not specific for sql.

@Totktonada
Copy link
Member

I would not change the default behaviour of our yaml encoder, because our tests (in tarantool and other projects) often generates yaml-formatted result files to compare them byte-per-byte with expected ones. I propose to add an option that will let yaml produce a regular string when it is valid UTF8. We need to elaborate should non-ascii strings be always quoted: it depends how our decoder and other decoders usually work.

@Totktonada
Copy link
Member

Aside of a yaml option we need to introduce a console option (\set ... directive?) to enable / disable it for console's output formatter.

@SudoBobo
Copy link
Contributor

@pgulutzan
Could you please explain your opinion on the topic "How should Tarantool output emojis?" As far as I understood your comments, you propose to output such characters as they are (as code points - like this 馃啒) Could you please also provide some justification.

I tried the commands in different DBs:

CREATE TABLE L (s1 VARCHAR(5) COLLATE "unicode_ci" PRIMARY KEY);
INSERT INTO L VALUES (CAST(X'F09F8698' AS VARCHAR(5)));
INSERT INTO L VALUES ('SOS');
SELECT * FROM L;

with the following results:
IBM DB2: printed 'SOS'
Postgresql: printed byte sequence
Oralce : printed 'SOS'
MSSQL: printed some random unicode symbols
MySQL: printed 'SOS'

@kostja
Copy link
Contributor

kostja commented Jun 21, 2019

I don't see any need for directive. This is a valid UTF8 character, just like 褝 褘 褞 褟. Why should it get a special treatment?

@pgulutzan
Copy link
Contributor Author

pgulutzan commented Jun 21, 2019

Replying to SudoBobo:
(1) My opinion of emojis is not important. I used an emoji in this example but there are many characters outside the BMP that are not emojis.
(2) For PostgreSQL you should use a different syntax:
"
CREATE TABLE L (s1 VARCHAR(5));
INSERT INTO L VALUES (U&'+01F198');
SELECT * FROM L;
"
It works here so I think the correct comment is:
PostgreSQL : printed 'SOS'
(3) You want "justification" for displaying a character correctly. Well, my only justification is: it would be correct. If that is not good enough, okay.
By the way, I cannot check with SQL Server, but I will suggest: if the default character set is not UTF-8 (it usually isn't), then maybe the insert could be tried with nchar().

@Totktonada
Copy link
Member

I don't see any need for directive. This is a valid UTF8 character, just like 褝 褘 褞 褟. Why should it get a special treatment?

I'll cite myself (see the whole comment above):

I would not change the default behaviour of our yaml encoder, because our tests (in tarantool and other projects) often generates yaml-formatted result files to compare them byte-per-byte with expected ones.

@kostja
Copy link
Contributor

kostja commented Jun 21, 2019

I don't understand this argument. byte-by-byte comparison will need to be updated, yes, so what?

@Totktonada
Copy link
Member

Maybe. At least we should consider that it can break some external (testing?) code. That is especially matters when one intend to support several tarantool versions: (s)he will need to support several variants of a test case for different tarantool versions.

@kostja
Copy link
Contributor

kostja commented Jun 21, 2019

The test output is broken now - it prints a printable character as a binary escaped sequence. By not fixing it we prolong broken legacy. Basically you're saying that the change needs to go into 2.x only.

SudoBobo added a commit that referenced this issue Jun 28, 2019
Before the patch unicode characters encoded with 4 bytes
were always treated as non-printable and displayed as byte
sequences (with 'binary' tag).
With the patch, range of printable characters is extended and
include characters encoded with 4 bytes.
Currently it is: (old printable range) U (icu printable range).
Corresponding changes are also made in tarantool/libyaml.

Closes: #4090
@kyukhin kyukhin assigned sergepetrenko and unassigned SudoBobo Jul 15, 2019
sergepetrenko pushed a commit to tarantool/libyaml that referenced this issue Jul 16, 2019
Before the patch IS_PRINTABLE macros was used
to determine if given character is printable or not.
This macros did not take into account characters
encoded with 4 bytes.
After the patch IS_PRINTABLE is replaced with new
corresponding function. Now the range of printable
characters is: (libyaml old range) U (icu range). This
new range include characters encoded with 4 bytes.

Related to tarantool/tarantool#4090
kyukhin pushed a commit to tarantool/libyaml that referenced this issue Jul 18, 2019
Before the patch IS_PRINTABLE macros was used
to determine if given character is printable or not.
This macros did not take into account characters
encoded with 4 bytes.
After the patch IS_PRINTABLE is replaced with new
corresponding function. Now the range of printable
characters is: (libyaml old range) U (icu range). This
new range include characters encoded with 4 bytes.

Related to tarantool/tarantool#4090
sergepetrenko pushed a commit that referenced this issue Jul 18, 2019
Before the patch unicode characters encoded with 4 bytes
were always treated as non-printable and displayed as byte
sequences (with 'binary' tag).
With the patch, range of printable characters is extended and
include characters encoded with 4 bytes.
Currently it is: (old printable range) U (icu printable range).
Corresponding changes are also made in tarantool/libyaml.

Closes: #4090
sergepetrenko pushed a commit to tarantool/libyaml that referenced this issue Jul 18, 2019
Before the patch IS_PRINTABLE macros was used
to determine if given character is printable or not.
This macros did not take into account characters
encoded with 4 bytes.
After the patch IS_PRINTABLE is replaced with new
corresponding function. Now the range of printable
characters is: (libyaml old range) U (icu range). This
new range include characters encoded with 4 bytes.

Related to tarantool/tarantool#4090
sergepetrenko added a commit that referenced this issue Jul 21, 2019
After we fixed bundled libyaml to correctly print 4-byte Unicode
characters, it is no longer compatible with the upstream version, so
enable building with bundled libyaml for every platform.
This way the tests will pass.

Follow-up #4090
sergepetrenko added a commit that referenced this issue Jul 21, 2019
After we fixed bundled libyaml to correctly print 4-byte Unicode
characters, it is no longer compatible with the upstream version, so
enable building with bundled libyaml for every platform.
This way the tests will pass.

Follow-up #4090
sergepetrenko added a commit that referenced this issue Jul 23, 2019
After we fixed bundled libyaml to correctly print 4-byte Unicode
characters, it is no longer compatible with the upstream version, so
enable building with bundled libyaml for every platform.
This way the tests will pass.

Follow-up #4090
sergepetrenko pushed a commit to tarantool/libyaml that referenced this issue Jul 23, 2019
Before the patch IS_PRINTABLE macros was used
to determine if given character is printable or not.
This macros did not take into account characters
encoded with 4 bytes.
After the patch IS_PRINTABLE is replaced with new
corresponding function. Now the range of printable
characters is: (libyaml old range) U (icu range). This
new range include characters encoded with 4 bytes.

Related to tarantool/tarantool#4090
kyukhin pushed a commit that referenced this issue Jul 24, 2019
After we fixed bundled libyaml to correctly print 4-byte Unicode
characters, it is no longer compatible with the upstream version, so
enable building with bundled libyaml for every platform.
This way the tests will pass.

Follow-up #4090
@kyukhin kyukhin added the tmp label Aug 2, 2019
avtikhon pushed a commit that referenced this issue Dec 11, 2019
After we fixed bundled libyaml to correctly print 4-byte Unicode
characters, it is no longer compatible with the upstream version, so
enable building with bundled libyaml for every platform.
This way the tests will pass.

Follow-up #4090
avtikhon pushed a commit that referenced this issue Jun 17, 2020
After we fixed bundled libyaml to correctly print 4-byte Unicode
characters, it is no longer compatible with the upstream version, so
enable building with bundled libyaml for every platform.
This way the tests will pass.

Follow-up #4090

(cherry picked from commit 47b91e9)
avtikhon pushed a commit that referenced this issue Jun 17, 2020
Before the patch unicode characters encoded with 4 bytes
were always treated as non-printable and displayed as byte
sequences (with 'binary' tag).
With the patch, range of printable characters is extended and
include characters encoded with 4 bytes.
Currently it is: (old printable range) U (icu printable range).
Corresponding changes are also made in tarantool/libyaml.

Closes: #4090
avtikhon added a commit that referenced this issue Jun 18, 2020
Bumped libyaml with the following commits:

  7f41d551a44a9b947dc341dd5b6c8779b5d83f00 'Make sure libyaml is C89 compliant'
  f1d1e5e0a5f6e6adeebe0e2c5e95a6ee729426e4 'cmake: make sure yaml is built statically when used in tarantool'
  74a00faf5c4c8720149f7a726a5eda5e8fff86df 'Extend range of printable unicode characters'

Needed for #4090
avtikhon pushed a commit that referenced this issue Jun 18, 2020
Before the patch unicode characters encoded with 4 bytes
were always treated as non-printable and displayed as byte
sequences (with 'binary' tag).
With the patch, range of printable characters is extended and
include characters encoded with 4 bytes.
Currently it is: (old printable range) U (icu printable range).
Corresponding changes are also made in tarantool/libyaml.

Closes: #4090

(cherry picked from commit cdf3787)
avtikhon pushed a commit that referenced this issue Jun 22, 2020
After we fixed bundled libyaml to correctly print 4-byte Unicode
characters, it is no longer compatible with the upstream version, so
enable building with bundled libyaml for every platform.
This way the tests will pass.

Follow-up #4090

(cherry picked from commit 47b91e9)
avtikhon pushed a commit to tarantool/libyaml that referenced this issue Jun 22, 2020
avtikhon pushed a commit that referenced this issue Jun 22, 2020
After we fixed bundled libyaml to correctly print 4-byte Unicode
characters, it is no longer compatible with the upstream version, so
enable building with bundled libyaml for every platform.
This way the tests will pass.

Follow-up #4090

(cherry picked from commit 47b91e9)

After the app/digest.test.lua and app/socket.test.lua where unblocked
by the commit:

  f588066
  commit f588066
  Author: Alexander V. Tikhonov <avtikhon@tarantool.org>
  Date:   Sun May 10 09:28:15 2020 +0300

    test: return tests to packaging testing

    Found that issues #1227 and #1322 were closed, returned
    the tests blocked by it into the testing.

    Part of #4599

THe following issue appeared in the both tests:

 [032] --- app/digest.result  Thu May 21 03:21:02 2020
 [032] +++ app/digest.reject  Thu May 21 03:26:47 2020
 [032] @@ -333,7 +333,7 @@
 [032]
 [032]    YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh
 [032]
 [032] -  '
 [032] +'
 [032]  ...
 [032]  digest.base64_decode(b) == s
 [032]  ---
 [032]

Found that the issue was already fixed in the main trunk of Tarantool
2.x version where bundled libyaml was used. To fix the tests in 1.10
version decided to use bundled libyaml too.
avtikhon pushed a commit to tarantool/libyaml that referenced this issue Jun 23, 2020
avtikhon pushed a commit that referenced this issue Jun 23, 2020
After we fixed bundled libyaml to correctly print 4-byte Unicode
characters, it is no longer compatible with the upstream version, so
enable building with bundled libyaml for every platform.
This way the tests will pass.

Follow-up #4090

(cherry picked from commit 47b91e9)

After the app/digest.test.lua and app/socket.test.lua where unblocked
by the commit:

  f588066
  commit f588066
  Author: Alexander V. Tikhonov <avtikhon@tarantool.org>
  Date:   Sun May 10 09:28:15 2020 +0300

    test: return tests to packaging testing

    Found that issues #1227 and #1322 were closed, returned
    the tests blocked by it into the testing.

    Part of #4599

THe following issue appeared in the both tests:

 [032] --- app/digest.result  Thu May 21 03:21:02 2020
 [032] +++ app/digest.reject  Thu May 21 03:26:47 2020
 [032] @@ -333,7 +333,7 @@
 [032]
 [032]    YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh
 [032]
 [032] -  '
 [032] +'
 [032]  ...
 [032]  digest.base64_decode(b) == s
 [032]  ---
 [032]

Found that the issue was already fixed in the main trunk of Tarantool
2.x version where bundled libyaml was used. To fix the tests in 1.10
version decided to use bundled libyaml too.
kyukhin pushed a commit that referenced this issue Jun 26, 2020
After we fixed bundled libyaml to correctly print 4-byte Unicode
characters, it is no longer compatible with the upstream version, so
enable building with bundled libyaml for every platform.
This way the tests will pass.

Follow-up #4090

(cherry picked from commit 47b91e9)

After the app/digest.test.lua and app/socket.test.lua where unblocked
by the commit:

  f588066
  commit f588066
  Author: Alexander V. Tikhonov <avtikhon@tarantool.org>
  Date:   Sun May 10 09:28:15 2020 +0300

    test: return tests to packaging testing

    Found that issues #1227 and #1322 were closed, returned
    the tests blocked by it into the testing.

    Part of #4599

THe following issue appeared in the both tests:

 [032] --- app/digest.result  Thu May 21 03:21:02 2020
 [032] +++ app/digest.reject  Thu May 21 03:26:47 2020
 [032] @@ -333,7 +333,7 @@
 [032]
 [032]    YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh
 [032]
 [032] -  '
 [032] +'
 [032]  ...
 [032]  digest.base64_decode(b) == s
 [032]  ---
 [032]

Found that the issue was already fixed in the main trunk of Tarantool
2.x version where bundled libyaml was used. To fix the tests in 1.10
version decided to use bundled libyaml too.
ligurio pushed a commit to tarantool/libyaml that referenced this issue Jul 19, 2022
Before the patch IS_PRINTABLE macros was used
to determine if given character is printable or not.
This macros did not take into account characters
encoded with 4 bytes.
After the patch IS_PRINTABLE is replaced with new
corresponding function. Now the range of printable
characters is: (libyaml old range) U (icu range). This
new range include characters encoded with 4 bytes.

Related to tarantool/tarantool#4090
ligurio pushed a commit to tarantool/libyaml that referenced this issue Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants