Character in a string is not displayed as a string #4090

pgulutzan · 2019-03-31T17:05:26Z

I have Ubuntu 18.04.
I have Tarantool 2.2 pulled from source today.

I use a character 🆘 which is Unicode 1F198 SQUARED SOS.
I insert it using the hex of the UTF8 i.e. X'F09F8698'.
I could also insert it using the decimal of the code point i.e. char(127384).
I could also insert it directly (on my platform) as '🆘'.

I issue these requests:
\set language sql
\set delimiter ;
CREATE TABLE L (s1 VARCHAR(5) COLLATE "unicode_ci" PRIMARY KEY);
INSERT INTO L VALUES (CAST(X'F09F8698' AS VARCHAR(5)));
INSERT INTO L VALUES ('SOS');
SELECT * FROM L;
\set language lua
\set delimiter
x = box.space.L:select()
utf8.len(x[1])

The INSERT INTO L VALUES ('SOS'); fails, duplicate key,
which is good -- that proves the character is correctly
recognized and is known to be equal to 'SOS'.

The SELECT * FROM l; fails, it says

- [!!binary 8J+GmA==]
  but it should show the result as a string.

The utf8.len(x[1]) fails, it says

error: 'Usage: utf8.len(, [i, [j]])'
but there is no error, it is a valid 1-character string.

This is not the same as issue#2344 which mentioned disallowing
invalid strings. This is about failing to recognize valid strings.

Totktonada · 2019-03-31T23:33:18Z

!!binary is how yaml encodes a string with special characters (don't sure about exact character set). It has nothing with cdata or SQL types.

Re utf8.len(x[1]): x[1] is a tuple, use x[1][1] to pass a string to utf8.len.

pgulutzan · 2019-04-01T13:04:12Z

Indeed my use of x[1] was my mistake, sorry. Disregarding my nonsense about cdata, though, we are left with: why is the client displaying nonsense about "!!binary 8J+GmA=="? We know that the character set is UTF-8, and what I call special characters can be displayed, but not this one. Perhaps it cannot display any characters whose code point value is greater than FFFF? Update: I changed the title to remove 'cdata'.

Totktonada · 2019-06-18T06:57:30Z

Just note (I see the label sql): this is yaml formatting and it is not specific for sql.

Totktonada · 2019-06-18T07:05:40Z

I would not change the default behaviour of our yaml encoder, because our tests (in tarantool and other projects) often generates yaml-formatted result files to compare them byte-per-byte with expected ones. I propose to add an option that will let yaml produce a regular string when it is valid UTF8. We need to elaborate should non-ascii strings be always quoted: it depends how our decoder and other decoders usually work.

Totktonada · 2019-06-18T14:48:01Z

Aside of a yaml option we need to introduce a console option (\set ... directive?) to enable / disable it for console's output formatter.

SudoBobo · 2019-06-21T14:20:51Z

@pgulutzan
Could you please explain your opinion on the topic "How should Tarantool output emojis?" As far as I understood your comments, you propose to output such characters as they are (as code points - like this 🆘) Could you please also provide some justification.

I tried the commands in different DBs:

CREATE TABLE L (s1 VARCHAR(5) COLLATE "unicode_ci" PRIMARY KEY);
INSERT INTO L VALUES (CAST(X'F09F8698' AS VARCHAR(5)));
INSERT INTO L VALUES ('SOS');
SELECT * FROM L;

with the following results:
IBM DB2: printed 'SOS'
Postgresql: printed byte sequence
Oralce : printed 'SOS'
MSSQL: printed some random unicode symbols
MySQL: printed 'SOS'

kostja · 2019-06-21T15:16:10Z

I don't see any need for directive. This is a valid UTF8 character, just like э ы ю я. Why should it get a special treatment?

pgulutzan · 2019-06-21T15:58:03Z

Replying to SudoBobo:
(1) My opinion of emojis is not important. I used an emoji in this example but there are many characters outside the BMP that are not emojis.
(2) For PostgreSQL you should use a different syntax:
"
CREATE TABLE L (s1 VARCHAR(5));
INSERT INTO L VALUES (U&'+01F198');
SELECT * FROM L;
"
It works here so I think the correct comment is:
PostgreSQL : printed 'SOS'
(3) You want "justification" for displaying a character correctly. Well, my only justification is: it would be correct. If that is not good enough, okay.
By the way, I cannot check with SQL Server, but I will suggest: if the default character set is not UTF-8 (it usually isn't), then maybe the insert could be tried with nchar().

Totktonada · 2019-06-21T16:45:06Z

I don't see any need for directive. This is a valid UTF8 character, just like э ы ю я. Why should it get a special treatment?

I'll cite myself (see the whole comment above):

I would not change the default behaviour of our yaml encoder, because our tests (in tarantool and other projects) often generates yaml-formatted result files to compare them byte-per-byte with expected ones.

kostja · 2019-06-21T18:04:28Z

I don't understand this argument. byte-by-byte comparison will need to be updated, yes, so what?

Totktonada · 2019-06-21T18:52:05Z

Maybe. At least we should consider that it can break some external (testing?) code. That is especially matters when one intend to support several tarantool versions: (s)he will need to support several variants of a test case for different tarantool versions.

kostja · 2019-06-21T21:28:25Z

The test output is broken now - it prints a printable character as a binary escaped sequence. By not fixing it we prolong broken legacy. Basically you're saying that the change needs to go into 2.x only.

Before the patch unicode characters encoded with 4 bytes were always treated as non-printable and displayed as byte sequences (with 'binary' tag). With the patch, range of printable characters is extended and include characters encoded with 4 bytes. Currently it is: (old printable range) U (icu printable range). Corresponding changes are also made in tarantool/libyaml. Closes: #4090

Before the patch IS_PRINTABLE macros was used to determine if given character is printable or not. This macros did not take into account characters encoded with 4 bytes. After the patch IS_PRINTABLE is replaced with new corresponding function. Now the range of printable characters is: (libyaml old range) U (icu range). This new range include characters encoded with 4 bytes. Related to tarantool/tarantool#4090

Before the patch unicode characters encoded with 4 bytes were always treated as non-printable and displayed as byte sequences (with 'binary' tag). With the patch, range of printable characters is extended and include characters encoded with 4 bytes. Currently it is: (old printable range) U (icu printable range). Corresponding changes are also made in tarantool/libyaml. Closes: #4090

Before the patch IS_PRINTABLE macros was used to determine if given character is printable or not. This macros did not take into account characters encoded with 4 bytes. After the patch IS_PRINTABLE is replaced with new corresponding function. Now the range of printable characters is: (libyaml old range) U (icu range). This new range include characters encoded with 4 bytes. Related to tarantool/tarantool#4090

After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090

Follow-up tarantool/tarantool#4090

After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090

Before the patch IS_PRINTABLE macros was used to determine if given character is printable or not. This macros did not take into account characters encoded with 4 bytes. After the patch IS_PRINTABLE is replaced with new corresponding function. Now the range of printable characters is: (libyaml old range) U (icu range). This new range include characters encoded with 4 bytes. Related to tarantool/tarantool#4090

After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090

After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090 (cherry picked from commit 47b91e9)

Before the patch unicode characters encoded with 4 bytes were always treated as non-printable and displayed as byte sequences (with 'binary' tag). With the patch, range of printable characters is extended and include characters encoded with 4 bytes. Currently it is: (old printable range) U (icu printable range). Corresponding changes are also made in tarantool/libyaml. Closes: #4090

Bumped libyaml with the following commits: 7f41d551a44a9b947dc341dd5b6c8779b5d83f00 'Make sure libyaml is C89 compliant' f1d1e5e0a5f6e6adeebe0e2c5e95a6ee729426e4 'cmake: make sure yaml is built statically when used in tarantool' 74a00faf5c4c8720149f7a726a5eda5e8fff86df 'Extend range of printable unicode characters' Needed for #4090

Before the patch unicode characters encoded with 4 bytes were always treated as non-printable and displayed as byte sequences (with 'binary' tag). With the patch, range of printable characters is extended and include characters encoded with 4 bytes. Currently it is: (old printable range) U (icu printable range). Corresponding changes are also made in tarantool/libyaml. Closes: #4090 (cherry picked from commit cdf3787)

After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090 (cherry picked from commit 47b91e9)

Follow-up tarantool/tarantool#4090

After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090 (cherry picked from commit 47b91e9) After the app/digest.test.lua and app/socket.test.lua where unblocked by the commit: f588066 commit f588066 Author: Alexander V. Tikhonov <avtikhon@tarantool.org> Date: Sun May 10 09:28:15 2020 +0300 test: return tests to packaging testing Found that issues #1227 and #1322 were closed, returned the tests blocked by it into the testing. Part of #4599 THe following issue appeared in the both tests: [032] --- app/digest.result Thu May 21 03:21:02 2020 [032] +++ app/digest.reject Thu May 21 03:26:47 2020 [032] @@ -333,7 +333,7 @@ [032] [032] YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh [032] [032] - ' [032] +' [032] ... [032] digest.base64_decode(b) == s [032] --- [032] Found that the issue was already fixed in the main trunk of Tarantool 2.x version where bundled libyaml was used. To fix the tests in 1.10 version decided to use bundled libyaml too.

Follow-up tarantool/tarantool#4090 (cherry picked from commit f1d1e5e)

After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090 (cherry picked from commit 47b91e9) After the app/digest.test.lua and app/socket.test.lua where unblocked by the commit: f588066 commit f588066 Author: Alexander V. Tikhonov <avtikhon@tarantool.org> Date: Sun May 10 09:28:15 2020 +0300 test: return tests to packaging testing Found that issues #1227 and #1322 were closed, returned the tests blocked by it into the testing. Part of #4599 THe following issue appeared in the both tests: [032] --- app/digest.result Thu May 21 03:21:02 2020 [032] +++ app/digest.reject Thu May 21 03:26:47 2020 [032] @@ -333,7 +333,7 @@ [032] [032] YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh [032] [032] - ' [032] +' [032] ... [032] digest.base64_decode(b) == s [032] --- [032] Found that the issue was already fixed in the main trunk of Tarantool 2.x version where bundled libyaml was used. To fix the tests in 1.10 version decided to use bundled libyaml too.

Before the patch IS_PRINTABLE macros was used to determine if given character is printable or not. This macros did not take into account characters encoded with 4 bytes. After the patch IS_PRINTABLE is replaced with new corresponding function. Now the range of printable characters is: (libyaml old range) U (icu range). This new range include characters encoded with 4 bytes. Related to tarantool/tarantool#4090

Follow-up tarantool/tarantool#4090

pgulutzan changed the title ~~Character in a string is treated as cdata~~ Character in a string is not displayed as a string Apr 1, 2019

kyukhin added bug Something isn't working sql labels May 28, 2019

kyukhin added this to the 2.2.0 milestone May 28, 2019

Totktonada removed the sql label Jun 18, 2019

Totktonada assigned SudoBobo Jun 18, 2019

kyukhin assigned sergepetrenko and unassigned SudoBobo Jul 15, 2019

sergepetrenko mentioned this issue Jul 18, 2019

Extend range of printable unicode characters to the ones encoded with 4 bytes. yaml/libyaml#148

Closed

sergepetrenko added the ready for review label Jul 19, 2019

kyukhin closed this as completed in cdf3787 Jul 19, 2019

sergepetrenko added a commit to tarantool/libyaml that referenced this issue Jul 22, 2019

cmake: make sure yaml is built statically when used in tarantool

f1d1e5e

Follow-up tarantool/tarantool#4090

sergepetrenko mentioned this issue Jul 23, 2019

Extend range of printable unicode characters yaml/libyaml#150

Open

Totktonada mentioned this issue Jul 30, 2019

Tarantool cannot be installed on current Fedora #4313

Closed

kyukhin added the tmp label Aug 2, 2019

sergepetrenko removed the ready for review label Aug 22, 2019

avtikhon mentioned this issue May 26, 2020

test: on 1.10 tests were unblocked and began to show the lack of feature #5022

Closed

avtikhon mentioned this issue Jun 17, 2020

test: on 1.10 tests were unblocked and began to show the lack of feature on 1.10 #5007

Closed

avtikhon pushed a commit to tarantool/libyaml that referenced this issue Jun 22, 2020

cmake: make sure yaml is built statically when used in tarantool

f75e138

Follow-up tarantool/tarantool#4090

avtikhon pushed a commit to tarantool/libyaml that referenced this issue Jun 23, 2020

cmake: make sure yaml is built statically when used in tarantool

261c7ee

Follow-up tarantool/tarantool#4090 (cherry picked from commit f1d1e5e)

Totktonada mentioned this issue Aug 6, 2020

Tarantool 2.6.0 tests fail at Arch Linux #5221

Closed

Totktonada mentioned this issue Sep 7, 2020

gmake building errors for yaml #5186

Open

ghost mentioned this issue Jun 9, 2021

Implement emoji support ruby/psych#508

Open

Totktonada mentioned this issue Dec 7, 2021

tarantool: use brewed libs & fix build on Monterey Homebrew/homebrew-core#90034

Closed

6 tasks

ligurio pushed a commit to tarantool/libyaml that referenced this issue Jul 19, 2022

cmake: make sure yaml is built statically when used in tarantool

df47189

Follow-up tarantool/tarantool#4090

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character in a string is not displayed as a string #4090

Character in a string is not displayed as a string #4090

pgulutzan commented Mar 31, 2019 •

edited

Totktonada commented Mar 31, 2019

pgulutzan commented Apr 1, 2019 •

edited

Totktonada commented Jun 18, 2019

Totktonada commented Jun 18, 2019

Totktonada commented Jun 18, 2019

SudoBobo commented Jun 21, 2019

kostja commented Jun 21, 2019

pgulutzan commented Jun 21, 2019 •

edited

Totktonada commented Jun 21, 2019

kostja commented Jun 21, 2019

Totktonada commented Jun 21, 2019

kostja commented Jun 21, 2019 •

edited

Character in a string is not displayed as a string #4090

Character in a string is not displayed as a string #4090

Comments

pgulutzan commented Mar 31, 2019 • edited

Totktonada commented Mar 31, 2019

pgulutzan commented Apr 1, 2019 • edited

Totktonada commented Jun 18, 2019

Totktonada commented Jun 18, 2019

Totktonada commented Jun 18, 2019

SudoBobo commented Jun 21, 2019

kostja commented Jun 21, 2019

pgulutzan commented Jun 21, 2019 • edited

Totktonada commented Jun 21, 2019

kostja commented Jun 21, 2019

Totktonada commented Jun 21, 2019

kostja commented Jun 21, 2019 • edited

pgulutzan commented Mar 31, 2019 •

edited

pgulutzan commented Apr 1, 2019 •

edited

pgulutzan commented Jun 21, 2019 •

edited

kostja commented Jun 21, 2019 •

edited