New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character in a string is not displayed as a string #4090
Comments
Re utf8.len(x[1]): x[1] is a tuple, use x[1][1] to pass a string to utf8.len. |
Indeed my use of x[1] was my mistake, sorry. Disregarding my nonsense about cdata, though, we are left with: why is the client displaying nonsense about "!!binary 8J+GmA=="? We know that the character set is UTF-8, and what I call special characters can be displayed, but not this one. Perhaps it cannot display any characters whose code point value is greater than FFFF? Update: I changed the title to remove 'cdata'. |
Just note (I see the label |
I would not change the default behaviour of our yaml encoder, because our tests (in tarantool and other projects) often generates yaml-formatted result files to compare them byte-per-byte with expected ones. I propose to add an option that will let yaml produce a regular string when it is valid UTF8. We need to elaborate should non-ascii strings be always quoted: it depends how our decoder and other decoders usually work. |
Aside of a yaml option we need to introduce a console option ( |
@pgulutzan I tried the commands in different DBs:
with the following results: |
I don't see any need for directive. This is a valid UTF8 character, just like 褝 褘 褞 褟. Why should it get a special treatment? |
Replying to SudoBobo: |
I'll cite myself (see the whole comment above):
|
I don't understand this argument. byte-by-byte comparison will need to be updated, yes, so what? |
Maybe. At least we should consider that it can break some external (testing?) code. That is especially matters when one intend to support several tarantool versions: (s)he will need to support several variants of a test case for different tarantool versions. |
The test output is broken now - it prints a printable character as a binary escaped sequence. By not fixing it we prolong broken legacy. Basically you're saying that the change needs to go into 2.x only. |
Before the patch unicode characters encoded with 4 bytes were always treated as non-printable and displayed as byte sequences (with 'binary' tag). With the patch, range of printable characters is extended and include characters encoded with 4 bytes. Currently it is: (old printable range) U (icu printable range). Corresponding changes are also made in tarantool/libyaml. Closes: #4090
Before the patch IS_PRINTABLE macros was used to determine if given character is printable or not. This macros did not take into account characters encoded with 4 bytes. After the patch IS_PRINTABLE is replaced with new corresponding function. Now the range of printable characters is: (libyaml old range) U (icu range). This new range include characters encoded with 4 bytes. Related to tarantool/tarantool#4090
Before the patch IS_PRINTABLE macros was used to determine if given character is printable or not. This macros did not take into account characters encoded with 4 bytes. After the patch IS_PRINTABLE is replaced with new corresponding function. Now the range of printable characters is: (libyaml old range) U (icu range). This new range include characters encoded with 4 bytes. Related to tarantool/tarantool#4090
Before the patch unicode characters encoded with 4 bytes were always treated as non-printable and displayed as byte sequences (with 'binary' tag). With the patch, range of printable characters is extended and include characters encoded with 4 bytes. Currently it is: (old printable range) U (icu printable range). Corresponding changes are also made in tarantool/libyaml. Closes: #4090
Before the patch IS_PRINTABLE macros was used to determine if given character is printable or not. This macros did not take into account characters encoded with 4 bytes. After the patch IS_PRINTABLE is replaced with new corresponding function. Now the range of printable characters is: (libyaml old range) U (icu range). This new range include characters encoded with 4 bytes. Related to tarantool/tarantool#4090
After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090
After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090
After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090
Before the patch IS_PRINTABLE macros was used to determine if given character is printable or not. This macros did not take into account characters encoded with 4 bytes. After the patch IS_PRINTABLE is replaced with new corresponding function. Now the range of printable characters is: (libyaml old range) U (icu range). This new range include characters encoded with 4 bytes. Related to tarantool/tarantool#4090
After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090
After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090
Before the patch unicode characters encoded with 4 bytes were always treated as non-printable and displayed as byte sequences (with 'binary' tag). With the patch, range of printable characters is extended and include characters encoded with 4 bytes. Currently it is: (old printable range) U (icu printable range). Corresponding changes are also made in tarantool/libyaml. Closes: #4090
Bumped libyaml with the following commits: 7f41d551a44a9b947dc341dd5b6c8779b5d83f00 'Make sure libyaml is C89 compliant' f1d1e5e0a5f6e6adeebe0e2c5e95a6ee729426e4 'cmake: make sure yaml is built statically when used in tarantool' 74a00faf5c4c8720149f7a726a5eda5e8fff86df 'Extend range of printable unicode characters' Needed for #4090
Before the patch unicode characters encoded with 4 bytes were always treated as non-printable and displayed as byte sequences (with 'binary' tag). With the patch, range of printable characters is extended and include characters encoded with 4 bytes. Currently it is: (old printable range) U (icu printable range). Corresponding changes are also made in tarantool/libyaml. Closes: #4090 (cherry picked from commit cdf3787)
After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090 (cherry picked from commit 47b91e9) After the app/digest.test.lua and app/socket.test.lua where unblocked by the commit: f588066 commit f588066 Author: Alexander V. Tikhonov <avtikhon@tarantool.org> Date: Sun May 10 09:28:15 2020 +0300 test: return tests to packaging testing Found that issues #1227 and #1322 were closed, returned the tests blocked by it into the testing. Part of #4599 THe following issue appeared in the both tests: [032] --- app/digest.result Thu May 21 03:21:02 2020 [032] +++ app/digest.reject Thu May 21 03:26:47 2020 [032] @@ -333,7 +333,7 @@ [032] [032] YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh [032] [032] - ' [032] +' [032] ... [032] digest.base64_decode(b) == s [032] --- [032] Found that the issue was already fixed in the main trunk of Tarantool 2.x version where bundled libyaml was used. To fix the tests in 1.10 version decided to use bundled libyaml too.
Follow-up tarantool/tarantool#4090 (cherry picked from commit f1d1e5e)
After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090 (cherry picked from commit 47b91e9) After the app/digest.test.lua and app/socket.test.lua where unblocked by the commit: f588066 commit f588066 Author: Alexander V. Tikhonov <avtikhon@tarantool.org> Date: Sun May 10 09:28:15 2020 +0300 test: return tests to packaging testing Found that issues #1227 and #1322 were closed, returned the tests blocked by it into the testing. Part of #4599 THe following issue appeared in the both tests: [032] --- app/digest.result Thu May 21 03:21:02 2020 [032] +++ app/digest.reject Thu May 21 03:26:47 2020 [032] @@ -333,7 +333,7 @@ [032] [032] YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh [032] [032] - ' [032] +' [032] ... [032] digest.base64_decode(b) == s [032] --- [032] Found that the issue was already fixed in the main trunk of Tarantool 2.x version where bundled libyaml was used. To fix the tests in 1.10 version decided to use bundled libyaml too.
After we fixed bundled libyaml to correctly print 4-byte Unicode characters, it is no longer compatible with the upstream version, so enable building with bundled libyaml for every platform. This way the tests will pass. Follow-up #4090 (cherry picked from commit 47b91e9) After the app/digest.test.lua and app/socket.test.lua where unblocked by the commit: f588066 commit f588066 Author: Alexander V. Tikhonov <avtikhon@tarantool.org> Date: Sun May 10 09:28:15 2020 +0300 test: return tests to packaging testing Found that issues #1227 and #1322 were closed, returned the tests blocked by it into the testing. Part of #4599 THe following issue appeared in the both tests: [032] --- app/digest.result Thu May 21 03:21:02 2020 [032] +++ app/digest.reject Thu May 21 03:26:47 2020 [032] @@ -333,7 +333,7 @@ [032] [032] YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh [032] [032] - ' [032] +' [032] ... [032] digest.base64_decode(b) == s [032] --- [032] Found that the issue was already fixed in the main trunk of Tarantool 2.x version where bundled libyaml was used. To fix the tests in 1.10 version decided to use bundled libyaml too.
Before the patch IS_PRINTABLE macros was used to determine if given character is printable or not. This macros did not take into account characters encoded with 4 bytes. After the patch IS_PRINTABLE is replaced with new corresponding function. Now the range of printable characters is: (libyaml old range) U (icu range). This new range include characters encoded with 4 bytes. Related to tarantool/tarantool#4090
I have Ubuntu 18.04.
I have Tarantool 2.2 pulled from source today.
I use a character 馃啒 which is Unicode 1F198 SQUARED SOS.
I insert it using the hex of the UTF8 i.e. X'F09F8698'.
I could also insert it using the decimal of the code point i.e. char(127384).
I could also insert it directly (on my platform) as '馃啒'.
I issue these requests:
\set language sql
\set delimiter ;
CREATE TABLE L (s1 VARCHAR(5) COLLATE "unicode_ci" PRIMARY KEY);
INSERT INTO L VALUES (CAST(X'F09F8698' AS VARCHAR(5)));
INSERT INTO L VALUES ('SOS');
SELECT * FROM L;
\set language lua
\set delimiter
x = box.space.L:select()
utf8.len(x[1])
The INSERT INTO L VALUES ('SOS'); fails, duplicate key,
which is good -- that proves the character is correctly
recognized and is known to be equal to 'SOS'.
The SELECT * FROM l; fails, it says
but it should show the result as a string.
The utf8.len(x[1]) fails, it says
but there is no error, it is a valid 1-character string.
This is not the same as issue#2344 which mentioned disallowing
invalid strings. This is about failing to recognize valid strings.
The text was updated successfully, but these errors were encountered: