Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON column with Unicode emojis breaks #56

Open
mpskovvang opened this issue Feb 21, 2023 · 8 comments
Open

JSON column with Unicode emojis breaks #56

mpskovvang opened this issue Feb 21, 2023 · 8 comments

Comments

@mpskovvang
Copy link

Inserting emojis becomes ������ in JSON columns.

CREATE TABLE `test` (
  `a` JSON COLLATE utf8mb4_unicode_ci,
  `b` text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
);
DB::table('test')->insert([
    'a' => json_encode(['text' => 'Hello world! æøå 😀']),
    'b' => 'Hello world! æøå 😀',
]);

image

Using the JSON_UNESCAPED_UNICODE flag seems to work:

image

This affects Eloquent models with JSON casting attributes (like array and object).

Is this related to Laravel, PDO MySQL driver, or SingleStore DB?

Normally, I would expect the stored JSON value to use backslashes to escape Unicode characters (PHP default with json_encode), but that does not seem to be the way SingleStore DB handles it.

@AdalbertMemSQL
Copy link
Collaborator

Hi Martin,
Here is more information on Unicode characters in SingleStoreDB JSON type:
https://docs.singlestore.com/db/v8.0/en/create-a-database/using-json.html#unicode-support-652809

According to this article, the default behavior of json_encode (escape as \uXXXX) should work well. I'm surprised that it doesn't.
Thanks for reporting the issue.

@mpskovvang
Copy link
Author

mpskovvang commented Feb 22, 2023

Hi Adalbert

Thanks for your input.

Actually, the docs make me a bit more confused. It says, only the Basic Multilingual Plane is supported in JSON columns, but the emoji from my example uses the U+1F600 codepoint (Supplementary Multilingual Plane). However, it seems supported when not escaping the character with backslashes.

SingleStoreDB (like MySQL) does not support characters outside the Basic Multilingual Plane (characters whose codepoints are in the range 0xFFFF to 0x10FFFF).

https://www.compart.com/en/unicode/U+1F600

@michabbb
Copy link

hi there,

i can confirm this problem.
i have a XML with this content:

<SubTitle>TOP PREIS ❗️ BESTE QUALITÄT 😍 KOSTENLOSE LIEFERUNG ✔</SubTitle>

I convert this to an array and from there I do

json_encode($array,true)

in a native JSON column, the smiley is broken, but if I store the JSON into a LONGTEXT, everything works fine
and after selecting the json-string I am able to get the smiley back from "json_decode".

here´s a sandbox of what I do: https://phpsandbox.io/n/singlestore-json-example-tg36z

I contacted singlestore support, and they told me:

We tested a similar scenario with the syntax that you have provided for the table as well as using the same JSON values and the output returned is correct and contains the emoji.

so I guess the problem is related to this package, somehow.

@AdalbertMemSQL I would be happy, if you could take a look into this issue (again).

THANKS!

@michabbb
Copy link

and yes, i can confirm, if I use my own "custom Cast" like this:

protected $casts = [
   'body'   => Json::class,
];

// Casts/Json.php
class Json implements CastsAttributes
{
    public function get(Model $model, string $key, mixed $value, array $attributes): mixed
    {
		return json_decode($value, true, 512, JSON_INVALID_UTF8_IGNORE);
    }

    public function set(Model $model, string $key, mixed $value, array $attributes): string|false
	{
		return json_encode($value, JSON_UNESCAPED_UNICODE);
    }
}

I see a correct smiley in my singlestore json column.
so maybe this whole json/array casting inside laravel is the problem 🤔 🤷

@AdalbertMemSQL
Copy link
Collaborator

Thanks, @michabbb
I created an internal ticket for this issue.

@michabbb
Copy link

michabbb commented Apr 10, 2023

after testing some more, i start thinking, that singlestore only supports "UNESCAPED UNICODE", because inserting a string like this:

{"SubTitle":"TOP PREIS \u2757\ufe0f BESTE QUALIT\u00c4T  \ud83d\ude0d  KOSTENLOSE LIEFERUNG \u2714"}

simply is not working. i asked the singlestore support, will update this posting, if I got feedback.

btw.... laravel offers no support for adding any options to auto-casts like array<->json - "json_encode", I guess that´s why there are "custom casts" 😏

https://github.com/laravel/framework/blob/9588b4e7d1c71f08444db66ed3652cf967b468b2/src/Illuminate/Database/Eloquent/Casts/Json.php#L26

@danmatthews
Copy link

Just jumping in here to say that we've been fighting with the exact same issue, and hanging in here for any follow up!

@AdalbertMemSQL
Copy link
Collaborator

A small update regarding this.
Despite this doc - escaped characters outside the Basic Multilingual Plane (characters whose codepoints are in the range 0x10000 to 0x10FFFF) are not handled correctly. 

That's why 😀 (0x1F600) doesn't work well. There is a ticket to fix this on the database side.

MySQL [db]> insert into t value ('{"a":"\\u1f600"}');
Query OK, 1 row affected (0.033 sec)

MySQL [db]> select * from t;
+--------------+
| a            |
+--------------+
| {"a":"ὠ0"}   |
+--------------+
1 row in set (0.029 sec)

Such characters still work well when they are not escaped

MySQL [db]> insert into t values ('{"a":"😀"}');
Query OK, 1 row affected (0.029 sec)

MySQL [db]> select * from t;
+--------------+
| a            |
+--------------+
| {"a":"😀"}     |
+--------------+
1 row in set (0.030 sec)

So, in conclusion, characters from the Basic Multilingual Plane work well

        DB::table('test')->insert([
            'data' => json_encode(['a' => 'ם']),
        ]);
+------------+
| data       |
+------------+
| {"a":"ם"}  |
+------------+
1 row in set (0.009 sec)

but characters outside of the Basic Multilingual Plane should be used with the JSON_UNESCAPED_UNICODE flag until their support is added to the database

        DB::table('test')->insert([
            'data' => json_encode(['a' => '😀'], JSON_UNESCAPED_UNICODE),
        ]);
+--------------+
| data         |
+--------------+
| {"a":"😀"}     |
+--------------+
1 row in set (0.019 sec)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants