MySQL database not using full UTF-8 charset #394

ophian · 2016-03-11T09:04:35Z

The 😊 is what this is about! (GitHub changes the smiley if not in code tags)

Insert text
Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. 😊 Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur? At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere
see preview is working
save the entry
see frontend entry text cut off after "voluptatem."
back in backend, see the second part isn't saved at all

Who is doing this?

The text was updated successfully, but these errors were encountered:

garvinhicking · 2016-03-11T09:12:14Z

This smilie is a high-byte UTF character. The reason it might not be saved is:

a.) html entities/transcoding
b.) Database charset/collation collision
c.) Maybe some mysql_escape_string() thing where it strips this

Definitely try to do this without a WYSIWYG editor to see if it makes a difference. Also try to directly insert it into the serendipity_entries DB table find out if it's the DB or s9y that's removing the character.

Also, it could be that those Smilies are not UTF-8, but UTF-16 (this could be), in that case we would need to convert DB tables etc. to UTF-16.

Just a few ideas, maybe this helps.

yellowled · 2016-03-11T09:15:03Z

Exactly the same behaviour in CKE, but I had to use the source code view in CKE because otherwise it adopts some inline formatting from GitHub.

yellowled · 2016-03-11T09:18:16Z

For the record, in CKE the preview is working as well, and it is the same if the text is not pasted in srouce code view (had to copy it without the code block fencing).

ophian · 2016-03-11T09:22:24Z

Definitely try to do this without a WYSIWYG editor to see if it makes a difference.

;-) Well, I did. (I do maintain the cke plugin, but this does not mean I use it alle the time!)
I thought of UTF-16 or something like that too...

yellowled · 2016-03-11T09:24:23Z

„Unicode code points in "Private Use"-Bereichen (UTF-8)“ – so no UTF-16.

https://twitter.com/fhemberger/status/708221689511940096

ophian · 2016-03-11T09:25:38Z

Putting it into DB via PhpMyAdmin editor gives
Warning: #1366 Incorrect string value: '\xF0\x9F\x98\x8A b...' for column 'body' at row 1

garvinhicking · 2016-03-11T09:37:10Z

@ophian What's the collation of the "body" column? Maybe try to set it to UTF-16 just to see if that changes things.

ophian · 2016-03-11T09:39:37Z

What's the collation of the "body" column? Maybe try to set it to UTF-16 just to see if that changes things.

It is utf8_unicode_ci.
Have to look up how to convert it to UTF-16 though...

ophian · 2016-03-11T09:46:06Z

I cannot change the body collision... to any of UTF-16 unicode_ci or utf8mb4 unicode_ci.
It gives #1283 Column 'body' cannot be part of the FULLTEXT index.

garvinhicking · 2016-03-11T09:52:07Z

Hm. I have no UTF-16 knowledge with MYSQL yet. Would suck if UTF-16 columns couldn't be fulltext searched. (Well, UTF-16 would suck nevertheless, because of other index lengths, and also because we would need to add charsets UTF-16 SET NAMES et all). Better to somehow encoded the UTF-16 string to a entity like &#blablabla on saving, so it can be stored as UTF-8...

yellowled · 2016-03-11T10:02:15Z

FWIW, http://apps.timwhitlock.info/emoji/tables/unicode has the unicode characters, codes or whatever, for the 😊 for example it's http://apps.timwhitlock.info/unicode/inspect/hex/1F60A (which has UTF-8 and UTF-16 LE).

onli · 2016-03-11T10:34:54Z

Warning: #1366 Incorrect string value: '\xF0\x9F\x98\x8A b...' for column 'body' at row 1

That's the UTS-8 hexcode for that smiley. I think you are on the wrong track with UTF-16. See http://graphemica.com/%F0%9F%98%8A (but YLs link shows it as well).

Maybe the text encoding of the input is not utf-8?

ophian · 2016-03-11T12:16:05Z

This was a PhpMyAdmin error. (They probably have some internal decoding structures.)
Serendipity does not throw anything.

ophian · 2016-03-13T12:04:41Z

Would suck if UTF-16 columns couldn't be fulltext searched.

@garvinhicking Seems so, read http://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html

Full-text searches can be used with most multibyte character sets. The exception is that for Unicode, the utf8 character set can be used, but not the ucs2 character set. However, although FULLTEXT indexes on ucs2 columns cannot be used, you can perform IN BOOLEAN MODE searches on a ucs2 column that has no such index.

The remarks for utf8 also apply to utf8mb4, and the remarks for ucs2 also apply to utf16 and utf32.

garvinhicking · 2016-03-13T20:02:45Z

Well, ok...then we should probably stick to UTF-8 and make somehow sure
that Non UTF-8 characters (UCS2/UTF-16) are transformed to their &#XXXX;
entities to properly show up.

I guess there must be some PHP string function to do these operations,
maybe mbstring or so, but I haven't worked with that yet...

On 13.03.2016 13:04 , Ian wrote:

Would suck if UTF-16 columns couldn't be fulltext searched.

@garvinhicking https://github.com/garvinhicking Seems so, read
http://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html

Full-text searches can be used with most multibyte character sets.
The exception is that for Unicode, the utf8 character set can be
used, but not the ucs2 character set. However, although FULLTEXT
indexes on ucs2 columns cannot be used, you can perform IN BOOLEAN
MODE searches on a ucs2 column that has no such index.

The remarks for utf8 also apply to utf8mb4, and the remarks for ucs2
also apply to utf16 and utf32.

—
Reply to this email directly or view it on GitHub
#394 (comment).

ophian · 2016-03-14T08:18:43Z

No, we can't use that I assume, since that is a "private use" utf8 area, which means, will never be ported to utf8 natively.
We have to convert it by a Symbol to unicode map list on a loop. I found something promising. The question is where to place that exactly and how this could be done for these rare cases only... with some try {} catch {} logic?

ophian · 2016-03-14T08:58:35Z

No, we can't use that I assume,

Oh wait.... Did you mean in a php.ini ?

default_charset = "utf-8"
[mbstring]
mbstring.language = UTF-8
mbstring.internal_encoding = UTF-8
mbstring.http_input = UTF-8
mbstring.http_output = UTF-8
mbstring.encoding_translation = Off
mbstring.detect_order = auto
mbstring.substitute_character = none
mbstring.func_overload = 0

or better ?

default_charset = "utf-8"
#[mbstring]
mbstring.internal_encoding = UTF-8
mbstring.encoding_translation = On

garvinhicking · 2016-04-23T14:11:50Z

(Yeah, I meant this extension, but not through automatic encoding, but by use of mbstring_detect_* or so.

The replacements, if we can perform any, we should build into our serendipity_updertEntry function.

I'm still now so sure what these emoji characters are actually transformed in, and how we could translate them to whatever HTML entity easily :-?

(Leaving this open)

bauigel · 2016-04-23T15:13:11Z

For emojis seems we have to use UTF8MB4. See https://mathiasbynens.be/notes/mysql-utf8mb4 for instructions, what has to be done.

yellowled · 2016-04-26T20:30:25Z

Also reproducable in comments, BTW, as I just found out. Comment is cut off after the Emoji.

garvinhicking · 2017-04-08T10:00:06Z

We need a new upgrader task method that will be executed on every update, that checks if the supported MySQL version is 5.5.3 or higher, and migration has not yet been executed. If so, we issue the ALTER TABLE commands that change utf8 to utf8mb4 and set an internal config variable that indicates that this migration has been executed.

garvinhicking · 2017-04-26T08:46:16Z

Ok all, I committed my "stuff" to the "feature_utf8mb4" branch. I have decided that the whole migration logic is so dangerous, we cannot force it on existing users.

The basic issue is that in UTF-8 we have about 1000 bytes in length for indexes. Due to UTF-8 this means we can use a character length of 333 characters (1000 / 3). With UTF-8 MB4 this is reduced to 250 characters (1000 / 4). A single index is allowed to reach 767 bytes (191 characters). This means, wherever we recently used an index length of 200 or 250 or 255 we can now only use 191 bytes. This means, all indexes need adjustment before the charset of a table can be successfully changed. While this can be automated, the problem is that it is not easy to deduce the new proper key length, because in some tables, reducing an index could lead to failing unique key constraints.

What I want to do is twofold:

1.) New Installations shall use utf8mb4 as charset on installation, if their mysql version matches. If not, UTF-8 will be used. I have adapted the internal SQL key statements to work with utf8mb4, also in the additional_plugins repository all created indexes should be adjusted.

2.) Existing installations now get a maintenance task, with which they can convert to UTF-8. There are two steps; first the "simulation" which yields all SQL statements that would get executed, and second the "execution" of those statements. In my commits, only the simulation has currently been implemented.

Now what I would need help with is some testing of the functionality. In fresh installations, in my tests, I could successfully insert "raw" emoji codes properly, so everyone could try that.

Then I would need help with testing the actual migration. For that you would need to have a test Serendipity 2.1 installation configured as UTF-8, ideally with some entries with special UTF-8 characters.

It would be great if someone can create a "template" s9y installation (with s9y 2.1, not the feature branch), once with UTF-8 tables and once with "Native charsets" (ISO-8859-1). Then create some categories, authors, entries, html nuggets - all with special characters. Then you could create a mysqldump of those tables, so that other people can use the dump to test the migration with (two dumps, one for UTF-8 and one for ISO-8859-1), plus the serendipity_config_local_inc.php file with the corresponding configuration.

Once we have that, we could execute the maintenance task simulation and see what tables are converted how, and try to execute the listed SQL commands in phpMyAdmin/Adminer/MySQL console and see if the migration task needs fixing.

So: Who's in for some testing? @bauigel @mariohommel @onli @yellowled @donchambers @th-h ? :-)

mariohommel · 2017-04-28T21:52:19Z

I will provide database dumps this weekend.

mariohommel · 2017-04-29T15:59:39Z

Hm, I setup two testblogs the same way, one with a database with kollation UTF8-general-ci and one with latin1-general-ci. The utf8 works fine. When I log in the other one, and click on any menu, I get back to the login page. Is this an issue with die kollation of the database?

mariohommel · 2017-05-01T12:49:20Z

Here are the database dumps for testing.
serendipity.zip

The ISO-Files are from a database with kollation latin1-general-ci but in the configuration I selected UTF8 as charset becaus with "Native" it doesn't even show german umlauts.
You can see both blogs live:
http://vps.hommel-net.de/s9yiso
http://vps.hommel-net.de/s9yutf8

I hope we can play around with this. Let me know if I can do anything further.

garvinhicking · 2017-05-01T20:52:02Z

Hi! If you use "native" charset, did you also enable the "set names/charaet conversion" option? Otherwise a mysql server running utf8 as default could interfere...?

…

On 1. May 2017, at 14:49, Mario Hommel ***@***.***> wrote: Here are the database dumps for testing. serendipity.zip The ISO-Files are from a database with kollation latin1-general-ci but in the configuration I selected UTF8 as charset becaus with "Native" it doesn't even show german umlauts. You can see both blogs live: http://vps.hommel-net.de/s9yiso http://vps.hommel-net.de/s9yutf8 I hope we can play around with this. Let me know if I can do anything further. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

surrim · 2020-02-08T01:39:18Z

[...]
What I want to do is twofold:

1.) New Installations shall use utf8mb4 as charset on installation, if their mysql version matches. If not, UTF-8 will be used. I have adapted the internal SQL key statements to work with utf8mb4, also in the additional_plugins repository all created indexes should be adjusted.

[...]

Could you please rebase the feature branch to current master? I would like to test, at least for my new installation. I have problems with emojis in categories.
At least setting the right charset on new installations would be great.

fe-hicking · 2020-02-12T09:07:32Z

Sadly, this feature hasn't found adoption for testers, and there were outstanding issues of keys and using plugins. It does sadly mean a rather big change in index treatment; and the automatic update/migration was never tested properly.

Due to the keys and handling in the core and plugins we can't really have our code support both running utf8mb4 OR utf-8, so if we want to utilize utf8mb, we would need to do so globally.

Having said this, I have just merged master into the feature branch. I haven't tested it, but it only had 3 conflicts that I could manually merge and shouldn't cause trouble. So if you have the time to test this with your blog, it would be really awesome and a big help. Be sure to make backups. :-)

onli · 2020-03-23T22:10:21Z

Maybe it's time for an update here. @surrim especially :)

In current master, we have three things:

New blogs will install with utf8mb4, if the core thinks the database supports that with our database scheme (we need the support for large indexes from innodb).
There is an upgrade task to move utf8 databases to utf8mb4. That's also toggle-able via a maintenance task (the interface is not done, texts are missing, but the buttons are there and it does what it's supposed to)
There is a second maintenance task to move databases that are not utf8 already to utf8.

Relevant code is in https://github.com/s9y/Serendipity/blob/master/include/db/mysqli.inc.php and https://github.com/s9y/Serendipity/blob/master/include/functions_upgrader.inc.php. The new solution is a lot better than what was the initial approach: Those partial indexes as workaround would have had a big negative impact on performance.

The third point is problematic though. I'm not sure whether we can test what is in the database: Is it utf8, despite the table being marked as latin1? Or is it really a non-utf8 charset? Or is it both? Depending on the situation we'd need different migration strategies: Either just changing the collation to utf8, or moving it to binary first and then marking it as utf8, and if it's both I'm not sure what we could do. Either we find a good solution there or old blogs with non-utf8 database tables have to solve this manually. The maintenance task is for the alpha, not necessarily supposed to stay in there.

But new and newish blogs should be fine. If everything is utf8, going to utf8mb4 will work with an adequate database version and settings by default or upgrade task, s9y will trigger that only when it thinks it will work fine. We just needs more testing to confirm that.

stephanbrunker · 2020-04-25T18:08:30Z

I think this needs more work. To check out #704 , I copied the current master on a test server and used an existing database like it would be in an upgrade scenario. As it should, it recognized the old version and showed the upgrader, told me it would do some tasks, success. But after that I checked via PhpMyAdmin and it showed me no changes in the collation of the tables, still latin1-german2-ci.

Then I logged in and tried via the maintenance button. The 'native to UTF-8'-button converted most of the tables, with the exception of:

authorgroups
entrycat
exits
permalinks
plugincategories

These were the same I got errors (index too large) for when I made the changes manually on the serendipity camp. And with this status, the utf8 to utfmb4 doesn't work at all.

As I have enough to do with my own new feature, what can I do to help the debugging? I can send you an SQL file with my database.

PHP is version 7.3.17 (recently updated), MariaDB is version 5.5.64 but I can upgrade that to 10.x if it makes a difference.

onli · 2020-04-25T18:44:12Z

The upgrader did not upgrade your system to utf8mb4 because your database is not utf8 yet. Note how

Serendipity/include/admin/upgrader.inc.php

Line 280 in 4df3ca6

'function' => 'serendipity_upgradeUTF8_UTF8mb4',

calls only utf8_to_ut8mb4, which checks for that. Native to utf8 is not called, I'm not convinced it is safe to do, more testing needed :)

But that native to utf8 was not possible is indeed unexpected. Do you get an index too large error for them when just trying to go to utf-8?

In that case we would need an additional check to block that conversion in native_to_utf8.

stephanbrunker · 2020-04-25T20:08:22Z

After amusing myself with my testserver, i got that fixed. In your code is mentioned that it only works with MariaDB 10.x - versions. And I forgot to update the database, it still ran with MariaDB 5.5 from back when i put that thing up. After I did that and blew up and re-installed a lot of other things too, the upgrader worked as it should.

So the only thing I have to mention is to put a safety check behind that native -> utf-8 button to check if the database is MariaDB >= 10.0 and abort if it is not. In my case, the code was executed and failed for the tables with the index problem. And output the mb4ready check that the user gets informed when the conversion fails because not all tables are in utf-8.

But with this I found out that #704 is a result of the changes made by the upgrader.

onli · 2020-04-27T14:21:30Z

I'm still not sure we should expose those buttons at all after the alpha. But showing the error message would be good of course.

So the only thing I have to mention is to put a safety check behind that native -> utf-8 button to check if the database is MariaDB >= 10.0 and abort if it is not

And the corresponding mysql version, if it is a problem there as well. Do you have some insight which versions exactly are problematic with even 3-byte utf8?

stephanbrunker · 2020-04-27T14:31:40Z

Unfortunately, no. Normally in a professionally hosted environment, the database should be on a recent version so that problem shouldn't appear (same as PHP where 5.x versions also died out). But as the comments in the code say are the indices compatible with utf-8 with MariaDB 10.x and that seems reasonable.

As for the buttons: If the conversion fails or is not possible, there should be the possibility of running the conversion again after the environmental problems are solved. The 'official' way would be a button, a simple hack is changing the version number in the config.local.inc back to the previous version ...

onli · 2020-05-28T10:41:07Z

Okay, let's recap:

We now have a function to determine whether utf8mb4 is possible
Based on that detection new blogs get utf8mb4
Old blogs with utf8 get converted to utf8mb4 by an upgrade task
We do have code to convert native charsets to utf8, but that seems risky
The utf8 -> utf8mb4 conversion was tested in multiple blogs

I just removed the buttons in the maintenance section. It's not something our users should click on outside of alpha testing. I think the ideal here would be to let the upgrader handle the utf8mb4 upgrade, for the next few versions repeatedly. Maybe we can expose the nativetoutf8 conversion functions to admins in a more manual way?

I'd also like to close here. If issues arise they could get a new dedicated issue thread. Keeps things less confusing. Re-open if I missed something.

ophian added the bugs label Mar 11, 2016

ophian added this to the 2.x.0 milestone Mar 11, 2016

ophian added the non-blocking label Mar 12, 2016

onli changed the title ~~text saving broken with certain smiley~~ MySQL database not using full UTF-8 charset May 10, 2016

onli mentioned this issue Aug 22, 2016

Smileys lost ... #416

Closed

onli mentioned this issue Feb 6, 2017

HTTP500: comment preview and submit crash on array #445

Closed

garvinhicking modified the milestones: x.0.0, 2.x.0 Apr 8, 2017

garvinhicking self-assigned this Apr 8, 2017

garvinhicking removed the non-blocking label Apr 8, 2017

th-h modified the milestones: Major, 2.4 Aug 15, 2019

onli added a commit that referenced this issue Mar 23, 2020

Upgrade task to move utf8 mysql databases to utf8mb4 (#394)

fa17e07

onli added a commit that referenced this issue Mar 23, 2020

Reworked version and function requirement for utf8mb4 (#394)

3ac81b9

th-h assigned onli Mar 23, 2020

onli added a commit that referenced this issue May 28, 2020

MySQL: Set charse tto ut8mb4, needed for some versions (#394)

0def9e9

onli closed this as completed May 28, 2020

onli mentioned this issue Dec 28, 2020

wrong charset after updating to utf8mb4 #747

Closed

MySQL database not using full UTF-8 charset #394

MySQL database not using full UTF-8 charset #394

Comments

ophian commented Mar 11, 2016

garvinhicking commented Mar 11, 2016

yellowled commented Mar 11, 2016

yellowled commented Mar 11, 2016

ophian commented Mar 11, 2016

yellowled commented Mar 11, 2016

ophian commented Mar 11, 2016

garvinhicking commented Mar 11, 2016

ophian commented Mar 11, 2016

ophian commented Mar 11, 2016

garvinhicking commented Mar 11, 2016

yellowled commented Mar 11, 2016

onli commented Mar 11, 2016

ophian commented Mar 11, 2016

ophian commented Mar 13, 2016

garvinhicking commented Mar 13, 2016

ophian commented Mar 14, 2016

ophian commented Mar 14, 2016

garvinhicking commented Apr 23, 2016

bauigel commented Apr 23, 2016

yellowled commented Apr 26, 2016

garvinhicking commented Apr 8, 2017

garvinhicking commented Apr 26, 2017

mariohommel commented Apr 28, 2017

mariohommel commented Apr 29, 2017

mariohommel commented May 1, 2017

garvinhicking commented May 1, 2017 via email

surrim commented Feb 8, 2020

fe-hicking commented Feb 12, 2020

onli commented Mar 23, 2020 • edited

stephanbrunker commented Apr 25, 2020

onli commented Apr 25, 2020

stephanbrunker commented Apr 25, 2020 • edited

onli commented Apr 27, 2020 • edited

stephanbrunker commented Apr 27, 2020

onli commented May 28, 2020

onli commented Mar 23, 2020 •

edited

stephanbrunker commented Apr 25, 2020 •

edited

onli commented Apr 27, 2020 •

edited