Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL database not using full UTF-8 charset #394

Closed
ophian opened this issue Mar 11, 2016 · 35 comments
Closed

MySQL database not using full UTF-8 charset #394

ophian opened this issue Mar 11, 2016 · 35 comments
Assignees
Labels
Milestone

Comments

@ophian
Copy link
Member

ophian commented Mar 11, 2016

The 😊 is what this is about! (GitHub changes the smiley if not in code tags)

  1. Insert text
    Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. 😊 Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur? At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere
  2. see preview is working
  3. save the entry
  4. see frontend entry text cut off after "voluptatem."
  5. back in backend, see the second part isn't saved at all

Who is doing this?

@ophian ophian added the bugs label Mar 11, 2016
@ophian ophian added this to the 2.x.0 milestone Mar 11, 2016
@garvinhicking
Copy link
Member

This smilie is a high-byte UTF character. The reason it might not be saved is:

a.) html entities/transcoding
b.) Database charset/collation collision
c.) Maybe some mysql_escape_string() thing where it strips this

Definitely try to do this without a WYSIWYG editor to see if it makes a difference. Also try to directly insert it into the serendipity_entries DB table find out if it's the DB or s9y that's removing the character.

Also, it could be that those Smilies are not UTF-8, but UTF-16 (this could be), in that case we would need to convert DB tables etc. to UTF-16.

Just a few ideas, maybe this helps.

@yellowled
Copy link
Member

Exactly the same behaviour in CKE, but I had to use the source code view in CKE because otherwise it adopts some inline formatting from GitHub.

@yellowled
Copy link
Member

For the record, in CKE the preview is working as well, and it is the same if the text is not pasted in srouce code view (had to copy it without the code block fencing).

@ophian
Copy link
Member Author

ophian commented Mar 11, 2016

Definitely try to do this without a WYSIWYG editor to see if it makes a difference.

;-) Well, I did. (I do maintain the cke plugin, but this does not mean I use it alle the time!)
I thought of UTF-16 or something like that too...

@yellowled
Copy link
Member

„Unicode code points in "Private Use"-Bereichen (UTF-8)“ – so no UTF-16.

https://twitter.com/fhemberger/status/708221689511940096

@ophian
Copy link
Member Author

ophian commented Mar 11, 2016

Putting it into DB via PhpMyAdmin editor gives
Warning: #1366 Incorrect string value: '\xF0\x9F\x98\x8A b...' for column 'body' at row 1

@garvinhicking
Copy link
Member

@ophian What's the collation of the "body" column? Maybe try to set it to UTF-16 just to see if that changes things.

@ophian
Copy link
Member Author

ophian commented Mar 11, 2016

What's the collation of the "body" column? Maybe try to set it to UTF-16 just to see if that changes things.

It is utf8_unicode_ci.
Have to look up how to convert it to UTF-16 though...

@ophian
Copy link
Member Author

ophian commented Mar 11, 2016

I cannot change the body collision... to any of UTF-16 unicode_ci or utf8mb4 unicode_ci.
It gives #1283 Column 'body' cannot be part of the FULLTEXT index.

@garvinhicking
Copy link
Member

Hm. I have no UTF-16 knowledge with MYSQL yet. Would suck if UTF-16 columns couldn't be fulltext searched. (Well, UTF-16 would suck nevertheless, because of other index lengths, and also because we would need to add charsets UTF-16 SET NAMES et all). Better to somehow encoded the UTF-16 string to a entity like &#blablabla on saving, so it can be stored as UTF-8...

@yellowled
Copy link
Member

FWIW, http://apps.timwhitlock.info/emoji/tables/unicode has the unicode characters, codes or whatever, for the 😊 for example it's http://apps.timwhitlock.info/unicode/inspect/hex/1F60A (which has UTF-8 and UTF-16 LE).

@onli
Copy link
Member

onli commented Mar 11, 2016

Warning: #1366 Incorrect string value: '\xF0\x9F\x98\x8A b...' for column 'body' at row 1

That's the UTS-8 hexcode for that smiley. I think you are on the wrong track with UTF-16. See http://graphemica.com/%F0%9F%98%8A (but YLs link shows it as well).

Maybe the text encoding of the input is not utf-8?

@ophian
Copy link
Member Author

ophian commented Mar 11, 2016

This was a PhpMyAdmin error. (They probably have some internal decoding structures.)
Serendipity does not throw anything.

@ophian
Copy link
Member Author

ophian commented Mar 13, 2016

Would suck if UTF-16 columns couldn't be fulltext searched.

@garvinhicking Seems so, read http://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html

Full-text searches can be used with most multibyte character sets. The exception is that for Unicode, the utf8 character set can be used, but not the ucs2 character set. However, although FULLTEXT indexes on ucs2 columns cannot be used, you can perform IN BOOLEAN MODE searches on a ucs2 column that has no such index.

The remarks for utf8 also apply to utf8mb4, and the remarks for ucs2 also apply to utf16 and utf32.

@garvinhicking
Copy link
Member

Well, ok...then we should probably stick to UTF-8 and make somehow sure
that Non UTF-8 characters (UCS2/UTF-16) are transformed to their &#XXXX;
entities to properly show up.

I guess there must be some PHP string function to do these operations,
maybe mbstring or so, but I haven't worked with that yet...

On 13.03.2016 13:04 , Ian wrote:

Would suck if UTF-16 columns couldn't be fulltext searched.

@garvinhicking https://github.com/garvinhicking Seems so, read
http://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html

Full-text searches can be used with most multibyte character sets.
The exception is that for Unicode, the utf8 character set can be
used, but not the ucs2 character set. However, although FULLTEXT
indexes on ucs2 columns cannot be used, you can perform IN BOOLEAN
MODE searches on a ucs2 column that has no such index.

The remarks for utf8 also apply to utf8mb4, and the remarks for ucs2
also apply to utf16 and utf32.


Reply to this email directly or view it on GitHub
#394 (comment).

@ophian
Copy link
Member Author

ophian commented Mar 14, 2016

No, we can't use that I assume, since that is a "private use" utf8 area, which means, will never be ported to utf8 natively.
We have to convert it by a Symbol to unicode map list on a loop. I found something promising. The question is where to place that exactly and how this could be done for these rare cases only... with some try {} catch {} logic?

@ophian
Copy link
Member Author

ophian commented Mar 14, 2016

No, we can't use that I assume,

Oh wait.... Did you mean in a php.ini ?

default_charset = "utf-8"
[mbstring]
mbstring.language = UTF-8
mbstring.internal_encoding = UTF-8
mbstring.http_input = UTF-8
mbstring.http_output = UTF-8
mbstring.encoding_translation = Off
mbstring.detect_order = auto
mbstring.substitute_character = none
mbstring.func_overload = 0

or better ?

default_charset = "utf-8"
#[mbstring]
mbstring.internal_encoding = UTF-8
mbstring.encoding_translation = On

@garvinhicking
Copy link
Member

(Yeah, I meant this extension, but not through automatic encoding, but by use of mbstring_detect_* or so.

The replacements, if we can perform any, we should build into our serendipity_updertEntry function.

I'm still now so sure what these emoji characters are actually transformed in, and how we could translate them to whatever HTML entity easily :-?

(Leaving this open)

@bauigel
Copy link

bauigel commented Apr 23, 2016

For emojis seems we have to use UTF8MB4. See https://mathiasbynens.be/notes/mysql-utf8mb4 for instructions, what has to be done.

@yellowled
Copy link
Member

Also reproducable in comments, BTW, as I just found out. Comment is cut off after the Emoji.

@onli onli changed the title text saving broken with certain smiley MySQL database not using full UTF-8 charset May 10, 2016
@onli onli mentioned this issue Aug 22, 2016
@garvinhicking
Copy link
Member

We need a new upgrader task method that will be executed on every update, that checks if the supported MySQL version is 5.5.3 or higher, and migration has not yet been executed. If so, we issue the ALTER TABLE commands that change utf8 to utf8mb4 and set an internal config variable that indicates that this migration has been executed.

@garvinhicking garvinhicking modified the milestones: x.0.0, 2.x.0 Apr 8, 2017
@garvinhicking garvinhicking self-assigned this Apr 8, 2017
@garvinhicking
Copy link
Member

Ok all, I committed my "stuff" to the "feature_utf8mb4" branch. I have decided that the whole migration logic is so dangerous, we cannot force it on existing users.

The basic issue is that in UTF-8 we have about 1000 bytes in length for indexes. Due to UTF-8 this means we can use a character length of 333 characters (1000 / 3). With UTF-8 MB4 this is reduced to 250 characters (1000 / 4). A single index is allowed to reach 767 bytes (191 characters). This means, wherever we recently used an index length of 200 or 250 or 255 we can now only use 191 bytes. This means, all indexes need adjustment before the charset of a table can be successfully changed. While this can be automated, the problem is that it is not easy to deduce the new proper key length, because in some tables, reducing an index could lead to failing unique key constraints.

What I want to do is twofold:

1.) New Installations shall use utf8mb4 as charset on installation, if their mysql version matches. If not, UTF-8 will be used. I have adapted the internal SQL key statements to work with utf8mb4, also in the additional_plugins repository all created indexes should be adjusted.

2.) Existing installations now get a maintenance task, with which they can convert to UTF-8. There are two steps; first the "simulation" which yields all SQL statements that would get executed, and second the "execution" of those statements. In my commits, only the simulation has currently been implemented.

Now what I would need help with is some testing of the functionality. In fresh installations, in my tests, I could successfully insert "raw" emoji codes properly, so everyone could try that.

Then I would need help with testing the actual migration. For that you would need to have a test Serendipity 2.1 installation configured as UTF-8, ideally with some entries with special UTF-8 characters.

It would be great if someone can create a "template" s9y installation (with s9y 2.1, not the feature branch), once with UTF-8 tables and once with "Native charsets" (ISO-8859-1). Then create some categories, authors, entries, html nuggets - all with special characters. Then you could create a mysqldump of those tables, so that other people can use the dump to test the migration with (two dumps, one for UTF-8 and one for ISO-8859-1), plus the serendipity_config_local_inc.php file with the corresponding configuration.

Once we have that, we could execute the maintenance task simulation and see what tables are converted how, and try to execute the listed SQL commands in phpMyAdmin/Adminer/MySQL console and see if the migration task needs fixing.

So: Who's in for some testing? @bauigel @mariohommel @onli @yellowled @donchambers @th-h ? :-)

@mariohommel
Copy link
Member

I will provide database dumps this weekend.

@mariohommel
Copy link
Member

Hm, I setup two testblogs the same way, one with a database with kollation UTF8-general-ci and one with latin1-general-ci. The utf8 works fine. When I log in the other one, and click on any menu, I get back to the login page. Is this an issue with die kollation of the database?

@mariohommel
Copy link
Member

Here are the database dumps for testing.
serendipity.zip

The ISO-Files are from a database with kollation latin1-general-ci but in the configuration I selected UTF8 as charset becaus with "Native" it doesn't even show german umlauts.
You can see both blogs live:
http://vps.hommel-net.de/s9yiso
http://vps.hommel-net.de/s9yutf8

I hope we can play around with this. Let me know if I can do anything further.

@garvinhicking
Copy link
Member

garvinhicking commented May 1, 2017 via email

@th-h th-h modified the milestones: Major, 2.4 Aug 15, 2019
@surrim
Copy link
Contributor

surrim commented Feb 8, 2020

[...]
What I want to do is twofold:

1.) New Installations shall use utf8mb4 as charset on installation, if their mysql version matches. If not, UTF-8 will be used. I have adapted the internal SQL key statements to work with utf8mb4, also in the additional_plugins repository all created indexes should be adjusted.

[...]

Could you please rebase the feature branch to current master? I would like to test, at least for my new installation. I have problems with emojis in categories.
At least setting the right charset on new installations would be great.

@fe-hicking
Copy link
Contributor

Sadly, this feature hasn't found adoption for testers, and there were outstanding issues of keys and using plugins. It does sadly mean a rather big change in index treatment; and the automatic update/migration was never tested properly.

Due to the keys and handling in the core and plugins we can't really have our code support both running utf8mb4 OR utf-8, so if we want to utilize utf8mb, we would need to do so globally.

Having said this, I have just merged master into the feature branch. I haven't tested it, but it only had 3 conflicts that I could manually merge and shouldn't cause trouble. So if you have the time to test this with your blog, it would be really awesome and a big help. Be sure to make backups. :-)

@onli
Copy link
Member

onli commented Mar 23, 2020

Maybe it's time for an update here. @surrim especially :)

In current master, we have three things:

  1. New blogs will install with utf8mb4, if the core thinks the database supports that with our database scheme (we need the support for large indexes from innodb).
  2. There is an upgrade task to move utf8 databases to utf8mb4. That's also toggle-able via a maintenance task (the interface is not done, texts are missing, but the buttons are there and it does what it's supposed to)
  3. There is a second maintenance task to move databases that are not utf8 already to utf8.

Relevant code is in https://github.com/s9y/Serendipity/blob/master/include/db/mysqli.inc.php and https://github.com/s9y/Serendipity/blob/master/include/functions_upgrader.inc.php. The new solution is a lot better than what was the initial approach: Those partial indexes as workaround would have had a big negative impact on performance.

The third point is problematic though. I'm not sure whether we can test what is in the database: Is it utf8, despite the table being marked as latin1? Or is it really a non-utf8 charset? Or is it both? Depending on the situation we'd need different migration strategies: Either just changing the collation to utf8, or moving it to binary first and then marking it as utf8, and if it's both I'm not sure what we could do. Either we find a good solution there or old blogs with non-utf8 database tables have to solve this manually. The maintenance task is for the alpha, not necessarily supposed to stay in there.

But new and newish blogs should be fine. If everything is utf8, going to utf8mb4 will work with an adequate database version and settings by default or upgrade task, s9y will trigger that only when it thinks it will work fine. We just needs more testing to confirm that.

@stephanbrunker
Copy link
Contributor

I think this needs more work. To check out #704 , I copied the current master on a test server and used an existing database like it would be in an upgrade scenario. As it should, it recognized the old version and showed the upgrader, told me it would do some tasks, success. But after that I checked via PhpMyAdmin and it showed me no changes in the collation of the tables, still latin1-german2-ci.

Then I logged in and tried via the maintenance button. The 'native to UTF-8'-button converted most of the tables, with the exception of:

  • authorgroups
  • entrycat
  • exits
  • permalinks
  • plugincategories

These were the same I got errors (index too large) for when I made the changes manually on the serendipity camp. And with this status, the utf8 to utfmb4 doesn't work at all.

As I have enough to do with my own new feature, what can I do to help the debugging? I can send you an SQL file with my database.

PHP is version 7.3.17 (recently updated), MariaDB is version 5.5.64 but I can upgrade that to 10.x if it makes a difference.

@onli
Copy link
Member

onli commented Apr 25, 2020

The upgrader did not upgrade your system to utf8mb4 because your database is not utf8 yet. Note how

'function' => 'serendipity_upgradeUTF8_UTF8mb4',
calls only utf8_to_ut8mb4, which checks for that. Native to utf8 is not called, I'm not convinced it is safe to do, more testing needed :)

But that native to utf8 was not possible is indeed unexpected. Do you get an index too large error for them when just trying to go to utf-8?

In that case we would need an additional check to block that conversion in native_to_utf8.

@stephanbrunker
Copy link
Contributor

stephanbrunker commented Apr 25, 2020

After amusing myself with my testserver, i got that fixed. In your code is mentioned that it only works with MariaDB 10.x - versions. And I forgot to update the database, it still ran with MariaDB 5.5 from back when i put that thing up. After I did that and blew up and re-installed a lot of other things too, the upgrader worked as it should.

So the only thing I have to mention is to put a safety check behind that native -> utf-8 button to check if the database is MariaDB >= 10.0 and abort if it is not. In my case, the code was executed and failed for the tables with the index problem. And output the mb4ready check that the user gets informed when the conversion fails because not all tables are in utf-8.

But with this I found out that #704 is a result of the changes made by the upgrader.

@onli
Copy link
Member

onli commented Apr 27, 2020

I'm still not sure we should expose those buttons at all after the alpha. But showing the error message would be good of course.

So the only thing I have to mention is to put a safety check behind that native -> utf-8 button to check if the database is MariaDB >= 10.0 and abort if it is not

And the corresponding mysql version, if it is a problem there as well. Do you have some insight which versions exactly are problematic with even 3-byte utf8?

@stephanbrunker
Copy link
Contributor

Unfortunately, no. Normally in a professionally hosted environment, the database should be on a recent version so that problem shouldn't appear (same as PHP where 5.x versions also died out). But as the comments in the code say are the indices compatible with utf-8 with MariaDB 10.x and that seems reasonable.

As for the buttons: If the conversion fails or is not possible, there should be the possibility of running the conversion again after the environmental problems are solved. The 'official' way would be a button, a simple hack is changing the version number in the config.local.inc back to the previous version ...

@onli
Copy link
Member

onli commented May 28, 2020

Okay, let's recap:

  1. We now have a function to determine whether utf8mb4 is possible
  2. Based on that detection new blogs get utf8mb4
  3. Old blogs with utf8 get converted to utf8mb4 by an upgrade task
  4. We do have code to convert native charsets to utf8, but that seems risky
  5. The utf8 -> utf8mb4 conversion was tested in multiple blogs

I just removed the buttons in the maintenance section. It's not something our users should click on outside of alpha testing. I think the ideal here would be to let the upgrader handle the utf8mb4 upgrade, for the next few versions repeatedly. Maybe we can expose the nativetoutf8 conversion functions to admins in a more manual way?

I'd also like to close here. If issues arise they could get a new dedicated issue thread. Keeps things less confusing. Re-open if I missed something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants