New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL database not using full UTF-8 charset #394

Open
ophian opened this Issue Mar 11, 2016 · 26 comments

Comments

Projects
None yet
6 participants
@ophian
Member

ophian commented Mar 11, 2016

The 馃槉 is what this is about! (GitHub changes the smiley if not in code tags)

  1. Insert text
    Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. 馃槉 Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur? At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere
  2. see preview is working
  3. save the entry
  4. see frontend entry text cut off after "voluptatem."
  5. back in backend, see the second part isn't saved at all

Who is doing this?

@ophian ophian added the bugs label Mar 11, 2016

@ophian ophian added this to the 2.x.0 milestone Mar 11, 2016

@garvinhicking

This comment has been minimized.

Member

garvinhicking commented Mar 11, 2016

This smilie is a high-byte UTF character. The reason it might not be saved is:

a.) html entities/transcoding
b.) Database charset/collation collision
c.) Maybe some mysql_escape_string() thing where it strips this

Definitely try to do this without a WYSIWYG editor to see if it makes a difference. Also try to directly insert it into the serendipity_entries DB table find out if it's the DB or s9y that's removing the character.

Also, it could be that those Smilies are not UTF-8, but UTF-16 (this could be), in that case we would need to convert DB tables etc. to UTF-16.

Just a few ideas, maybe this helps.

@yellowled

This comment has been minimized.

Member

yellowled commented Mar 11, 2016

Exactly the same behaviour in CKE, but I had to use the source code view in CKE because otherwise it adopts some inline formatting from GitHub.

@yellowled

This comment has been minimized.

Member

yellowled commented Mar 11, 2016

For the record, in CKE the preview is working as well, and it is the same if the text is not pasted in srouce code view (had to copy it without the code block fencing).

@ophian

This comment has been minimized.

Member

ophian commented Mar 11, 2016

Definitely try to do this without a WYSIWYG editor to see if it makes a difference.

;-) Well, I did. (I do maintain the cke plugin, but this does not mean I use it alle the time!)
I thought of UTF-16 or something like that too...

@yellowled

This comment has been minimized.

Member

yellowled commented Mar 11, 2016

鈥濽nicode code points in "Private Use"-Bereichen (UTF-8)鈥 鈥 so no UTF-16.

https://twitter.com/fhemberger/status/708221689511940096

@ophian

This comment has been minimized.

Member

ophian commented Mar 11, 2016

Putting it into DB via PhpMyAdmin editor gives
Warning: #1366 Incorrect string value: '\xF0\x9F\x98\x8A b...' for column 'body' at row 1

@garvinhicking

This comment has been minimized.

Member

garvinhicking commented Mar 11, 2016

@ophian What's the collation of the "body" column? Maybe try to set it to UTF-16 just to see if that changes things.

@ophian

This comment has been minimized.

Member

ophian commented Mar 11, 2016

What's the collation of the "body" column? Maybe try to set it to UTF-16 just to see if that changes things.

It is utf8_unicode_ci.
Have to look up how to convert it to UTF-16 though...

@ophian

This comment has been minimized.

Member

ophian commented Mar 11, 2016

I cannot change the body collision... to any of UTF-16 unicode_ci or utf8mb4 unicode_ci.
It gives #1283 Column 'body' cannot be part of the FULLTEXT index.

@garvinhicking

This comment has been minimized.

Member

garvinhicking commented Mar 11, 2016

Hm. I have no UTF-16 knowledge with MYSQL yet. Would suck if UTF-16 columns couldn't be fulltext searched. (Well, UTF-16 would suck nevertheless, because of other index lengths, and also because we would need to add charsets UTF-16 SET NAMES et all). Better to somehow encoded the UTF-16 string to a entity like &#blablabla on saving, so it can be stored as UTF-8...

@yellowled

This comment has been minimized.

Member

yellowled commented Mar 11, 2016

FWIW, http://apps.timwhitlock.info/emoji/tables/unicode has the unicode characters, codes or whatever, for the 馃槉 for example it's http://apps.timwhitlock.info/unicode/inspect/hex/1F60A (which has UTF-8 and UTF-16 LE).

@onli

This comment has been minimized.

Member

onli commented Mar 11, 2016

Warning: #1366 Incorrect string value: '\xF0\x9F\x98\x8A b...' for column 'body' at row 1

That's the UTS-8 hexcode for that smiley. I think you are on the wrong track with UTF-16. See http://graphemica.com/%F0%9F%98%8A (but YLs link shows it as well).

Maybe the text encoding of the input is not utf-8?

@ophian

This comment has been minimized.

Member

ophian commented Mar 11, 2016

This was a PhpMyAdmin error. (They probably have some internal decoding structures.)
Serendipity does not throw anything.

@ophian ophian added the non-blocking label Mar 12, 2016

@ophian

This comment has been minimized.

Member

ophian commented Mar 13, 2016

Would suck if UTF-16 columns couldn't be fulltext searched.

@garvinhicking Seems so, read http://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html

Full-text searches can be used with most multibyte character sets. The exception is that for Unicode, the utf8 character set can be used, but not the ucs2 character set. However, although FULLTEXT indexes on ucs2 columns cannot be used, you can perform IN BOOLEAN MODE searches on a ucs2 column that has no such index.

The remarks for utf8 also apply to utf8mb4, and the remarks for ucs2 also apply to utf16 and utf32.

@garvinhicking

This comment has been minimized.

Member

garvinhicking commented Mar 13, 2016

Well, ok...then we should probably stick to UTF-8 and make somehow sure
that Non UTF-8 characters (UCS2/UTF-16) are transformed to their &#XXXX;
entities to properly show up.

I guess there must be some PHP string function to do these operations,
maybe mbstring or so, but I haven't worked with that yet...

On 13.03.2016 13:04 , Ian wrote:

Would suck if UTF-16 columns couldn't be fulltext searched.

@garvinhicking https://github.com/garvinhicking Seems so, read
http://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html

Full-text searches can be used with most multibyte character sets.
The exception is that for Unicode, the utf8 character set can be
used, but not the ucs2 character set. However, although FULLTEXT
indexes on ucs2 columns cannot be used, you can perform IN BOOLEAN
MODE searches on a ucs2 column that has no such index.

The remarks for utf8 also apply to utf8mb4, and the remarks for ucs2
also apply to utf16 and utf32.


Reply to this email directly or view it on GitHub
#394 (comment).

@ophian

This comment has been minimized.

Member

ophian commented Mar 14, 2016

No, we can't use that I assume, since that is a "private use" utf8 area, which means, will never be ported to utf8 natively.
We have to convert it by a Symbol to unicode map list on a loop. I found something promising. The question is where to place that exactly and how this could be done for these rare cases only... with some try {} catch {} logic?

@ophian

This comment has been minimized.

Member

ophian commented Mar 14, 2016

No, we can't use that I assume,

Oh wait.... Did you mean in a php.ini ?

default_charset = "utf-8"
[mbstring]
mbstring.language = UTF-8
mbstring.internal_encoding = UTF-8
mbstring.http_input = UTF-8
mbstring.http_output = UTF-8
mbstring.encoding_translation = Off
mbstring.detect_order = auto
mbstring.substitute_character = none
mbstring.func_overload = 0

or better ?

default_charset = "utf-8"
#[mbstring]
mbstring.internal_encoding = UTF-8
mbstring.encoding_translation = On
@garvinhicking

This comment has been minimized.

Member

garvinhicking commented Apr 23, 2016

(Yeah, I meant this extension, but not through automatic encoding, but by use of mbstring_detect_* or so.

The replacements, if we can perform any, we should build into our serendipity_updertEntry function.

I'm still now so sure what these emoji characters are actually transformed in, and how we could translate them to whatever HTML entity easily :-?

(Leaving this open)

@bauigel

This comment has been minimized.

bauigel commented Apr 23, 2016

For emojis seems we have to use UTF8MB4. See https://mathiasbynens.be/notes/mysql-utf8mb4 for instructions, what has to be done.

@yellowled

This comment has been minimized.

Member

yellowled commented Apr 26, 2016

Also reproducable in comments, BTW, as I just found out. Comment is cut off after the Emoji.

@onli onli changed the title from text saving broken with certain smiley to MySQL database not using full UTF-8 charset May 10, 2016

@onli onli referenced this issue Aug 22, 2016

Closed

Smileys lost ... #416

@garvinhicking

This comment has been minimized.

Member

garvinhicking commented Apr 8, 2017

We need a new upgrader task method that will be executed on every update, that checks if the supported MySQL version is 5.5.3 or higher, and migration has not yet been executed. If so, we issue the ALTER TABLE commands that change utf8 to utf8mb4 and set an internal config variable that indicates that this migration has been executed.

@garvinhicking garvinhicking modified the milestones: x.0.0, 2.x.0 Apr 8, 2017

@garvinhicking garvinhicking self-assigned this Apr 8, 2017

@garvinhicking

This comment has been minimized.

Member

garvinhicking commented Apr 26, 2017

Ok all, I committed my "stuff" to the "feature_utf8mb4" branch. I have decided that the whole migration logic is so dangerous, we cannot force it on existing users.

The basic issue is that in UTF-8 we have about 1000 bytes in length for indexes. Due to UTF-8 this means we can use a character length of 333 characters (1000 / 3). With UTF-8 MB4 this is reduced to 250 characters (1000 / 4). A single index is allowed to reach 767 bytes (191 characters). This means, wherever we recently used an index length of 200 or 250 or 255 we can now only use 191 bytes. This means, all indexes need adjustment before the charset of a table can be successfully changed. While this can be automated, the problem is that it is not easy to deduce the new proper key length, because in some tables, reducing an index could lead to failing unique key constraints.

What I want to do is twofold:

1.) New Installations shall use utf8mb4 as charset on installation, if their mysql version matches. If not, UTF-8 will be used. I have adapted the internal SQL key statements to work with utf8mb4, also in the additional_plugins repository all created indexes should be adjusted.

2.) Existing installations now get a maintenance task, with which they can convert to UTF-8. There are two steps; first the "simulation" which yields all SQL statements that would get executed, and second the "execution" of those statements. In my commits, only the simulation has currently been implemented.

Now what I would need help with is some testing of the functionality. In fresh installations, in my tests, I could successfully insert "raw" emoji codes properly, so everyone could try that.

Then I would need help with testing the actual migration. For that you would need to have a test Serendipity 2.1 installation configured as UTF-8, ideally with some entries with special UTF-8 characters.

It would be great if someone can create a "template" s9y installation (with s9y 2.1, not the feature branch), once with UTF-8 tables and once with "Native charsets" (ISO-8859-1). Then create some categories, authors, entries, html nuggets - all with special characters. Then you could create a mysqldump of those tables, so that other people can use the dump to test the migration with (two dumps, one for UTF-8 and one for ISO-8859-1), plus the serendipity_config_local_inc.php file with the corresponding configuration.

Once we have that, we could execute the maintenance task simulation and see what tables are converted how, and try to execute the listed SQL commands in phpMyAdmin/Adminer/MySQL console and see if the migration task needs fixing.

So: Who's in for some testing? @bauigel @mariohommel @onli @yellowled @donchambers @th-h ? :-)

@mariohommel

This comment has been minimized.

Contributor

mariohommel commented Apr 28, 2017

I will provide database dumps this weekend.

@mariohommel

This comment has been minimized.

Contributor

mariohommel commented Apr 29, 2017

Hm, I setup two testblogs the same way, one with a database with kollation UTF8-general-ci and one with latin1-general-ci. The utf8 works fine. When I log in the other one, and click on any menu, I get back to the login page. Is this an issue with die kollation of the database?

@mariohommel

This comment has been minimized.

Contributor

mariohommel commented May 1, 2017

Here are the database dumps for testing.
serendipity.zip

The ISO-Files are from a database with kollation latin1-general-ci but in the configuration I selected UTF8 as charset becaus with "Native" it doesn't even show german umlauts.
You can see both blogs live:
http://vps.hommel-net.de/s9yiso
http://vps.hommel-net.de/s9yutf8

I hope we can play around with this. Let me know if I can do anything further.

@garvinhicking

This comment has been minimized.

Member

garvinhicking commented May 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment