MySQL database not using full UTF-8 charset #394

Open
ophian opened this Issue Mar 11, 2016 · 20 comments

Projects

None yet

5 participants

@ophian
Member
ophian commented Mar 11, 2016

The 😊 is what this is about! (GitHub changes the smiley if not in code tags)

  1. Insert text
    Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. 😊 Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur? At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere
  2. see preview is working
  3. save the entry
  4. see frontend entry text cut off after "voluptatem."
  5. back in backend, see the second part isn't saved at all

Who is doing this?

@ophian ophian added the bugs label Mar 11, 2016
@ophian ophian added this to the 2.x.0 milestone Mar 11, 2016
@garvinhicking
Member

This smilie is a high-byte UTF character. The reason it might not be saved is:

a.) html entities/transcoding
b.) Database charset/collation collision
c.) Maybe some mysql_escape_string() thing where it strips this

Definitely try to do this without a WYSIWYG editor to see if it makes a difference. Also try to directly insert it into the serendipity_entries DB table find out if it's the DB or s9y that's removing the character.

Also, it could be that those Smilies are not UTF-8, but UTF-16 (this could be), in that case we would need to convert DB tables etc. to UTF-16.

Just a few ideas, maybe this helps.

@yellowled
Member

Exactly the same behaviour in CKE, but I had to use the source code view in CKE because otherwise it adopts some inline formatting from GitHub.

@yellowled
Member

For the record, in CKE the preview is working as well, and it is the same if the text is not pasted in srouce code view (had to copy it without the code block fencing).

@ophian
Member
ophian commented Mar 11, 2016

Definitely try to do this without a WYSIWYG editor to see if it makes a difference.

;-) Well, I did. (I do maintain the cke plugin, but this does not mean I use it alle the time!)
I thought of UTF-16 or something like that too...

@yellowled
Member

β€žUnicode code points in "Private Use"-Bereichen (UTF-8)β€œ – so no UTF-16.

https://twitter.com/fhemberger/status/708221689511940096

@ophian
Member
ophian commented Mar 11, 2016

Putting it into DB via PhpMyAdmin editor gives
Warning: #1366 Incorrect string value: '\xF0\x9F\x98\x8A b...' for column 'body' at row 1

@garvinhicking
Member

@ophian What's the collation of the "body" column? Maybe try to set it to UTF-16 just to see if that changes things.

@ophian
Member
ophian commented Mar 11, 2016

What's the collation of the "body" column? Maybe try to set it to UTF-16 just to see if that changes things.

It is utf8_unicode_ci.
Have to look up how to convert it to UTF-16 though...

@ophian
Member
ophian commented Mar 11, 2016

I cannot change the body collision... to any of UTF-16 unicode_ci or utf8mb4 unicode_ci.
It gives #1283 Column 'body' cannot be part of the FULLTEXT index.

@garvinhicking
Member

Hm. I have no UTF-16 knowledge with MYSQL yet. Would suck if UTF-16 columns couldn't be fulltext searched. (Well, UTF-16 would suck nevertheless, because of other index lengths, and also because we would need to add charsets UTF-16 SET NAMES et all). Better to somehow encoded the UTF-16 string to a entity like &#blablabla on saving, so it can be stored as UTF-8...

@yellowled
Member

FWIW, http://apps.timwhitlock.info/emoji/tables/unicode has the unicode characters, codes or whatever, for the 😊 for example it's http://apps.timwhitlock.info/unicode/inspect/hex/1F60A (which has UTF-8 and UTF-16 LE).

@onli
Member
onli commented Mar 11, 2016

Warning: #1366 Incorrect string value: '\xF0\x9F\x98\x8A b...' for column 'body' at row 1

That's the UTS-8 hexcode for that smiley. I think you are on the wrong track with UTF-16. See http://graphemica.com/%F0%9F%98%8A (but YLs link shows it as well).

Maybe the text encoding of the input is not utf-8?

@ophian
Member
ophian commented Mar 11, 2016

This was a PhpMyAdmin error. (They probably have some internal decoding structures.)
Serendipity does not throw anything.

@ophian ophian added the non-blocking label Mar 12, 2016
@ophian
Member
ophian commented Mar 13, 2016

Would suck if UTF-16 columns couldn't be fulltext searched.

@garvinhicking Seems so, read http://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html

Full-text searches can be used with most multibyte character sets. The exception is that for Unicode, the utf8 character set can be used, but not the ucs2 character set. However, although FULLTEXT indexes on ucs2 columns cannot be used, you can perform IN BOOLEAN MODE searches on a ucs2 column that has no such index.

The remarks for utf8 also apply to utf8mb4, and the remarks for ucs2 also apply to utf16 and utf32.

@garvinhicking
Member

Well, ok...then we should probably stick to UTF-8 and make somehow sure
that Non UTF-8 characters (UCS2/UTF-16) are transformed to their &#XXXX;
entities to properly show up.

I guess there must be some PHP string function to do these operations,
maybe mbstring or so, but I haven't worked with that yet...

On 13.03.2016 13:04 , Ian wrote:

Would suck if UTF-16 columns couldn't be fulltext searched.

@garvinhicking https://github.com/garvinhicking Seems so, read
http://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html

Full-text searches can be used with most multibyte character sets.
The exception is that for Unicode, the utf8 character set can be
used, but not the ucs2 character set. However, although FULLTEXT
indexes on ucs2 columns cannot be used, you can perform IN BOOLEAN
MODE searches on a ucs2 column that has no such index.

The remarks for utf8 also apply to utf8mb4, and the remarks for ucs2
also apply to utf16 and utf32.

β€”
Reply to this email directly or view it on GitHub
#394 (comment).

@ophian
Member
ophian commented Mar 14, 2016

No, we can't use that I assume, since that is a "private use" utf8 area, which means, will never be ported to utf8 natively.
We have to convert it by a Symbol to unicode map list on a loop. I found something promising. The question is where to place that exactly and how this could be done for these rare cases only... with some try {} catch {} logic?

@ophian
Member
ophian commented Mar 14, 2016

No, we can't use that I assume,

Oh wait.... Did you mean in a php.ini ?

default_charset = "utf-8"
[mbstring]
mbstring.language = UTF-8
mbstring.internal_encoding = UTF-8
mbstring.http_input = UTF-8
mbstring.http_output = UTF-8
mbstring.encoding_translation = Off
mbstring.detect_order = auto
mbstring.substitute_character = none
mbstring.func_overload = 0

or better ?

default_charset = "utf-8"
#[mbstring]
mbstring.internal_encoding = UTF-8
mbstring.encoding_translation = On
@garvinhicking
Member

(Yeah, I meant this extension, but not through automatic encoding, but by use of mbstring_detect_* or so.

The replacements, if we can perform any, we should build into our serendipity_updertEntry function.

I'm still now so sure what these emoji characters are actually transformed in, and how we could translate them to whatever HTML entity easily :-?

(Leaving this open)

@bauigel
bauigel commented Apr 23, 2016

For emojis seems we have to use UTF8MB4. See https://mathiasbynens.be/notes/mysql-utf8mb4 for instructions, what has to be done.

@yellowled
Member

Also reproducable in comments, BTW, as I just found out. Comment is cut off after the Emoji.

@onli onli changed the title from text saving broken with certain smiley to MySQL database not using full UTF-8 charset May 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment