Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change MySQL UTF-8 examples to use utf8mb4 #5100

Closed
wants to merge 8 commits into from
Closed

Change MySQL UTF-8 examples to use utf8mb4 #5100

wants to merge 8 commits into from

Conversation

DHager
Copy link
Contributor

@DHager DHager commented Mar 21, 2015

You might think MySQL's utf8 is the right choice, but it's actually got some problems handling certain character inputs. The later, corrected mode of utf8mb4 has fewer surprises.


If you are using MySQL, its `utf8` character set has some shortcomings
which may cause problems. Prefer the `utf8mb4` character set instead, if
your version supports it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will have to indented all these lines by four spaces to be rendered inside the enclosing sidebar block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess I did want to have a caution-block inside the sidebar-block, but offhand I'm not sure if that's a thing that is done elsewhere in the documentation.

@xabbuh
Copy link
Member

xabbuh commented Mar 21, 2015

@DHager Thanks for your suggestion. Do you have some resources to which we can refer here?

@DHager
Copy link
Contributor Author

DHager commented Mar 24, 2015

@xabbuh Sure. First, MySQL 5.5 docs on on the 10.1.10.6 The utf8mb4 Character Set:

The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters [...]

For a supplementary character, utf8 cannot store the character at all

In addition, other people have reported that the failure mode is complete string truncation, losing everything past the first problem-symbol:

The content got truncated at the first astral Unicode symbol [...] MySQL returned a warning message

@DHager
Copy link
Contributor Author

DHager commented Mar 24, 2015

Crap, an unrelated doc-fix seems to have become part of the pull-commit, I forgot Github automatically drew them in. I'll see if I can fix it.


.. caution::

If you are using MySQL, its `utf8` character set actually only supports
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In restructeredText, you have to use double backticks.

@DHager
Copy link
Contributor Author

DHager commented Mar 26, 2015

OK, my local make html still doesn't have the same styles as the main site, but they should look like code/keywords now.

@weaverryan
Copy link
Member

Hi @DHager!

Thanks for bringing this up - I think it's a good note, especially since it causes silent issues (truncations).

Since this character set is new to 5.5.3, I think we should:

A) Add 2 new (commented out) lines in the [mysqld] codeblock with the old character set, and a comment above it that says you need to use this before 5.5.3
B) Add a quick comment right below the [mysqld] line (so above where you set the character set to your new type) that explains (shortly) that this character set is better, with maybe a link to the MySQL docs about this
C) Remove the caution block you added - because this will basically be covered by the above comment.

What do you think? If you agree, can you make these changes?

Thanks!

@ricardclau
Copy link
Contributor

I would personally not add the commented lines but reword a bit the caution section adding a link to the official MySQL docs.

MySQL 5.5.3 was launched 5 years ago and I think it is good that this setting becomes more popular as many people (myself included until I read this PR) still believe utf8 is the way to go

Plus, I think we should take the opportunity to review all the Symfony and Silex docs regarding this. A good opportunity to get easy doc badges as well :)

Just my 2 cents @weaverryan and @DHager :)

@DHager
Copy link
Contributor Author

DHager commented Apr 29, 2015

@weaverryan, @ricardclau : I'm inclined to put utf8mb4 as the default because:

  1. If it isn't supported by their DB, the user will find out immediately.
  2. The string is very distinctive, so they can easily find the configuration-file and search online for more information.

The reverse-case, using utf8, is trickier:

  1. Everything will seem to work until weeks, months, or years later.
  2. They won't necessarily know/remember to consult this documentation or their my.cnf comments. Instead they will probably waste time searching for "mysql text truncated" and (if they saw the MySQL warnings) "mysql incorrect string value".

I've "promoted" the caution block to another sentence, and added comments to the configuration sample.

@ricardclau
Copy link
Contributor

I am 100% with @DHager on this one, but of course up to you @weaverryan :)

@weaverryan
Copy link
Member

I think it's great now - you're right that the new one should be the default, and I like that you show the old one and have some (short) words explaining. I'll merge this shortly :).

Thanks!

@ricardclau
Copy link
Contributor

@DHager do you want to add this to the Silex / Doctrine docs as well? I can have a look at them but since you opened this PR I think it is fair that you go for them if you have time

@weaverryan
Copy link
Member

Thanks again! And yes, I think adding this to Silex or Doctrine if they have similar notes will make great sense. Cheers!

weaverryan added a commit that referenced this pull request May 3, 2015
…ien Hager)

This PR was submitted for the 2.6 branch but it was merged into the 2.3 branch instead (closes #5100).

Discussion
----------

Change MySQL UTF-8 examples to use utf8mb4

You might think MySQL's `utf8` is the right choice, but it's actually got some problems handling certain character inputs. The later, corrected mode of `utf8mb4` has fewer surprises.

Commits
-------

7d7d94e Rewrite utf8mb4 cautions, add comment into sample configuration
55874c4 Add backticks for code-styling
e3c2fb6 Indenting caution block to nest it inside the sidebar
6406f22 Revert "Fix example name to avoid breaking collision with standard data-collectors"
dfc5620 Revert "Add a cautionary note telling users where the "standard" data-collector names can be found."
216ae51 Add a cautionary note telling users where the "standard" data-collector names can be found.
f0ced91 Fix example name to avoid breaking collision with standard data-collectors
f9cae6c Change MySQL UTF-8 examples to use utf8mb4, which is closer to the standard most people would expect
@weaverryan weaverryan closed this May 3, 2015
@DHager
Copy link
Contributor Author

DHager commented May 5, 2015

Forgot to mention: Using utf8mb4 does have one risk, which I believe to be minor/easily-detectable. Depending on a lot of MySQL configuration details, it's possible that certain text-columns cannot be indexed/unique because their new byte-requirements are larger than what the engines (e.g. innodb) are configured to support.

I'd call this minor because if someone already has a database, those columns will already be in some other character set, and if they're creating a new one... they probably shouldn't be trying to index longer text fields in the first place.

fabpot added a commit to silexphp/Silex that referenced this pull request May 28, 2015
This PR was merged into the 1.2 branch.

Discussion
----------

Changed Doctrine page to use utf8mb4 as sample

MySQL's `utf8` character set is a little broken, and does not cover 4-byte UTF-8 characters. In most cases it will quietly truncate the string whenever it sees one, saving incomplete text data.

In  5.5.3 they introduced `utf8mb4` to fix this inconsistency, and given that it's been 5 years, it's probably safe to encourage people to use it. If their MySQL installation is older, it should be easy for them to find the distinctive string and change it back to `utf8`, and for a new project.

Additional details can be found in the equivalent [pull-request for Symfony-2](symfony/symfony-docs#5100).

Commits
-------

a20f8f6 Changed Doctrine page to use utf8mb4 as sample
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants