Displaying Column Charset and Collation Incorrectly #2237

Open
ye opened this Issue Sep 23, 2015 · 16 comments

Projects

None yet

2 participants

@ye
ye commented Sep 23, 2015

Running Sequel Pro 1.0.2 Build 4096 and MySQL 5.5.38 on Ubuntu.

Table definition statement below.

CREATE TABLE `user` (
  `modified_at` datetime DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `first_name` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `middle_name` varchar(40) CHARACTER SET utf8mb4 DEFAULT NULL,
  `last_name` varchar(40) CHARACTER SET utf8mb4 DEFAULT NULL,
  `username` varchar(100) CHARACTER SET utf8mb4 DEFAULT NULL,
  `active` tinyint(1) NOT NULL,
  `email` varchar(120) CHARACTER SET utf8mb4 NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci 

Under Structure tab of the selected table, charset and collation are incorrectly displayed as pointed out below in RED boxes. Two issues:

  1. Column level charset/collation override are NOT taken into consideration (or perhaps when overridden at columns, the defaults is not properly inferred as MySQL would). That is, I believe Sequel Pro assumes table level default collation rather than given charset's default collation as MySQL does. In this case, middle_name, last_name, username all should have been overridden by column level charset utf8mb4, which has a default collation utf8mb4_general_ci. But Sequel Pro displays the table level default collation utf8mb4_unicode_ci instead. See chart1
  2. Charset on Table structure tab view doesn't recognize utf8mb4 - the 4-byte correct implementation of full utf-8 character code points. Sequel Pro still displays UTF-8 Unicode charset, which is indifferent from utf8, where it should have distinguish between utf8mb4 (correct UTF-8 unicode implementation) vs. utf8 (broken UTF-8 unicode implementation as it only covers BMP, Plane 0 or about 5% of all UTF-8 characters). See chart2.

screen shot 2015-09-23 at 3 11 50 pm

screen shot 2015-09-23 at 3 36 58 pm

@dmoagx
Collaborator
dmoagx commented Sep 23, 2015

Retest with 1.1: http://www.sequelpro.com/blog/2015.08/sequel-pro-1-1-rc2/

utf8mb4 is only officially supported starting with 1.1

@ye
ye commented Oct 22, 2015

Still happening on 1.1

Create a test table (see CREATE TABLE statement below), examined on MySQL command line client below vs. latest stable Sequal Pro 1.1 Build 4499 (0bac177).

Charset and Collation should MATCH what command line gives me, that is, charset=UTF-8 4-byte Unicode, Collation=utf8mb4_general_ci

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1234
Server version: 5.5.38-0ubuntu0.14.04.1 (Ubuntu)

mysql> use emoji_test;
Database changed
mysql> show create table tbl_collate_utf8mb4_general_ci;
+--------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table                          | Create Table                                                                                                                                                                                                                          |
+--------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tbl_collate_utf8mb4_general_ci | CREATE TABLE `tbl_collate_utf8mb4_general_ci` (
  `id` int(11) NOT NULL DEFAULT '0',
  `txt_col` varchar(255) NOT NULL DEFAULT '',
  `created_at` datetime DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 |
+--------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> show full columns from tbl_collate_utf8mb4_general_ci;
+------------+--------------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| Field      | Type         | Collation          | Null | Key | Default | Extra | Privileges                      | Comment |
+------------+--------------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| id         | int(11)      | NULL               | NO   | PRI | 0       |       | select,insert,update,references |         |
| txt_col    | varchar(255) | utf8mb4_general_ci | NO   |     |         |       | select,insert,update,references |         |
| created_at | datetime     | NULL               | YES  |     | NULL    |       | select,insert,update,references |         |
+------------+--------------+--------------------+------+-----+---------+-------+---------------------------------+---------+
3 rows in set (0.00 sec)

mysql> show create database emoji_test;
+------------+---------------------------------------------------------------------------------------------------+
| Database   | Create Database                                                                                   |
+------------+---------------------------------------------------------------------------------------------------+
| emoji_test | CREATE DATABASE `emoji_test` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */ |
+------------+---------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

screen shot 2015-10-22 at 12 05 35 pm

@ye
ye commented Oct 22, 2015

@dmoagx re: utf8mb4 is only officially supported starting with 1.1 - if you are officially supporting utf8mb4, which is different from utf8, then you should make a visually different display name to denote the full support of 4-byte Unicode code map since utf8 only supports <6% of what utf8mb4 supports.

Also, regardless the of visual change to differentiate charsets between utf8mb4 vs. utf8, collation has been consistently wrong through the different versions - no ambiguity, just flat out wrong.

Just a little FYI, MySQL determines collation based on 4 levels of defaults when there is no collation override. That is, charset override will override collations implicitly as well.

  • Database
  • Table
  • Column
  • Charset

I believe Sequel Pro is missing the logic to figure out charset default. In other words, whenever there is a charset override but without collation override, MySQL uses charset default collation instead of parent hierarchy's default collation (in my test case, charset default is utf8mb4_general_ci, parent hierarchy's default, which is database default is utf8mb4_unicode_ci).

Getting Collation right matters, because string indexes are depending on it.

@dmoagx
Collaborator
dmoagx commented Oct 24, 2015

then you should make a visually different display name

The name comes from MySQL, as you can see in the information_schema.CHARACTER_SETS table.

@dmoagx
Collaborator
dmoagx commented Oct 24, 2015

Fixing this issue is going to be a lot more work than what it looks like.

I went through our code and there is a lot of things going wrong IMHO, especially concerning the pre 4.1 charset handling.

Inferring the correct collation will also mean a lot of code changes as the default collation can't be guessed from the SHOW CREATE TABLE statement (which we are doing now), but has to be loaded from the db in a second query.

Then, I'm not even sure which one the grey collation should be:

  • The collation that is also the default for the table --> Makes it easy to visually distinguish deviating columns
  • The collation that is the default for the charset
  • Both?

Perhaps we'll have to revert back to just displaying an empty field if SHOW CREATE TABLE gives no collation info for 1.1.1. Everything else would be too big of a change for a bugfix update.

@dmoagx dmoagx added a commit that referenced this issue Oct 25, 2015
@dmoagx dmoagx Improve the way Sequel Pro inferrs the collation of a column. (#2237)
This does not entirely fix the bug of SP sometimes displaying the wrong collation, but should work in >= 99% of cases.
2a8528d
@dmoagx dmoagx added a commit that referenced this issue Oct 25, 2015
@dmoagx dmoagx Improve the way Sequel Pro inferrs the collation of a column. (#2237)
This does not entirely fix the bug of SP sometimes displaying the wrong collation, but should work in >= 99% of cases.
07b2773
@ye
ye commented Oct 26, 2015

@dmoagx thanks for looking into the code to determine the plan of action. I don't blame you for this bug - it's really MySQL's messy unicode implementation and its own invention of convoluted utf8mb4 that's causing a lot of headaches in the first place.

A little back story: I've been spending more than a month to figure out an emoji bug in our system that eventually leaded to this shocking discovery of MySQL's. That is, charset level default collations. And to our surprised, Sequel Pro, being one of our most favorite tools, is showing us the wrong information; should we have the right information, we'd have found out the issue sooner. So opening this ticket and sharing what we discovered is our way to contribute to the community in the hope that it saves other people who baffle at the same issue.

PS: it's pretty easy to get the default collation for each charset, as shown in the documentation above. These defaults for charsets are fixed (not overridable AFAIK.) But yes, it's tricky as you will have to look for database, table, column overrides first before you look for the charset defaults. In any case, I will be happy to contribute a few test cases if you necessary for the fix(I am not familiar with Objective C so let me know what else I can be more helpful).

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |      1 |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |      2 |
| cp866    | DOS Russian                 | cp866_general_ci    |      1 |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |      1 |
| macce    | Mac Central European        | macce_general_ci    |      1 |
| macroman | Mac West European           | macroman_general_ci |      1 |
| cp852    | DOS Central European        | cp852_general_ci    |      1 |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |      1 |
| utf8mb4  | UTF-8 Unicode               | utf8mb4_general_ci  |      4 |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |      1 |
| utf16    | UTF-16 Unicode              | utf16_general_ci    |      4 |
| cp1256   | Windows Arabic              | cp1256_general_ci   |      1 |
| cp1257   | Windows Baltic              | cp1257_general_ci   |      1 |
| utf32    | UTF-32 Unicode              | utf32_general_ci    |      4 |
| binary   | Binary pseudo charset       | binary              |      1 |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |      1 |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |      2 |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |      3 |
+----------+-----------------------------+---------------------+--------+
39 rows in set (0.00 sec)
@dmoagx dmoagx added a commit that referenced this issue Nov 1, 2015
@dmoagx dmoagx Change the internal handling of charset/collation from offset-based t…
…o name-based in structure view

This is a major change to the charset/collation code. Please watch out for issues!
(part of #2237)
f02fb78
@dmoagx dmoagx added a commit that referenced this issue Nov 2, 2015
@dmoagx dmoagx Add a visual indication for the defaults in the encoding/collation me…
…nus in table structure view (final part of #2237)
7c7660b
@dmoagx
Collaborator
dmoagx commented Nov 2, 2015

In 7c7660b I added a visual indicator to make it easier to understand what the gray color means:

bildschirmfoto 2015-11-02 um 04 22 36 638

@ye
ye commented Nov 3, 2015

@dmoagx 👍 thank you Max!

@ye
ye commented Nov 10, 2015

Another issue was discovered regarding charset/encoding: currently version of sequelpro seems to disregard the table's charset/encoding and using utf8 charset regardless the table setting. In other words, if I have a table that's set to use utf8mb4 charset the export SQL file will still blindly setting client connection to utf8 charset.

For instance: test_default table was set with
DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
but the exported SQL dump file still sets client connection with
/*!40101 SET NAMES utf8 */;

The consequence of this bug is that if I have data content outside of utf8 charset, then this export is broken as it cannot represent those characters and will be replaced by ?.

# ************************************************************
# Sequel Pro SQL dump
# Version 4499
#
# http://www.sequelpro.com/
# https://github.com/sequelpro/sequelpro
#
# Host: 127.0.0.1 (MySQL 5.5.38-0ubuntu0.14.04.1)
# Database: emoji_test
# Generation Time: 2015-11-10 21:14:08 +0000
# ************************************************************


/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;


# Dump of table test_default
# ------------------------------------------------------------

CREATE TABLE `test_default` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `foo` varchar(250) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `sticky_priority` enum('HIGH','MEDIUM','LOW') COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;




/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
@dmoagx
Collaborator
dmoagx commented Nov 10, 2015

Please open a new issue for that.

@ye
ye commented Nov 10, 2015

@dmoagx you got it. A new issue is opened as advised. #2321

@ye
ye commented Jan 29, 2016

It looks like this fix hasn't been made it to the new release :(

http://www.sequelpro.com/blog/2016.01/security-update-1-1-0-1/

@dmoagx
Collaborator
dmoagx commented Jan 29, 2016

No, since the next release would be 1.1.1
1.1.0.1 is an unplanned security fix only.

@ye
ye commented Jan 29, 2016

Ok good to know. Thanks @dmoagx. When is version 1.1.1 planned?

@ye
ye commented May 16, 2016

Retested Sequel Pro 1.1.2 Build 4541 (602e11a). The display of database/table collation seems to be correct. However, the visual indicator of "Default" are not present in the dropdowns.

@dmoagx
Collaborator
dmoagx commented May 16, 2016

Yeah, that didn't make it into the 1.1.x branch. It will probably be included in 1.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment