vt: case preserve column names #1688

sougou · 2016-05-09T05:37:34Z

@erzel for split query
@michael-berlin for vtworker
@enisoc and @alainjobart for the rest

I've created two interchangeable types: cistring.CIString and sqlparser.ColIdent. Those who don't want to depend on sqlparser are supposed to use CIString. Others can use either type depending on which is more convenient. sqlparser.ColIdent has the necessary functions to be in the AST.

Cases are mostly preserved. There are some exceptions:

View names are always lowercased, because they get stored as table names in AST structs. This affects only DDLs. So, it should have no material impact.
Function name cases are preserved, except for the 'if' function. Since that is a reserved word, it gets lowercased by the tokenizer. Again, there should be no material impact.

The general approach I've used is to convert case-insensitive data into CI types as early as possible. This way, the compiler is likely to catch more accidental comparisons. The main caveat is that this does not work in situations where a variable gets anonymized into an interface. I found a few places: Printf, DeepEqual & JSON. Hopefully, there shouldn't be more. If there are, we just have to fix forward.
The worker code uses protobufs to store the schema. So, column names have remained normal strings there.

Ive added new tests using some reasonable judgment.

This change is

sougou · 2016-05-09T05:41:22Z

#1620

erzel · 2016-05-09T18:40:40Z

LGTM on splitquery after you address my comments.

Previously, sougou wrote…

#1620

Reviewed 7 of 47 files at r1.
Review status: 7 of 47 files reviewed at latest revision, 3 unresolved discussions.

go/vt/tabletserver/splitquery/full_scan_algorithm.go, line 198 [r1] (raw file):

  result := make([]string, 0, len(splitColumns))
  for _, splitColumn := range splitColumns {
      result = append(result, prevBindVariablePrefix+splitColumn.Val())

Same comment here. Use the canonicalized (lowered) name for the splitColumn name.

go/vt/tabletserver/splitquery/split_params.go, line 210 [r1] (raw file):

// isStringSlicePrefix returns true if 'potentialPrefix' is a prefix of the slice
// 'slice'.
func isStringSlicePrefix(potentialPrefix []sqlparser.ColIdent, slice []cistring.CIString) bool {

Rename this method to isColIdentSlicePrefix since it uses ColIdent's now.

go/vt/tabletserver/splitquery/splitter.go, line 32 [r1] (raw file):

  for _, splitColumn := range splitter.splitParams.splitColumns {
      splitter.startBindVariableNames = append(
          splitter.startBindVariableNames, startBindVariablePrefix+splitColumn.Val())

It makes more sense to me to have the bind variable name use the lower-case version so it's not dependent on the user's query.
This should be fine with the user, since this bind variable is internal and the user should not depend on its name.

Comments from Reviewable

@erzel

@erzel code review comments

sougou · 2016-05-09T19:49:23Z

Review status: 4 of 47 files reviewed at latest revision, 3 unresolved discussions.

go/vt/tabletserver/splitquery/full_scan_algorithm.go, line 198 [r1] (raw file):

Previously, erzel (Erez Louidor) wrote…

Same comment here. Use the canonicalized (lowered) name for the splitColumn name.

Done.

go/vt/tabletserver/splitquery/split_params.go, line 210 [r1] (raw file):

Previously, erzel (Erez Louidor) wrote…

Rename this method to isColIdentSlicePrefix since it uses ColIdent's now.

Done.

go/vt/tabletserver/splitquery/splitter.go, line 32 [r1] (raw file):

Previously, erzel (Erez Louidor) wrote…

It makes more sense to me to have the bind variable name use the lower-case version so it's not dependent on the user's query.
This should be fine with the user, since this bind variable is internal and the user should not depend on its name.

Done.

Comments from Reviewable

enisoc · 2016-05-09T19:59:26Z

Reviewed 40 of 47 files at r1.
Review status: 44 of 47 files reviewed at latest revision, 11 unresolved discussions.

go/cistring/cistring.go, line 15 [r1] (raw file):

// CIString is an immutable case-insensitive string.
type CIString struct {
  val, lowered string

Can you add to the doc comment an explanation of why you chose the "extra storage" implementation rather than "lowercase on compare"? Is it mainly for the CPU during comparisons? Or do you expect the conversion-on-compare to put extra load on the GC?

go/cistring/cistring.go, line 18 [r1] (raw file):

  // nocompare prevents this struct from being compared
  // with itself.
  nocompare func()

At lunch, we discussed the zero byte version of this, used here: https://github.com/youtube/vitess/blob/master/go/vt/mysqlctl/replication/replication.go#L31

Will that work here?

go/cistring/cistring.go, line 22 [r1] (raw file):

// NewCIString creates a new CIString.
func NewCIString(str string) CIString {

This should be called New() as in cistring.New(). Or if you foresee other case-insensitive types later, you could name the package ci and have ci.NewString().

go/cistring/cistring.go, line 34 [r1] (raw file):

// Val returns the case-preserved value of the string.
func (s CIString) Val() string {

The name Val() seems ambiguous. What about Original() and Lowered()?

go/cistring/cistring.go, line 46 [r1] (raw file):

// Equal returns true if the input is case-insensitive
// equal to the string. If the input is already lower-cased,

For cases where you want to lowercase once and compare in a loop, maybe you could recommend a pattern like this:

ciInput := ci.NewString(input)
for ... {
  if ciName.Equal(ciInput) {
    ...
  }
}

This has the benefit of avoiding manual calls to strings.ToLower() in places where you interoperate with CIS, which will help if the specific lowercasing behavior of CIS is ever changed (for locale fixes, for example).

go/cistring/cistring.go, line 48 [r1] (raw file):

// equal to the string. If the input is already lower-cased,
// it's more efficient to check if s.Lowered()==in.
func (s CIString) Equal(in string) bool {

Do you ever need to compare CIString to another CIString? Usually I would expect CIString.Equal() to have signature CIString.Equal(CIString), and then the version for comparing to something other than CIString would be suffixed like CIString.EqualString(string).

go/vt/sqlparser/ast.go, line 1746 [r1] (raw file):

// ColIdent is a case insensitive SQL identifier. It will be escaped with
// backquotes if it matches a keyword.
type ColIdent cistring.CIString

If you embed, you won't have to duplicate the CIString API (String(), Val(), Lowered(), etc.):

type ColIdent struct {
  cistring.CIString
}

Is there a reason you prefer wrapping over embedding?

test/vtgatev3_test.py, line 359 [r1] (raw file):

    # Test case sensitivity
    cursor.execute('select Id, Name from vt_user where iD = :id', {'id': 1})

Is there a test for giving the case used in schema if they do select *?

Comments from Reviewable

@enisoc

@enisoc code review comments

sougou · 2016-05-09T21:18:20Z

Review status: 23 of 47 files reviewed at latest revision, 11 unresolved discussions.

go/cistring/cistring.go, line 15 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Can you add to the doc comment an explanation of why you chose the "extra storage" implementation rather than "lowercase on compare"? Is it mainly for the CPU during comparisons? Or do you expect the conversion-on-compare to put extra load on the GC?

Done.

go/cistring/cistring.go, line 18 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

At lunch, we discussed the zero byte version of this, used here: https://github.com/youtube/vitess/blob/master/go/vt/mysqlctl/replication/replication.go#L31

Will that work here?

It turns out that the zero-byte version (array) still consumes one word. I tested both options and the size of the struct was the same 40 bytes. So, I chose the simpler looking code.

go/cistring/cistring.go, line 22 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

This should be called New() as in cistring.New(). Or if you foresee other case-insensitive types later, you could name the package ci and have ci.NewString().

Done.

go/cistring/cistring.go, line 34 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

The name Val() seems ambiguous. What about Original() and Lowered()?

Done.

go/cistring/cistring.go, line 46 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

For cases where you want to lowercase once and compare in a loop, maybe you could recommend a pattern like this:
ciInput := ci.NewString(input)
for ... {
  if ciName.Equal(ciInput) {
    ...
  }
}
This has the benefit of avoiding manual calls to strings.ToLower() in places where you interoperate with CIS, which will help if the specific lowercasing behavior of CIS is ever changed (for locale fixes, for example).

I'm neutral both ways. This still doesn't address the situation where the input is already lowercased, but one always has the option of using Lowered: ciName.Lowered() == input. I'll leave those parts alone for now.

go/cistring/cistring.go, line 48 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Do you ever need to compare CIString to another CIString? Usually I would expect CIString.Equal() to have signature CIString.Equal(CIString), and then the version for comparing to something other than CIString would be suffixed like CIString.EqualString(string).

Done.

go/vt/sqlparser/ast.go, line 1746 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

If you embed, you won't have to duplicate the CIString API (String(), Val(), Lowered(), etc.):
type ColIdent struct {
  cistring.CIString
}
Is there a reason you prefer wrapping over embedding?

I've avoided embedding because it has generally led to confusion in the past. Also, the last time I tested, it didn't work for things like MarshalJSON. The wrapping also lets you freely cast the two types back and forth.

test/vtgatev3_test.py, line 359 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Is there a test for giving the case used in schema if they do select *?

Quite a few: data/test/tabletserver/exec_cases.txt. I changed the schema to contain mixed case column names.
I did the same thing for the v3 tests also (data/test/vtgate/*.txt).

Comments from Reviewable

enisoc · 2016-05-09T23:25:49Z

LGTM for the parts outside splitquery.

Previously, erzel (Erez Louidor) wrote…

LGTM on splitquery after you address my comments.

Reviewed 21 of 22 files at r3.
Review status: 44 of 47 files reviewed at latest revision, 5 unresolved discussions.

go/cistring/cistring.go, line 18 [r1] (raw file):

Previously, sougou wrote…

It turns out that the zero-byte version (array) still consumes one word. I tested both options and the size of the struct was the same 40 bytes. So, I chose the simpler looking code.

Ah they must have "fixed the glitch" that I was taking advantage of :) Your way is safer anyway, since they may someday decide that the compiler can compare two zero-length arrays of uncomparable things...

go/vt/sqlparser/ast.go, line 1746 [r1] (raw file):

Previously, sougou wrote…

I've avoided embedding because it has generally led to confusion in the past. Also, the last time I tested, it didn't work for things like MarshalJSON. The wrapping also lets you freely cast the two types back and forth.

ok

Comments from Reviewable

enisoc · 2016-05-09T23:39:23Z

Review status: 44 of 47 files reviewed at latest revision, 4 unresolved discussions.

go/cistring/cistring.go, line 18 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Ah they must have "fixed the glitch" that I was taking advantage of :) Your way is safer anyway, since they may someday decide that the compiler can compare two zero-length arrays of uncomparable things...

It turns out, you can still make it take zero space, as long as it's not the last field in the struct:

golang/go@6f07ac2

I'll make a PR to fix the other use so it's not the last field.

Comments from Reviewable

Use the newly-sugested zero-size construct that prevents the struct from comparing with itself. The previous construct used a nil function pointer that consumed one word.

sougou · 2016-05-10T03:42:59Z

Review status: 42 of 47 files reviewed at latest revision, 4 unresolved discussions.

go/cistring/cistring.go, line 18 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

It turns out, you can still make it take zero space, as long as it's not the last field in the struct:

golang/go@6f07ac2

I'll make a PR to fix the other use so it's not the last field.

Done.

Comments from Reviewable

enisoc · 2016-05-10T05:12:29Z

Reviewed 2 of 2 files at r4.
Review status: 44 of 47 files reviewed at latest revision, 3 unresolved discussions.

Comments from Reviewable

Signed-off-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com> Co-authored-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com>

sougou added 3 commits May 6, 2016 22:14

vt: case preserve column names (WIP)

a35ae57

vt: case preserve column names, fixed tests (WIP)

4bf3615

vt: case preserve column names, new tests

5290b99

googlebot added the cla: yes label May 9, 2016

vt: case preserve column names

0cf3512

@erzel code review comments

vt: case preserve column names

514e95e

@enisoc code review comments

cistring: use zero-size artifact

559d32f

Use the newly-sugested zero-size construct that prevents the struct from comparing with itself. The previous construct used a nil function pointer that consumed one word.

sougou merged commit c32dc3c into vitessio:master May 10, 2016

frouioui pushed a commit to planetscale/vitess that referenced this pull request Nov 21, 2023

backport of 1651 (vitessio#1688)

5ea2414

Signed-off-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com> Co-authored-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vt: case preserve column names #1688

vt: case preserve column names #1688

sougou commented May 9, 2016 •

edited by enisoc

sougou commented May 9, 2016

erzel commented May 9, 2016 •

edited by enisoc

sougou commented May 9, 2016

enisoc commented May 9, 2016

sougou commented May 9, 2016

enisoc commented May 9, 2016 •

edited

enisoc commented May 9, 2016

sougou commented May 10, 2016

enisoc commented May 10, 2016

vt: case preserve column names #1688

vt: case preserve column names #1688

Conversation

sougou commented May 9, 2016 • edited by enisoc

sougou commented May 9, 2016

erzel commented May 9, 2016 • edited by enisoc

sougou commented May 9, 2016

enisoc commented May 9, 2016

sougou commented May 9, 2016

enisoc commented May 9, 2016 • edited

enisoc commented May 9, 2016

sougou commented May 10, 2016

enisoc commented May 10, 2016

sougou commented May 9, 2016 •

edited by enisoc

erzel commented May 9, 2016 •

edited by enisoc

enisoc commented May 9, 2016 •

edited