Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vt: case preserve column names #1688

Merged
merged 6 commits into from May 10, 2016
Merged

vt: case preserve column names #1688

merged 6 commits into from May 10, 2016

Conversation

sougou
Copy link
Contributor

@sougou sougou commented May 9, 2016

@erzel for split query
@michael-berlin for vtworker
@enisoc and @alainjobart for the rest

I've created two interchangeable types: cistring.CIString and sqlparser.ColIdent. Those who don't want to depend on sqlparser are supposed to use CIString. Others can use either type depending on which is more convenient. sqlparser.ColIdent has the necessary functions to be in the AST.

Cases are mostly preserved. There are some exceptions:

  • View names are always lowercased, because they get stored as table names in AST structs. This affects only DDLs. So, it should have no material impact.
  • Function name cases are preserved, except for the 'if' function. Since that is a reserved word, it gets lowercased by the tokenizer. Again, there should be no material impact.

The general approach I've used is to convert case-insensitive data into CI types as early as possible. This way, the compiler is likely to catch more accidental comparisons. The main caveat is that this does not work in situations where a variable gets anonymized into an interface. I found a few places: Printf, DeepEqual & JSON. Hopefully, there shouldn't be more. If there are, we just have to fix forward.
The worker code uses protobufs to store the schema. So, column names have remained normal strings there.

Ive added new tests using some reasonable judgment.


This change is Reviewable

@sougou
Copy link
Contributor Author

sougou commented May 9, 2016

#1620

@erzel
Copy link
Contributor

erzel commented May 9, 2016

LGTM on splitquery after you address my comments.

Previously, sougou wrote…

#1620


Reviewed 7 of 47 files at r1.
Review status: 7 of 47 files reviewed at latest revision, 3 unresolved discussions.


go/vt/tabletserver/splitquery/full_scan_algorithm.go, line 198 [r1] (raw file):

  result := make([]string, 0, len(splitColumns))
  for _, splitColumn := range splitColumns {
      result = append(result, prevBindVariablePrefix+splitColumn.Val())

Same comment here. Use the canonicalized (lowered) name for the splitColumn name.


go/vt/tabletserver/splitquery/split_params.go, line 210 [r1] (raw file):

// isStringSlicePrefix returns true if 'potentialPrefix' is a prefix of the slice
// 'slice'.
func isStringSlicePrefix(potentialPrefix []sqlparser.ColIdent, slice []cistring.CIString) bool {

Rename this method to isColIdentSlicePrefix since it uses ColIdent's now.


go/vt/tabletserver/splitquery/splitter.go, line 32 [r1] (raw file):

  for _, splitColumn := range splitter.splitParams.splitColumns {
      splitter.startBindVariableNames = append(
          splitter.startBindVariableNames, startBindVariablePrefix+splitColumn.Val())

It makes more sense to me to have the bind variable name use the lower-case version so it's not dependent on the user's query.
This should be fine with the user, since this bind variable is internal and the user should not depend on its name.


Comments from Reviewable

Approved with PullApprove

@sougou
Copy link
Contributor Author

sougou commented May 9, 2016

Review status: 4 of 47 files reviewed at latest revision, 3 unresolved discussions.


go/vt/tabletserver/splitquery/full_scan_algorithm.go, line 198 [r1] (raw file):

Previously, erzel (Erez Louidor) wrote…

Same comment here. Use the canonicalized (lowered) name for the splitColumn name.


Done.


go/vt/tabletserver/splitquery/split_params.go, line 210 [r1] (raw file):

Previously, erzel (Erez Louidor) wrote…

Rename this method to isColIdentSlicePrefix since it uses ColIdent's now.


Done.


go/vt/tabletserver/splitquery/splitter.go, line 32 [r1] (raw file):

Previously, erzel (Erez Louidor) wrote…

It makes more sense to me to have the bind variable name use the lower-case version so it's not dependent on the user's query.
This should be fine with the user, since this bind variable is internal and the user should not depend on its name.


Done.


Comments from Reviewable

@enisoc
Copy link
Member

enisoc commented May 9, 2016

Reviewed 40 of 47 files at r1.
Review status: 44 of 47 files reviewed at latest revision, 11 unresolved discussions.


go/cistring/cistring.go, line 15 [r1] (raw file):

// CIString is an immutable case-insensitive string.
type CIString struct {
  val, lowered string

Can you add to the doc comment an explanation of why you chose the "extra storage" implementation rather than "lowercase on compare"? Is it mainly for the CPU during comparisons? Or do you expect the conversion-on-compare to put extra load on the GC?


go/cistring/cistring.go, line 18 [r1] (raw file):

  // nocompare prevents this struct from being compared
  // with itself.
  nocompare func()

At lunch, we discussed the zero byte version of this, used here: https://github.com/youtube/vitess/blob/master/go/vt/mysqlctl/replication/replication.go#L31

Will that work here?


go/cistring/cistring.go, line 22 [r1] (raw file):

// NewCIString creates a new CIString.
func NewCIString(str string) CIString {

This should be called New() as in cistring.New(). Or if you foresee other case-insensitive types later, you could name the package ci and have ci.NewString().


go/cistring/cistring.go, line 34 [r1] (raw file):

// Val returns the case-preserved value of the string.
func (s CIString) Val() string {

The name Val() seems ambiguous. What about Original() and Lowered()?


go/cistring/cistring.go, line 46 [r1] (raw file):

// Equal returns true if the input is case-insensitive
// equal to the string. If the input is already lower-cased,

For cases where you want to lowercase once and compare in a loop, maybe you could recommend a pattern like this:

ciInput := ci.NewString(input)
for ... {
  if ciName.Equal(ciInput) {
    ...
  }
}

This has the benefit of avoiding manual calls to strings.ToLower() in places where you interoperate with CIS, which will help if the specific lowercasing behavior of CIS is ever changed (for locale fixes, for example).


go/cistring/cistring.go, line 48 [r1] (raw file):

// equal to the string. If the input is already lower-cased,
// it's more efficient to check if s.Lowered()==in.
func (s CIString) Equal(in string) bool {

Do you ever need to compare CIString to another CIString? Usually I would expect CIString.Equal() to have signature CIString.Equal(CIString), and then the version for comparing to something other than CIString would be suffixed like CIString.EqualString(string).


go/vt/sqlparser/ast.go, line 1746 [r1] (raw file):

// ColIdent is a case insensitive SQL identifier. It will be escaped with
// backquotes if it matches a keyword.
type ColIdent cistring.CIString

If you embed, you won't have to duplicate the CIString API (String(), Val(), Lowered(), etc.):

type ColIdent struct {
  cistring.CIString
}

Is there a reason you prefer wrapping over embedding?


test/vtgatev3_test.py, line 359 [r1] (raw file):

    # Test case sensitivity
    cursor.execute('select Id, Name from vt_user where iD = :id', {'id': 1})

Is there a test for giving the case used in schema if they do select *?


Comments from Reviewable

@sougou
Copy link
Contributor Author

sougou commented May 9, 2016

Review status: 23 of 47 files reviewed at latest revision, 11 unresolved discussions.


go/cistring/cistring.go, line 15 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Can you add to the doc comment an explanation of why you chose the "extra storage" implementation rather than "lowercase on compare"? Is it mainly for the CPU during comparisons? Or do you expect the conversion-on-compare to put extra load on the GC?


Done.


go/cistring/cistring.go, line 18 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

At lunch, we discussed the zero byte version of this, used here: https://github.com/youtube/vitess/blob/master/go/vt/mysqlctl/replication/replication.go#L31

Will that work here?


It turns out that the zero-byte version (array) still consumes one word. I tested both options and the size of the struct was the same 40 bytes. So, I chose the simpler looking code.


go/cistring/cistring.go, line 22 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

This should be called New() as in cistring.New(). Or if you foresee other case-insensitive types later, you could name the package ci and have ci.NewString().


Done.


go/cistring/cistring.go, line 34 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

The name Val() seems ambiguous. What about Original() and Lowered()?


Done.


go/cistring/cistring.go, line 46 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

For cases where you want to lowercase once and compare in a loop, maybe you could recommend a pattern like this:

ciInput := ci.NewString(input)
for ... {
  if ciName.Equal(ciInput) {
    ...
  }
}

This has the benefit of avoiding manual calls to strings.ToLower() in places where you interoperate with CIS, which will help if the specific lowercasing behavior of CIS is ever changed (for locale fixes, for example).


I'm neutral both ways. This still doesn't address the situation where the input is already lowercased, but one always has the option of using Lowered: ciName.Lowered() == input. I'll leave those parts alone for now.


go/cistring/cistring.go, line 48 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Do you ever need to compare CIString to another CIString? Usually I would expect CIString.Equal() to have signature CIString.Equal(CIString), and then the version for comparing to something other than CIString would be suffixed like CIString.EqualString(string).


Done.


go/vt/sqlparser/ast.go, line 1746 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

If you embed, you won't have to duplicate the CIString API (String(), Val(), Lowered(), etc.):

type ColIdent struct {
  cistring.CIString
}

Is there a reason you prefer wrapping over embedding?


I've avoided embedding because it has generally led to confusion in the past. Also, the last time I tested, it didn't work for things like MarshalJSON. The wrapping also lets you freely cast the two types back and forth.


test/vtgatev3_test.py, line 359 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Is there a test for giving the case used in schema if they do select *?


Quite a few: data/test/tabletserver/exec_cases.txt. I changed the schema to contain mixed case column names.
I did the same thing for the v3 tests also (data/test/vtgate/*.txt).


Comments from Reviewable

@enisoc
Copy link
Member

enisoc commented May 9, 2016

LGTM for the parts outside splitquery.

Previously, erzel (Erez Louidor) wrote…

LGTM on splitquery after you address my comments.


Reviewed 21 of 22 files at r3.
Review status: 44 of 47 files reviewed at latest revision, 5 unresolved discussions.


go/cistring/cistring.go, line 18 [r1] (raw file):

Previously, sougou wrote…

It turns out that the zero-byte version (array) still consumes one word. I tested both options and the size of the struct was the same 40 bytes. So, I chose the simpler looking code.


Ah they must have "fixed the glitch" that I was taking advantage of :) Your way is safer anyway, since they may someday decide that the compiler can compare two zero-length arrays of uncomparable things...


go/vt/sqlparser/ast.go, line 1746 [r1] (raw file):

Previously, sougou wrote…

I've avoided embedding because it has generally led to confusion in the past. Also, the last time I tested, it didn't work for things like MarshalJSON. The wrapping also lets you freely cast the two types back and forth.


ok


Comments from Reviewable

Approved with PullApprove

@enisoc
Copy link
Member

enisoc commented May 9, 2016

Review status: 44 of 47 files reviewed at latest revision, 4 unresolved discussions.


go/cistring/cistring.go, line 18 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

Ah they must have "fixed the glitch" that I was taking advantage of :) Your way is safer anyway, since they may someday decide that the compiler can compare two zero-length arrays of uncomparable things...


It turns out, you can still make it take zero space, as long as it's not the last field in the struct:

golang/go@6f07ac2

I'll make a PR to fix the other use so it's not the last field.


Comments from Reviewable

Use the newly-sugested zero-size construct that prevents the
struct from comparing with itself. The previous construct used
a nil function pointer that consumed one word.
@sougou
Copy link
Contributor Author

sougou commented May 10, 2016

Review status: 42 of 47 files reviewed at latest revision, 4 unresolved discussions.


go/cistring/cistring.go, line 18 [r1] (raw file):

Previously, enisoc (Anthony Yeh) wrote…

It turns out, you can still make it take zero space, as long as it's not the last field in the struct:

golang/go@6f07ac2

I'll make a PR to fix the other use so it's not the last field.


Done.


Comments from Reviewable

@enisoc
Copy link
Member

enisoc commented May 10, 2016

Reviewed 2 of 2 files at r4.
Review status: 44 of 47 files reviewed at latest revision, 3 unresolved discussions.


Comments from Reviewable

@sougou sougou merged commit c32dc3c into vitessio:master May 10, 2016
frouioui pushed a commit to planetscale/vitess that referenced this pull request Nov 21, 2023
Signed-off-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com>
Co-authored-by: Vitess Cherry-Pick Bot <vitess-cherrypick-bot@planetscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants