Exact nature of case-insensitive match in schema compatibility #551

iherman · 2015-05-13T13:47:58Z

I am just reading the (draft) Character Model for the World Wide Web: String Matching and Searching document, and hit the section on string identity matching. This section specifies what it means to have "case-insensitive matching" and contains advises on how this should be used in specifications. We may want to look at the section on schema compatibility where we also specify case-insensitive matching and refine that section. (Or ask for help from the I18N group, specifically @aphillips)

JeniT · 2015-05-14T19:38:24Z

Specifically it says that specs shouldn't specify case-insensitive matching, but where they do they should specify "Unicode case-folding C+F" comparisons. I propose that we adopt that terminology. (The alternative is that we remove the case-insensitive matching.)

aphillips · 2015-05-14T19:54:21Z

The specific challenge for tabular data if you were to adopt Unicode C+F is that there are many existing tabular data files which use a legacy character encoding (that is, not a Unicode encoding) or which contain labels or values in a specific natural language. Making user-defined labels "case-insensitive" invites confusion for users of languages that are affected by language-specific case folding. The example given in our WG's document is of Turkish.

This is why #i18n recommends that specs use case-insensitive matching, particularly when user-defined values are to be matched later.

iherman · 2015-05-15T07:45:20Z

@aphillips I am afraid we indeed do not have too much choice.

First of all, @aphillips, the comparison does not occur on cell values, but only on column titles and names. In some sense, this is a much more controlled set of data, i.e., the source of errors is reduced (though still exists).

On the other hand, while the values for title are genuinely natural language texts (with a natural usage of upper or lower case), name is more akin to program variables, where there is a more natural tendency of using lower case (or camel-case:-( terms. In other words, case sensitive comparison of those goes probably against general usage.

Finally, titles are obviously not restricted to ASCII. Combining all these, I think that @JeniT's conclusion is the right one.

As for the encoding: of course that is an issue. We do recommend authors to use UTF-8 (which is the default) in the CSV files. However, @JeniT, I do not think we require that, in the model, texts MUST be in UTF-8. Shouldn't we specify that? Created a new issue for this (#557)

iherman · 2015-05-15T07:50:34Z

@aphillips, an admin question: alas!, the charmod-norm document is not yet a Rec, so we cannot normatively refer to it. Is "Unicode [UNICODE] Section 5.18" the right reference for "Unicode C+F"? I mean, is this a term from the Unicode document or from charmod-norm?

aphillips · 2015-05-15T20:40:16Z

@iherman I agree that name items are more likely to be expressed in an "identifier-like" way, but not that this removes the casing issue. Although a general convention helps the end-users avoid pitfalls, you still have to normatively specify what to do for those that ignore or cannot follow the convention.

If you choose to require case insensitivity, I would define it for name (at least) based on Unicode C+F and ensure that there is a health warning in the text.

@iherman That's correct, charmod-norm is being actively worked on, but I don't think you want to wait six months for us to publish something you can reference! Unicode section 5.18 does not, alas, define C+F directly. The letter forms are defined in the UCD, inside the file CaseFolding.txt here: ftp://ftp.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

They are also mentioned in UAX#44

iherman · 2015-05-16T07:29:54Z

On 15 May 2015, at 22:40 , aphillips notifications@github.com wrote:

@iherman I agree that name items are more likely to be expressed in an "identifier-like" way, but not that this removes the casing issue. Although a general convention helps the end-users avoid pitfalls, you still have to normatively specify what to do for those that ignore or cannot follow the convention.

If you choose to require case insensitivity, I would define it for name (at least) based on Unicode C+F and ensure that there is a health warning in the text.

I let @JeniT or @gkellogg (the editors of that document) decide on the text to be added; I agree some health warning would be good.

@iherman That's correct, charmod-norm is being actively worked on, but I don't think you want to wait six months for us to publish something you can reference! Unicode section 5.18 does not, alas, define C+F directly. The letter forms are defined in the UCD, inside the file CaseFolding.txt here: ftp://ftp.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

They are also mentioned in UTR#44

@aphillips, thanks. Can you suggest a normative text how to refer to this exactly?

Cheers

Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

JeniT · 2015-05-20T12:00:34Z

I'd like to discuss this today on the call. There's clear guidance here not to include case-insensitive matching, and I'm not sure that it buys us enough functionality to justify the additional implementation burden of supporting case folding.

@aphillips should we (or should we not) be saying anything about Unicode normalization prior to case-sensitive comparisons? I think from http://w3c.github.io/charmod-norm/#h-formal-language that the recommendation is that we do not specify that the strings are Unicode normalized prior to comparison, but do add a note pointing out that this might mean that canonically equivalent but disjoint Unicode character sequences will not match. Have I got that right?

JeniT · 2015-05-20T14:55:29Z

We discussed on the call; @gkellogg, @danbri and I propose that in 5.5.1 Schema Compatibility, when not validating, columns should simply match on index without consideration of name or titles - so long as the number of columns matches then you can proceed. This would mean no more case-insensitive matching. @iherman, @6a6d74 what do you think?

aphillips · 2015-05-20T15:08:32Z

@JeniT

Have I got that right?

Exactly right. I18N has been recommending that specifications not require Unicode normalization in most cases, creating (or, rather, preserving) the potential for visually indistinguishable mismatches. Including a health warning as a note saying exactly what your comments says (and pointing to Charmod-Norm or to Unicode Standard Annex 15 (UAX15)) as a reference.

We also recommend specs use case-sensitive matching unless they have an operational reason not to. That's more of a judgement call. Case-sensitivity is certainly the easiest to implement.

iherman · 2015-05-20T18:05:17Z

On 20 May 2015, at 16:55 , Jeni Tennison notifications@github.com wrote:

We discussed on the call; @gkellogg, @danbri and I propose that in 5.5.1 Schema Compatibility, when not validating, columns should simply match on index without consideration of name or titles - so long as the number of columns matches then you can proceed. This would mean no more case-insensitive matching. @iherman, @6a6d74 what do you think?

I am not sure I understand. I mean: it is fine to match on indeces (of non-virtual columns) but what happens if we end up with two names that are not equal? Is this an error? What should a non-validating processor, eg, a converter, do?

gkellogg · 2015-05-20T19:47:04Z

It's an error when validating, but this would be ignored when not validating. Of course, this can be avoided if the metadata includes titles, in which case it's a positive match. A Column having only name properties would use those names for reference, but not to actually match against the tabular data.

A non-validating processor will ensure that the number of non-virtual columns match the actual input columns, and perform compatibility check for intersecting titles, but otherwise continue processing.

iherman · 2015-05-21T04:41:06Z

On 20 May 2015, at 21:47 , Gregg Kellogg notifications@github.com wrote:

It's an error when validating, but this would be ignored when not validating.

But what would then happen? The name plays a role in the conversion, so we must say what a non-validating tool should do in this case.

Of course, this can be avoided if the metadata includes titles, in which case it's a positive match. A Column having only name properties would use those names for reference, but not to actually match against the tabular data.

But if there are more, which?

I seem to miss something essential here.

A non-validating processor will ensure that the number of non-virtual columns match the actual input columns, and perform compatibility check for intersecting titles, but otherwise continue processing.

gkellogg · 2015-05-21T06:09:45Z

As we don't merge, there is no titles annotation on the column, there may be a name annotation, if it is defined in the metadata, otherwise the conversion documents have a provision to default name. We already have cases when there is no title. The annotations come entirely from the metadata document, and the conversion algorithms operate off of annotations.

What was taken out is a compatibility criteria in the case that there is no titles metadata, so without this, columns would presume to be compatible. The only way to be incompatible (in the absence of titles) is for the column counts to not match up. Since all conversions operate off of model annotations, they should continue to operate as expected.

The case is not unlike that where the dialect says that there is no header row, and the first row is skipped.

From a processing perspective, the compatibility check would succeed, and processing would continue as before. Note that we don't use the title from the CSV except for the compatibility check, so there should be no difference in downstream processing.

iherman · 2015-05-21T09:29:00Z

Hm. Maybe I should wait for the new text of 5.5.1., because I do not seem to understand what you mean. The text in Jeni's comment simply says (for non validating processors)

columns should simply match on index without consideration of name or titles - so long as the number of columns matches then you can proceed

But then, for example, what would happen if the two metadata are:

   "columns": [{
    "name": "countryCode",
   }, {

and

  "columns": [{
    "name": "somethingElse",
  }, {

Surely these two columns should not coincide, right? So there should be some consideration of names or titles...

Bottom line: can one of you give a new version of section 5.5.1? Because I am obviously lost here.

…es to not need to match titles. Fixes #551.

gkellogg · 2015-05-21T23:08:58Z

@iherman, that case you site can't really happen, because that would be merge case. As described now, if one has name but not titles and the other has titles but not name, the columns are assumed to match. If they both had either name or titles, those would be used in a case-sensitive match. Note the test case "test124" to check this.

iherman · 2015-05-22T05:12:49Z

O.k., next time we should discuss real texts:-) Now that I see the #564, and I am fine with that. I will merge #564 and close this issue. Pfew:-)

iherman added Metadata vocabulary document Requires telcon discussion/decision For LCCR LC official review: I18N labels May 13, 2015

iherman mentioned this issue May 15, 2015

Specify (in the model) the encoding #557

Closed

gkellogg self-assigned this May 21, 2015

gkellogg added a commit that referenced this issue May 21, 2015

When not validating, relax compatibility check on columns so that nam…

5284dad

…es to not need to match titles. Fixes #551.

gkellogg mentioned this issue May 21, 2015

When not validating, relax compatibility check on columns so that nam… #564

Merged

gkellogg removed their assignment May 21, 2015

gkellogg removed the Requires telcon discussion/decision label May 21, 2015

iherman closed this as completed May 22, 2015

This was referenced Jun 3, 2015

What are the rules for string equality when column names are matched with annotations #576

Closed

What are the rules for string equality when column names are matched with annotations #578

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exact nature of case-insensitive match in schema compatibility #551

Exact nature of case-insensitive match in schema compatibility #551

iherman commented May 13, 2015

JeniT commented May 14, 2015

aphillips commented May 14, 2015

iherman commented May 15, 2015

iherman commented May 15, 2015

aphillips commented May 15, 2015

iherman commented May 16, 2015

JeniT commented May 20, 2015

JeniT commented May 20, 2015

aphillips commented May 20, 2015

iherman commented May 20, 2015

gkellogg commented May 20, 2015

iherman commented May 21, 2015

gkellogg commented May 21, 2015

iherman commented May 21, 2015

gkellogg commented May 21, 2015

iherman commented May 22, 2015

Exact nature of case-insensitive match in schema compatibility #551

Exact nature of case-insensitive match in schema compatibility #551

Comments

iherman commented May 13, 2015

JeniT commented May 14, 2015

aphillips commented May 14, 2015

iherman commented May 15, 2015

iherman commented May 15, 2015

aphillips commented May 15, 2015

iherman commented May 16, 2015

JeniT commented May 20, 2015

JeniT commented May 20, 2015

aphillips commented May 20, 2015

iherman commented May 20, 2015

gkellogg commented May 20, 2015

iherman commented May 21, 2015

gkellogg commented May 21, 2015

iherman commented May 21, 2015

gkellogg commented May 21, 2015

iherman commented May 22, 2015