Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exact nature of case-insensitive match in schema compatibility #551

Closed
iherman opened this issue May 13, 2015 · 16 comments · Fixed by #564
Closed

Exact nature of case-insensitive match in schema compatibility #551

iherman opened this issue May 13, 2015 · 16 comments · Fixed by #564

Comments

@iherman
Copy link
Member

iherman commented May 13, 2015

I am just reading the (draft) Character Model for the World Wide Web: String Matching and Searching document, and hit the section on string identity matching. This section specifies what it means to have "case-insensitive matching" and contains advises on how this should be used in specifications. We may want to look at the section on schema compatibility where we also specify case-insensitive matching and refine that section. (Or ask for help from the I18N group, specifically @aphillips)

@JeniT
Copy link

JeniT commented May 14, 2015

Specifically it says that specs shouldn't specify case-insensitive matching, but where they do they should specify "Unicode case-folding C+F" comparisons. I propose that we adopt that terminology. (The alternative is that we remove the case-insensitive matching.)

@aphillips
Copy link

The specific challenge for tabular data if you were to adopt Unicode C+F is that there are many existing tabular data files which use a legacy character encoding (that is, not a Unicode encoding) or which contain labels or values in a specific natural language. Making user-defined labels "case-insensitive" invites confusion for users of languages that are affected by language-specific case folding. The example given in our WG's document is of Turkish.

This is why #i18n recommends that specs use case-insensitive matching, particularly when user-defined values are to be matched later.

@iherman
Copy link
Member Author

iherman commented May 15, 2015

@aphillips I am afraid we indeed do not have too much choice.

First of all, @aphillips, the comparison does not occur on cell values, but only on column titles and names. In some sense, this is a much more controlled set of data, i.e., the source of errors is reduced (though still exists).

On the other hand, while the values for title are genuinely natural language texts (with a natural usage of upper or lower case), name is more akin to program variables, where there is a more natural tendency of using lower case (or camel-case:-( terms. In other words, case sensitive comparison of those goes probably against general usage.

Finally, titles are obviously not restricted to ASCII. Combining all these, I think that @JeniT's conclusion is the right one.

As for the encoding: of course that is an issue. We do recommend authors to use UTF-8 (which is the default) in the CSV files. However, @JeniT, I do not think we require that, in the model, texts MUST be in UTF-8. Shouldn't we specify that? Created a new issue for this (#557)

@iherman
Copy link
Member Author

iherman commented May 15, 2015

@aphillips, an admin question: alas!, the charmod-norm document is not yet a Rec, so we cannot normatively refer to it. Is "Unicode [UNICODE] Section 5.18" the right reference for "Unicode C+F"? I mean, is this a term from the Unicode document or from charmod-norm?

@aphillips
Copy link

@iherman I agree that name items are more likely to be expressed in an "identifier-like" way, but not that this removes the casing issue. Although a general convention helps the end-users avoid pitfalls, you still have to normatively specify what to do for those that ignore or cannot follow the convention.

If you choose to require case insensitivity, I would define it for name (at least) based on Unicode C+F and ensure that there is a health warning in the text.

@iherman That's correct, charmod-norm is being actively worked on, but I don't think you want to wait six months for us to publish something you can reference! Unicode section 5.18 does not, alas, define C+F directly. The letter forms are defined in the UCD, inside the file CaseFolding.txt here: ftp://ftp.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

They are also mentioned in UAX#44

@iherman
Copy link
Member Author

iherman commented May 16, 2015

On 15 May 2015, at 22:40 , aphillips notifications@github.com wrote:

@iherman I agree that name items are more likely to be expressed in an "identifier-like" way, but not that this removes the casing issue. Although a general convention helps the end-users avoid pitfalls, you still have to normatively specify what to do for those that ignore or cannot follow the convention.

If you choose to require case insensitivity, I would define it for name (at least) based on Unicode C+F and ensure that there is a health warning in the text.

I let @JeniT or @gkellogg (the editors of that document) decide on the text to be added; I agree some health warning would be good.

@iherman That's correct, charmod-norm is being actively worked on, but I don't think you want to wait six months for us to publish something you can reference! Unicode section 5.18 does not, alas, define C+F directly. The letter forms are defined in the UCD, inside the file CaseFolding.txt here: ftp://ftp.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

They are also mentioned in UTR#44

@aphillips, thanks. Can you suggest a normative text how to refer to this exactly?

Cheers


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

@JeniT
Copy link

JeniT commented May 20, 2015

I'd like to discuss this today on the call. There's clear guidance here not to include case-insensitive matching, and I'm not sure that it buys us enough functionality to justify the additional implementation burden of supporting case folding.

@aphillips should we (or should we not) be saying anything about Unicode normalization prior to case-sensitive comparisons? I think from http://w3c.github.io/charmod-norm/#h-formal-language that the recommendation is that we do not specify that the strings are Unicode normalized prior to comparison, but do add a note pointing out that this might mean that canonically equivalent but disjoint Unicode character sequences will not match. Have I got that right?

@JeniT
Copy link

JeniT commented May 20, 2015

We discussed on the call; @gkellogg, @danbri and I propose that in 5.5.1 Schema Compatibility, when not validating, columns should simply match on index without consideration of name or titles - so long as the number of columns matches then you can proceed. This would mean no more case-insensitive matching. @iherman, @6a6d74 what do you think?

@aphillips
Copy link

@JeniT

Have I got that right?

Exactly right. I18N has been recommending that specifications not require Unicode normalization in most cases, creating (or, rather, preserving) the potential for visually indistinguishable mismatches. Including a health warning as a note saying exactly what your comments says (and pointing to Charmod-Norm or to Unicode Standard Annex 15 (UAX15)) as a reference.

We also recommend specs use case-sensitive matching unless they have an operational reason not to. That's more of a judgement call. Case-sensitivity is certainly the easiest to implement.

@iherman
Copy link
Member Author

iherman commented May 20, 2015

On 20 May 2015, at 16:55 , Jeni Tennison notifications@github.com wrote:

We discussed on the call; @gkellogg, @danbri and I propose that in 5.5.1 Schema Compatibility, when not validating, columns should simply match on index without consideration of name or titles - so long as the number of columns matches then you can proceed. This would mean no more case-insensitive matching. @iherman, @6a6d74 what do you think?

I am not sure I understand. I mean: it is fine to match on indeces (of non-virtual columns) but what happens if we end up with two names that are not equal? Is this an error? What should a non-validating processor, eg, a converter, do?

@gkellogg
Copy link
Member

It's an error when validating, but this would be ignored when not validating. Of course, this can be avoided if the metadata includes titles, in which case it's a positive match. A Column having only name properties would use those names for reference, but not to actually match against the tabular data.

A non-validating processor will ensure that the number of non-virtual columns match the actual input columns, and perform compatibility check for intersecting titles, but otherwise continue processing.

@iherman
Copy link
Member Author

iherman commented May 21, 2015

On 20 May 2015, at 21:47 , Gregg Kellogg notifications@github.com wrote:

It's an error when validating, but this would be ignored when not validating.

But what would then happen? The name plays a role in the conversion, so we must say what a non-validating tool should do in this case.

Of course, this can be avoided if the metadata includes titles, in which case it's a positive match. A Column having only name properties would use those names for reference, but not to actually match against the tabular data.

But if there are more, which?

I seem to miss something essential here.

A non-validating processor will ensure that the number of non-virtual columns match the actual input columns, and perform compatibility check for intersecting titles, but otherwise continue processing.

@gkellogg
Copy link
Member

As we don't merge, there is no titles annotation on the column, there may be a name annotation, if it is defined in the metadata, otherwise the conversion documents have a provision to default name. We already have cases when there is no title. The annotations come entirely from the metadata document, and the conversion algorithms operate off of annotations.

What was taken out is a compatibility criteria in the case that there is no titles metadata, so without this, columns would presume to be compatible. The only way to be incompatible (in the absence of titles) is for the column counts to not match up. Since all conversions operate off of model annotations, they should continue to operate as expected.

The case is not unlike that where the dialect says that there is no header row, and the first row is skipped.

From a processing perspective, the compatibility check would succeed, and processing would continue as before. Note that we don't use the title from the CSV except for the compatibility check, so there should be no difference in downstream processing.

@iherman
Copy link
Member Author

iherman commented May 21, 2015

Hm. Maybe I should wait for the new text of 5.5.1., because I do not seem to understand what you mean. The text in Jeni's comment simply says (for non validating processors)

columns should simply match on index without consideration of name or titles - so long as the number of columns matches then you can proceed

But then, for example, what would happen if the two metadata are:

   "columns": [{
    "name": "countryCode",
   }, {

and

  "columns": [{
    "name": "somethingElse",
  }, {

Surely these two columns should not coincide, right? So there should be some consideration of names or titles...

Bottom line: can one of you give a new version of section 5.5.1? Because I am obviously lost here.

@gkellogg gkellogg self-assigned this May 21, 2015
gkellogg added a commit that referenced this issue May 21, 2015
@gkellogg
Copy link
Member

@iherman, that case you site can't really happen, because that would be merge case. As described now, if one has name but not titles and the other has titles but not name, the columns are assumed to match. If they both had either name or titles, those would be used in a case-sensitive match. Note the test case "test124" to check this.

@iherman
Copy link
Member Author

iherman commented May 22, 2015

O.k., next time we should discuss real texts:-) Now that I see the #564, and I am fine with that. I will merge #564 and close this issue. Pfew:-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants