New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exact nature of case-insensitive match in schema compatibility #551
Comments
Specifically it says that specs shouldn't specify case-insensitive matching, but where they do they should specify "Unicode case-folding C+F" comparisons. I propose that we adopt that terminology. (The alternative is that we remove the case-insensitive matching.) |
The specific challenge for tabular data if you were to adopt Unicode C+F is that there are many existing tabular data files which use a legacy character encoding (that is, not a Unicode encoding) or which contain labels or values in a specific natural language. Making user-defined labels "case-insensitive" invites confusion for users of languages that are affected by language-specific case folding. The example given in our WG's document is of Turkish. This is why #i18n recommends that specs use case-insensitive matching, particularly when user-defined values are to be matched later. |
@aphillips I am afraid we indeed do not have too much choice. First of all, @aphillips, the comparison does not occur on cell values, but only on column titles and names. In some sense, this is a much more controlled set of data, i.e., the source of errors is reduced (though still exists). On the other hand, while the values for Finally, titles are obviously not restricted to ASCII. Combining all these, I think that @JeniT's conclusion is the right one. As for the encoding: of course that is an issue. We do recommend authors to use UTF-8 (which is the default) in the CSV files. However, @JeniT, I do not think we require that, in the model, texts MUST be in UTF-8. Shouldn't we specify that? Created a new issue for this (#557) |
@aphillips, an admin question: alas!, the charmod-norm document is not yet a Rec, so we cannot normatively refer to it. Is "Unicode [UNICODE] Section 5.18" the right reference for "Unicode C+F"? I mean, is this a term from the Unicode document or from charmod-norm? |
@iherman I agree that If you choose to require case insensitivity, I would define it for @iherman That's correct, charmod-norm is being actively worked on, but I don't think you want to wait six months for us to publish something you can reference! Unicode section 5.18 does not, alas, define C+F directly. The letter forms are defined in the UCD, inside the file CaseFolding.txt here: ftp://ftp.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt They are also mentioned in UAX#44 |
I let @JeniT or @gkellogg (the editors of that document) decide on the text to be added; I agree some health warning would be good.
@aphillips, thanks. Can you suggest a normative text how to refer to this exactly? Cheers Ivan Herman, W3C |
I'd like to discuss this today on the call. There's clear guidance here not to include case-insensitive matching, and I'm not sure that it buys us enough functionality to justify the additional implementation burden of supporting case folding. @aphillips should we (or should we not) be saying anything about Unicode normalization prior to case-sensitive comparisons? I think from http://w3c.github.io/charmod-norm/#h-formal-language that the recommendation is that we do not specify that the strings are Unicode normalized prior to comparison, but do add a note pointing out that this might mean that canonically equivalent but disjoint Unicode character sequences will not match. Have I got that right? |
We discussed on the call; @gkellogg, @danbri and I propose that in 5.5.1 Schema Compatibility, when not validating, columns should simply match on index without consideration of name or titles - so long as the number of columns matches then you can proceed. This would mean no more case-insensitive matching. @iherman, @6a6d74 what do you think? |
Exactly right. I18N has been recommending that specifications not require Unicode normalization in most cases, creating (or, rather, preserving) the potential for visually indistinguishable mismatches. Including a health warning as a note saying exactly what your comments says (and pointing to Charmod-Norm or to Unicode Standard Annex 15 (UAX15)) as a reference. We also recommend specs use case-sensitive matching unless they have an operational reason not to. That's more of a judgement call. Case-sensitivity is certainly the easiest to implement. |
I am not sure I understand. I mean: it is fine to match on indeces (of non-virtual columns) but what happens if we end up with two names that are not equal? Is this an error? What should a non-validating processor, eg, a converter, do? |
It's an error when validating, but this would be ignored when not validating. Of course, this can be avoided if the metadata includes A non-validating processor will ensure that the number of non-virtual columns match the actual input columns, and perform compatibility check for intersecting |
But what would then happen? The name plays a role in the conversion, so we must say what a non-validating tool should do in this case.
But if there are more, which? I seem to miss something essential here.
|
As we don't merge, there is no What was taken out is a compatibility criteria in the case that there is no The case is not unlike that where the dialect says that there is no header row, and the first row is skipped. From a processing perspective, the compatibility check would succeed, and processing would continue as before. Note that we don't use the title from the CSV except for the compatibility check, so there should be no difference in downstream processing. |
Hm. Maybe I should wait for the new text of 5.5.1., because I do not seem to understand what you mean. The text in Jeni's comment simply says (for non validating processors)
But then, for example, what would happen if the two metadata are:
and
Surely these two columns should not coincide, right? So there should be some consideration of names or titles... Bottom line: can one of you give a new version of section 5.5.1? Because I am obviously lost here. |
…es to not need to match titles. Fixes #551.
@iherman, that case you site can't really happen, because that would be merge case. As described now, if one has name but not titles and the other has titles but not name, the columns are assumed to match. If they both had either name or titles, those would be used in a case-sensitive match. Note the test case "test124" to check this. |
I am just reading the (draft) Character Model for the World Wide Web: String Matching and Searching document, and hit the section on string identity matching. This section specifies what it means to have "case-insensitive matching" and contains advises on how this should be used in specifications. We may want to look at the section on schema compatibility where we also specify case-insensitive matching and refine that section. (Or ask for help from the I18N group, specifically @aphillips)
The text was updated successfully, but these errors were encountered: