New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limiting the length of strings #1516
Comments
Those terms are borrowed from JSON schema (see https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.6.3)
The TD document should make that clear(er). |
You should reopen this issue. The term "character" is rather non-specific and its meaning here depends critically on whether you mean UTF-16 code units or Unicode code points. If one carefully reads JSON Schema and RFC8259, then it means Unicode code points. But this is potentially different from what JavaScript's |
I reopened the issue (took some time since I was on vacation). The part
is taken from JSON schema (see for example maxLength) which points to the JSON format which speaks about "Unicode characters". Do you think using the term "Unicode characters" is clear enough or would you argue "Unicode code points" is the right term? Note: we don't consider JavaScript. Instead we would like to be compliant with JSON schema. Hence any potential difference with JavaScript is okay for us... I think. |
@danielpeintner Thanks! It's much easier on the reader when specs use the Unicode jargon explicitly. Casual readers who think they know what the ambiguous term RFC8259's use of the term So I would suggest changing the term and I would consider adding a note of reminder about code points != chars. It's not so much the "difference with JavaScript" as it is reminding folks that |
@aphillips thanks for your detailed information. I created #1568 which explicitly states "unicode code points" and gives an example that I found quite useful (see original source). I hope this makes it clear now and resolves the issue. |
I'll copy my comment on the PR here for posterity. Alas, the I18N community is a bit fastidious/fussy about our jargon, particularly when it relates to the ambiguity of the term character. Hopefully my comments can help resolve the problems here. Some Unicode code points do not represent characters--because they are surrogate code points, unassigned code points, or non-character code points. But all Unicode code points that are assigned represent "characters" in Unicode. The example you inserted is trying to describe what Unicode jargon refers to as a grapheme or (more formally) a grapheme cluster. The description here is only part of the problem when dealing with length limitations. The I18N WG maintains guidelines and help for spec authors and there is a specific section (which I should have pointed to before 😞) describing the terminology and requirements. See here. What I would suggest is that you word this paragraph thusly:
Notice that RFC8259 discusses the potential for partial truncation in section 8.2 and that JSON Schema section 6.3.1 says nothing about grapheme boundaries--only character (code point) boundaries. While it definitely can change the meaning of text truncate arbitrarily mid-string, I think your specification is safer if you do not break with these requirements. A stronger health warning in your specification, however, seems like overkill. |
Thank you very much. I updated #1568 accordingly. |
Looks good to me. Thank you!! |
I sweated a lot about this topic. Not wanting to open a can of worms, but I believe that this topic deserves still more attention. If we consider what string length constraints in schema are actually used for, I can think of these use cases:
I am very happy that JSON schema is clarifying the meaning of I think it would be extremely helpful for schema authors to be able to better specify their intend. One way to do that could be to specify something like:
So The problem with the current approach is that length constraints based on code points cannot be used to express storage size or graphemes. In a way the current confusion about what |
@mutech I think that your points are valid however we do not want to break interoperability JSON Schema. However, we can define additional keywords that allow for more precise length requirements. What do you think? |
I guess that's the only meaningful solution. The root of the problem is that the notion of users seeing characters on the screen and various platforms handling characters in storage or memory diverged with Unicode and that support for Unicode features (such as counting graphemes) is far from universal. |
Personally, I think it is not useful to argue that code points (what we and JSON schema use to specify the length) are not useful
Your arguments are all valid but there is no one fits all solution here. I think it is important to not invent another way of defining the length and that (if properly calculated) all would come to the same length (code-points). |
I completely agree, there is no one-fits-all solution other than a solution that allows to express the intention in use cases that depend on what is counted as length. I'm not sure if I understand your comment right: "all would come to the same length (code-points)". If you mean to say that code point counting is sufficient, if properly calculated, then I disagree:
Again, I am aware that in practice these issues usually are not a problem, because the vast majority of users are either rarely concerned about characters outside of the BMP range or are very aware and careful to correctly handle these cases because their languages do not fit into the ASCII/UCS2 scheme where counting characters is trivial. Anyway, I am happy about the clarification that |
I see your points, but once again I think we cannot solve that. It is not that we do not care about characters outside of BMP. On the contrary!
In fact, we re-use the definition of JSON schema. It is defined there as code points and we just aligned with it. |
StringSchema
https://w3c.github.io/wot-thing-description/#stringschema
This section has two metadata fields related to the length of a string:
minLength
andmaxLength
. What are their units? Is it Unicode code point?See Truncating or limiting the length of strings for i18n WG's best practices for spec developers.
The text was updated successfully, but these errors were encountered: