Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limiting the length of strings #1516

Closed
xfq opened this issue May 31, 2022 · 14 comments · Fixed by #1526 or #1568
Closed

Limiting the length of strings #1516

xfq opened this issue May 31, 2022 · 14 comments · Fixed by #1526 or #1568
Assignees
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. PR available

Comments

@xfq
Copy link
Member

xfq commented May 31, 2022

StringSchema
https://w3c.github.io/wot-thing-description/#stringschema

This section has two metadata fields related to the length of a string: minLength and maxLength. What are their units? Is it Unicode code point?

See Truncating or limiting the length of strings for i18n WG's best practices for spec developers.

@xfq xfq added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label May 31, 2022
@github-actions github-actions bot added the needs-triage Automatically added to new issues. TF should triage them with proper labels label May 31, 2022
@danielpeintner
Copy link
Contributor

Those terms are borrowed from JSON schema (see https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.6.3)

The length of a string instance is defined as the number of its characters as defined by RFC 8259.

The TD document should make that clear(er).

@sebastiankb sebastiankb added PR needed and removed needs-triage Automatically added to new issues. TF should triage them with proper labels labels Jun 7, 2022
@aphillips
Copy link

You should reopen this issue. The term "character" is rather non-specific and its meaning here depends critically on whether you mean UTF-16 code units or Unicode code points. If one carefully reads JSON Schema and RFC8259, then it means Unicode code points. But this is potentially different from what JavaScript's String.length returns (counted in UTF-16 code units). That is, the string "😃" (U+1F603) has a .length of 2 (because it is encoded as \uD83D\uDE03). Because this is potentially surprising to developers, you should call it out. And I'm not sure this is your intention in any case (it's actually somewhat painful to try to measure and truncate strings in terms of code points in JavaScript)

@danielpeintner
Copy link
Contributor

I reopened the issue (took some time since I was on vacation).

The part

characters as defined by [[RFC8259]]

is taken from JSON schema (see for example maxLength) which points to the JSON format which speaks about "Unicode characters".

Do you think using the term "Unicode characters" is clear enough or would you argue "Unicode code points" is the right term?

Note: we don't consider JavaScript. Instead we would like to be compliant with JSON schema. Hence any potential difference with JavaScript is okay for us... I think.

@danielpeintner danielpeintner reopened this Jul 5, 2022
@aphillips
Copy link

@danielpeintner Thanks!

It's much easier on the reader when specs use the Unicode jargon explicitly. Casual readers who think they know what the ambiguous term character means are not surprised later. And experienced readers don't have to click through to and closely read a couple of "far away documents" to try to puzzle it out--and they try to ensure that you knew what it meant.

RFC8259's use of the term character or Unicode character actually means Unicode code point, but it does require careful reading to ascertain this and then ensure that this was, indeed, what was intended. (There are also some corner cases found in section 8.2, since unpaired surrogate code points are permitted by the ABNF). JSON Schema's definition of maxLength and minLength also mean code points.

So I would suggest changing the term and I would consider adding a note of reminder about code points != chars. It's not so much the "difference with JavaScript" as it is reminding folks that String.prototype.length in JS gives the wrong answer. cf. here. Also, while lots of JSON users never use JavaScript, many of these users have the same problems in their local programming language--Java for example also uses UTF-16 code units for char.

@danielpeintner
Copy link
Contributor

@aphillips thanks for your detailed information.

I created #1568 which explicitly states "unicode code points" and gives an example that I found quite useful (see original source).

I hope this makes it clear now and resolves the issue.

@aphillips
Copy link

I'll copy my comment on the PR here for posterity. Alas, the I18N community is a bit fastidious/fussy about our jargon, particularly when it relates to the ambiguity of the term character. Hopefully my comments can help resolve the problems here.

Some Unicode code points do not represent characters--because they are surrogate code points, unassigned code points, or non-character code points. But all Unicode code points that are assigned represent "characters" in Unicode. The example you inserted is trying to describe what Unicode jargon refers to as a grapheme or (more formally) a grapheme cluster. The description here is only part of the problem when dealing with length limitations.

The I18N WG maintains guidelines and help for spec authors and there is a specific section (which I should have pointed to before 😞) describing the terminology and requirements. See here.

What I would suggest is that you word this paragraph thusly:

The length of a string (i.e. minLength and maxLength) is defined as the number of Unicode code points, as specified by [RFC8259]. Note that some user-perceived characters are composed of more than one Unicode code point. Arbitrary index values might not fall on these grapheme boundaries, so truncation according to maxLength might alter the appearance or meaning of the string.

Notice that RFC8259 discusses the potential for partial truncation in section 8.2 and that JSON Schema section 6.3.1 says nothing about grapheme boundaries--only character (code point) boundaries. While it definitely can change the meaning of text truncate arbitrarily mid-string, I think your specification is safer if you do not break with these requirements. A stronger health warning in your specification, however, seems like overkill.

@danielpeintner
Copy link
Contributor

Thank you very much. I updated #1568 accordingly.

@aphillips
Copy link

Looks good to me. Thank you!!

@mutech
Copy link

mutech commented Aug 23, 2022

I sweated a lot about this topic. Not wanting to open a can of worms, but I believe that this topic deserves still more attention.

If we consider what string length constraints in schema are actually used for, I can think of these use cases:

  • Deployment limitations such as database columns. Here a max length depends on the encoding used to store strings and is typically measured in encoded bytes.
  • Formats such as f.e. some national postal code. These are often using a restricted character set (digits in this example) and often use ASCII or otherwise single code unit characters (one might argue that these should be better expressed as patterns though).
  • Semantic constraints such as f.e. a Twitter message length. These most likely intend to mean user-visible characters. They in turn are hard to compute and considering that Intl.Segmenter uses locales to do its job and this is actually expected to affect grapheme segmentation in the future, this smells scary in a validation context.
  • Formatting constraints such as f.e. limiting the length of a formatted address to that it fits into an envelope window. That of course suffers from all kinds of problems.

I am very happy that JSON schema is clarifying the meaning of character as code point. Sadly the choice of using code points as measure is the least helpful. Strings are rarely stored in utf-32, so this is probably not useful to express storage limitations. Code points have no user (and thus no user-requirement) visible meaning. The most common reason to express a length constraint in code points is to assume that that measure is the same as a JS String.length because most applications requiring a constraint also use a restricted character set that actually sanitises the meaning of character. The problem is that this sanitation is not expressed explicitly and often lost.

I think it would be extremely helpful for schema authors to be able to better specify their intend. One way to do that could be to specify something like:

length: {
  utf8?: { min?: integer, max?: integer } | integer
  utf16CodeUnits?: { min?: integer, max?: integer } | integer
  codePoints?: { min?: integer, max?: integer } | integer
  graphemes?: { min?: integer, max?: integer } | integer
} | { min?: integer, max?: integer } | integer

So maxLength: 254 would probably become length: { utf8: { max: 254 } }, maxLength: 280 would become length: { graphemes: { max: 280 } } and in contexts in which the application has implied assumptions about what character length means (then assuming this coincides with graphemes, code points and code units are the same measure), length: { max: X } can be used.

The problem with the current approach is that length constraints based on code points cannot be used to express storage size or graphemes. In a way the current confusion about what character actually means is the reason why people use length constraints at all, basically a "let's hope for the best" approach that works surprisingly well but does not really seam to be adequate for schema validation purposes.

@egekorkan
Copy link
Contributor

@mutech I think that your points are valid however we do not want to break interoperability JSON Schema. However, we can define additional keywords that allow for more precise length requirements. What do you think?

@mutech
Copy link

mutech commented Sep 7, 2022

@mutech I think that your points are valid however we do not want to break interoperability JSON Schema. However, we can define additional keywords that allow for more precise length requirements. What do you think?

I guess that's the only meaningful solution. The root of the problem is that the notion of users seeing characters on the screen and various platforms handling characters in storage or memory diverged with Unicode and that support for Unicode features (such as counting graphemes) is far from universal.

@danielpeintner
Copy link
Contributor

Personally, I think it is not useful to argue that code points (what we and JSON schema use to specify the length) are not useful

  • for user-visible characters
  • to express storage limitations
  • to assume it matches JS String.length
  • ...

Your arguments are all valid but there is no one fits all solution here.

I think it is important to not invent another way of defining the length and that (if properly calculated) all would come to the same length (code-points).
In any case someone needs to adapt...

@mutech
Copy link

mutech commented Sep 8, 2022

I completely agree, there is no one-fits-all solution other than a solution that allows to express the intention in use cases that depend on what is counted as length.

I'm not sure if I understand your comment right: "all would come to the same length (code-points)".

If you mean to say that code point counting is sufficient, if properly calculated, then I disagree:

  • The length of a tweet is understood by users as graphemes and recently also implemented as such, an emoticon is counted as 1 character. You cannot reasonably express that in code point lengths, considering that there is no reasonable limit to the length of a grapheme (f.e. Zalgo)
  • Most databases understand storage allocation as utf8-bytes. If you would compute the maximum storage size for a string to fit in 255 bytes, it would have to be a maxLength of 42. But while that is sufficient to limit the data such that it fits into a column, it would not express the intention which is probably to allow as many characters as would fit into 255 bytes.

Again, I am aware that in practice these issues usually are not a problem, because the vast majority of users are either rarely concerned about characters outside of the BMP range or are very aware and careful to correctly handle these cases because their languages do not fit into the ASCII/UCS2 scheme where counting characters is trivial.

Anyway, I am happy about the clarification that character actually means code point which was not really defined before and to my knowledge is still not defined in JSON schema. After all, regular expressions can be used to express many of these measurements through UC categories, not portable, not complete but probably good enough for practical purposes, but that's the SHOULD realm anyway.

@danielpeintner
Copy link
Contributor

I'm not sure if I understand your comment right: "all would come to the same length (code-points)".

If you mean to say that code point counting is sufficient, if properly calculated, then I disagree:

I see your points, but once again I think we cannot solve that. It is not that we do not care about characters outside of BMP. On the contrary!
And yes, a database that uses utf8-bytes needs to account for the worst case. Hence 10 character strings might be longer than 10 bytes.

Anyway, I am happy about the clarification that character actually means code point which was not really defined before and to my knowledge is still not defined in JSON schema

In fact, we re-use the definition of JSON schema. It is defined there as code points and we just aligned with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. PR available
Projects
None yet
6 participants