New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textarea[maxlength] interoperability issue #1467

Open
tkent-google opened this Issue Jun 28, 2016 · 61 comments

Comments

@tkent-google
Contributor

tkent-google commented Jun 28, 2016

Current specification:

https://html.spec.whatwg.org/multipage/forms.html#limiting-user-input-length:-the-maxlength-attribute

... the code-unit length of the element's value is greater than the element's maximum allowed value length, then the element is suffering from being too long.

https://html.spec.whatwg.org/multipage/forms.html#the-textarea-element:concept-textarea-api-value

... The API value is the value used in the value IDL attribute. It is normalised so that line breaks use U+000A LINE FEED (LF) characters. Finally, there is the value, as used in form submission and other processing models in this specification. It is normalised so that line breaks use U+000D CARRIAGE RETURN U+000A LINE FEED (CRLF) character pairs, ...

So, a single linebreak should be counted as 2 on applyingmaxlength. Also, a non-BMP character such as an Emoji should be counted as 2 because it consists of two code units.

Current implementations:

The following table shows how characters are counted in maxlength computation in major browsers.

Google Chrome WebKit Firefox Edge, IE
A linebreak 2 2 1 1
Non-BMP character 2 1 2 1
Letter + combining char 2 1 2 2

Only Google Chrome conforms to the specification.

Issues

  • Not interoperable at all 😂
  • I think web authors need both of limiting submission value and limiting API value. It should be configurable.
  • Non-BMP characters are getting popular because of Emoji. Web authors might need to count an Emoji as 1. <input maxlength=2> and <input pattern=".{2}"> are not equivalent now because pattern attribute counts an Emoji as 1.

Proposal

Introduce new enumerated attribute maxlengthmode to specify maxlength counting algorithm. Its value should be something like codeunit-submission (Google Chrome) codeunit-api (Firefox, etc.) codepoint-submission (Non-BMP is 1, linebreak is 2) codepoint-api (Non-BMP is 1, linebreak is 1).

@domenic

This comment has been minimized.

Member

domenic commented Jun 28, 2016

Hmm.

I think web authors need both of limiting submission value and limiting API value. It should be configurable.

I'd be interested in hearing from more web authors before we add complexity for this. To me it seems better to limit submission value only, like other constraints.

Non-BMP characters are getting popular because of Emoji.

I agree this is an issue. I wonder if we could just update the spec to measure code points, in the same spirit as changing pattern="" to use Unicode. I think it would be more intuitive for authors.

In total, I would weakly prefer saying that maxlength always has what you called codepoint-submission behavior (non-BMP is 1, linebreak is 2), and work to converge all browsers to that. I am eager to hear from others on this subject, both implementers and authors.

Maybe codepoint-api is better, actually...

@domenic domenic added the compat label Jun 28, 2016

@phistuck

This comment has been minimized.

phistuck commented Jun 28, 2016

It seems to me that maxlength is used to guard against values going over the database limit of a field, which is defined in bytes most of the time. For this reason, I believe it should reflect the number of bytes (so code units, right?) and not the number of displayed characters or glyphs.

@annevk

This comment has been minimized.

Member

annevk commented Jun 28, 2016

See https://bugs.webkit.org/show_bug.cgi?id=120030 and https://lists.w3.org/Archives/Public/public-whatwg-archive/2013Aug/thread.html#msg184 for prior discussion and debate.

Safari uses grapheme clusters which seems a little heavy handed and probably not something anyone expects. Using code points by default seems reasonable. @rniwa any new thoughts on this meanwhile?

(The way Firefox and Edge count linebreaks seems like a bug that should just be fixed.)

@tkent-google

This comment has been minimized.

Contributor

tkent-google commented Jun 28, 2016

Chromium project has received multiple bug reports (less than 10, I think) that textarea[maxlength] miscounts characters. All of them expected a linebreak was counted as 1.

@annevk

This comment has been minimized.

Member

annevk commented Jun 28, 2016

@tkent-google none of those folks found it weird they got two line breaks on the server? Or where they all extracting the value client-side?

@tkent-google

This comment has been minimized.

Contributor

tkent-google commented Jun 28, 2016

I don't know the detail of their systems. They just mentioned they expected maxlength was applied to API value.
They might normalize CRLF to LF in server-side.

Let's forget about non-BMP characters and grapheme clusters, and focus on the pain of linebreak counting.

@annevk

This comment has been minimized.

Member

annevk commented Jun 28, 2016

I guess if that's the main problem you see and neither Firefox nor Edge is planning on switching, it seems reasonable to change the standard for linebreaks.

@tkent-google

This comment has been minimized.

Contributor

tkent-google commented Jun 28, 2016

FYI.
I searched bugzilla.mozilla.org and bugs.webkit.org for related issues.

https://bugzilla.mozilla.org/show_bug.cgi?id=702296
This says a linebreak should be counted as 2.
https://bugs.webkit.org/show_bug.cgi?id=154342
This says a linebreak should be counted as 1.

https://bugzilla.mozilla.org/show_bug.cgi?id=670837
https://bugzilla.mozilla.org/show_bug.cgi?id=1277820
They are codepoint counting issues.

@mathiasbynens

This comment has been minimized.

Member

mathiasbynens commented Jun 28, 2016

It seems to me that maxlength is used to guard against values going over the database limit of a field, which is defined in bytes most of the time. For this reason, I believe it should reflect the number of bytes (so code units, right?) and not the number of displayed characters or glyphs.

Bytes ≠ code units. The term “code units” generally refers to UTF-16/UCS-2 blocks, where each BMP symbol consists of a single unit and each astral symbol of two code units.

Byte values can be anything, as they depends on the encoding used.

@phistuck

This comment has been minimized.

phistuck commented Jun 28, 2016

Bytes ≠ code units. The term “code units” generally refers to UTF-16/UCS-2 blocks, where each BMP symbol consists of a single unit and each astral symbol of two code units.

Byte values can be anything, as they depends on the encoding used.

So this should be bytes according to the page encoding, I guess, since this is what is transferred to the server (in regular form submissions).

@tkent-google

This comment has been minimized.

Contributor

tkent-google commented Jun 28, 2016

Correction: Edge and IE count a non-BMP character as 1.

@aaron-nance

This comment has been minimized.

aaron-nance commented Jun 28, 2016

I am one of the web developers who would like to see this work consistently in all browsers. I work in the assessment industry. We use text areas for long form responses. These can be in any languages as we support IME.

What most web developers do to get around this maxlength vs. field.value.length inconsistency is they don't use maxlength. Instead most use JavaScript to enforce a character length of the response. This introduces a maintenance burden into the software since you have to consider not only normal character entry (which can be handled via key events or input event or any of the browser specific events), but also composition events, and cut/paste events. Why do all that in your web app when the maxlength attribute should be able to do that for you?

There would be more bugs filed, but most web developers have, unfortunately, accepted that this will never be consistent and given up. Search stack overflow for this issue and you'll find plenty of discussion.

In the work I do it is not acceptable to have a different number of characters available to a user using one browser over another. We also display a counter to the user, using the field length, to tell them how many characters they have left. We end up having to adjust the maxlength in Chrome and WebKit to achieve this parity, which is a silly thing to have to do.

@domenic

This comment has been minimized.

Member

domenic commented Jun 28, 2016

maxlength cannot be used to enforce that a certain number of bytes get sent to the server so that should not be a consideration. And introducing encoding as a factor is way more complicated than this needs to be.

It sounds like, based on Chromium bugs filed and other web developers anecdotes, web developers prefer linebreak as 1 character. Partially because it makes sense; partially because it matches textarea.value.length. There does not seem to be a lot of people favoring Chromium's/WebKit's/the spec's behavior of linebreaks being 2 characters, to match what is sent to the server. (I also note that Twitter counts linebreaks as 1 character against the 140 limit.)

I am not sure what to do about non-BMP characters. I think users might prefer grapheme clusters. Some developers might prefer consistency with pattern="", which would be code points. Some developers might prefer consistency with textarea.value.length, which would be code units. :-/ Maybe it is best to focus on linebreaks first.

@tkent-google, it sounds like you are willing to change linebreaks to 1, which would make it 3/4 browsers. WebKit people, any thoughts on that? Would you prefer to keep your current linebreak behavior, which matches the current spec, or do you agree with @tkent-google that the current spec behavior is causing problems for users? /cc @cdumez as I feel he has been working on lots of compat and spec compliance cases like this...

@cdumez

This comment has been minimized.

Contributor

cdumez commented Jun 28, 2016

I'll defer to @rniwa on this one since he is already familiar with this (based on https://bugs.webkit.org/show_bug.cgi?id=120030).

@tkent-google

This comment has been minimized.

Contributor

tkent-google commented Jun 29, 2016

My preference is to make the behavior configurable as I wrote in the first message. I'm not sure what's the majority of web developer demand.

dstockwell pushed a commit to dstockwell/chromium that referenced this issue Jun 29, 2016

UseCounter: Count usage of textarea[maxlength] and textarea[minlength].
This is a preparation for a possible specification change.
whatwg/html#1467

BUG=624361
TBR=isherman@chromium.org

Review-Url: https://codereview.chromium.org/2104893005
Cr-Commit-Position: refs/heads/master@{#402806}
@zcorpan

This comment has been minimized.

Member

zcorpan commented Jul 6, 2016

I don't particularly like the idea of being able to configure the behavior and have lots of modes. I think it is better to reason about what is the best design, and choose that. It seems to me most people expect maxlength to behave like codepoint-api, and but it seems useful to me to match up with elm.value.length so JS code doesn't get confused. (edit: hmm elm.value.length is code units...)

If we want to solve the "guard against values going over the database limit of a field" use case, we could add a separate attribute like maxbytelength, which counts the number of bytes that will be sent (in the encoding the form data will be sent).

Searching on github for maxbytelength I find at least one (broken) JS implementation of such an attribute:
https://github.com/jiangyukun/2015July/blob/709a347f3fe9b80674dcabab7d26004626826529/src/main/webapp/helper/h2/jsPlumbDemo/js/jquery-extend/jquery.maxbytelength.js

@tkent-google

This comment has been minimized.

Contributor

tkent-google commented Jul 7, 2016

I agree that most of web developers wants codepoint-api.

Web developers can re-implement their own user-input restriction by hooking keypress and paste events, right? If so, providing only codepoint-api behavior would be reasonable.

@domenic

This comment has been minimized.

Member

domenic commented Jul 7, 2016

I started looking into fixing this. A couple interesting points:

  • Does this impact minlength too?
  • Do we count linebreaks inserted by hard-wrapping? Currently minlength and maxlength operate on the element's value, which is the result of the textarea wrapping transformation. That is what transforms LF and CR into CRLF, in addition to inserting CRLFs for hard-wrapping.
  • Does this impact textarea.textLength? This currently returns the number of code units (not code points) of the API value. The API value normalizes linebreaks to LF and doesn't include linebreaks inserted by hard-wrapping.

My preferences:

  • minlength is also affected
  • minlength and maxlength count the number of code points in the API value (so linebreaks are counted as 1 and inserted hard-wraps are not counted)
  • textLength would change to return the number of code units in the API value

I think these preferences are consistent with the idea that we are not trying to control the number of bytes on the wire (the element's value) but instead something closer to the user visible characters.

domenic added a commit that referenced this issue Jul 7, 2016

Change minlength/maxlength behavior around linebreaks and code points
As discussed in #1467, the current situation around these attributes is
not very interoperable in two ways:

* Some browsers count line breaks as two characters (per the spec),
others as one character.
* Some browsers count code units (per the spec), others code points,
others grapheme clusters.

Per discussions, this updates minlength and maxlength, as well as
textarea's textLength property, to count line breaks as one character
and to count code points. We believe this is the most developer-friendly
approach. Notably the choice of code points aligns with the pattern=""
attribute.

Fixes #1467.
@domenic

This comment has been minimized.

Member

domenic commented Jul 7, 2016

I changed my mind on textLength. We appear to already have interop there. So let's not change anything, even if it is a bit inconsistent. http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4305

domenic added a commit that referenced this issue Jul 7, 2016

Change minlength/maxlength behavior around linebreaks and code points
As discussed in #1467, the current situation around these attributes is
not very interoperable in two ways:

* Some browsers count line breaks as two characters (per the spec),
others as one character.
* Some browsers count code units (per the spec), others code points,
others grapheme clusters.

Per discussions, this updates minlength and maxlength to count line
breaks as one character and to count code points. We believe this is the
most developer-friendly approach. Notably the choice of code points
aligns with the pattern="" attribute.

While here, updated the textLength property to return the code-unit
length of the element's API value, instead of the element's value, since
as per http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4304
that is what browsers actually do.

Fixes #1467.

domenic added a commit that referenced this issue Jul 7, 2016

Change minlength/maxlength behavior around linebreaks and code points
As discussed in #1467, the current situation around these attributes is
not very interoperable in two ways:

* Some browsers count line breaks as two characters (per the spec),
others as one character.
* Some browsers count code units (per the spec), others code points,
others grapheme clusters.

Per discussions, this updates minlength and maxlength to count line
breaks as one character and to count code points. We believe this is the
most developer-friendly approach. Notably the choice of code points
aligns with the pattern="" attribute.

While here, updated the textLength property to return the code-unit
length of the element's API value, instead of the element's value, since
as per http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4304
that is what browsers actually do.

Fixes #1467.
@tkent-google

This comment has been minimized.

Contributor

tkent-google commented Jul 8, 2016

Oops, I made a wrong reply.

I agree that most of web developers wants codepoint-api.

I don't think web developers want codepoint-api. I think it's codeunit-api, which is consistent with textarea.value.length.

@rniwa

This comment has been minimized.

Collaborator

rniwa commented Jul 8, 2016

I don't think that matches your earlier statement that non-BMP characters should be counted as one character due to the popularity of emojis.

I think what we really need is a new JavaScript API that lets author count the number of grapheme clusters.

@domenic

This comment has been minimized.

Member

domenic commented Jul 8, 2016

I will reply to @rniwa's #1517 (comment) here so we can keep the discussion in one place. Sorry for splitting it; I'm not sure why I thought that was a good idea.

I don't understand why we want to use code points at all if we're not trying to count the number of bytes. What's the point?

I don't think we're trying to count the number of bytes. That seems to match nobody's interests: not the web developer's, and not the user's. If you want to match bytes then you have to start worrying about encodings and your answer changes depending on your <meta charset> and so on. No fun.

It doesn't match what length property returns in JavaScript either since it counts code units. e.g. '😀'.length returns 2.

Yeah, this is the biggest problem. I thought we were going for consistency with pattern (which uses code points) and with some of the newer JS APIs (such as for-of over a string or Unicode regexes). But per #1467 (comment) I guess @tkent-google thought consistency with textarea.value.length was more important.

Given that the user visible number of characters is only related to the number of grapheme clusters, I don't think it's acceptable to treat them differently. e.g. "é" and "é".

I agree there is a definite tension between what users might expect (grapheme clusters) and what developers might expect (either code points or code units depending on what other platform and JS APIs they are using). I'm not sure which you are arguing for.

I thought code points was a good medium (it at least allows emoji entry to count as 1 character, and matches Twitter at least). But it seems like both you and @tkent-google are not a big fan of that. I can't tell if you are arguing that minlength/maxlength should use code units, or grapheme clusters, or bytes.

I think what we really need is a new JavaScript API that lets author count the number of grapheme clusters.

I think ideally we'd want a whole new JS string library that deals with grapheme clusters, instead of the current one that mostly deals with code units but in some cases code points. Or maybe we just expect developers to normalize? I'm not sure. I think such wishings are a bit off-topic for discussing maxlength/minlength, though.


I'm unsure where this leaves us. My PR was clearly premature. If we went with "codeunit-api", that would require:

  • Chrome:
    • Change linebreaks from 2 characters to 1 character
  • WebKit:
    • Move from grapheme clusters to code units
    • Change linebreaks from 2 characters to 1 character
  • Firefox: no change
  • Edge:
    • Move from code points to code units
@domenic

This comment has been minimized.

Member

domenic commented Jul 14, 2016

@rniwa, do you have thoughts on moving from grapheme clusters to code units for minlength/maxlength (matching the current spec), and on moving from 2 characters to 1 character for linebreaks (a spec change)?

@rniwa

This comment has been minimized.

Collaborator

rniwa commented Jul 16, 2016

I'm arguing for using grapheme clusters for maxlength and minlength. I can be convinced either way for line breaks.

@kojiishi

This comment has been minimized.

kojiishi commented Jul 21, 2016

"write their own verification code" can apply to both of us, so I don't think we can break the web with that argument. UTF-16 is defined in ANSI SQL, one product not supporting it doesn't convinced me either. You're right about other choices are available, but that's what people do in reality, and browser used to support it.

We already agreed to add. Do you want to keep discussing until we're happy to break the web?

@rniwa

This comment has been minimized.

Collaborator

rniwa commented Jul 21, 2016

What do you mean by "break the web"? There is no interoperable behavior here. If interoperability and backwards compatibility is an issue (please provide a list of websites on which either behavior is required for compatibility, not some stack overflow post), we should implement what IE does.

UTF-16 is defined in ANSI SQL, one product not supporting it doesn't convinced me either.

Why is that even remotely relevant? Nobody implements ANSI SQL and nobody uses it. Also, with the increasing popularity of no-SQL database, it's almost irrelevant what popular RDMS uses as its primary text encoding. And NoSQL databases such as MongoDB and Cassandra use utf-8 as their text encoding of choice. May I remind you that your own AppEngine uses utf-8 as the text encoding of choice?

You're right about other choices are available, but that's what people do in reality, and browser used to support it.

I don't follow what you're saying here. In particular, "that" in "that's what people do" is ambiguous. Could you clarity?

We already agreed to add.

We agreed to add what?

@kojiishi

This comment has been minimized.

kojiishi commented Jul 25, 2016

What do you mean by "break the web"?

When only one browser started to change the behavior recently does not guarantee that it's safe for all browsers to change the behavior.

If interoperability and backwards compatibility is an issue ... we should implement what IE does.

IIUC that's what tkent originally proposed, and what Hixie resolved before. Do you have data why it's safe to revert the previous resolution?

Why is that even remotely relevant?

I already mentioned I'm ok with other encoding if preferred. Whichever encoding is used, developers can compute max length after conversion. Grapheme is a different animal, developers cannot compute the max.

We agreed to add what?

Agreed to add a mode that handles grapheme clusters in the original proposal of this issue. We haven't seen the data that shows it's safe to change yet.

@rniwa

This comment has been minimized.

Collaborator

rniwa commented Jul 25, 2016

What do you mean by "break the web"?
When only one browser started to change the behavior recently does not guarantee that it's safe for all browsers to change the behavior.

Which browser changed what behavior? WebKit's behavior to use the grapheme clusters for maxlength existed as long as maxlength was supported in WebKit. Blink is the one that recently changed its behavior in August 2013 to use the number of code units for maxlength.

Why is that even remotely relevant?
I already mentioned I'm ok with other encoding if preferred. Whichever encoding is used, developers can compute max length after conversion. Grapheme is a different animal, developers cannot compute the max.

This is precisely why grapheme clusters is more useful. If author needed to limit the number of characters using UTF-7, 8, 16, or 32, he/she can easily implement that with a dozen or so lines of JavaScript. Counting grapheme clusters is a lot harder, and it's better implemented by UA.

Agreed to add a mode that handles grapheme clusters in the original proposal of this issue. We haven't seen the data that shows it's safe to change yet.

Well, we can make the same argument that nobody has shown to us that changing our behavior to not use graphene cluster is safe. Again, WebKit has always used the number of grapheme clusters for maxlength.

At this point, I don't see any value in continuing this discussion until someone comes back with data.

@domenic

This comment has been minimized.

Member

domenic commented Jul 25, 2016

I'm getting tired of this discussion without anyone providing any new information. I don't think there is any use in continuing this discussion until someone comes back with data.

It sounds like WebKit is not interested in arriving at consensus with the other browsers on this issue :(. In other words, I guess Darin Adler's "Anyway, we should match the standard" from https://bugs.webkit.org/show_bug.cgi?id=120030#c4 is no longer the WebKit team's policy. It may be best to seek consensus among the majority for the purposes of creating a spec, instead.

Right now the best we can spec with regard to non-linebreaks is the existing spec (code units), which gets 2/4 browsers. If Edge is willing to change, we can get 3/4. If Firefox and Chrome are both willing to change to code points, that is another path to 3/4.

(Linebreaks, it sounds like there is still hope of consensus on at 1 instead of 2.)

@rniwa

This comment has been minimized.

Collaborator

rniwa commented Jul 25, 2016

I'm getting tired of this discussion without anyone providing any new information. I don't think there is any use in continuing this discussion until someone comes back with data.
It sounds like WebKit is not interested in arriving at consensus with the other browsers on this issue :(. In other words, I guess Darin Adler's "Anyway, we should match the standard" from https://bugs.webkit.org/show_bug.cgi?id=120030#c4 is no longer the WebKit team's policy. It may be best to seek consensus among the majority for the purposes of creating a spec, instead.

WTF? That's not what I'm saying at all. What I'm saying that there is no point in keeping discussion without anyone showing any compatibility data given @kojiishi's repeated argument is that we can't use grapheme clusters for Web compatibility but without any data showing as such.

I find your commentary here extremely offensive. I've spent so much of my time replying to the discussion here for the last couple of weeks. Why on the Earth would I do that if I wasn't interested in coming to a consensus.

Right now the best we can spec with regard to non-linebreaks is the existing spec (code units)

Why? In what criteria is that "best"?

I also find all these arguments about what being "best" and "good" for developers extremely distractive and counterproductive. They're extremely subjective measures and of no merits to anyone who doesn't share the same idea of what's "good" and "bad" for the Web.

@rniwa

This comment has been minimized.

Collaborator

rniwa commented Jul 25, 2016

Here's the summary of the discussion thus far for UTF-16 code units versus grapheme clusters. Feel free to add more arguments / counter-arguments.

Arguments for UTF-16 Code Units

Argument Counter
Web developers want code unit because it matches what input.value.length returns
Matches what popular databases such as ANSI SQL, Microsoft SQL Server, and Oracle use Postgres, MongoDB, Cassandra, AppEngine use UTF-8
Chrome, Firefox, and spec match this behavior Chrome changed its behavior in August 2013. WebKit always used (and still uses) grapheme clusters.

Arguments for Grapheme Clusters

Argument Counter
Emoji is getting popular, and accented characters such as "é" and "é" should have the same behavior for end users
WebKit always used and still uses Chrome changed its behavior in August 2013 and hasn't gotten complaints
Checking the number of grapheme clusters is harder than doing the same for UTF-8, 16, and 32 or any other text encodings Web developers can't easily compute the length of text using grapheme clusters.
@domenic

This comment has been minimized.

Member

domenic commented Jul 26, 2016

WTF? That's not what I'm saying at all.

My sincere apologies then @rniwa. I must have misinterpreted. Perhaps you can see how people might interpret your statements that way. By all means, let's continue discussing; hopefully we can all move toward a compromise.

Why? In what criteria is that "best"?

In this sentence, I meant "best" as in "has the most browsers implementing it".

@tkent-google

This comment has been minimized.

Contributor

tkent-google commented Jul 26, 2016

If a form with <input maxlength> is submitted to a server, the server should validate the submitted data length again anyway. I don't think servers built in the last 20 years count grapheme cluster length with the algorithm same as WebKit.

@rniwa

This comment has been minimized.

Collaborator

rniwa commented Jul 26, 2016

If a form with <input maxlength> is submitted to a server, the server should validate the submitted data length again anyway.

Right. This is why I don't think guarding against the length of some UTF encoded text is a valid use case for maxlength.

I don't think servers built in the last 20 years count grapheme cluster length with the algorithm same as WebKit.

I don't think so either. However, I don't think that's an argument for or against using grapheme clusters for maxlength given the first point you just made. Since the server needs to do some sort of validation orthogonal or in addition to what Web browser does, maxlength can do whatever it needs to do regardless of how the backend server counts the number of characters.

domenic added a commit that referenced this issue Aug 24, 2016

Change minlength/maxlength behavior around linebreaks
As discussed in #1467, the current situation around these attributes is
not very interoperable. Some browsers count line breaks as two
characters (per the spec before this change), others as one character.

Per discussions, this updates minlength and maxlength to count line
breaks as one character. We believe this is the most developer-friendly
approach, as evidenced in part by repeated complaints against Chromium
for its behavior following the standard.

While here, updated the textLength property to return the code-point
length of the element's API value, instead of the element's value, since
as per http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4304
that is what browsers actually do.

Fixes part of #1467, but the debate remains about code-unit length vs.
code-point length vs. number of grapheme clusters.

domenic added a commit that referenced this issue Aug 24, 2016

Change minlength/maxlength behavior around linebreaks
As discussed in #1467, the current situation around these attributes is
not very interoperable. Some browsers count line breaks as two
characters (per the spec before this change), others as one character.

Per discussions, this updates minlength and maxlength to count line
breaks as one character. We believe this is the most developer-friendly
approach, as evidenced in part by repeated complaints against Chromium
for its behavior following the standard.

While here, updated the textLength property to return the code-point
length of the element's API value, instead of the element's value, since
as per http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4304
that is what browsers actually do.

Fixes part of #1467, but the debate remains about code-unit length vs.
code-point length vs. number of grapheme clusters.

domenic added a commit that referenced this issue Aug 24, 2016

Change minlength/maxlength behavior around linebreaks
As discussed in #1467, the current situation around these attributes is
not very interoperable. Some browsers count line breaks as two
characters (per the spec before this change), others as one character.

Per discussions, this updates minlength and maxlength to count line
breaks as one character. We believe this is the most developer-friendly
approach, as evidenced in part by repeated complaints against Chromium
for its behavior following the previous standard.

While here, updated the textLength property to return the code-point
length of the element's API value, instead of the element's value, since
as per http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4304
that is what browsers actually do.

Fixes part of #1467, but the debate remains about code-unit length vs.
code-point length vs. number of grapheme clusters.

annevk added a commit that referenced this issue Aug 26, 2016

Change minlength/maxlength behavior around linebreaks
As discussed in #1467, the current situation around these attributes is
not very interoperable. Some browsers count line breaks as two
characters (per the spec before this change), others as one character.

Per discussions, this updates minlength and maxlength to count line
breaks as one character. We believe this is the most developer-friendly
approach, as evidenced in part by repeated complaints against Chromium
for its behavior following the previous standard.

While here, updated the textLength property to return the code-point
length of the element's API value, instead of the element's value, since
as per http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4304
that is what browsers actually do. Similarly, updated the conformance
requirement on the text content of the textarea element to match
maxlength's actually-implemented behavior.

Fixes part of #1467, but the debate remains about code-unit length vs.
code-point length vs. number of grapheme clusters.
@crislin2046

This comment has been minimized.

crislin2046 commented Nov 17, 2016

In relation to what @domenic notes

There is a real tension here between users (who probably expect grapheme clusters) and developers (who probably expect 16-bit code units). There is also a tension between client-side developers (who probably expect linebreaks to be 1, to match .value.length) and server-side developers (who expect it to be 2, to match what is submitted with the form---and may have further expectations around non-ASCII characters depending on encoding).

Would it work to experiment with some new DOM and IDL attributes on textarea and relevant input, something like:

  • lengthinchars for grapheme clusters
  • lengthincodes for code units

With each having two constraint setters prefixed by max and min, and also the unprefixed getter that returns the relevant view of the length?

As for adding API or spec to clarify the newline behaviour for textarea, it's not clear what could work better than what's implemented.

@kojiishi

This comment has been minimized.

kojiishi commented Nov 18, 2016

I agree on @rniwa and @tkent that servers will check it again anyway for errors, but probably it'll give different experiences.

So from rniwa'a comment above, arguments that look reasonable to me are compat with other parts of DOM v.s. using maxlength for Emoji and non-pre-composed characters?

I changed my position to be neutral since my understanding on what cases authors use maxlength for is not backed by the data, and hope there were a consensus among impls.

@rianby64

This comment was marked as off-topic.

rianby64 commented Dec 20, 2016

Hello! I was looking for information about the maxlength and couldn't find anything useful for one case.

My case is:
I've an <input> and I want to write to it a very long string.

  1. str.length === 1002587.
  2. After setting to the input's value input.value = str then
  3. I test it's length and found that input.length === 524288 what differs from str.length.

If use instead of <input> a <textarea> then the above indicated test will pass.

So, my question is. Where can I find information about the input's restriction for length in the spec? If such restriction is depends on browser then would be great to see any warning about that.

And why textarea doesn't have any restriction for .value's length?

Hope my question fits into this scope.
Thanks a lot!

@phistuck

This comment was marked as off-topic.

phistuck commented Dec 20, 2016

@rianby64 - you just bumped into an old WebKit based bug - https://crbug.com/450544 - fixed in Chrome 56 (might still exist in Safari - see https://trac.webkit.org/browser/trunk/Source/WebCore/html/HTMLInputElement.cpp#L91 for the code).
I believe there is no restriction in the specification, it was an implementation detail of the browser.
Off topic, anyway.

@rianby64

This comment was marked as off-topic.

rianby64 commented Dec 21, 2016

@phistuck Thanks a lot.

I think there should be a mention about the <input>'s default maxlength and related tags, something like The input's value has no length restriction or similar.

@thany

This comment has been minimized.

thany commented Oct 17, 2018

I have created a pen to make it easy to check a browser against this issue:
https://codepen.io/thany/pen/zmRZKM

I feel maxlength should check characters, not bytes or whatever. Every non-BMP character is specified in unicode as singular characters and should be treated as such. We have some legacy issues regarding non-BMP characters, but I don't see a problem in allowing maxlength to count emoji as a single character. Systems that don't 'support' emoji will break anyway. There are legacy issues with BMP characters as well, so systems that are badly designed are bad systems, and I don't think we should accomodate them with a HTML standard that is expected to greatly outlive those systems.

@jfkthame

This comment has been minimized.

jfkthame commented Oct 17, 2018

I feel maxlength should check characters, not bytes or whatever. Every non-BMP character is specified in unicode as singular characters and should be treated as such.

What about emoji that are encoded as multi-character sequences (but rendered as single glyphs)? Something like 👩‍❤️‍👩 "two women with heart" appears to the user as a single emoji, but is encoded as a sequence of 6 unicode characters (or 8 code units in UTF-16). Counting "characters" (as encoded in Unicode) won't really help here.

@thany

This comment has been minimized.

thany commented Oct 18, 2018

Let's say it should count codepoints.

It's going to be an absolute nightmare (in html, javascript, and on servers alike) to count characters with combinating characters as one. Remember zalgo? That's too gnarly to deal with, if not impossible. And that goes for legitimate combinating characters as well.

You might see a combined emoji like that, or just a flag, as a ligature. They're not quite ligature afaik, but they're at least closely related. So would you count the "fi" ligature as a single character too? Of course not.

So I think a combined emoji should be counted as 6 (or however many codepoints), because it's 6 codepoints, as per the unicode spec. This is not great experience for the end-user, who might assume a them to be 1 character, but at least counting codepoints is more correct than counting UTF-16 characters (surrogate pairs?).

When dealing with normal text that is so international that its characters sit both inside and outside the BMP, it's not easy to know which character is going to count as two. It's not just emoji.

@jfkthame

This comment has been minimized.

jfkthame commented Oct 18, 2018

When dealing with normal text that is so international that its characters sit both inside and outside the BMP, it's not easy to know which character is going to count as two. It's not just emoji.

It's not just whether characters are in the BMP or not, either; what about precomposed accented letters vs decomposed sequences? They may be indistinguishable to the user (assuming proper font support), yet their lengths will be counted differently.

Perhaps we need to back up and reconsider what the use-cases are, and whether minlength/maxlength attributes on form fields can really make sense in an internationalized world.

@phistuck

This comment has been minimized.

phistuck commented Oct 18, 2018

Due to the name (length) and the tight relationship the web already has with JavaScript, I think that maxlength should use the JavaScript definition of String.prototype.length, as awkward as it is. In addition, maxunitlength and maxcharacterlength and whatever else needs be added for the other (reasonable) use cases. Use cases -

  • Limit the text in order to show it in a small space (maxcharacterlength).
  • Limit the text in order to put it in a database (maxunitlength).
  • Limit the text in the same way you limit JavaScript strings (maxlength).

(I also advocate for similar properties on String.prototype)

I realize the use cases I mentioned do not yet handle ligatures and similar, but those are just initial examples for fleshing this out.

@thany

This comment has been minimized.

thany commented Oct 19, 2018

@jfkthame

It's not just whether characters are in the BMP or not, either; what about precomposed accented letters vs decomposed sequences?

This has been addressed previously in this thread. And I personally believe maxlength should count codepoints. Otherwise it'll just simply get too hairy with all that zalgo text floating around. Moreover because maxlength is oftenly used to limit the amount of data sent over the line, and to limit the number of characters written to some database or file. If we allow combinating characters to be counted as one with their base character, we could make a virtually infinitely large text that is 1 character in length. I don't think it's a good idea to support that. Because abused it will be :)

Can I also say that ligatures and certain combinating characters like combined emoji (flags, families, etc) are heavily dependent on the font and therefor the browser & OS as well. That makes it ever harder to detect whether they are supposed to look like (and counted as) a single character or not. To me that's just one more reason to drop trying to account for such things, and simply count codepoints.

@rniwa

This comment has been minimized.

Collaborator

rniwa commented Nov 6, 2018

See #1467 (comment) for the summary of discussions. Keep making the same anecdotal arguments wouldn't help us sway discussion one way or another short of some hard cold data on web compatibility.

@domenic

This comment has been minimized.

Member

domenic commented Nov 6, 2018

I mean, right now the evidence seems to suggest that any of code units, code points, or grapheme clusters are web-compatible. (Or, if not, then people are doing browser sniffing and assuming the status quo, which is problematic in a different way.) So I am not sure any web-compatibility data can help us here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment