AccName trims whitespace but doesn't define which code points are whitespace #55

dd8 · 2019-07-10T16:02:13Z

This seems important to define because there's a lot of inconsistency between the whitespace definitions in different W3 specs.

HTML 5 uses two different definitions for white space:

https://www.w3.org/TR/html51/single-page.html#space-characters

HTML.1) White_space characters - defined as code points with the Unicode property "White_Space" in the Unicode PropList.txt data file
This definition is only used in the HTML spec to determine if table cells are empty. This definition includes non-ASCII spaces like non-breaking spaces (U+00A0) but excludes zero width spaces (U+200B and U+FEFF)

HTML.2) space characters
U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR)
This definition is used in lots of places in the HTML spec.

CSS has another two definitions for whitespace - both different to the HTML definitions:

CSS.1) White space: the 'white-space' property
U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), and U+000D CARRIAGE RETURN (CR)
This is different from the HTML space characters definition - it doesn't include U+000C FORM FEED (FF)
https://www.w3.org/TR/CSS2/text.html#white-space-prop
https://drafts.csswg.org/css-text-3/#white-space-processing

CSS.2) The grammar for CSS files uses yet another definition of whitespace:
https://www.w3.org/TR/css-syntax-3/#whitespace

XML.1) XML uses another definition inherited by XML based formats like SVG and MathML
(#x20 | #x9 | #xD | #xA)+
but this looks equivalent to the CSS.1 definition
U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), and U+000D CARRIAGE RETURN (CR)
https://www.w3.org/TR/xml/#NT-S

dd8 · 2019-08-14T22:20:32Z

Using the whitespace definition from 'Flat string' should work for trimming at step 2C - but worth being explicit in 'Flat string' about the exact code points involved for consistency with other W3/WhatWG spec.

dd8 · 2019-08-20T08:52:46Z

One other thing to consider - the CSS spec is explicit about not rendering certain code points (Default_ignoreable) distinct from Unicode white_space. AccName should take this into consideration:

https://drafts.csswg.org/css-text-3/#white-space-processing
http://unicode.org/L2/L2002/02368-default-ignorable.html
http://unicode.org/faq/unsup_char.html#3

Without taking this into account you can have an element with an AccName that contains no visible glyphs:

<!-- zero width no-break space and zero-width space -->
<h1>&#FEFF; &#200B;</h1>

The zero width no-break space U+FEFF pops up quite often because it also functions as the byte order mark in UTF-16 (every Unicode file saved by Windows Notepad starts with this character), so it's easy to get it into a page using cat or doing something like this:

<h1><!--#include "title.txt" --></h1>

pkra · 2019-08-20T09:23:19Z

To add to this list (thanks for this!): In a recent ARIA WG telco we discussed if U+2800 (blank braille pattern) should be considered whitespace (in the context of w3c/aria#924 and the requirement for UAs to ignore role descriptions containing only whitespace).

dd8 · 2019-08-20T09:29:57Z

From discussion on ACT-R I think you probably need two different concepts:

syntactic whitespace (e.g. blank roles, spaces separating IDREFs in aria-labelledby, headers etc) for consumption by machines (HTML/CSS/XML specs agree on ASCII-only whitespace, but disagree on whether U+000C Form Feed is whitespace)
user-visible code points (i.e. excluding Unicode White_space and Default_ignoreable) for consumption by humans

dd8 · 2019-08-20T14:45:57Z

To add to this list (thanks for this!): In a recent ARIA WG telco we discussed if U+2800 (blank braille pattern) should be considered whitespace (in the context of w3c/aria#924 and the requirement for UAs to ignore role descriptions containing only whitespace).

My $0.02 would be to use the 'space-separated tokens' microsyntax for roles since this is used in lots of other places (e.g. the class, headers and rel attributes) where the tokens are consumed by a machine:
https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#space-separated-tokens

A counter-argument is this could make authoring harder for braille users, but the problem already exists for other HTML attributes like headers and class, and using a different whitespace syntax for role and class is probably going to make authoring harder in general.

Plus there are already 4 incompatible whitespace definitions in HTML/CSS/XML...

Edit: sorry mis-understood role descriptions above - ignore this comment

accdc · 2020-08-20T23:30:18Z

Hi,
Based on our discussion today in the ARIA WG call, I recommend we do the following:

As a baseline for Core AccName, add the following characters to explicitly define which are considered baseline whitespace characters.

U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR)

Though this matches the HTML spec definition for whitespace, these characters exist within any text editor and don't require a specific user agent such as a browser to interpret them, so they would make a good core baseline.

I agree with the HTML spec in not including 0-width characters within this list though, because doing so would actually break the accessible names of human readable strings that have these characters within them, at least in the way that the algorithm replaces characters in the above list by flattening all such characters into one string by replacing them into a single space character " ".

E.G If you had a string such as Here we are now\n\nentertain us., where \n\n represents 2 new line characters, it would be flattened into the string Here we are now entertain us., where the 2 characters would be replaced and condensed into 1 space character separating the human readable words.

If you were to add the 0-width character to this list and apply the same logic to it, and you encountered a string that was meant to be read as a single word but each character had a 0-width character between each character in the string, then all such letters would be separated by a single space character even though the word visually appeared as though it had no spacing between the characters. In this case the most accessible solution is to ignore all 0-width characters and replace them with nothing so that the computed name matches the content that is visually displayed.

So, with the above list as a baseline whitespace character list, it can then be added to as needed by user agents when native host semantics require additions to be made within specific specs such as SVG, HTML, and CSS as needed within their respective algorithms.

Does this make sense?

All the best,
Bryan

MelSumner · 2020-09-17T16:39:13Z

Adding a note, I recently worked on a template linting rule - no-whitespace-within-word and we came up with this list after a fairly exhaustive search. I think we don't cover the Mongolian vowel spaces but attempted to include everything else we could find:
https://github.com/ember-template-lint/ember-template-lint/blob/master/lib/rules/no-whitespace-within-word.js

cookiecrook · 2020-09-17T22:44:12Z

I took an action this morning to check with other WebKit engineers on the whitespace implementation and preferences. WebKit has several implementations differentiating "whitespace" in the contexts where the various specs disagree. For example:

WebCore::RFC7230::isWhitespace()
WebCore::isHTMLSpace()
WebCore::isCSSSpace()

So from an implementation perspective, it doesn't really matter which the spec uses... Ideally not another one though.

There's a mild preference for HTML ASCII Whitespace unless there's a specific reason to use CSS Whitespace. Please double-check that advice during i18n review.

cookiecrook · 2020-09-17T22:48:14Z

@accdc wrote:

I recommend we do the following: As a baseline for Core AccName, add the following characters to explicitly define which are considered baseline whitespace characters. U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR)

I don't think any of these characters should be listed in AccName. Instead of AccName hosting a copy of the HTML values, link across to the HTML Spec with prose indicating it's the definitive source.

jnurthen · 2021-09-02T17:18:05Z

proposed for Sep 30 Deep Dive

pkra · 2023-01-12T18:21:32Z

I think this was resolved via #165 and w3c/core-aam#128.

Please re-open if there's something missing.

dd8 mentioned this issue Jul 11, 2019

Some space characters missing from glossary/whitespace.md act-rules/act-rules.github.io#642

Open

joanmarie added this to the 1.2 milestone Jul 11, 2019

pkra mentioned this issue Sep 9, 2019

Whitespace alone should be ignored in accName algorithms w3c/html-aam#238

Closed

Jym77 mentioned this issue Jan 10, 2020

Metadata in glossary act-rules/act-rules.github.io#1085

Merged

jnurthen added this to Needs triage in Whitespace Jul 21, 2020

jnurthen moved this from Needs triage to High priority in Whitespace Jul 23, 2020

jnurthen added the deep-dive label Jul 23, 2020

jnurthen added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Sep 17, 2020

w3cbot mentioned this issue Sep 21, 2020

AccName trims whitespace but doesn't define which code points are whitespace w3c/i18n-activity#973

Closed

xfq mentioned this issue Jul 5, 2022

"Whitespace characters" underspecified w3c/core-aam#128

Closed

pkra closed this as completed Jan 12, 2023

Whitespace automation moved this from High priority to Closed Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AccName trims whitespace but doesn't define which code points are whitespace #55

AccName trims whitespace but doesn't define which code points are whitespace #55

dd8 commented Jul 10, 2019

dd8 commented Aug 14, 2019

dd8 commented Aug 20, 2019 •

edited

pkra commented Aug 20, 2019

dd8 commented Aug 20, 2019 •

edited

dd8 commented Aug 20, 2019 •

edited

accdc commented Aug 20, 2020

MelSumner commented Sep 17, 2020

cookiecrook commented Sep 17, 2020

cookiecrook commented Sep 17, 2020

jnurthen commented Sep 2, 2021

pkra commented Jan 12, 2023

AccName trims whitespace but doesn't define which code points are whitespace #55

AccName trims whitespace but doesn't define which code points are whitespace #55

Comments

dd8 commented Jul 10, 2019

dd8 commented Aug 14, 2019

dd8 commented Aug 20, 2019 • edited

pkra commented Aug 20, 2019

dd8 commented Aug 20, 2019 • edited

dd8 commented Aug 20, 2019 • edited

accdc commented Aug 20, 2020

MelSumner commented Sep 17, 2020

cookiecrook commented Sep 17, 2020

cookiecrook commented Sep 17, 2020

jnurthen commented Sep 2, 2021

pkra commented Jan 12, 2023

dd8 commented Aug 20, 2019 •

edited

dd8 commented Aug 20, 2019 •

edited

dd8 commented Aug 20, 2019 •

edited