Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AccName trims whitespace but doesn't define which code points are whitespace #55

Closed
dd8 opened this issue Jul 10, 2019 · 11 comments
Closed
Labels
deep-dive i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
Milestone

Comments

@dd8
Copy link

dd8 commented Jul 10, 2019

This seems important to define because there's a lot of inconsistency between the whitespace definitions in different W3 specs.

HTML 5 uses two different definitions for white space:

https://www.w3.org/TR/html51/single-page.html#space-characters

HTML.1) White_space characters - defined as code points with the Unicode property "White_Space" in the Unicode PropList.txt data file
This definition is only used in the HTML spec to determine if table cells are empty. This definition includes non-ASCII spaces like non-breaking spaces (U+00A0) but excludes zero width spaces (U+200B and U+FEFF)

HTML.2) space characters
U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR)
This definition is used in lots of places in the HTML spec.

CSS has another two definitions for whitespace - both different to the HTML definitions:

CSS.1) White space: the 'white-space' property
U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), and U+000D CARRIAGE RETURN (CR)
This is different from the HTML space characters definition - it doesn't include U+000C FORM FEED (FF)
https://www.w3.org/TR/CSS2/text.html#white-space-prop
https://drafts.csswg.org/css-text-3/#white-space-processing

CSS.2) The grammar for CSS files uses yet another definition of whitespace:
https://www.w3.org/TR/css-syntax-3/#whitespace

XML.1) XML uses another definition inherited by XML based formats like SVG and MathML
(#x20 | #x9 | #xD | #xA)+
but this looks equivalent to the CSS.1 definition
U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), and U+000D CARRIAGE RETURN (CR)
https://www.w3.org/TR/xml/#NT-S

@dd8
Copy link
Author

dd8 commented Aug 14, 2019

Using the whitespace definition from 'Flat string' should work for trimming at step 2C - but worth being explicit in 'Flat string' about the exact code points involved for consistency with other W3/WhatWG spec.

@dd8
Copy link
Author

dd8 commented Aug 20, 2019

One other thing to consider - the CSS spec is explicit about not rendering certain code points (Default_ignoreable) distinct from Unicode white_space. AccName should take this into consideration:

https://drafts.csswg.org/css-text-3/#white-space-processing
http://unicode.org/L2/L2002/02368-default-ignorable.html
http://unicode.org/faq/unsup_char.html#3

Without taking this into account you can have an element with an AccName that contains no visible glyphs:

<!-- zero width no-break space and zero-width space -->
<h1>&#FEFF; &#200B;</h1>

The zero width no-break space U+FEFF pops up quite often because it also functions as the byte order mark in UTF-16 (every Unicode file saved by Windows Notepad starts with this character), so it's easy to get it into a page using cat or doing something like this:

<h1><!--#include "title.txt" --></h1>

@pkra
Copy link
Member

pkra commented Aug 20, 2019

To add to this list (thanks for this!): In a recent ARIA WG telco we discussed if U+2800 (blank braille pattern) should be considered whitespace (in the context of w3c/aria#924 and the requirement for UAs to ignore role descriptions containing only whitespace).

@dd8
Copy link
Author

dd8 commented Aug 20, 2019

From discussion on ACT-R I think you probably need two different concepts:

  • syntactic whitespace (e.g. blank roles, spaces separating IDREFs in aria-labelledby, headers etc) for consumption by machines (HTML/CSS/XML specs agree on ASCII-only whitespace, but disagree on whether U+000C Form Feed is whitespace)
  • user-visible code points (i.e. excluding Unicode White_space and Default_ignoreable) for consumption by humans

@dd8
Copy link
Author

dd8 commented Aug 20, 2019

To add to this list (thanks for this!): In a recent ARIA WG telco we discussed if U+2800 (blank braille pattern) should be considered whitespace (in the context of w3c/aria#924 and the requirement for UAs to ignore role descriptions containing only whitespace).

My $0.02 would be to use the 'space-separated tokens' microsyntax for roles since this is used in lots of other places (e.g. the class, headers and rel attributes) where the tokens are consumed by a machine:
https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#space-separated-tokens

A counter-argument is this could make authoring harder for braille users, but the problem already exists for other HTML attributes like headers and class, and using a different whitespace syntax for role and class is probably going to make authoring harder in general.

Plus there are already 4 incompatible whitespace definitions in HTML/CSS/XML...

Edit: sorry mis-understood role descriptions above - ignore this comment

@accdc
Copy link
Contributor

accdc commented Aug 20, 2020

Hi,
Based on our discussion today in the ARIA WG call, I recommend we do the following:

As a baseline for Core AccName, add the following characters to explicitly define which are considered baseline whitespace characters.

U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR)

Though this matches the HTML spec definition for whitespace, these characters exist within any text editor and don't require a specific user agent such as a browser to interpret them, so they would make a good core baseline.

I agree with the HTML spec in not including 0-width characters within this list though, because doing so would actually break the accessible names of human readable strings that have these characters within them, at least in the way that the algorithm replaces characters in the above list by flattening all such characters into one string by replacing them into a single space character " ".

E.G If you had a string such as Here we are now\n\nentertain us., where \n\n represents 2 new line characters, it would be flattened into the string Here we are now entertain us., where the 2 characters would be replaced and condensed into 1 space character separating the human readable words.

If you were to add the 0-width character to this list and apply the same logic to it, and you encountered a string that was meant to be read as a single word but each character had a 0-width character between each character in the string, then all such letters would be separated by a single space character even though the word visually appeared as though it had no spacing between the characters. In this case the most accessible solution is to ignore all 0-width characters and replace them with nothing so that the computed name matches the content that is visually displayed.

So, with the above list as a baseline whitespace character list, it can then be added to as needed by user agents when native host semantics require additions to be made within specific specs such as SVG, HTML, and CSS as needed within their respective algorithms.

Does this make sense?

All the best,
Bryan

@jnurthen jnurthen added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Sep 17, 2020
@MelSumner
Copy link
Contributor

Adding a note, I recently worked on a template linting rule - no-whitespace-within-word and we came up with this list after a fairly exhaustive search. I think we don't cover the Mongolian vowel spaces but attempted to include everything else we could find:
https://github.com/ember-template-lint/ember-template-lint/blob/master/lib/rules/no-whitespace-within-word.js

@cookiecrook
Copy link
Contributor

I took an action this morning to check with other WebKit engineers on the whitespace implementation and preferences. WebKit has several implementations differentiating "whitespace" in the contexts where the various specs disagree. For example:

WebCore::RFC7230::isWhitespace()
WebCore::isHTMLSpace()
WebCore::isCSSSpace()

So from an implementation perspective, it doesn't really matter which the spec uses... Ideally not another one though.

There's a mild preference for HTML ASCII Whitespace unless there's a specific reason to use CSS Whitespace. Please double-check that advice during i18n review.

@cookiecrook
Copy link
Contributor

@accdc wrote:

I recommend we do the following: As a baseline for Core AccName, add the following characters to explicitly define which are considered baseline whitespace characters. U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR)

I don't think any of these characters should be listed in AccName. Instead of AccName hosting a copy of the HTML values, link across to the HTML Spec with prose indicating it's the definitive source.

@jnurthen
Copy link
Member

jnurthen commented Sep 2, 2021

proposed for Sep 30 Deep Dive

@pkra
Copy link
Member

pkra commented Jan 12, 2023

I think this was resolved via #165 and w3c/core-aam#128.

Please re-open if there's something missing.

@pkra pkra closed this as completed Jan 12, 2023
Whitespace automation moved this from High priority to Closed Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deep-dive i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
Whitespace
  
Closed
Development

No branches or pull requests

7 participants