Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Email: Comment: WCAG Accessibility Guidelines - Accessing a "virtual" meeting using a teleconferencing application such as Zoom - generating accurate text content #474

Closed
jspellman opened this issue Mar 12, 2021 · 3 comments
Labels
a11y-tracker Group bringing to attention of a11y, or tracked by the a11y Group but not needing response. migration: guidelines Issues that apply to guidelines status: assigned to subgroup ask subgroup for proposal Subgroup: XR - Captions Directly related to XR SubGroup

Comments

@jspellman
Copy link
Contributor

Comment from Email:

Re: W3C Accessibility Guidelines (WCAG) 3.0 - W3C First Public Working Draft 21 January 2021

Issue: Accessing a "virtual" meeting using a teleconferencing application such as Zoom - generating accurate text content

Background: For purposes of this discussion, individuals who are Deaf-Blind can be grouped into four sub-groups:

(1) The Deaf-Blind individual retains sufficient residual hearing, with amplification and/or other enhancements, to access web content in the of a hearing person.

(2) The Deaf-Blind individual retains sufficient residual vision, with screen magnification and/or other enhancements, to access web content in the manner of a sighted person.

(3) The Deaf-Blind individual cannot access web content via speech or hearing, but can do so using braille.

(4) The Deaf-Blind individual cannot access web content using vision, hearing or braille, and thus cannot access web content at all. (There may be extremely rare cases when the use of unusual technology may circumvent this.)

This discussion refers exclusively to Deaf-Blind individuals in the third group - braille users.

Problem: At present there are two ways in which a person can access virtual meeting content, absent the ability to hear the discussion or follow the proceedings via visual means, such as provided sign language interpreter: streamed text or a subsequent text transcript. This issue deals with creation of the text to be streamed or encapsulated in a transcript.

A "live" captioner, using technology similar to that of a court reporter, can and does include additional information to make the text more meaningful. In my experience, however, some practitioners of this art are not up to the task, and standards would be helpful.

Automated Speech-to-text applications are a work in progress and need considerable improvement in order to be truly accessible. generally, these apps focus on generating an accurate text reproduction of the words spoken. Significant progress has been made in this area. The problem lies in capturing the context:

(1) Who is speaking?

A live captioner usually provides speaker identification. Captioned text frequently is displayed on-screen in the vicinity of the speaker, making it clear who is speaking - if you can see the screen.. A hearing person can usually identify the speaker from their voice. Automated speech-to-text Streamed text usually is in the form of a continuous moving line of text at the bottom of the screen, and knowing who is speaking can be problematic. Observing meeting participants on-screen can help. But for a person relying entirely on a stream of text, it is often impossible to know the identity of the speaker, leading to erroneous assumptions and incorrect understanding of an individual's spoken comments and beliefs. AI should be able to assign speaker identity in some manner - if not the speaker's actual name, then at least the speaker's virtual identity, such as a sequence number or individuals speaking or similar.

(2) When is there a change of speaker?

Along with the issues raised above, it is very important to one's understanding of meeting content to know when there is a change of speaker. A solution would be for the speech-to-text application to at least insert a hard line break when a new speaker emerges.

(3) What is the meaning of the words spoken?

Consider this: The words "WOMAN WITHOUT HER MAN IS A SAVAGE" are spoken.

Do you mean "Woman, without her man, is a savage."?

Or perhaps "Woman: Without her, man is a savage."?

Or even "Woman without! Her man is a savage!!!"?

The meaning of the words spoken are conveyed via cadence, tone of voice, pitch, pauses in speech, facial expression, body language. Capturing the true meaning of speech translated to text is a major issue.

These issues need to be addressed in order to make speech-to-text applications truly accessible.

@jspellman jspellman added status: assigned to subgroup ask subgroup for proposal Subgroup: XR - Captions Directly related to XR SubGroup labels Mar 12, 2021
@jspellman
Copy link
Contributor Author

Thank you for your comment. Project members are working on your comment. You may see discussion in the comment thread and we may ask for additional information as we work on it. We will mark the official response when we are finished and close the issue.

@RealJoshue108
Copy link

@jspellman We will need to discuss further in RQTF but for myself - having reviewed this issue I think the RTC Accessibility User Requirements deals well with the first two user needs that are referred to by the OP.

The twin user needs of 'Identify who is speaking and if there is a change of speaker' looks like they are covered by the requirements under 'Window anchoring and pinning':

REQ 1a: Provide the ability to anchor or pin specific windows so the user can associate the sign language interpreter with the correct speaker.

and

REQ 1c: Ensure the source of any captions, transcriptions or other alternatives is clear to the user, even when second screen devices are used.

My reading of the last use case is that it may not technically be an accessibility issue or at least may be beyond the scope of the RTC Accessibility User Requirements document - 'What is the meaning of the words spoken?' as this really does look like a broader matter of punctuation, emphasis, prosody (linguistic functions such as intonation, tone, stress, and rhythm.)

@WilcoFiers
Copy link
Contributor

Thank you for reviewing and commenting on the WCAG 3 first public working draft. We have an updated working draft and the group is starting a new workflow process. To facilitate this transition we have recorded your concern as part of w3c/wcag3#23. You can engage in the discussion around this issue there.

Please continue to review future drafts and provide feedback through Github.

If you disagree with closing this issue, please reopen it and add your reasons for reopening it to the comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a11y-tracker Group bringing to attention of a11y, or tracked by the a11y Group but not needing response. migration: guidelines Issues that apply to guidelines status: assigned to subgroup ask subgroup for proposal Subgroup: XR - Captions Directly related to XR SubGroup
Projects
None yet
Development

No branches or pull requests

4 participants