-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validity of independent review (retitled) #1622
Comments
Hi @DavidMacDonald, sorry, which document? |
That's a important catch @DavidMacDonald, so I hope you can track down the right GitHub page to be suggesting an edit! The I will note that a core motivation for the methodology underpinning DHS Trusted Tester is have a repeatable process which is as unambiguous as humanly possible. They aim for much, much higher agreement on any one test than 80%! Moreover, the TT credentialing process is aimed at allowing inexperienced evaluators to achieve high inter-rater reliability. Finally, I would note that the ACT rules aim for 100% inter-rater reliability. I would love to read the article though. EDIT: Here is a cite from ResearchGate. Title is |
I am of the opinion that just deleting the last bullet (exactly the bit that David excerpted) is a reasonable fix for now. My pull request also corrects the spelling of author's last name, adds the article title, and provides a link. I don't think this article needs to be added to the References section at this time. |
I ran the article abstract by my colleague @kengdoj and though I would share her observations:
|
I think we need a different solution, as I have spent several hours searching old archives to see if I can find the original paper. We had a data loss when the structure of Google Drive that belonged to a no-longer-active member was deleted. The data still exists (so I am told), but I can't find it. As a side note, we are encouraging W3C to find a Google Drive solution, because Drive is accessible to some people with disabilities and we expect to keep using it in the future. I have been thinking about the possibilities of addressing the problem. First, the paper with the 80% figure is dated. It would be helpful to find the date, but I remember it as being associated with the release of WCAG 2.0, so I would suspect it is in the 2008-2012 time frame. If there is more recent research with a different percentage, then I would recommend using it. I don't think the Silver Task Force would object to using updated research. Otherwise, use the 80% with the note that the research is associated with the release of WCAG 2.0. Members of the Silver Task Force (myself included) have been loath to see the Silver Problem Statements submerged in the Challenges document because they were the result of research with academic and corporate researchers. However, I would like to propose a way forward. I would be amenable to paraphrasing the Silver research results as long as there are frequent references to the Silver Problem Statements. The Silver research was broader in scope than the Challenges, because the Silver research addressed a wider population than large organizations. I still do not want to see the Challenges document used to justify changes to the WCAG3 Requirements or to WCAG3 itself. The Challenges document is the opinion of a relatively small (but influential) group of people and should not be considered of greater importance than the research. A paragraph in the Introduction could explain that. I am open to further discussion and ideas of a way forward. I would also like to hear from @slauriat on this issue. I have flagged it as a topic for a Silver leadership discussion. |
@jspellman I am pretty sure I linked to the article in question in my first reply in this issue thread. Here is that URL: March 2012 is the date. I tried to get the article text directly via ResearchGate but they have not approved my request (even after I ticked the boxes for reconsideration). I choose to believe that it is an automaton making that choice! @sajkaj - I think you may have renamed this issue with maybe what was supposed to be a comment. I cannot quite tell what is going on. |
From the referenced URL:
Date:
March 2012
Abstract:
The Web Content Accessibility Guidelines (WCAG) 2.0 separate testing into
both “Machine” and “Human” audits; and further classify “Human Testability”
into “Reliably Human Testable” and “Not Reliably Testable”;* it is human
testability that is the focus of this paper*. We wanted to investigate the
likelihood that “at least 80% of knowledgeable human evaluators would agree
on the conclusion” of an accessibility audit, and therefore understand the
percentage of success criteria that could be described as reliably human
testable, and those that could not.
In this case, we recruited twenty-five experienced evaluators to audit four
pages for WCAG 2.0 conformance. These pages were chosen to differ in
layout, complexity, and accessibility support, thereby creating a small but
variable sample. *We found that an 80% agreement between experienced
evaluators almost never occurred and that the average agreement was at the
70--75% mark, while the error rate was around 29%.* Further, trained—but
novice—evaluators performing the same audits exhibited the same agreement
to that of our more experienced ones, but a reduction on validity of 6--13%
; the validity that an untrained user would attain can only be a
conjecture. Expertise appears to improve (by 19%) the ability to avoid
false positives.
Finally, pooling the results of two independent experienced evaluators
would be the best option, capturing at most 76% of the true problems and
producing only 24% of false positives. Any other independent combination of
audits would achieve worse results. This means that an 80% target for
agreement, when audits are conducted without communication between
evaluators, is not attainable, even with experienced evaluators, when
working on pages similar to the ones used in this experiment; that the
error rate even for experienced evaluators is relatively high and further,
that untrained accessibility auditors be they developers or quality testers
from other domains, would do much worse than this.
While the data is 10 years old, I believe the main conclusions remain
relevant, as they were evaluating "..human testability that as the focus of
the paper..." and the ability to elicit reliable results rather than which
version of WCAG they were using. If newer research is available I'm all for
reviewing it.
JF
…On Mon, Mar 8, 2021 at 7:48 AM Bruce Bailey ***@***.***> wrote:
@jspellman <https://github.com/jspellman> I am pretty sure I linked to
the article in question in my first reply in this issue thread. Here is the
URL:
https://www.researchgate.net/publication/235339930_Is_accessibility_conformance_an_elusive_property_A_study_of_validity_and_reliability_of_WCAG_20
March 2012 is the date.
@sajkaj <https://github.com/sajkaj> - I think you may have renamed this
issue with maybe what was supposed to be a comment. I cannot quite tell
what is going on.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1622 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJL44YPFFRQD6BJ66MXH2LTCTBR5ANCNFSM4XFYAR2Q>
.
--
*John Foliot* | Principal Accessibility Specialist
"I made this so long because I did not have time to make it shorter." -
Pascal "links go places, buttons do things"
|
@johnfoliot et al., the problematic bullet @DavidMacDonald cites in this issue is a direct excerpt from that abstract you pasted in: From the abstract, it does seem to be true that these researchers came to that conclusion. It is not, however, a factually correct statement. It does not IMHO belong in the Challenges document. Moreover, the formatting ascribes more authority than the bullet warrants. Maybe it is just me, but before digging out that citation, I didn't realize the bullet was a quotation. After reading the abstract, I would argue that characterizing that bullet as a |
@bruce-usab I'm not really sure what your
point is: are you suggesting that the conclusion is not a fact?
I disagree - it is a fact. It may not be a conclusion that everyone agrees
to, but the conclusion as published is a fact: it's their conclusion.
More importantly, it is research that is supporting concerns raised by
multiple parties (including myself) around the need for non-subjective
measurements for conformance. Reliance on individual subjective
determinations will certainly introduce the types of concerns addressed by
this research paper, and with the current trajectory, likely introduce more
concern, not lessen it.
And while it may only be one bullet point (data point) it is none-the-less
a significant one, and one (again) backed by *some* research by academics.
…On Mon, Mar 8, 2021 at 11:17 AM Bruce Bailey ***@***.***> wrote:
@johnfoliot <https://github.com/johnfoliot> et al., the problematic
bullet @DavidMacDonald <https://github.com/DavidMacDonald> cites in this
issue is a direct excerpt from that abstract you pasted in: This means
that an 80% target for agreement, when audits are conducted without
communication between evaluators, is not attainable, even with experienced
evaluators. See Conformance Challenges -- Themes from Research
<https://www.w3.org/TR/2020/WD-accessibility-conformance-challenges-20200619/#theme>
.
From the abstract, it does seem to be true that these researchers came to
that conclusion.
It is not, however, a factually correct statement. It does not IMHO belong
in the Challenges document. Moreover, the formatting ascribes more
authority than the sentences warrants. Maybe it is just me, but before
digging out that citation, I didn't realize the bullet was a quotation. Now
I would argue that characterizing that bullet as a *theme* from research
really overstates what is really just one data point.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1622 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJL44YCEXJ3VJR5RSXR5QTTCT2BNANCNFSM4XFYAR2Q>
.
--
*John Foliot* | Principal Accessibility Specialist
"I made this so long because I did not have time to make it shorter." -
Pascal "links go places, buttons do things"
|
@johnfoliot GitHub put my addy in plain text so I edited your comment. (Not that my email is hard to find, but who needs the extra spam?) FWIW, I don't seem to have your current email (or I would have asked you to edit your comment). Also, I find it surprising that I could edit your comment! Correct, I am saying that the conclusion is not fact. It may be a fact that the paper authors made such a conclusion, but I regard that as irrelevant to the premise that AG WG should include this particular bullet in the Conformance Issues document. But without reading the article, I am not confident that the authors reach this conclusion. The phrasing used in the abstract is not entirely unambiguous. FWIW, I agree that inter-rating reliability is not what we want it to be. But I strongly disagree that an 80% target for inter-rater reliability is not attainable. The assertion that |
I'm sorry Bruce, but the authors of that paper came to a conclusion: that
is an undisputed fact.
You may not agree with the conclusion, you may believe that it is not
relevant to the discussion, but it is factual that they came to a
conclusion, and their conclusion is one that supports concerns and comments
that others have articulated (including myself).
I noted with interest the comments from Gregg Vanderheiden
<https://lists.w3.org/Archives/Public/public-silver/2021Mar/0002.html>,
former chair of the WCAG 2.0 Working Group, who wrote:
"I know the pain that can lead one to do this. We had the same problem in
WCAG 2.0. We actually spent enormous time on cognitive language learning
disabilities for example (more than on any other single disability) trying
to find provisions that would address their needs and yet would be
objective and meet the criteria necessary for a testable provision. We
called in Nancy Ward, Clayton Lewis and a whole host of other people to
talk with us and propose provisions that might work. John Slaton and I
launched two, many-months-long efforts on both the cognitive language and
learning disability area and the use of plain language in the guidelines.
It was the most frustrating thing I have ever done in my life. Seeing the
needs, but being unable to identify or find ways to qualify as strategies
from all the materials we read, and people we talk to, was the most
difficult and frustrating part of the work on WCAG.
In the end the group will need to either rename the document and have it be
a really wonderful guidance document with broad scope for including
guidance provisions, or return to the WCAG 2x like criteria for selecting
provisions that is needed in a standard that could be adopted in
regulation. This latter choice would, of course, put you back in the same
bind as the existing WCAG 2 thread. It is aggravating,
bang-head-against-the-wall frustrating, etc. but that is the situation."
Gregg also notes:
"If the provisions in a standard are not objective, the very first time it
shows up in court, the defendants will cite, accurately, that the provision
is not objective but rather is subjective. *And as a result, it is not
enforceable.*"
One of the conclusions of that paper (as I understand it) is that when it
comes to subjective evaluations, they were unable to demonstrate that even
experienced evaluators could agree on some of the subjective determinations
we already have in WCAG 2.x. Gregg continues:
In order for something to be a standard, particularly a standard that is
going to be used in regulation of any type,
- all of the provisions that are normative (that is, the only ones you
have to pay attention to inorder to conform) must be objectively testable.
That is, one must either pass them or fail them.
- And *you must have high inter-rater-reliability (that is, if you have
a number of people who are aware of the technologies in use, a very high
percentage of them would all come up with the same answer as to whether
something passes or fails.)*
*[JF: this is where the cited research paper is relevant, as the research
concluded that this "high-level 'inter-rater-reliability'" could not be
proven today] *
- If a number is used, (e.g. the volume must be above 3 dB), then the
scale, the number that is the threshold, and the tool that can be used to
measure the threshold value (number) that causes it to pass or fail, needs
to be cited and the instrument creating the number needs to be objective.
- For example “The background noise must be 20 dB down from the
foreground speech" would qualify. But “The evaluator's opinion
is that the
score is a three or better" does not qualify since it relies on
the opinion
of an individual rather than on a measurement. Any measurements that rely
on opinion are called qualitative rather than quantitative and are
subjective rather than objective measures (unless one is
measuring opinions
rather than conformance).
This constrains the types of provisions or requirements that you can have
in a standard. Often leaving out guidance you would like to include but
cannot reduce to an objectively testable requirement."
I eagerly anticipate the WG's response to Gregg's comments.
JF
…On Mon, Mar 8, 2021 at 12:28 PM Bruce Bailey ***@***.***> wrote:
@johnfoliot <https://github.com/johnfoliot> GitHub put my email in plain
text so I edited your comment. (Not that my email is hard to find, but who
needs the extra spam?) FWIW, I don't seem to have your current email.
Correct, I am saying that the conclusion is not fact.
It may be a fact that the paper authors made such a conclusion, but I
regard that as irrelevant to the premise that AG WG should include this
particular bullet in the Conformance Issues document. But without reading
the article, I am not confident that the authors reach this conclusion. The
phrasing used in the abstract is not entirely unambiguous.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1622 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJL443IFVSOQKEL64NOOYDTCUCM7ANCNFSM4XFYAR2Q>
.
--
*John Foliot* | Principal Accessibility Specialist
"I made this so long because I did not have time to make it shorter." -
Pascal "links go places, buttons do things"
|
@johnfoliot , sticking to the very narrow issue raised by @DavidMacDonald at the start of this thread, which I now find that @sajkaj has (accidently) over-written, we can recognize the real concern raised by these researchers without this particular quotation from the abstract. Further, I would argue that including the quote (because it is so easily debunked) is counter-productive to the important lesson that inter-rater reliability needs to be improved. |
My apologies to everyone, and especially to @DavidMacDonald, for over-writing |
The section of the Challenges document being discussed in this issue is a straight |
Reopening because the pull request did not actually address the issue raised by @DavidMacDonald. In my opinion, this is something the AG WG would appreciate having called to their attention. |
Hi @sajkaj, as the Silver problem statements are not an official (draft) note, there is a higher bar. This issue should remain open until the original point is addressed, or it comes to the group to agree not to address it. |
Agreed. My bad--yet again in this issue. I meant to hit "Comment," not "comment and close." But, I was in too much of a hurry to post before being late to the Silver call. I agree the underlying question remains unresolved, even if the citation is now available. |
The document cites a Brajnick et al., 2012 study which is not provided, and is not in the reference list and based on one bullet point it cites from that it states:
I think this is unnecessarily disparaging to WCAG. I think this sentence should be removed. I recently evaluated an international site. I was in Canada and another professional in Paris conducted an evaluation of the same pages without any communication. We had a strong correlation. Much higher than 80%.
The text was updated successfully, but these errors were encountered: