Validity of independent review (retitled) #1622

DavidMacDonald · 2021-02-06T02:08:46Z

The document cites a Brajnick et al., 2012 study which is not provided, and is not in the reference list and based on one bullet point it cites from that it states:

"This means that an 80% target for agreement, when audits are conducted without communication between evaluators, is not attainable, even with experienced evaluators." link to quote in Challenges doc

I think this is unnecessarily disparaging to WCAG. I think this sentence should be removed. I recently evaluated an international site. I was in Canada and another professional in Paris conducted an evaluation of the same pages without any communication. We had a strong correlation. Much higher than 80%.

The text was updated successfully, but these errors were encountered:

alastc · 2021-02-06T17:12:55Z

Hi @DavidMacDonald, sorry, which document?

bruce-usab · 2021-02-08T15:47:41Z

That's a important catch @DavidMacDonald, so I hope you can track down the right GitHub page to be suggesting an edit!

The not attainable conclusion (as represented in Challenges doc) is just factually incorrect (1) because I am pretty sure it misrepresents (mathematically) what we as WG members understand of as 80% reliability for inter-rater agreement, and (2) is disproven (as it is currently stated) by a single counter-example (as you provided).

I will note that a core motivation for the methodology underpinning DHS Trusted Tester is have a repeatable process which is as unambiguous as humanly possible. They aim for much, much higher agreement on any one test than 80%! Moreover, the TT credentialing process is aimed at allowing inexperienced evaluators to achieve high inter-rater reliability.

Finally, I would note that the ACT rules aim for 100% inter-rater reliability.

I would love to read the article though.

EDIT: Here is a cite from ResearchGate. Title is Is accessibility conformance an elusive property? A study of validity and reliability of WCAG 2.0 and the authors last name is Brajnik (not Brajnick, so no c).

bruce-usab · 2021-02-08T18:30:32Z

I am of the opinion that just deleting the last bullet (exactly the bit that David excerpted) is a reasonable fix for now.

My pull request also corrects the spelling of author's last name, adds the article title, and provides a link. I don't think this article needs to be added to the References section at this time.

bruce-usab · 2021-02-08T18:43:39Z

I ran the article abstract by my colleague @kengdoj and though I would share her observations:

The research that yielded this rating evaluated testers using different methodologies, which we know is the source of varying test results. Not surprising but it further supports the work of the ACT and ICT Baseline.
It would be interesting if the researchers had administered their test pages to testers all following the same methodology. I bet these scores would be much higher. I would hope that TTs would be around 90%, but 80%+ should be easy assuming the TTs have been consistently testing.

jspellman · 2021-03-07T16:35:11Z

I think we need a different solution, as I have spent several hours searching old archives to see if I can find the original paper. We had a data loss when the structure of Google Drive that belonged to a no-longer-active member was deleted. The data still exists (so I am told), but I can't find it. As a side note, we are encouraging W3C to find a Google Drive solution, because Drive is accessible to some people with disabilities and we expect to keep using it in the future.

I have been thinking about the possibilities of addressing the problem. First, the paper with the 80% figure is dated. It would be helpful to find the date, but I remember it as being associated with the release of WCAG 2.0, so I would suspect it is in the 2008-2012 time frame. If there is more recent research with a different percentage, then I would recommend using it. I don't think the Silver Task Force would object to using updated research. Otherwise, use the 80% with the note that the research is associated with the release of WCAG 2.0.

Members of the Silver Task Force (myself included) have been loath to see the Silver Problem Statements submerged in the Challenges document because they were the result of research with academic and corporate researchers. However, I would like to propose a way forward. I would be amenable to paraphrasing the Silver research results as long as there are frequent references to the Silver Problem Statements.

The Silver research was broader in scope than the Challenges, because the Silver research addressed a wider population than large organizations. I still do not want to see the Challenges document used to justify changes to the WCAG3 Requirements or to WCAG3 itself. The Challenges document is the opinion of a relatively small (but influential) group of people and should not be considered of greater importance than the research.

A paragraph in the Introduction could explain that. I am open to further discussion and ideas of a way forward. I would also like to hear from @slauriat on this issue. I have flagged it as a topic for a Silver leadership discussion.

bruce-usab · 2021-03-08T12:48:15Z

@jspellman I am pretty sure I linked to the article in question in my first reply in this issue thread. Here is that URL:
https://www.researchgate.net/publication/235339930_Is_accessibility_conformance_an_elusive_property_A_study_of_validity_and_reliability_of_WCAG_20

March 2012 is the date. I tried to get the article text directly via ResearchGate but they have not approved my request (even after I ticked the boxes for reconsideration). I choose to believe that it is an automaton making that choice!

@sajkaj - I think you may have renamed this issue with maybe what was supposed to be a comment. I cannot quite tell what is going on.

johnfoliot · 2021-03-08T14:26:59Z

From the referenced URL: Date: March 2012 Abstract: The Web Content Accessibility Guidelines (WCAG) 2.0 separate testing into both “Machine” and “Human” audits; and further classify “Human Testability” into “Reliably Human Testable” and “Not Reliably Testable”;* it is human testability that is the focus of this paper*. We wanted to investigate the likelihood that “at least 80% of knowledgeable human evaluators would agree on the conclusion” of an accessibility audit, and therefore understand the percentage of success criteria that could be described as reliably human testable, and those that could not. In this case, we recruited twenty-five experienced evaluators to audit four pages for WCAG 2.0 conformance. These pages were chosen to differ in layout, complexity, and accessibility support, thereby creating a small but variable sample. *We found that an 80% agreement between experienced evaluators almost never occurred and that the average agreement was at the 70--75% mark, while the error rate was around 29%.* Further, trained—but novice—evaluators performing the same audits exhibited the same agreement to that of our more experienced ones, but a reduction on validity of 6--13% ; the validity that an untrained user would attain can only be a conjecture. Expertise appears to improve (by 19%) the ability to avoid false positives. Finally, pooling the results of two independent experienced evaluators would be the best option, capturing at most 76% of the true problems and producing only 24% of false positives. Any other independent combination of audits would achieve worse results. This means that an 80% target for agreement, when audits are conducted without communication between evaluators, is not attainable, even with experienced evaluators, when working on pages similar to the ones used in this experiment; that the error rate even for experienced evaluators is relatively high and further, that untrained accessibility auditors be they developers or quality testers from other domains, would do much worse than this. While the data is 10 years old, I believe the main conclusions remain relevant, as they were evaluating "..human testability that as the focus of the paper..." and the ability to elicit reliable results rather than which version of WCAG they were using. If newer research is available I'm all for reviewing it. JF

…

On Mon, Mar 8, 2021 at 7:48 AM Bruce Bailey ***@***.***> wrote: @jspellman <https://github.com/jspellman> I am pretty sure I linked to the article in question in my first reply in this issue thread. Here is the URL: https://www.researchgate.net/publication/235339930_Is_accessibility_conformance_an_elusive_property_A_study_of_validity_and_reliability_of_WCAG_20 March 2012 is the date. @sajkaj <https://github.com/sajkaj> - I think you may have renamed this issue with maybe what was supposed to be a comment. I cannot quite tell what is going on. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1622 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJL44YPFFRQD6BJ66MXH2LTCTBR5ANCNFSM4XFYAR2Q> .

-- *John Foliot* | Principal Accessibility Specialist "I made this so long because I did not have time to make it shorter." - Pascal "links go places, buttons do things"

bruce-usab · 2021-03-08T16:17:08Z

@johnfoliot et al., the problematic bullet @DavidMacDonald cites in this issue is a direct excerpt from that abstract you pasted in: This means that an 80% target for agreement, when audits are conducted without communication between evaluators, is not attainable, even with experienced evaluators. See Conformance Challenges — Themes from Research.

From the abstract, it does seem to be true that these researchers came to that conclusion.

It is not, however, a factually correct statement. It does not IMHO belong in the Challenges document. Moreover, the formatting ascribes more authority than the bullet warrants. Maybe it is just me, but before digging out that citation, I didn't realize the bullet was a quotation. After reading the abstract, I would argue that characterizing that bullet as a theme from research really overstates what is really just one data point. It is an assertion from one study, where the authors own abstract provides evidence of a flawed methodology.

johnfoliot · 2021-03-08T16:31:31Z

@bruce-usab I'm not really sure what your point is: are you suggesting that the conclusion is not a fact? I disagree - it is a fact. It may not be a conclusion that everyone agrees to, but the conclusion as published is a fact: it's their conclusion. More importantly, it is research that is supporting concerns raised by multiple parties (including myself) around the need for non-subjective measurements for conformance. Reliance on individual subjective determinations will certainly introduce the types of concerns addressed by this research paper, and with the current trajectory, likely introduce more concern, not lessen it. And while it may only be one bullet point (data point) it is none-the-less a significant one, and one (again) backed by *some* research by academics.

…

On Mon, Mar 8, 2021 at 11:17 AM Bruce Bailey ***@***.***> wrote: @johnfoliot <https://github.com/johnfoliot> et al., the problematic bullet @DavidMacDonald <https://github.com/DavidMacDonald> cites in this issue is a direct excerpt from that abstract you pasted in: This means that an 80% target for agreement, when audits are conducted without communication between evaluators, is not attainable, even with experienced evaluators. See Conformance Challenges -- Themes from Research <https://www.w3.org/TR/2020/WD-accessibility-conformance-challenges-20200619/#theme> . From the abstract, it does seem to be true that these researchers came to that conclusion. It is not, however, a factually correct statement. It does not IMHO belong in the Challenges document. Moreover, the formatting ascribes more authority than the sentences warrants. Maybe it is just me, but before digging out that citation, I didn't realize the bullet was a quotation. Now I would argue that characterizing that bullet as a *theme* from research really overstates what is really just one data point. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1622 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJL44YCEXJ3VJR5RSXR5QTTCT2BNANCNFSM4XFYAR2Q> .

-- *John Foliot* | Principal Accessibility Specialist "I made this so long because I did not have time to make it shorter." - Pascal "links go places, buttons do things"

bruce-usab · 2021-03-08T17:28:30Z

@johnfoliot GitHub put my addy in plain text so I edited your comment. (Not that my email is hard to find, but who needs the extra spam?) FWIW, I don't seem to have your current email (or I would have asked you to edit your comment). Also, I find it surprising that I could edit your comment!

Correct, I am saying that the conclusion is not fact.

It may be a fact that the paper authors made such a conclusion, but I regard that as irrelevant to the premise that AG WG should include this particular bullet in the Conformance Issues document. But without reading the article, I am not confident that the authors reach this conclusion. The phrasing used in the abstract is not entirely unambiguous.

FWIW, I agree that inter-rating reliability is not what we want it to be. But I strongly disagree that an 80% target for inter-rater reliability is not attainable. The assertion that an 80% target for agreement is not attainable is simply not credible. Assertions which are not credible should not be repeated verbatim in an AG WG document. (Or at least not without lots of context and/or caveats.)

johnfoliot · 2021-03-08T18:11:01Z

I'm sorry Bruce, but the authors of that paper came to a conclusion: that is an undisputed fact. You may not agree with the conclusion, you may believe that it is not relevant to the discussion, but it is factual that they came to a conclusion, and their conclusion is one that supports concerns and comments that others have articulated (including myself). I noted with interest the comments from Gregg Vanderheiden <https://lists.w3.org/Archives/Public/public-silver/2021Mar/0002.html>, former chair of the WCAG 2.0 Working Group, who wrote: "I know the pain that can lead one to do this. We had the same problem in WCAG 2.0. We actually spent enormous time on cognitive language learning disabilities for example (more than on any other single disability) trying to find provisions that would address their needs and yet would be objective and meet the criteria necessary for a testable provision. We called in Nancy Ward, Clayton Lewis and a whole host of other people to talk with us and propose provisions that might work. John Slaton and I launched two, many-months-long efforts on both the cognitive language and learning disability area and the use of plain language in the guidelines. It was the most frustrating thing I have ever done in my life. Seeing the needs, but being unable to identify or find ways to qualify as strategies from all the materials we read, and people we talk to, was the most difficult and frustrating part of the work on WCAG. In the end the group will need to either rename the document and have it be a really wonderful guidance document with broad scope for including guidance provisions, or return to the WCAG 2x like criteria for selecting provisions that is needed in a standard that could be adopted in regulation. This latter choice would, of course, put you back in the same bind as the existing WCAG 2 thread. It is aggravating, bang-head-against-the-wall frustrating, etc. but that is the situation." Gregg also notes: "If the provisions in a standard are not objective, the very first time it shows up in court, the defendants will cite, accurately, that the provision is not objective but rather is subjective. *And as a result, it is not enforceable.*" One of the conclusions of that paper (as I understand it) is that when it comes to subjective evaluations, they were unable to demonstrate that even experienced evaluators could agree on some of the subjective determinations we already have in WCAG 2.x. Gregg continues: In order for something to be a standard, particularly a standard that is going to be used in regulation of any type, - all of the provisions that are normative (that is, the only ones you have to pay attention to inorder to conform) must be objectively testable. That is, one must either pass them or fail them. - And *you must have high inter-rater-reliability (that is, if you have a number of people who are aware of the technologies in use, a very high percentage of them would all come up with the same answer as to whether something passes or fails.)* *[JF: this is where the cited research paper is relevant, as the research concluded that this "high-level 'inter-rater-reliability'" could not be proven today] * - If a number is used, (e.g. the volume must be above 3 dB), then the scale, the number that is the threshold, and the tool that can be used to measure the threshold value (number) that causes it to pass or fail, needs to be cited and the instrument creating the number needs to be objective. - For example “The background noise must be 20 dB down from the foreground speech" would qualify. But “The evaluator's opinion is that the score is a three or better" does not qualify since it relies on the opinion of an individual rather than on a measurement. Any measurements that rely on opinion are called qualitative rather than quantitative and are subjective rather than objective measures (unless one is measuring opinions rather than conformance). This constrains the types of provisions or requirements that you can have in a standard. Often leaving out guidance you would like to include but cannot reduce to an objectively testable requirement." I eagerly anticipate the WG's response to Gregg's comments. JF

…

On Mon, Mar 8, 2021 at 12:28 PM Bruce Bailey ***@***.***> wrote: @johnfoliot <https://github.com/johnfoliot> GitHub put my email in plain text so I edited your comment. (Not that my email is hard to find, but who needs the extra spam?) FWIW, I don't seem to have your current email. Correct, I am saying that the conclusion is not fact. It may be a fact that the paper authors made such a conclusion, but I regard that as irrelevant to the premise that AG WG should include this particular bullet in the Conformance Issues document. But without reading the article, I am not confident that the authors reach this conclusion. The phrasing used in the abstract is not entirely unambiguous. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1622 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJL443IFVSOQKEL64NOOYDTCUCM7ANCNFSM4XFYAR2Q> .

-- *John Foliot* | Principal Accessibility Specialist "I made this so long because I did not have time to make it shorter." - Pascal "links go places, buttons do things"

bruce-usab · 2021-03-08T23:00:32Z

@johnfoliot , sticking to the very narrow issue raised by @DavidMacDonald at the start of this thread, which I now find that @sajkaj has (accidently) over-written, we can recognize the real concern raised by these researchers without this particular quotation from the abstract. Further, I would argue that including the quote (because it is so easily debunked) is counter-productive to the important lesson that inter-rater reliability needs to be improved.

sajkaj · 2021-03-09T14:13:10Z

My apologies to everyone, and especially to @DavidMacDonald, for over-writing
the head of this issue. David's original text has now been restored. It had
been my intent to comment, but I misused the hub command. My apologies.

sajkaj · 2021-03-09T14:25:44Z

The section of the Challenges document being discussed in this issue is a straight
copy and paste from Silver Problem
Statements.
As @jspellman notes above, there was a data loss event that resulted in a loss
of all the hyperlinks in
the original, as well as in the copy submitted into the Challenges doc.
So, I am gratefully accepting the citation on behalf of the Challenges doc,
and I'm sure a PR against the original would also be welcome. If you can help
with additional citations missing from Challenges (and from the upstream doc),
I'm confident we'd all appreciate having those.
I am, however, leaving the conclusion drawn from Silver Research for further discussion
in Silver and AGWG. I don't feel it's appropriate for me, as document Editor,
to make that substantive change on my own.
Meanwhile, please note the current Editor's Draft for Challenges has moved
Section 5 to an
[https://raw.githack.com/w3c/wcag/conformance-challenges-5aside/conformance-challenges/index.html#silver-research-problem-statements](Appendix
C in the latest Challenges draft). Please now create PR against that draft.

bruce-usab · 2021-03-09T14:36:29Z

Reopening because the pull request did not actually address the issue raised by @DavidMacDonald. In my opinion, this is something the AG WG would appreciate having called to their attention.

alastc · 2021-03-09T14:53:06Z

Hi @sajkaj, as the Silver problem statements are not an official (draft) note, there is a higher bar. This issue should remain open until the original point is addressed, or it comes to the group to agree not to address it.

sajkaj · 2021-03-09T18:11:01Z

Agreed. My bad--yet again in this issue. I meant to hit "Comment," not "comment and close." But, I was in too much of a hurry to post before being late to the Silver call. I agree the underlying question remains unresolved, even if the citation is now available.

alastc added the Challenges with Conformance Issues relating to the document at https://w3c.github.io/wcag/conformance-challenges/ label Feb 7, 2021

bruce-usab mentioned this issue Feb 8, 2021

Correct a citation under Themes from Research #1629

Closed

bruce-usab linked a pull request Feb 8, 2021 that will close this issue

Correct a citation under Themes from Research #1629

Closed

sajkaj self-assigned this Mar 3, 2021

sajkaj closed this as completed Mar 9, 2021

alastc reopened this Mar 9, 2021

sajkaj closed this as completed Mar 9, 2021

bruce-usab reopened this Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validity of independent review (retitled) #1622

Validity of independent review (retitled) #1622

DavidMacDonald commented Feb 6, 2021 •

edited by michael-n-cooper

Loading

alastc commented Feb 6, 2021

bruce-usab commented Feb 8, 2021 •

edited

Loading

bruce-usab commented Feb 8, 2021

bruce-usab commented Feb 8, 2021

jspellman commented Mar 7, 2021

bruce-usab commented Mar 8, 2021 •

edited

Loading

johnfoliot commented Mar 8, 2021 via email

bruce-usab commented Mar 8, 2021 •

edited

Loading

johnfoliot commented Mar 8, 2021 via email •

edited by bruce-usab

Loading

bruce-usab commented Mar 8, 2021 •

edited

Loading

johnfoliot commented Mar 8, 2021 via email

bruce-usab commented Mar 8, 2021

sajkaj commented Mar 9, 2021

sajkaj commented Mar 9, 2021

bruce-usab commented Mar 9, 2021

alastc commented Mar 9, 2021

sajkaj commented Mar 9, 2021

Validity of independent review (retitled) #1622

Validity of independent review (retitled) #1622

Comments

DavidMacDonald commented Feb 6, 2021 • edited by michael-n-cooper Loading

alastc commented Feb 6, 2021

bruce-usab commented Feb 8, 2021 • edited Loading

bruce-usab commented Feb 8, 2021

bruce-usab commented Feb 8, 2021

jspellman commented Mar 7, 2021

bruce-usab commented Mar 8, 2021 • edited Loading

johnfoliot commented Mar 8, 2021 via email

bruce-usab commented Mar 8, 2021 • edited Loading

johnfoliot commented Mar 8, 2021 via email • edited by bruce-usab Loading

bruce-usab commented Mar 8, 2021 • edited Loading

johnfoliot commented Mar 8, 2021 via email

bruce-usab commented Mar 8, 2021

sajkaj commented Mar 9, 2021

sajkaj commented Mar 9, 2021

bruce-usab commented Mar 9, 2021

alastc commented Mar 9, 2021

sajkaj commented Mar 9, 2021

DavidMacDonald commented Feb 6, 2021 •

edited by michael-n-cooper

Loading

bruce-usab commented Feb 8, 2021 •

edited

Loading

bruce-usab commented Mar 8, 2021 •

edited

Loading

bruce-usab commented Mar 8, 2021 •

edited

Loading

johnfoliot commented Mar 8, 2021 via email •

edited by bruce-usab

Loading

bruce-usab commented Mar 8, 2021 •

edited

Loading