Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the role of automation #909

Open
4 tasks
jugglinmike opened this issue Mar 14, 2023 · 6 comments
Open
4 tasks

Define the role of automation #909

jugglinmike opened this issue Mar 14, 2023 · 6 comments

Comments

@jugglinmike
Copy link
Contributor

jugglinmike commented Mar 14, 2023

Since the inception of ARIA-AT, its participants have anticipated that some aspects of testing ATs will be automated. That general expectation has been sufficient to drive development of the requisite automation standards and tools, but it is too vague to inform the process of delivering AT interoperability reports. Indeed, over the years, participants have expressed a variety of expectations regarding the role of automation.

During today's meeting, we started discussing the extent we intend to integrate automation into the Working Mode.

We started off by recognizing that the term test result (as we have been using it) actually describes two different pieces of information:

  1. a description of the way the AT-under-test has responded to the test script (e.g. the text that a screen reader verbalizes), and
  2. an interpretation of the response (in other words: a judgement as to whether the response recorded for the test satisfies that test's assertions)

In other words: a "test result" is comprised of an AT response and an assertion analysis.

We still had (at least) two open questions when the meeting commenced:

  • When do we want to allow/disallow automation in the collection of AT responses?
  • Do we expect to automate assertion analysis?

This issue is intended to augment the meeting minutes and to host public asynchronous conversation. Folks in attendance today informally planned to hold a one-off meeting to continue the discussion synchronously (I'll post an invitation here in just a moment).

On 2023-03-27, we identified the following tasks:

  • extend the glossary with a definition for the term "AT response" (assignee: @mcking65)
  • extend the glossary with a definition for the term "assertion verdict" (assignee: @mcking65)
  • update ARIA-AT App to use the term "AT response" rather than "output" (assignee: @jugglinmike)
  • determine how automation will factor in to the working mode, select a place to document that information, and write the documentation (assignee: @jugglinmike)

An observation from @mcking65 during that meeting: "Just like the working mode doesn't say how to run the [ARIA-AT App] website, it won't say how to run the automation."

@jugglinmike
Copy link
Contributor Author

It looks like I'll need more than "just a moment" to create that event. I hope to post it here within 24 hours.

@jugglinmike
Copy link
Contributor Author

@jugglinmike
Copy link
Contributor Author

At the end of today's meeting, @mcking65 expressed a desire for AT responses to be collected by a human "at least once." I'd like to nail that down a little further (here in this discussion thread if possible, or during next week's one-off meeting if not).

First: does a manual collection for one AT satisfy the requirement for ALL ATs? In other words, do we want to require that AT responses are manually collected "at least once [for each AT]" or "at least once [for all ATs, collectively]"? (I have a similar question for web browsers.)

Second: should test modification refresh the need for manual validation?

Maybe another way to think about this is to consider which component (or group of components) the manual collection is intended to validate: the tests, the automation system, the browsers, or the ATs.

@jugglinmike
Copy link
Contributor Author

The meeting minutes are available on w3.org.

The full IRC log of that discussion

<jugglinmike> MEETING: ARIA-AT Community Group Automation Workstream
<jugglinmike> present+ jugglinmike
<jugglinmike> scribe+ jugglinmike
<jugglinmike> Topic: Considering the direction of automation
<jugglinmike> present+ Matt_King
<jugglinmike> present+: Michael_Fairchild
<jugglinmike> present+ Michael_Fairchild
<mzgoddard> +present
<jugglinmike> jugglinmike: When do we want to allow/disallow automation in the collection of AT responses?
<jugglinmike> Matt_King: I'm going to make some proposals to change the glossary to separate "test results" into "AT responses" from "response analysis"
<jugglinmike> Matt_King: The "test case" (the thing you're testing) and the assertions are designed to be AT-agnostic (for the set of ATs which are in-scope for the command)
<jugglinmike> Matt_King: Then you have commands or events which generate responses. Right now, we use the word "output", but we might want to use the word "response" instead.
<jugglinmike> Matt_King: I think those responses should be called "command responses" because they are tied to a response. That's one proposal
<jugglinmike> Matt_King: Command responses aren't part of the test. Running the test generates command responses
<jugglinmike> Matt_King: Should the analysis be considered part of "running the test"?
<jugglinmike> Matt_King: Another proposal I have for the glossary is to label the analysis of command responses "verdicts"
<jugglinmike> present+ mzgoddard
<jugglinmike> mzgoddard: Does that include unexpected responses?
<jugglinmike> Matt_King: No
<jugglinmike> Matt_King: Unexpected behaviors are an attribute of the response
<jugglinmike> Matt_King: I think there are two aspects of unexpected behavior: a token which documents if it's present and what kind it is, and a textual description
<jugglinmike> Matt_King: When it comes to assertion verdicts, I think the realm of automation is pretty limited
<jugglinmike> Matt_King: Having automation simply detect congruence with previously-interpreted responses is reasonable
<jugglinmike> Michael_Fairchild: in the future, we might be able to train an AI model to perform analysis, but that seems far off right now
<jugglinmike> github: #909
<jugglinmike> github #909
<jugglinmike> jugglinmike: Verdicts need to be tied to the content of the tests. A change to the test would mean the system should not "trust" or "reuse" verdicts previously reported by a human
<jugglinmike> Matt_King: I was anticipating that almost everything here is versioned
<jugglinmike> Matt_King: If we get feedback from an AT vendor that they want some assertion to change, that drives a change to the test plan
<jugglinmike> Matt_King: If a screen reader command changes, that also drives a change to the test plan
<jugglinmike> Matt_King: What if we change a test plan and we want to generate new results for the updated test plan? We would expect the automation to collect all the responses, but in the cases where the assertions have changed, then it would say "I don't have verdict"
<jugglinmike> Matt_King: We need to be able to insert a test between "test 1" and "test 2" without invalidating either of their verdicts
<jugglinmike> Matt_King: Could changes to instructions ever invalidate the verdict?
<jugglinmike> Matt_King: The setup scripts could definitely. Ditto with the commands and the assertions
<jugglinmike> mzgoddard: Whatever we land on, we can agree that there are a group of traits which we can use to look up prior verdicts and reuse.
<jugglinmike> Matt_King: Exactly
<jugglinmike> Matt_King: Automation is allowed to run a test, execute commands, generate responses. In the event that human-approved verdicts exists, and IF those verdicts are still applicable, then assign those verdicts to the responses.
<jugglinmike> s/generate/collect/
<jugglinmike> Michael_Fairchild: I pause in thinking that any change to the APG would invalidate the verdicts
<jugglinmike> Michael_Fairchild: It ought to be enough to look at the AT response, because a meaningful change to the APG example would change the AT response
<jugglinmike> jugglinmike: But if the source changes meaningfully and the AT incorrectly does not change its output, then the system would be "fooled" into re-using a prior verdict which is actually no longer valid
<jugglinmike> jugglinmike: It seems like these considerations need to be addressed in the working mode. Is that right?
<jugglinmike> Matt_King: I'm worried about the working mode becoming too opaque. I think the working mode is more for humans to understand the high level. Maybe certain sections of the work could link to separate documents which get into mechanics
<jugglinmike> Matt_King: These condiserations definitely need to be documented, though I'm not sure how yet
<jugglinmike> Matt_King: The first thing I want to do is update the glossary and then use that to inform how we want to update the working mode
<jugglinmike> Matt_King: There is a need for stakeholders (e.g. AT implementers) to understand the working mode. Like, "what's the appeals process?"
<jugglinmike> Matt_King: Complexity will be the enemy of clarity there
<jugglinmike> Zakim, end the meeting

@jugglinmike jugglinmike changed the title Considering the direction of automation Define the role of automation Apr 5, 2023
@jugglinmike
Copy link
Contributor Author

@mcking65 as per our conversation on 2023-03-27, I've updated the name of this issue and added a checklist describing the next steps. Does that reflect your understanding of the work ahead?

@jugglinmike
Copy link
Contributor Author

@mcking65 In addition to the above, I'm wondering if the following change makes sense for the app.

Currently, the app includes "unexpected behaviors" in its description of test results. If I understand the new terminology correctly, it would be appropriate to call these "unexpected responses." That would be an improvement in my mind for a couple reasons:

  1. it would avoid insinuating the presence of some other classification of data (instead of documenting both "responses" and "behaviors", we only consider "responses" and denote some of them as "unexpected")
  2. it would reinforce the relationship between these two pieces of information (the two have identical data types)

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant