Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to summarize and display test results? #19

Closed
spectranaut opened this issue Nov 22, 2019 · 8 comments
Closed

How to summarize and display test results? #19

spectranaut opened this issue Nov 22, 2019 · 8 comments

Comments

@spectranaut
Copy link
Contributor

Hey all! This is a somewhat urgent issue because I need to get something in by Wednesday the 27th, so hoping spend all of Tuesday the 26th at the latest implementing a summary page. Each test has a lot of information recorded in it, and I am not sure which information is the most relevant to show. I'm hoping to get feedback in this issue to simply make a first draft of a test result summary.

Here is an example html page with the summary after you complete the "read-checkbox.html" test, which I used because it is the longest test:

read checkbox results

That is the result for just one test. What we need to discuss here is:

  1. Is this a reasonable way to show a summary of one test
  2. How should we ultimately show results for "read checkbox" along with results from "operate checkbox" and "read checkbox grouping"?
  3. What does it mean for a test to pass?

Current implementation of algorithm for test passing or failing:

  1. The test passes if:
    1. All assertions pass for every AT command.
    2. There is no unexpected bad behaviors (such as irrelevant extra information or the AT crashing) after any AT command.

Therefore, any failing assertion or undesirable behavior after any key command will result in the test failing. We could have a different state for a test result if all assertions pass but there is some additional undesirable behavior occurs.

What will cause an assertion to pass or fail?

An assertion will fail if any screen reader commands results in incorrect information (in this case, the tester will have marked "incorrect output"). In the checkbox case, if the accessible name or role or state is actually wrong for any tested key command, then the assertion fails.

Additionally, an assertion will fail if the test author includes an additional assertion that is not related to the output of the screen reader and that assertion fails. So far we only have one example of this kind of assertion, that is the assertion about the JAWS and NVDA changing modes when you use TAB to read a checkbox.

An assertion will be considered to pass if there is only "missing" information (for example, if a test marks "no output" because the role "checkbox" is not announced).

Not all assertions are equal

The test design so far does not have a way to record the necessity of an assertion (for example, "mandatory" or "nice to have"). This will take some thinking to fit into the test design (as the tests are already quite complicated) so I do not think that the ability to mark some assertions as necessary for passing and others are not necessary will make it into the prototype for this phase of the project.

@spectranaut
Copy link
Contributor Author

@mcking65 and @jfhector and @mfairchild365 and @Yohta89, can you take a look at this issue? :)

@ghost
Copy link

ghost commented Nov 22, 2019

I liked the overall structure of the result page! I'm assuming the primary purpose of the result page is to facilitate stakeholders to understand the deficiencies of each screen reader based on the results. i.e. Help them be on the same page to have a conversation about what's wrong with current experiences. With those in mind, here are some thoughts.

  • Is this a reasonable way to show a summary of one test

-In case of a more complicated and longer test, adding a summary score after the test results. e.g. test result 1 of 5 failed.
-To help implementers skim the results quickly, mark fail in different colors or italic.

  • How should we ultimately show results for "read checkbox" along with results from "operate checkbox" and "read checkbox grouping"?

-The only structure I can think of for now is a tree structure. Under the root umbrella of a checkbox, having three sorts of test results. The root page is for the summary of the test result that could potentially record scoring (though I think scoring would be the next phase).

  • What does it mean for a test to pass?

-The current definition you've shared makes sense to me. I'll get back to this once I could come up with other thoughts.

And this relates to the testing page. But since we don't have free form text sections to capture Other detail, I was wondering how we could see the detail of why the tester marked something as fail in the current result page.

@mfairchild365
Copy link
Contributor

  1. Is this a reasonable way to show a summary of one test

I think it is certainly a good start, and I agree with @Yohta89's comments. Additionally, it would be good to give the tables row and column headers.

  1. How should we ultimately show results for "read checkbox" along with results from "operate checkbox" and "read checkbox grouping"?

If I'm understanding this correctly, we should have a summary of support at the test suite level, where the user could dive into the summaries for each test. Side note, what are we calling a group of related tests? A test plan? A test suite?

  1. What does it mean for a test to pass?

I can't help but wonder if we should mark tests as 'partial' when some assertions pass and others fail. That could help stakeholders quickly determine 'oh, it looks like this one is completely failing but that one at least has some support'.

@isaacdurazo
Copy link
Member

@spectranaut I was taking a look at the summary page and was wondering how we could simplify it.

I was thinking that, ultimately, what we want to get out of it is a picture of what's failing, so why display what is passing?

Showing only that's failing, in addition to the summary score that @Yohta89 is suggesting, could make parsing the summary page information easier.

What do you all think?

@mcking65
Copy link
Contributor

What is passing is as important with these tests as what passes.

@mcking65
Copy link
Contributor

I have a hard time parsing this because of all the words, headings, tables ... Feels difficult to get a picture of what happened during the testing.

Here is my suggestion.

Put the entire report into a single table with the following columns:

  1. Task
  2. Must-Have Pass/Fail
  3. Should-Have Pass/Fail
  4. Nice-Have Pass/Fail
  5. Unexpected Behavior Count

At the bottom there are two summary rows:

  1. Totals: column 1 has the task count, e.g., 5 tasks. The pass/fail columns total all the passes and all the fails in the column. The unexpected column totals the number of unexpected behaviors.
  2. Percentages: Column 1 has "Percent supported" and the pass/fail columns have (passes/(passes+fails))*100

To calculate pass/fail counts, let's consider a command/assertion pair as a single expected behavior.
That is, down arrow conveying role is 1 expected behavior; down arrow announcing name is another.
If there was no output, or if the output was incorrect, that command/assertion pair is counted as a fail.
So, if there were 20 expected must-have behaviors and 18 passed, put 18/2 in that column.

For unexpected behaviors, such as excess verbosity, just count them for column 5.

The task column is a link that opens details for that task in a new tab. The title is "Results for task TASK_NAME"

When you get down to the task detail behavior, the command is typically going to be the primary element of concern, e.g., how well is a specific command supported. Screen reader bugs will be typically associated with either a command or an unexpected behavior.

This page would have a single table with columns:

  1. command
  2. Support level
  3. Details

Note: The command will serve as a row header for each row.

The support level column will have one of the following values:

  1. Full: All assertions had expected behavior and there were no unexpected behaviors. This value is only possible if there are should-have or nice-to-have assertions.
  2. All Required: All Must-Have assertions had expected behavior and there were no unexpected behaviors.
  3. Failing: There was some kind of failure or unexpected behavior.

Later, I'd like to refine the above to include a partial support option that is distinct from failing support. but we would first need to make some adjustments to the way we categorize unexpected behaviors. For instance, some excess speech does not introduce errors, while other excess speech could actually be incorrect and is worth noting as a worse kind of failure.

The details column shows the output and then lists assertions, grouping according to pass/fail.

Output = ...

Passing assertions:

  • Must: The role 'checkbox' is conveyed
  • Must: The name 'Lettuce' is spoken
    ...
    Failing assertions:
  • Must: assertion that failed...
  • ...

Unexpected behaviors: none

@mfairchild365
Copy link
Contributor

@mcking65 I like the direction this is going. Clarification: should "Must-Have Pass/Fail" be two columns? 1 for pass and 1 for fail? I'm struggling to visualize what cell data under that single column would look like, maybe "x/y" where x is the number of passing commands and y is the number of failing commands?

@mcking65
Copy link
Contributor

@mfairchild365 commented:

@mcking65 I like the direction this is going. Clarification: should "Must-Have Pass/Fail" be two columns? 1 for pass and 1 for fail? I'm struggling to visualize what cell data under that single column would look like, maybe "x/y" where x is the number of passing commands and y is the number of failing commands?

Yes, 18/2 would mean 18 pass and 2 fail. This is a way to 1) give more space for column 1 and 2) make it easier to get more info quickly by reading down a single column. As a screen reader user, I can get a lot more info with fewer key strokes this way. the number of passes and number of fails are numbers the user will often want to consume simultaneously when skimming. The only disadvantage is if you are purely focused on the failures. In that case, you would have to listen to extrainfo. Given the nature of the data, seems like a good trade off to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants