Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add ability to compute metrics by accumulating across samples when desired #595

Open
4 tasks
NickleDave opened this issue Nov 23, 2022 · 0 comments
Open
4 tasks
Assignees
Labels
ENH: enhancement enhancement; new feature or request

Comments

@NickleDave
Copy link
Collaborator

NickleDave commented Nov 23, 2022

edit:
after discussing with @yardencsGitHub we are not considering it a bug that we average metrics on a per-sample basis.

In fact, we want to be able to get a statistic like the average across samples/songs, because this is in many cases more useful from the point of view of a user. You want to know something like "how well does this model perform on average", knowing that it will imply that the distribution of predictions can include some cases that are worse.

but we do need to

  • be very explicit about when we are averaging
  • and we need to better understand what standards are in the literature, and possibly provide those options

I think it might be the case that in ASR + NLP researchers want the WER for the entire corpus as a benchmark so they accumulate error across batches and then compute at the end. You would also want this if you were using an edit distance as a loss function. I think this might be why it's common in

In our case we want an estimate of the metric on a per-sample basis where a sample is a given bout of song or other unit vocalization, the level that we usually predict at.

context
The engine.Model._eval method computes metrics by averaging the metric values across every batch in the validation/test set.

Here:

# ---- compute metrics averaged across batches -----------------------------------------------------------------

        # ---- compute metrics averaged across batches -----------------------------------------------------------------
        # iterate over list of keys, to avoid "dictionary changed size" error when adding average metrics
        for metric_name in list(metric_vals):
            if metric_name in ["loss", "acc", "levenshtein", "segment_error_rate"]:
                avg_metric_val = (
                    torch.tensor(metric_vals[metric_name]).sum().cpu().numpy()
                    / n_batches
                ).item()
                metric_vals[f"avg_{metric_name}"] = avg_metric_val
            else:
                raise NotImplementedError(
                    f"calculation of metric across batches not yet implemented for {metric_name}"
                )

This has the effect of deflating error rates, producing an overly optimistic estimate.
We should instead accumulate errors and number of samples across batches, and then compute the metric after running through all batches.

Instead of re-implementing by hand, I think it's better to depend on another library for this. torchmetrics is actively maintained, used by 5k other projects on GitHub and has 1.2k stars. An initial comparison with our test cases suggest we would get the same values on a per-batch basis for both accuracy and segment error rate (in torchmetrics, character error rate).

We should

  • add the dependency
  • use their functional versions in our unit tests as a form of regression test
  • also test the classes
  • rewrite engine.Model._eval to use the class methods, specifically metric.compute() at the end should replace the averaging logic above -- as a nice side effect this will make logic in _eval more concise
@NickleDave NickleDave added the BUG Something isn't working label Nov 23, 2022
@NickleDave NickleDave self-assigned this Nov 23, 2022
@NickleDave NickleDave changed the title BUG: Metrics are averaged across batches in eval instead of accumulating then computing ENH: Add ability to compute metrics by accumulating across batches when desired Nov 23, 2022
@NickleDave NickleDave changed the title ENH: Add ability to compute metrics by accumulating across batches when desired ENH: Add ability to compute metrics by accumulating across samples when desired Nov 23, 2022
@NickleDave NickleDave added ENH: enhancement enhancement; new feature or request and removed BUG Something isn't working labels Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ENH: enhancement enhancement; new feature or request
Projects
None yet
Development

No branches or pull requests

1 participant