ENH: Add ability to compute metrics by accumulating across samples when desired #595

NickleDave · 2022-11-23T01:38:39Z

edit:
after discussing with @yardencsGitHub we are not considering it a bug that we average metrics on a per-sample basis.

In fact, we want to be able to get a statistic like the average across samples/songs, because this is in many cases more useful from the point of view of a user. You want to know something like "how well does this model perform on average", knowing that it will imply that the distribution of predictions can include some cases that are worse.

but we do need to

be very explicit about when we are averaging
and we need to better understand what standards are in the literature, and possibly provide those options

I think it might be the case that in ASR + NLP researchers want the WER for the entire corpus as a benchmark so they accumulate error across batches and then compute at the end. You would also want this if you were using an edit distance as a loss function. I think this might be why it's common in

In our case we want an estimate of the metric on a per-sample basis where a sample is a given bout of song or other unit vocalization, the level that we usually predict at.

context
The engine.Model._eval method computes metrics by averaging the metric values across every batch in the validation/test set.

Here:

vak/src/vak/engine/model.py

Line 301 in 0a5d18b

    
           # ---- compute metrics averaged across batches -----------------------------------------------------------------

        # ---- compute metrics averaged across batches -----------------------------------------------------------------
        # iterate over list of keys, to avoid "dictionary changed size" error when adding average metrics
        for metric_name in list(metric_vals):
            if metric_name in ["loss", "acc", "levenshtein", "segment_error_rate"]:
                avg_metric_val = (
                    torch.tensor(metric_vals[metric_name]).sum().cpu().numpy()
                    / n_batches
                ).item()
                metric_vals[f"avg_{metric_name}"] = avg_metric_val
            else:
                raise NotImplementedError(
                    f"calculation of metric across batches not yet implemented for {metric_name}"
                )

This has the effect of deflating error rates, producing an overly optimistic estimate.
We should instead accumulate errors and number of samples across batches, and then compute the metric after running through all batches.

Instead of re-implementing by hand, I think it's better to depend on another library for this. torchmetrics is actively maintained, used by 5k other projects on GitHub and has 1.2k stars. An initial comparison with our test cases suggest we would get the same values on a per-batch basis for both accuracy and segment error rate (in torchmetrics, character error rate).

We should

add the dependency
use their functional versions in our unit tests as a form of regression test
also test the classes
rewrite engine.Model._eval to use the class methods, specifically metric.compute() at the end should replace the averaging logic above -- as a nice side effect this will make logic in _eval more concise

The text was updated successfully, but these errors were encountered:

NickleDave added the BUG Something isn't working label Nov 23, 2022

NickleDave mentioned this issue Nov 23, 2022

ENH: Add built-in TweetyNet model #596

Closed

NickleDave self-assigned this Nov 23, 2022

NickleDave changed the title ~~BUG: Metrics are averaged across batches in eval instead of accumulating then computing~~ ENH: Add ability to compute metrics by accumulating across batches when desired Nov 23, 2022

NickleDave changed the title ~~ENH: Add ability to compute metrics by accumulating across batches when desired~~ ENH: Add ability to compute metrics by accumulating across samples when desired Nov 23, 2022

NickleDave added ENH: enhancement enhancement; new feature or request and removed BUG Something isn't working labels Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add ability to compute metrics by accumulating across samples when desired #595

ENH: Add ability to compute metrics by accumulating across samples when desired #595

NickleDave commented Nov 23, 2022 •

edited

Loading

ENH: Add ability to compute metrics by accumulating across samples when desired #595

ENH: Add ability to compute metrics by accumulating across samples when desired #595

Comments

NickleDave commented Nov 23, 2022 • edited Loading

NickleDave commented Nov 23, 2022 •

edited

Loading