You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
edit:
after discussing with @yardencsGitHub we are not considering it a bug that we average metrics on a per-sample basis.
In fact, we want to be able to get a statistic like the average across samples/songs, because this is in many cases more useful from the point of view of a user. You want to know something like "how well does this model perform on average", knowing that it will imply that the distribution of predictions can include some cases that are worse.
but we do need to
be very explicit about when we are averaging
and we need to better understand what standards are in the literature, and possibly provide those options
I think it might be the case that in ASR + NLP researchers want the WER for the entire corpus as a benchmark so they accumulate error across batches and then compute at the end. You would also want this if you were using an edit distance as a loss function. I think this might be why it's common in
In our case we want an estimate of the metric on a per-sample basis where a sample is a given bout of song or other unit vocalization, the level that we usually predict at.
context
The engine.Model._eval method computes metrics by averaging the metric values across every batch in the validation/test set.
# ---- compute metrics averaged across batches -----------------------------------------------------------------
# ---- compute metrics averaged across batches -----------------------------------------------------------------# iterate over list of keys, to avoid "dictionary changed size" error when adding average metricsformetric_nameinlist(metric_vals):
ifmetric_namein ["loss", "acc", "levenshtein", "segment_error_rate"]:
avg_metric_val= (
torch.tensor(metric_vals[metric_name]).sum().cpu().numpy()
/n_batches
).item()
metric_vals[f"avg_{metric_name}"] =avg_metric_valelse:
raiseNotImplementedError(
f"calculation of metric across batches not yet implemented for {metric_name}"
)
This has the effect of deflating error rates, producing an overly optimistic estimate.
We should instead accumulate errors and number of samples across batches, and then compute the metric after running through all batches.
Instead of re-implementing by hand, I think it's better to depend on another library for this. torchmetrics is actively maintained, used by 5k other projects on GitHub and has 1.2k stars. An initial comparison with our test cases suggest we would get the same values on a per-batch basis for both accuracy and segment error rate (in torchmetrics, character error rate).
We should
add the dependency
use their functional versions in our unit tests as a form of regression test
also test the classes
rewrite engine.Model._eval to use the class methods, specifically metric.compute() at the end should replace the averaging logic above -- as a nice side effect this will make logic in _eval more concise
The text was updated successfully, but these errors were encountered:
NickleDave
changed the title
BUG: Metrics are averaged across batches in eval instead of accumulating then computing
ENH: Add ability to compute metrics by accumulating across batches when desired
Nov 23, 2022
NickleDave
changed the title
ENH: Add ability to compute metrics by accumulating across batches when desired
ENH: Add ability to compute metrics by accumulating across samples when desired
Nov 23, 2022
edit:
after discussing with @yardencsGitHub we are not considering it a bug that we average metrics on a per-sample basis.
In fact, we want to be able to get a statistic like the average across samples/songs, because this is in many cases more useful from the point of view of a user. You want to know something like "how well does this model perform on average", knowing that it will imply that the distribution of predictions can include some cases that are worse.
but we do need to
I think it might be the case that in ASR + NLP researchers want the WER for the entire corpus as a benchmark so they accumulate error across batches and then compute at the end. You would also want this if you were using an edit distance as a loss function. I think this might be why it's common in
In our case we want an estimate of the metric on a per-sample basis where a sample is a given bout of song or other unit vocalization, the level that we usually predict at.
context
The
engine.Model._eval
method computes metrics by averaging the metric values across every batch in the validation/test set.Here:
vak/src/vak/engine/model.py
Line 301 in 0a5d18b
This has the effect of deflating error rates, producing an overly optimistic estimate.
We should instead accumulate errors and number of samples across batches, and then compute the metric after running through all batches.
Instead of re-implementing by hand, I think it's better to depend on another library for this.
torchmetrics
is actively maintained, used by 5k other projects on GitHub and has 1.2k stars. An initial comparison with our test cases suggest we would get the same values on a per-batch basis for both accuracy and segment error rate (in torchmetrics, character error rate).We should
engine.Model._eval
to use the class methods, specificallymetric.compute()
at the end should replace the averaging logic above -- as a nice side effect this will make logic in_eval
more conciseThe text was updated successfully, but these errors were encountered: