What are good loss and accuracy scores? #21

jaaaamIron · 2021-08-17T13:55:33Z

(This is more a question than an issue).

I am wondering what are generally considered good loss and accuracy scores on Tensorboard during training - and at what point (if any) you've found you get good/interesting results etc?

EDIT: And also how you'd expect these to change over time if training is going well? (NB I know "right answers" aren't totally easy to come by)

relativeflux · 2021-08-18T20:10:04Z

Thanks for the question, and apologies for the delay in answering. Indeed it's a very difficult question to answer definitively. One possible answer (with respect to the loss metric) is 'as low as possible' (or as high as possible for accuracy). But you don't want it to go too low, since then you'll just be reproducing the original... instead you want it to be able to generalise, so it generates output that is in some way a variation on the original. I like what the Dadabots said in one of their papers, which is that it should overfit on the local scale, but underfit on the wider scale - so you get something that has an uncanny quality of being reminiscent in some way of the original, but also clearly not the original.

I generally aim to get the loss below 1, but that doesn't necessarily mean that results above that won't be interesting or useful. Recently I've been training on a large orchestral dataset - at the moment the loss is at about 1.5 but it's producing some really impressive and interesting results even at that level.

One thing that is pretty much certain is that if your training loss is still descending while your validation loss is ascending then you are overfitting, meaning that the network has learnt features (or even noise) from the original too well, and can't generalise. Having said that, the results of that can be interesting in themselves, from an aesthetic point of view. The main technique for combatting overfitting is to increase the size of your dataset. I tend to work with datasets around 2000 to 3000 chunks in size. Another anti-overfit technique is to decrease the capacity of the network (fewer layers, lower dimensionality).

You coukd also try building a dataset from from different versions (perhaps different performances, recordings) of the same music. This is a form of data augmentation, which is widespread in the computer vision domain - for example using different versions of the same image (by flipping, adding noise, etc). Not so common when working with audio. But I've found this technique can help, especially when you don't have much data (the overlap feature when producing the chunks is also a form of data augmentation).

One thing that has struck me recently is that the closer the training and validation losses are the better, even if the absolute values for these metrics might be higher than at a later point, when they are lower but further apart.

Take a look at my article on the Beethoven Piano Sonatas training I ran a few months ago. Those are some of the best results I've achieved so far, although the training was at 16kHz sample rate so the audio quality is not the best. One thing that made a noticeable difference is the type of quantisation - linear quantisation trained much quicker and achieved better metrics... But the audio quality was not as good, a lot of background hiss.

A new feature I am currently working on is weight normalisation, which should help push the metrics further in the right direction, and also speed up their convergence. Due to some issues with the TensorFlow Keras API I was not able to incorporate this feature before now. Hopefully should be available within the next month.

jaaaamIron · 2021-08-18T20:30:00Z

Thank you - and thanks for the article, it's very informative.
I'm often triggering the early stopping patience with my datasets, and find that when I continue to train, after say 100 epochs it has definitely overfit as I can hear the original dataset. If it's not possible to increase the length of the dataset, would increasing chunk quantity and decreasing the size lead to broadly 'better' results in your experience (i.e. each chunk is 3 seconds instead of 8)?

relativeflux · 2021-08-19T18:28:01Z

@jaaaamIron I usually work with chunks of 6 or 8 seconds, with an overlap if that generates more chunks. i have experimented with smaller chunk size, I think the smallest I've tried is 4 seconds, with an overlap of 3 seconds. In wouldn't like to say that alone improved the results, as I also modified other parameters. I have found reducing the rnn dimensionality can work in cases where data is limited - I've gone as low as 512.

Recently I've been training with the following configuration, which I initially used with the Beethoven sonatas dataset:

{
    "seq_len": 512,
    "frame_sizes": [2,8],
    "dim": 1024,
    "rnn_type": "gru",
    "num_rnn_layers": 1,
    "q_type": "mu-law",
    "q_levels": 256,
    "emb_size": 256
}

That has worked well for me across a range of datasets - indeed I am minded to make it the default config. Another thing you might try is the tuning scripts, particularly ray_tune.py. That works best with multiple cards or on a cluster, if you have access to one. So it can run multiple trials simultaneously. Unfortunately it doesn't save checkpoints, just prints the best trial results at the end, but you do have access to the Ray Tune dashboard which integrates TensorBoard very nicely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are good loss and accuracy scores? #21

What are good loss and accuracy scores? #21

jaaaamIron commented Aug 17, 2021 •

edited

relativeflux commented Aug 18, 2021 •

edited

jaaaamIron commented Aug 18, 2021

relativeflux commented Aug 19, 2021

What are good loss and accuracy scores? #21

What are good loss and accuracy scores? #21

Comments

jaaaamIron commented Aug 17, 2021 • edited

relativeflux commented Aug 18, 2021 • edited

jaaaamIron commented Aug 18, 2021

relativeflux commented Aug 19, 2021

jaaaamIron commented Aug 17, 2021 •

edited

relativeflux commented Aug 18, 2021 •

edited