Skip to content

Analysing neural networks

Albert Zeyer edited this page Mar 24, 2024 · 16 revisions

Related also: Debugging, RETURNN Debugging doc (e.g. for inf/nan, or bad/wrong behavior, etc)

E.g. you are trying out a new neural network model, or part, like e.g. layer-normalization, residual connections, new attention variant, new architecture, new regularization, new training or optimization, etc. But it doesn't quite work. Although maybe other people have reported that it should work well. Or it's a new idea and you don't know. Or even if it works well, maybe you want to better understand how it works, why it works well.

How do you analyse your neural network? The behavior at inference? The training convergence behavior?

This is a random list of things you could look at. And also external references.

For any of those, maybe you assume it behaves well, or works well, but it still might be interesting to take a short look at it, because when it actually is not as you would expect, this is a hint and good starting place to better understand why that occurs this or that way.

This is not technical here. For technical questions like how to actually collect running statistics of gradients, or whatever, please check the documentation, ask in the discussion forum here, or so. Usually this is also less specific about RETURNN but more about how you would do that in general for TensorFlow. (Maybe we would need to add some new option in RETURNN, to make some of these things easier, or to allow this more easily for the user. In that case, make a feature request.)

  • Depending on the model, it might have specific explicit representations which are interesting:

    • Attention weights.
    • Probability distributions of various kind, e.g. for the output, but also maybe for your latent variables.
    • If you have a latent variable (e.g. the alignment), estimate it, e.g. calculate the best values according to the model (Viterbi alignment).
    • Batch norm, layer norm, other norm statistics.
  • Layer output activities, of output layer and hidden layers. Calculate some statistics (min, max, mean, var; also running statistics).

  • Trainable parameter values. Calculate statistics.

    • Relative changes of the parameters across training. How much variance. Diff across epochs. E.g. compare_epochs.py.
  • Gradients. Both w.r.t. the layer activities, and also parameters. Calculate statistics.

  • Convergence rate, looking at the loss scores. Does it converge fast? (If so, that's usually a good sign. If you have that, but then it goes bad later, there might be another problem, e.g. overfitting, and these other problems can usually be solved.) (If it does not, that's usually bad. You can help by pretraining, learning rate scheduling, curriculum learning, other scheduling tricks like using less regularization initially. But maybe the model/method is actually not so good then?)

  • Convergence behavior. All the things you might watch (attention weights, statistics of parameters, gradients, whatever), watch how it develops over the training. Esp the very beginning of the training might be very relevant and interesting.

  • Overfitting. How much? Measure this by taking also a subset of the train data, which you evaluate the same way as your CV data (i.e. no chunking, no dropout, etc). (The obvious solution is to use more regularization. But the question is of course, which variant of regularization. E.g. just adding lots of random noise will reduce the overfitting but also the model performs much worse. So this need to be done in a good way.)

  • Add other auxiliary losses. Just for debugging, i.e. the loss would not be used for training. E.g. we always also have frame-error-rate. Maybe you can measure other things. Think about what behavior you expect from some internal aspect of your model, and measure that in some way.

  • Is there some reference code which works? Import the model parameters to your implementation, and verify that you get exactly the same outputs. Maybe just have some bug in your implementation. You will easily find that this way. Or at least carefully study the other code, and look for subtle details, or maybe default values of hyper parameters, etc.

  • Create an artificial toy task, i.e. synthetic artificial data, where it definitely should work. If it does not work, you can usually better try to understand where and why it fails. Also, you can make it very easy, and then also get away with a very small network, which can be trained fast on your local computer, which helps a lot for debugging and analysing. In principle, you should be able to construct the network (the weights) even by hand. Thinking about how the neural network could solve this often leads to a better understanding whether you maybe need some additional layer or other function inside your network.

  • Get it to convergence on a very small subset of the train data (in the extreme case, a single sequence only). You should be able to fast reach loss 0.0. If the model is not able to do that, sth is wrong.

  • Test a trained model on some specifically prepared input. E.g. does it recognize individual words? Does it work on very long sequences (e.g. by concatenating them)?

  • Saliency maps / methods. E.g. calculate the gradients w.r.t. the input, to see the impact of individual frames, or features. This is a whole research topic on its own, and you find lots of publications on this topic. E.g. see Sanity Checks for Saliency Maps, 2018.

Other resources: