4DS: NaNs in inputs #5

snjsnj · 2021-03-12T09:58:56Z

Brief Description:

While training the 4DS network (demo_validateDL.py) using PH data from the cardiac drive, 'NaNs detected in inputs, please correct or drop' error is observed:

About the screenshot:

I have suppressed the N/W training progress information.
The training starts fine with step (1)
For the first bootsrap sample, the error is observed in step (2a), hyperparameter search.
concordance.py
The error is 'NaNs detected in inputs, please correct or drop'

My understanding:

While finding the hyperparameters, the inputs are becoming NaNs.
Possible causes:

Exploding/vanishing gradient problem
Seeding the solver properly
Need of changing the inputs passed to the DL_single_run() inside hypersearch_DL()

Reproducing the error:

screen
Run the latest docker image with appropriate mounting of the volume for the following file: $PWD/cardiac/for_ghalib/backup/home/gbello@isd.csc.mrc.ac.uk/gbello/RWorkspace1/mesh_motiondescriptor_PHAB0267Adecmesh_CTEPH302.pkl:

nvidia-docker run -ti --rm -v $PWD/cardiac/for_ghalib/backup/home/gbello@isd.csc.mrc.ac.uk/gbello/RWorkspace1:/data lisurui6/4dsurvival-gpu:latest

cd demo
CUDA_VISIBLE_DEVICES='available_device' python3 demo_validateDL.py
Wait.

lisurui6 · 2021-03-12T17:17:24Z

How long do I have to wait to see NaN error occur?

lisurui6 · 2021-03-12T18:32:48Z

I would need more information. Does the loss in step 2a become NaN at some point? What iteration?

lisurui6 · 2021-03-12T18:48:37Z

A few points to note down:

I don't think such extensive search of hyperparameters is useful in practice, and without it would surely make it more efficient to execute the code. I think once they are found in step 1a, the hyperparameters can be used in the bootstrap sampling steps. Things like drop out rate, number of hidden units, is not that sensitive. Learning rate isn't sensitive as well, if Adam optimiser is used. So there is little point in repeating the search.
Most likely the NaN is from exploding gradients. Adding batch norm layers definitively is recommended, and will solve this issue. Notice to add them after linear layer before nonlinear activation (and of course not in the output layer).
The repo uses keras==2.2.2 and tf1. Highly recommended to upgrade to tf2 at least. Most recommended to switch to pytorch. It will be much easier to debug.

doregan · 2021-03-17T13:27:52Z

Lifelines function had a bugfix in April 2019 as it had been silently converting NaNs in the event vector to True. In the current version these now produce an error - which we are picking up when we ran the code. Reducing dropout search range to [0.1,0.5] allows code to run with lifelines v 0.23.9 and achieves similar discrimination performance.

lisurui6 self-assigned this Mar 12, 2021

doregan closed this as completed Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4DS: NaNs in inputs #5

4DS: NaNs in inputs #5

snjsnj commented Mar 12, 2021 •

edited

Loading

lisurui6 commented Mar 12, 2021

lisurui6 commented Mar 12, 2021

lisurui6 commented Mar 12, 2021

doregan commented Mar 17, 2021

4DS: NaNs in inputs #5

4DS: NaNs in inputs #5

Comments

snjsnj commented Mar 12, 2021 • edited Loading

lisurui6 commented Mar 12, 2021

lisurui6 commented Mar 12, 2021

lisurui6 commented Mar 12, 2021

doregan commented Mar 17, 2021

snjsnj commented Mar 12, 2021 •

edited

Loading