Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4DS: NaNs in inputs #5

Closed
snjsnj opened this issue Mar 12, 2021 · 4 comments
Closed

4DS: NaNs in inputs #5

snjsnj opened this issue Mar 12, 2021 · 4 comments
Assignees

Comments

@snjsnj
Copy link

snjsnj commented Mar 12, 2021

@lisurui6

Brief Description:

While training the 4DS network (demo_validateDL.py) using PH data from the cardiac drive, 'NaNs detected in inputs, please correct or drop' error is observed:

image

About the screenshot:

  1. I have suppressed the N/W training progress information.
  2. The training starts fine with step (1)
  3. For the first bootsrap sample, the error is observed in step (2a), hyperparameter search.
    concordance.py
  4. The error is 'NaNs detected in inputs, please correct or drop'

My understanding:

While finding the hyperparameters, the inputs are becoming NaNs.
Possible causes:

  1. Exploding/vanishing gradient problem
  2. Seeding the solver properly
  3. Need of changing the inputs passed to the DL_single_run() inside hypersearch_DL()

Reproducing the error:

  1. screen

  2. Run the latest docker image with appropriate mounting of the volume for the following file: $PWD/cardiac/for_ghalib/backup/home/gbello@isd.csc.mrc.ac.uk/gbello/RWorkspace1/mesh_motiondescriptor_PHAB0267Adecmesh_CTEPH302.pkl:

nvidia-docker run -ti --rm -v $PWD/cardiac/for_ghalib/backup/home/gbello@isd.csc.mrc.ac.uk/gbello/RWorkspace1:/data lisurui6/4dsurvival-gpu:latest

  1. cd demo

  2. CUDA_VISIBLE_DEVICES='available_device' python3 demo_validateDL.py

  3. Wait.
@lisurui6
Copy link
Collaborator

How long do I have to wait to see NaN error occur?

@lisurui6 lisurui6 self-assigned this Mar 12, 2021
@lisurui6
Copy link
Collaborator

I would need more information. Does the loss in step 2a become NaN at some point? What iteration?

@lisurui6
Copy link
Collaborator

A few points to note down:

  1. I don't think such extensive search of hyperparameters is useful in practice, and without it would surely make it more efficient to execute the code. I think once they are found in step 1a, the hyperparameters can be used in the bootstrap sampling steps. Things like drop out rate, number of hidden units, is not that sensitive. Learning rate isn't sensitive as well, if Adam optimiser is used. So there is little point in repeating the search.
  2. Most likely the NaN is from exploding gradients. Adding batch norm layers definitively is recommended, and will solve this issue. Notice to add them after linear layer before nonlinear activation (and of course not in the output layer).
  3. The repo uses keras==2.2.2 and tf1. Highly recommended to upgrade to tf2 at least. Most recommended to switch to pytorch. It will be much easier to debug.

@doregan
Copy link
Contributor

doregan commented Mar 17, 2021

Lifelines function had a bugfix in April 2019 as it had been silently converting NaNs in the event vector to True. In the current version these now produce an error - which we are picking up when we ran the code. Reducing dropout search range to [0.1,0.5] allows code to run with lifelines v 0.23.9 and achieves similar discrimination performance.
image (8)

@doregan doregan closed this as completed Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants