Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_loss is not found #26

Closed
ilkarman opened this issue Mar 1, 2020 · 2 comments
Closed

train_loss is not found #26

ilkarman opened this issue Mar 1, 2020 · 2 comments

Comments

@ilkarman
Copy link

ilkarman commented Mar 1, 2020

I run the plot_surface code like so:

    /usr/bin/python -u /local/mnt/workspace/ikarmano/Gitlab/sagd/loss-landscape/plot_surface.py --cuda \
    --x=-1:1:51 --y=-1:1:51 --model_file models/32_32_32_32_32_32_32_32_32_32_32_32_32_32_32cnn.t \
    --dir_type weights --xnorm filter --xignore biasbn --ynorm filter --yignore biasbn --plot

And it seem to calculate the loss fine:

Evaluating rank 2  90/2601  (3.5%)  coord=[ 0.56 -0.96] 	train_loss= 21.470 	train_acc=14.54 	time=5.28 	sync=0.00
Evaluating rank 2  91/2601  (3.5%)  coord=[ 0.6  -0.96] 	train_loss= 22.225 	train_acc=14.10 	time=5.65 	sync=0.00
Evaluating rank 2  92/2601  (3.5%)  coord=[ 0.64 -0.96] 	train_loss= 23.044 	train_acc=13.67 	time=5.92 	sync=0.00
Evaluating rank 2  93/2601  (3.6%)  coord=[ 0.68 -0.96] 	train_loss= 23.935 	train_acc=13.33 	time=5.71 	sync=0.00
Evaluating rank 2  94/2601  (3.6%)  coord=[ 0.72 -0.96] 	train_loss= 24.905 	train_acc=13.02 	time=5.65 	sync=0.00
Evaluating rank 2  95/2601  (3.7%)  coord=[ 0.76 -0.96] 	train_loss= 25.958 	train_acc=12.66 	time=5.50 	sync=0.00
Evaluating rank 2  96/2601  (3.7%)  coord=[ 0.8  -0.96] 	train_loss= 27.100 	train_acc=12.37 	time=5.99 	sync=0.00
Evaluating rank 2  97/2601  (3.7%)  coord=[ 0.84 -0.96] 	train_loss= 28.334 	train_acc=12.13 	time=5.85 	sync=0.00
Evaluating rank 2  98/2601  (3.8%)  coord=[ 0.88 -0.96] 	train_loss= 29.666 	train_acc=11.91 	time=5.71 	sync=0.00
Evaluating rank 2  99/2601  (3.8%)  coord=[ 0.92 -0.96] 	train_loss= 31.101 	train_acc=11.69 	time=5.58 	sync=0.00

However, the plot functions do not work because 'train_loss' is not found:

train_loss is not found in ../models/32_32_32_32_32_32_32_32_32_32_32_32_32_32_32cnn.t_weights_xignore=biasbn_xnorm=filter_yignore=biasbn_ynorm=filter.h5_[-1.0,1.0,51]x[-1.0,1.0,51].h5

And if I print the keys(), it's just:

<KeysViewHDF5 ['dir_file', 'xcoordinates', 'ycoordinates']>

Not sure what I'm doing wrong?

@ljk628
Copy link
Collaborator

ljk628 commented Mar 1, 2020

Hi @ilkarman, our code saves the surface values by the rank 0 process in default after collecting values calculated by multiple processes, as you can see in https://github.com/tomgoldstein/loss-landscape/blob/master/plot_surface.py#L88 and https://github.com/tomgoldstein/loss-landscape/blob/master/plot_surface.py#L136.

It seems that you are not using mpi and your process rank value is 2, so it might be the reason why the surface values are not saved into the h5 file. It could be an easy fix if you change the default rank values to 2 or figure out why it is not zero.

@ilkarman
Copy link
Author

ilkarman commented Mar 2, 2020

Thank you! One of the params for crunch() was being overwritten instead of rank.

@ilkarman ilkarman closed this as completed Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants