Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on Synthetic Shapes (loss nan, precision nan, recall 0.0000) #189

Closed
aelsaer opened this issue Jan 30, 2021 · 9 comments
Closed

Comments

@aelsaer
Copy link

aelsaer commented Jan 30, 2021

Hi, I followed the instructions of #173 and now I am trying to run the 1st step.
python experiment.py train configs/magic-point_shapes.yaml magic-point_synth
The iteration 0 gives me
[01/30/2021 15:35:05 INFO] Iter 0: loss 4.1807, precision 0.0006, recall 0.0451

but then all I get is nan values.

Here is my full log.

/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
experiment.py:153: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
[01/30/2021 15:21:32 INFO] Running command TRAIN
[01/30/2021 15:21:32 INFO] Number of GPUs detected: 1
[01/30/2021 15:21:33 INFO] Extracting archive for primitive draw_lines.
[01/30/2021 15:21:36 INFO] Extracting archive for primitive draw_polygon.
[01/30/2021 15:21:38 INFO] Extracting archive for primitive draw_multiple_polygons.
[01/30/2021 15:21:40 INFO] Extracting archive for primitive draw_ellipses.
[01/30/2021 15:21:43 INFO] Extracting archive for primitive draw_star.
[01/30/2021 15:21:45 INFO] Extracting archive for primitive draw_checkerboard.
[01/30/2021 15:21:48 INFO] Extracting archive for primitive draw_stripes.
[01/30/2021 15:21:50 INFO] Extracting archive for primitive draw_cube.
[01/30/2021 15:21:52 INFO] Extracting archive for primitive gaussian_noise.
[01/30/2021 15:21:55 INFO] Caching data, fist access will take some time.
[01/30/2021 15:21:56 INFO] Caching data, fist access will take some time.
[01/30/2021 15:21:56 INFO] Caching data, fist access will take some time.
2021-01-30 15:21:56.304294: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-01-30 15:21:56.390506: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-30 15:21:56.390821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce RTX 3070 major: 8 minor: 6 memoryClockRate(GHz): 1.755
pciBusID: 0000:01:00.0
totalMemory: 7.78GiB freeMemory: 7.41GiB
2021-01-30 15:21:56.390834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2021-01-30 15:22:29.360087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-30 15:22:29.360104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2021-01-30 15:22:29.360108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2021-01-30 15:22:29.360172: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-01-30 15:22:29.360196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7128 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3070, pci bus id: 0000:01:00.0, compute capability: 8.6)
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
2021-01-30 15:22:30.575953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2021-01-30 15:22:30.575984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-30 15:22:30.575988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2021-01-30 15:22:30.575990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2021-01-30 15:22:30.576031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7128 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3070, pci bus id: 0000:01:00.0, compute capability: 8.6)
[01/30/2021 15:22:33 INFO] Start training
[01/30/2021 15:35:05 INFO] Iter 0: loss 4.1807, precision 0.0006, recall 0.0451
/home/Projects/pythonProject1/SuperPoint/superpoint/models/base_model.py:387: RuntimeWarning: Mean of empty slice
metrics = {m: np.nanmean(metrics[m], axis=0) for m in metrics}
[01/30/2021 15:35:14 INFO] Iter 1000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:22 INFO] Iter 2000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:31 INFO] Iter 3000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:40 INFO] Iter 4000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:49 INFO] Iter 5000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:58 INFO] Iter 6000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:07 INFO] Iter 7000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:15 INFO] Iter 8000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:24 INFO] Iter 9000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:33 INFO] Iter 10000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:42 INFO] Iter 11000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:51 INFO] Iter 12000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:00 INFO] Iter 13000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:08 INFO] Iter 14000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:17 INFO] Iter 15000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:26 INFO] Iter 16000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:35 INFO] Iter 17000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:44 INFO] Iter 18000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:53 INFO] Iter 19000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:38:02 INFO] Iter 20000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:38:12 INFO] Iter 21000: loss nan, precision nan, recall 0.0000

Any suggestions; Thanks in advance.

@rpautrat
Copy link
Owner

Hi,
I have never seen such a behavior. Did you try to track where the nans started appearing? This would tell us where the problem is.

Can you also visualize the synthetic dataset and its ground truth, to check that everything is fine there? You can do so in this notebook: https://github.com/rpautrat/SuperPoint/blob/master/notebooks/visualize_synthetic-shapes.ipynb

@aelsaer
Copy link
Author

aelsaer commented Jan 30, 2021

Thank you for your fast reply. I managed to run the notebook and the results seem pretty fine (attached). I am now trying to figure out in which step the nan values appear. I managed to see that during the 385/50000 iteration loss is ~3.63 and then it's getting nan. I will investigate this further and come back asap.

Thank you very much. I appreciate your help.

visualize_synthetic-shapes.pdf

@rpautrat
Copy link
Owner

The ground truth in the notebook looks indeed fine.

The main question to answer is: do the nans start happening in the loss computation (e.g. after a division by zero), or do they start appearing in the backward pass (so at some iteration your loss is fine, and at the next iteration the weights of the network contain nans)? The first option is usually easier to fix (avoid division by 0, log(0), etc). The second one usually means exploding gradient or unstable gradients computation, but I would be surprised that it would be the case here, as no one ever reported such a behavior.

@aelsaer
Copy link
Author

aelsaer commented Feb 4, 2021

Hi again, I tried to reinstall everything, I even tried within docker but always I am getting the same issue. I also tried with different tf versions

I don't know if that helps, but I am attaching the whole dictionary as it is before it starts training L308

self dictionary

Any suggestion; Maybe it is something wrong because of my gpu; (GTX 3070)

@rpautrat
Copy link
Owner

rpautrat commented Feb 4, 2021

Hi, I have never tried to run this code on a GTX 3070, so I can't say. But did you try locating the origin of the nans? For example using the function tf.math.is_nan and checking at various point in the code when the nans start appearing.

@aelsaer
Copy link
Author

aelsaer commented Feb 5, 2021

Yes, I am also investigating that about the loss. As I can see, after the first iteration loss is fine (around 4.17) but during the evaluation, recall is nan

It's really bizarre that when I change my batch size to 128, nan values appear after 3000 steps. And it's strange that it's running with that batch size even though most users propose small values 1-2. Is there any other way to check my input data?

@rpautrat
Copy link
Owner

rpautrat commented Feb 7, 2021

A batch size of 1-2 is for training the descriptor, which is quite expensive in terms of GPU memory. But 128 to train the detector only should be fine.

If only recall is nan, it probably means that the denominator in

recall = tf.reduce_sum(pred * labels) / tf.reduce_sum(labels)
is zero at some point. Which means that there is a whole batch of images without any labelled point in it, which does not seem realistic. You can try to check that, by printing the tf.reduce_sum(labels) for example. But if that is really the case, then you need to add a check that if the denominator is 0, then the recall should be 1 for example.

@aelsaer
Copy link
Author

aelsaer commented Feb 8, 2021

Thanks a lot for your reply. Unfortunately as far as I can see (reading also here , there is a fundamental compatibility issue with ampere technology (SM_86) and old cuda + tf versions. nan values appear instantly when the values are fetched from the tensors. Likely, I found the pytorch implementation and I was able to train superpoint etc. Thank you again for your support. I will close the issue now.

@aelsaer aelsaer closed this as completed Feb 8, 2021
@rpautrat
Copy link
Owner

rpautrat commented Feb 8, 2021

I see, that is unfortunate... Yes the Pytorch version is quite similar to this repo, so good that you found it and were able to use it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants