Training on Synthetic Shapes (loss nan, precision nan, recall 0.0000) #189

aelsaer · 2021-01-30T13:40:46Z

Hi, I followed the instructions of #173 and now I am trying to run the 1st step.
python experiment.py train configs/magic-point_shapes.yaml magic-point_synth
The iteration 0 gives me
[01/30/2021 15:35:05 INFO] Iter 0: loss 4.1807, precision 0.0006, recall 0.0451

but then all I get is nan values.

Here is my full log.

/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
experiment.py:153: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
[01/30/2021 15:21:32 INFO] Running command TRAIN
[01/30/2021 15:21:32 INFO] Number of GPUs detected: 1
[01/30/2021 15:21:33 INFO] Extracting archive for primitive draw_lines.
[01/30/2021 15:21:36 INFO] Extracting archive for primitive draw_polygon.
[01/30/2021 15:21:38 INFO] Extracting archive for primitive draw_multiple_polygons.
[01/30/2021 15:21:40 INFO] Extracting archive for primitive draw_ellipses.
[01/30/2021 15:21:43 INFO] Extracting archive for primitive draw_star.
[01/30/2021 15:21:45 INFO] Extracting archive for primitive draw_checkerboard.
[01/30/2021 15:21:48 INFO] Extracting archive for primitive draw_stripes.
[01/30/2021 15:21:50 INFO] Extracting archive for primitive draw_cube.
[01/30/2021 15:21:52 INFO] Extracting archive for primitive gaussian_noise.
[01/30/2021 15:21:55 INFO] Caching data, fist access will take some time.
[01/30/2021 15:21:56 INFO] Caching data, fist access will take some time.
[01/30/2021 15:21:56 INFO] Caching data, fist access will take some time.
2021-01-30 15:21:56.304294: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-01-30 15:21:56.390506: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-30 15:21:56.390821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce RTX 3070 major: 8 minor: 6 memoryClockRate(GHz): 1.755
pciBusID: 0000:01:00.0
totalMemory: 7.78GiB freeMemory: 7.41GiB
2021-01-30 15:21:56.390834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2021-01-30 15:22:29.360087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-30 15:22:29.360104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2021-01-30 15:22:29.360108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2021-01-30 15:22:29.360172: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-01-30 15:22:29.360196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7128 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3070, pci bus id: 0000:01:00.0, compute capability: 8.6)
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
2021-01-30 15:22:30.575953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2021-01-30 15:22:30.575984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-30 15:22:30.575988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2021-01-30 15:22:30.575990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2021-01-30 15:22:30.576031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7128 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3070, pci bus id: 0000:01:00.0, compute capability: 8.6)
[01/30/2021 15:22:33 INFO] Start training
[01/30/2021 15:35:05 INFO] Iter 0: loss 4.1807, precision 0.0006, recall 0.0451
/home/Projects/pythonProject1/SuperPoint/superpoint/models/base_model.py:387: RuntimeWarning: Mean of empty slice
metrics = {m: np.nanmean(metrics[m], axis=0) for m in metrics}
[01/30/2021 15:35:14 INFO] Iter 1000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:22 INFO] Iter 2000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:31 INFO] Iter 3000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:40 INFO] Iter 4000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:49 INFO] Iter 5000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:58 INFO] Iter 6000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:07 INFO] Iter 7000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:15 INFO] Iter 8000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:24 INFO] Iter 9000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:33 INFO] Iter 10000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:42 INFO] Iter 11000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:51 INFO] Iter 12000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:00 INFO] Iter 13000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:08 INFO] Iter 14000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:17 INFO] Iter 15000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:26 INFO] Iter 16000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:35 INFO] Iter 17000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:44 INFO] Iter 18000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:53 INFO] Iter 19000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:38:02 INFO] Iter 20000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:38:12 INFO] Iter 21000: loss nan, precision nan, recall 0.0000

Any suggestions; Thanks in advance.

rpautrat · 2021-01-30T14:05:24Z

Hi,
I have never seen such a behavior. Did you try to track where the nans started appearing? This would tell us where the problem is.

Can you also visualize the synthetic dataset and its ground truth, to check that everything is fine there? You can do so in this notebook: https://github.com/rpautrat/SuperPoint/blob/master/notebooks/visualize_synthetic-shapes.ipynb

aelsaer · 2021-01-30T21:35:02Z

Thank you for your fast reply. I managed to run the notebook and the results seem pretty fine (attached). I am now trying to figure out in which step the nan values appear. I managed to see that during the 385/50000 iteration loss is ~3.63 and then it's getting nan. I will investigate this further and come back asap.

Thank you very much. I appreciate your help.

visualize_synthetic-shapes.pdf

rpautrat · 2021-01-31T09:17:30Z

The ground truth in the notebook looks indeed fine.

The main question to answer is: do the nans start happening in the loss computation (e.g. after a division by zero), or do they start appearing in the backward pass (so at some iteration your loss is fine, and at the next iteration the weights of the network contain nans)? The first option is usually easier to fix (avoid division by 0, log(0), etc). The second one usually means exploding gradient or unstable gradients computation, but I would be surprised that it would be the case here, as no one ever reported such a behavior.

aelsaer · 2021-02-04T14:18:27Z

Hi again, I tried to reinstall everything, I even tried within docker but always I am getting the same issue. I also tried with different tf versions

I don't know if that helps, but I am attaching the whole dictionary as it is before it starts training L308

self dictionary

Any suggestion; Maybe it is something wrong because of my gpu; (GTX 3070)

rpautrat · 2021-02-04T14:35:59Z

Hi, I have never tried to run this code on a GTX 3070, so I can't say. But did you try locating the origin of the nans? For example using the function tf.math.is_nan and checking at various point in the code when the nans start appearing.

aelsaer · 2021-02-05T10:59:20Z

Yes, I am also investigating that about the loss. As I can see, after the first iteration loss is fine (around 4.17) but during the evaluation, recall is nan

It's really bizarre that when I change my batch size to 128, nan values appear after 3000 steps. And it's strange that it's running with that batch size even though most users propose small values 1-2. Is there any other way to check my input data?

rpautrat · 2021-02-07T21:15:35Z

A batch size of 1-2 is for training the descriptor, which is quite expensive in terms of GPU memory. But 128 to train the detector only should be fine.

If only recall is nan, it probably means that the denominator in

SuperPoint/superpoint/models/magic_point.py

Line 62 in 6d83092

recall = tf.reduce_sum(pred * labels) / tf.reduce_sum(labels)

is zero at some point. Which means that there is a whole batch of images without any labelled point in it, which does not seem realistic. You can try to check that, by printing the tf.reduce_sum(labels) for example. But if that is really the case, then you need to add a check that if the denominator is 0, then the recall should be 1 for example.

aelsaer · 2021-02-08T11:16:01Z

Thanks a lot for your reply. Unfortunately as far as I can see (reading also here , there is a fundamental compatibility issue with ampere technology (SM_86) and old cuda + tf versions. nan values appear instantly when the values are fetched from the tensors. Likely, I found the pytorch implementation and I was able to train superpoint etc. Thank you again for your support. I will close the issue now.

rpautrat · 2021-02-08T11:26:46Z

I see, that is unfortunate... Yes the Pytorch version is quite similar to this repo, so good that you found it and were able to use it!

aelsaer closed this as completed Feb 8, 2021

rpautrat mentioned this issue Feb 20, 2023

About Step1 Training MagicPoint on Synthetic Shapes Problem #282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on Synthetic Shapes (loss nan, precision nan, recall 0.0000) #189

Training on Synthetic Shapes (loss nan, precision nan, recall 0.0000) #189

aelsaer commented Jan 30, 2021

rpautrat commented Jan 30, 2021

aelsaer commented Jan 30, 2021

rpautrat commented Jan 31, 2021

aelsaer commented Feb 4, 2021

rpautrat commented Feb 4, 2021

aelsaer commented Feb 5, 2021 •

edited

Loading

rpautrat commented Feb 7, 2021

aelsaer commented Feb 8, 2021 •

edited

Loading

rpautrat commented Feb 8, 2021

Training on Synthetic Shapes (loss nan, precision nan, recall 0.0000) #189

Training on Synthetic Shapes (loss nan, precision nan, recall 0.0000) #189

Comments

aelsaer commented Jan 30, 2021

rpautrat commented Jan 30, 2021

aelsaer commented Jan 30, 2021

rpautrat commented Jan 31, 2021

aelsaer commented Feb 4, 2021

rpautrat commented Feb 4, 2021

aelsaer commented Feb 5, 2021 • edited Loading

rpautrat commented Feb 7, 2021

aelsaer commented Feb 8, 2021 • edited Loading

rpautrat commented Feb 8, 2021

aelsaer commented Feb 5, 2021 •

edited

Loading

aelsaer commented Feb 8, 2021 •

edited

Loading