-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training on Synthetic Shapes (loss nan, precision nan, recall 0.0000) #189
Comments
Hi, Can you also visualize the synthetic dataset and its ground truth, to check that everything is fine there? You can do so in this notebook: https://github.com/rpautrat/SuperPoint/blob/master/notebooks/visualize_synthetic-shapes.ipynb |
Thank you for your fast reply. I managed to run the notebook and the results seem pretty fine (attached). I am now trying to figure out in which step the nan values appear. I managed to see that during the 385/50000 iteration loss is ~3.63 and then it's getting nan. I will investigate this further and come back asap. Thank you very much. I appreciate your help. |
The ground truth in the notebook looks indeed fine. The main question to answer is: do the nans start happening in the loss computation (e.g. after a division by zero), or do they start appearing in the backward pass (so at some iteration your loss is fine, and at the next iteration the weights of the network contain nans)? The first option is usually easier to fix (avoid division by 0, log(0), etc). The second one usually means exploding gradient or unstable gradients computation, but I would be surprised that it would be the case here, as no one ever reported such a behavior. |
Hi again, I tried to reinstall everything, I even tried within docker but always I am getting the same issue. I also tried with different tf versions I don't know if that helps, but I am attaching the whole dictionary as it is before it starts training L308 Any suggestion; Maybe it is something wrong because of my gpu; (GTX 3070) |
Hi, I have never tried to run this code on a GTX 3070, so I can't say. But did you try locating the origin of the nans? For example using the function |
Yes, I am also investigating that about the loss. As I can see, after the first iteration loss is fine (around 4.17) but during the evaluation, recall is nan It's really bizarre that when I change my batch size to 128, nan values appear after 3000 steps. And it's strange that it's running with that batch size even though most users propose small values 1-2. Is there any other way to check my input data? |
A batch size of 1-2 is for training the descriptor, which is quite expensive in terms of GPU memory. But 128 to train the detector only should be fine. If only recall is nan, it probably means that the denominator in
tf.reduce_sum(labels) for example. But if that is really the case, then you need to add a check that if the denominator is 0, then the recall should be 1 for example.
|
Thanks a lot for your reply. Unfortunately as far as I can see (reading also here , there is a fundamental compatibility issue with ampere technology (SM_86) and old cuda + tf versions. nan values appear instantly when the values are fetched from the tensors. Likely, I found the pytorch implementation and I was able to train superpoint etc. Thank you again for your support. I will close the issue now. |
I see, that is unfortunate... Yes the Pytorch version is quite similar to this repo, so good that you found it and were able to use it! |
Hi, I followed the instructions of #173 and now I am trying to run the 1st step.
python experiment.py train configs/magic-point_shapes.yaml magic-point_synth
The iteration 0 gives me
[01/30/2021 15:35:05 INFO] Iter 0: loss 4.1807, precision 0.0006, recall 0.0451
but then all I get is nan values.
Here is my full log.
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/anaconda3/envs/pythonProject1/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
experiment.py:153: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
[01/30/2021 15:21:32 INFO] Running command TRAIN
[01/30/2021 15:21:32 INFO] Number of GPUs detected: 1
[01/30/2021 15:21:33 INFO] Extracting archive for primitive draw_lines.
[01/30/2021 15:21:36 INFO] Extracting archive for primitive draw_polygon.
[01/30/2021 15:21:38 INFO] Extracting archive for primitive draw_multiple_polygons.
[01/30/2021 15:21:40 INFO] Extracting archive for primitive draw_ellipses.
[01/30/2021 15:21:43 INFO] Extracting archive for primitive draw_star.
[01/30/2021 15:21:45 INFO] Extracting archive for primitive draw_checkerboard.
[01/30/2021 15:21:48 INFO] Extracting archive for primitive draw_stripes.
[01/30/2021 15:21:50 INFO] Extracting archive for primitive draw_cube.
[01/30/2021 15:21:52 INFO] Extracting archive for primitive gaussian_noise.
[01/30/2021 15:21:55 INFO] Caching data, fist access will take some time.
[01/30/2021 15:21:56 INFO] Caching data, fist access will take some time.
[01/30/2021 15:21:56 INFO] Caching data, fist access will take some time.
2021-01-30 15:21:56.304294: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-01-30 15:21:56.390506: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-30 15:21:56.390821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce RTX 3070 major: 8 minor: 6 memoryClockRate(GHz): 1.755
pciBusID: 0000:01:00.0
totalMemory: 7.78GiB freeMemory: 7.41GiB
2021-01-30 15:21:56.390834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2021-01-30 15:22:29.360087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-30 15:22:29.360104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2021-01-30 15:22:29.360108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2021-01-30 15:22:29.360172: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-01-30 15:22:29.360196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7128 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3070, pci bus id: 0000:01:00.0, compute capability: 8.6)
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:29 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
[01/30/2021 15:22:30 INFO] Scale of 0 disables regularizer.
2021-01-30 15:22:30.575953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2021-01-30 15:22:30.575984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-30 15:22:30.575988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2021-01-30 15:22:30.575990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2021-01-30 15:22:30.576031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7128 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3070, pci bus id: 0000:01:00.0, compute capability: 8.6)
[01/30/2021 15:22:33 INFO] Start training
[01/30/2021 15:35:05 INFO] Iter 0: loss 4.1807, precision 0.0006, recall 0.0451
/home/Projects/pythonProject1/SuperPoint/superpoint/models/base_model.py:387: RuntimeWarning: Mean of empty slice
metrics = {m: np.nanmean(metrics[m], axis=0) for m in metrics}
[01/30/2021 15:35:14 INFO] Iter 1000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:22 INFO] Iter 2000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:31 INFO] Iter 3000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:40 INFO] Iter 4000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:49 INFO] Iter 5000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:35:58 INFO] Iter 6000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:07 INFO] Iter 7000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:15 INFO] Iter 8000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:24 INFO] Iter 9000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:33 INFO] Iter 10000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:42 INFO] Iter 11000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:36:51 INFO] Iter 12000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:00 INFO] Iter 13000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:08 INFO] Iter 14000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:17 INFO] Iter 15000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:26 INFO] Iter 16000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:35 INFO] Iter 17000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:44 INFO] Iter 18000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:37:53 INFO] Iter 19000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:38:02 INFO] Iter 20000: loss nan, precision nan, recall 0.0000
[01/30/2021 15:38:12 INFO] Iter 21000: loss nan, precision nan, recall 0.0000
Any suggestions; Thanks in advance.
The text was updated successfully, but these errors were encountered: