-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bbox_transform.py:48: RuntimeWarning: overflow encountered in exp ... #65
Comments
I have the same issue here. |
After many tries and test, i found it is related to RNG_SEED, for my dataset, using VGG16 pre-trained model with default RNG_SEED = 3, it always leads to the instability of Symptom: if print out Using default random seed, I still got a very low chance to make it go through the whole training process successfully if i tried many times. If So my conclusion is try to change the RNG_SEED value if met the same problem, watch out the maximum value of Btw, i tried to change batch size and rpn batch size in yml file, after caffe loaded it did show the batch size values i set but the memory usage of gpu seems the same, is it normal? |
@ZhengRui I have met similar problem too.Follow your instruction, I print out dw.max() and dh.max(), they come around 3 and 4 respectively. But they suddenly becomes nan without a gradual change. It is confusing! |
RNG_SEED is just the seed for random number generator, you can change it to whatever number so long as it works. So change it to some number that you have a lower chance to meet this issue. I don't think it will impact the speed of convergence of something else, it's just a random seed. |
I got this issue, too. Since I use training images that are cropped on target objects, I suspect that the use of bounding box closed to image size cause this problem. I haven't found out solution yet. |
You can try to pad zeros around your training images before feeding to the network so that the ratio of your target objects is smaller. The network always rescale your input images to around 600x1000 or 1000x600 in the first step, so simply downsizing images won't work, have to do padding. |
I also encountered this problems recently, this my suggestion to check what bug comes from. Most importantly, I found this bug shown when load error roi bounding box information into db (e.g. 65535), and it causes by error implementation of loading annotation. |
Has anyone solved this? I tried @RyanLiuNtust @ZhengRui 's method, but it doesn't work for me.
I output 'ws, hs' , and they come to be "NAN". |
A possible solution could be to decrease the base learning rate in the solver.prototxt
I did try to change the base_lr value and now the NAN value disappeared. |
A question... @ZhengRui what do you mean by "You can try to pad zeros around your training images"? Can you give a quick example? Thanks! |
to
|
is there a caffe argument to zero pad all the images ? |
I am not sure if caffe has it now or not, but i did it by myself before everything as data augmentation def paddingzeros(im, desMin, desMax):
if im.shape[0] <= im.shape[1]:
if im.shape[0] < desMin:
im = np.lib.pad(im, (((desMin-im.shape[0])/2, (desMin+1-im.shape[0])/2), (0, 0), (0, 0)), 'constant')
if im.shape[1] < desMax:
im = np.lib.pad(im, ((0, 0), ((desMax-im.shape[1])/2, (desMax+1-im.shape[1])/2), (0, 0)), 'constant')
if im.shape[0] > im.shape[1]:
if im.shape[0] < desMax:
im = np.lib.pad(im, (((desMax-im.shape[0])/2, (desMax+1-im.shape[0])/2), (0, 0), (0, 0)), 'constant')
if im.shape[1] < desMin:
im = np.lib.pad(im, ((0, 0), ((desMin-im.shape[1])/2, (desMin+1-im.shape[1])/2), (0, 0)), 'constant')
print 'after padding: ', im.shape
return im
im = paddingzeros(im, 600, 1000) |
Doing this would also require changing the values of the bounding boxes in the annotation XML files as well ? |
yes, also have to change the bounding boxes annotations in padded images by adding the offsets in width and height directions |
Can someone explain the intuition behind this issue? Is the backpropagation algorithm oscillating with increasingly bigger steps around the optimum, eventually causing overflows in python? If so, are these understandings then correct:
I'm trying to get a better understanding of what is going wrong here :). |
For me it was @RyanLiuNtust method that worked. Fixed the ground truth xml files so coordinates are 1-based |
@neuleaf Have you solved the problem? I meet the same problem like yours, my output is also NAN. I have checked the output of bottom[0],bottom[1],and bottom[2] in proposal_layer.py. Only bottom[2] have value. And i have tried the solution of @azamattokhtaev ,but it doesn't help. |
where to change the RNG_SEED value? |
For me the issue was resolved when lowering the learning rate in solver.prototxt. The RNG_SEED can be changed in the config file. |
@ZhengRui @fernandorovai Have you solved the problem? I am also running into the same issue. |
Hey I changed the lr to 0.00001, removed the minus 1 while reading the annotations and changed the RNG seed to 17. Still keep getting the error. |
I also encountered this problem, and i changed RNG_SEED = 4 solved it |
I have this problem when running on my own dataset, things to check is
|
@skyuuka better to have some check the load_annotation function,How is your function written? |
As bbox_transform_inv is just for draw the bbox on the image, it's none of business with training. Reduce the learning rate can avoid the occurrence of nan. |
The issue that causes this NaN in the loss is because the The new Detectron code released has a fix for this. Just update your config file to have a line:
and add these lines just before the
It works with any RNG_SEED and very high learning rates (lr = 0.01) |
@ZhengRui @pyoguy @MenglaiWang @LiberiFatali @RyanLiuNtust I0312 16:25:25.883342 2983 sgd_solver.cpp:106] Iteration 0, lr = 0.0005 I try to change lr from 0.001 to 0.0001,but it didn't work.I also change RNG_SEED,and it also didn't work. |
@meetshah1995 use cpu only mode, after apply your solution, the problem still exist, though no nan value |
@MenglaiWang I use multi gpu mode with you solution,the problem still exists . |
@kingchenchina Seeing so many 1., the problem is very likely that the generated proposals are all just one pixel in length. Then, in Can use the following strategy to avoid empty proposals: keep = _filter_boxes(proposals, min_size * im_info[2])
if len(keep)!=0:
proposals = proposals[keep, :]
scores = scores[keep] But then the loss becomes nan. So turn down learning rate would be a better approach. |
I have met similar problem too.Follow your instruction, I print out dw.max() and dh.max(), they come around 7489.9507 and 11519.379 respectively. I can't understand why there is such a large number. I don't know why. I hope someone can give us some advice. |
I solved my 'Floating point exception (core dumped)' by modifying the function 'is_valid' in function 'filter_roidb' in file da-faster-rcnn-master/lib/fast_rcnn/train.py: def filter_roidb(roidb):
|
in
proposal_layer.py
'sforward()
function, when i print outbbox_deltas.min()
andbbox_deltas.max()
, at some point it suddenly become large and cause overflow and core dump, here is the log:Anyone can help figuring out what could be the problem?
The text was updated successfully, but these errors were encountered: