error while training #5

Closed
stexandev opened this Issue Nov 13, 2014 · 16 comments

Comments

Projects
None yet
10 participants
@stexandev

After executing (on 156 files of groundtruth text and imagery):
ocropus-rtrain gt/????/*.png -F 10000 -o mub_combined &
I've got the following reproduceable error:

454 150.32 (1486, 48) gt/0001/01000b.bin.png
TRU: u'quod dicitur Fulda, quod est situm in pago Grapfeld, constructum in honore sancti'
ALN: u'quuod dicituur Fuulda, qquod et situumm in pagoo Grapfeld, construuctuuumm in honnore '
OUT: u' iiii ii te ti imm tm e iii eutmut m mi eii '

oops, got FloatingPointError overflow encountered in exp

Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 228, in
pcs = network.trainSequence(line,cs,update=do_update,key=fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 863, in trainSequence
self.outputs = array(self.lstm.forward(xs))
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 587, in forward
xs = net.forward(xs)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 636, in forward
outputs = [net.forward(xs) for net in self.nets]
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 545, in forward
self.WIP,self.WFP,self.WOP)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 419, in forward_py
go[t] = ffunc(gox[t])
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 367, in ffunc
return 1.0/(1.0+exp(-x))
FloatingPointError: overflow encountered in exp
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 232, in
network = ocrolib.load_object(last_save)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 502, in load_object
fname = ocropus_find_file(fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 680, in ocropus_find_file
if os.path.exists(fname):
File "/usr/lib/python2.7/genericpath.py", line 18, in exists
os.stat(path)
TypeError: coercing to Unicode: need string or buffer, NoneType found

another case with half of the files (dir 0001 only):

960 110.63 (1490, 48) gt/0001/010022.bin.png
TRU: u'in honorem\u2074 domini salvatoris Jesu Christi et beate Marie genetricis\u2075 eius episco-'
ALN: u'in honorem~ domini salvatoris Jesu Christi et beate MMarie genetricis eius episco-'
OUT: u'iu bouoreu ouiui salvatoris lesu bristi et beate arie geuetricis eius episoo-'

oops, got FloatingPointError overflow encountered in exp

Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 228, in
pcs = network.trainSequence(line,cs,update=do_update,key=fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 863, in trainSequence
self.outputs = array(self.lstm.forward(xs))
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 587, in forward
xs = net.forward(xs)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 636, in forward
outputs = [net.forward(xs) for net in self.nets]
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 619, in forward
return self.net.forward(xs[::-1])[::-1]
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 545, in forward
self.WIP,self.WFP,self.WOP)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 419, in forward_py
go[t] = ffunc(gox[t])
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 367, in ffunc
return 1.0/(1.0+exp(-x))
FloatingPointError: overflow encountered in exp
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 232, in
network = ocrolib.load_object(last_save)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 502, in load_object
fname = ocropus_find_file(fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 680, in ocropus_find_file
if os.path.exists(fname):
File "/usr/lib/python2.7/genericpath.py", line 18, in exists
os.stat(path)
TypeError: coercing to Unicode: need string or buffer, NoneType found

@tmbdev

This comment has been minimized.

Show comment
Hide comment
@tmbdev

tmbdev Nov 24, 2014

Owner

I haven't seen that happen in the past. It's fairly easy to fix with clipping the value, but I'm concerned that your weights are getting large enough to trigger this in the first place. What learning rates and # hidden units are you using?

Owner

tmbdev commented Nov 24, 2014

I haven't seen that happen in the past. It's fairly easy to fix with clipping the value, but I'm concerned that your weights are getting large enough to trigger this in the first place. What learning rates and # hidden units are you using?

@stexandev

This comment has been minimized.

Show comment
Hide comment
@stexandev

stexandev Nov 26, 2014

I am sorry but I can't answer your question concerning learning rates and hidden units as I didn't change anything in the source code (except for enhancing the codec) and executed the command as stated above.

I am sorry but I can't answer your question concerning learning rates and hidden units as I didn't change anything in the source code (except for enhancing the codec) and executed the command as stated above.

@tmbdev

This comment has been minimized.

Show comment
Hide comment
@tmbdev

tmbdev Nov 26, 2014

Owner

If you changed the codec, how many output classes are there?

Owner

tmbdev commented Nov 26, 2014

If you changed the codec, how many output classes are there?

@stexandev

This comment has been minimized.

Show comment
Hide comment
@stexandev

stexandev Nov 27, 2014

I just added a class of superdigits = u"⁰¹²³⁴⁵⁶⁷⁸⁹" and attached it to the default codec.

I just added a class of superdigits = u"⁰¹²³⁴⁵⁶⁷⁸⁹" and attached it to the default codec.

@tmbdev

This comment has been minimized.

Show comment
Hide comment
@tmbdev

tmbdev Dec 3, 2014

Owner

OK, thanks. I'll keep the bug open and will incorporate a workaround (or you can send me a patch; basically, to avoid the overflow just clip the argument to the exp to some reasonable range and test it).

Owner

tmbdev commented Dec 3, 2014

OK, thanks. I'll keep the bug open and will incorporate a workaround (or you can send me a patch; basically, to avoid the overflow just clip the argument to the exp to some reasonable range and test it).

@danvk

This comment has been minimized.

Show comment
Hide comment
@danvk

danvk Jan 5, 2015

Contributor

I'm also running into this error. I started to see it increasingly often as I let my model train. The quality of its outputs started decreasing around the same time—my assumption is that the model started diverging, causing both of these problems. I have sample data, command lines and model files if they would helpful.

Contributor

danvk commented Jan 5, 2015

I'm also running into this error. I started to see it increasingly often as I let my model train. The quality of its outputs started decreasing around the same time—my assumption is that the model started diverging, causing both of these problems. I have sample data, command lines and model files if they would helpful.

@hzhangwd

This comment has been minimized.

Show comment
Hide comment
@hzhangwd

hzhangwd May 8, 2015

Also running into this issue. I think it has something to do with exploding/vanishing gradient nature of RNN even with LSTM.

hzhangwd commented May 8, 2015

Also running into this issue. I think it has something to do with exploding/vanishing gradient nature of RNN even with LSTM.

@tmbdev

This comment has been minimized.

Show comment
Hide comment
@tmbdev

tmbdev May 12, 2015

Owner

I've run a lot of benchmarks now, and generally, the gradients don't explode haphazardly. They explode at high learning rates, but lowering the learning rate reliably makes things work.

Owner

tmbdev commented May 12, 2015

I've run a lot of benchmarks now, and generally, the gradients don't explode haphazardly. They explode at high learning rates, but lowering the learning rate reliably makes things work.

QuLogic pushed a commit to QuLogic/ocropy that referenced this issue Jun 6, 2015

@slimanef

This comment has been minimized.

Show comment
Hide comment
@slimanef

slimanef Jun 18, 2015

I used the ocropus-rtrain to train the ocropus with handwritten historical word images. It gave the following error after 13605 iterations.
ocropus-rtrain -o /home/p/models/test /home/p/WordImages/*.jpg

Could you help me on this error?

Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 270, in
line = network.lnorm.normalize(line,cval=amax(line))
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lineest.py", line 59, in normalize
dewarped = self.dewarp(img,cval=cval,dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lineest.py", line 56, in dewarp
dewarped = array(dewarped,dtype=dtype).T
ValueError: setting an array element with a sequence.

and this error also:

/usr/lib/python2.7/dist-packages/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
warnings.warn("Mean of empty slice.", RuntimeWarning)
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 269, in
network.lnorm.measure(amax(line)-line)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lineest.py", line 43, in measure
self.mad = mean(deltas[line!=0])
File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2716, in mean
out=out, keepdims=keepdims)
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 67, in _mean
ret = ret.dtype.type(ret / rcount)
FloatingPointError: invalid value encountered in double_scalars

Thank you

I used the ocropus-rtrain to train the ocropus with handwritten historical word images. It gave the following error after 13605 iterations.
ocropus-rtrain -o /home/p/models/test /home/p/WordImages/*.jpg

Could you help me on this error?

Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 270, in
line = network.lnorm.normalize(line,cval=amax(line))
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lineest.py", line 59, in normalize
dewarped = self.dewarp(img,cval=cval,dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lineest.py", line 56, in dewarp
dewarped = array(dewarped,dtype=dtype).T
ValueError: setting an array element with a sequence.

and this error also:

/usr/lib/python2.7/dist-packages/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
warnings.warn("Mean of empty slice.", RuntimeWarning)
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 269, in
network.lnorm.measure(amax(line)-line)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lineest.py", line 43, in measure
self.mad = mean(deltas[line!=0])
File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2716, in mean
out=out, keepdims=keepdims)
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 67, in _mean
ret = ret.dtype.type(ret / rcount)
FloatingPointError: invalid value encountered in double_scalars

Thank you

@anupamaray

This comment has been minimized.

Show comment
Hide comment
@anupamaray

anupamaray May 5, 2016

I am using ocropus-rtrain and getting this error at the beginning itself.. I am running it: ocropus-rtrain -o model words/*.bin.png
and the error is

got FloatingPointError divide by zero encountered in double_scalars
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 286, in
pcs = network.trainSequence(line,cs,update=do_update,key=fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 902, in trainSequence
self.error_log.append(self.error**.5/len(cs))
FloatingPointError: divide by zero encountered in double_scalars
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 290, in
network = load_lstm(last_save)
File "/usr/local/bin/ocropus-rtrain", line 176, in load_lstm
network = ocrolib.load_object(last_save)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 503, in load_object
fname = ocropus_find_file(fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 682, in ocropus_find_file
if os.path.exists(fname):
File "/usr/lib/python2.7/genericpath.py", line 18, in exists
os.stat(path)
TypeError: coercing to Unicode: need string or buffer, NoneType found

Can someone help me on this error please.
Thank you

anupamaray commented May 5, 2016

I am using ocropus-rtrain and getting this error at the beginning itself.. I am running it: ocropus-rtrain -o model words/*.bin.png
and the error is

got FloatingPointError divide by zero encountered in double_scalars
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 286, in
pcs = network.trainSequence(line,cs,update=do_update,key=fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 902, in trainSequence
self.error_log.append(self.error**.5/len(cs))
FloatingPointError: divide by zero encountered in double_scalars
Traceback (most recent call last):
File "/usr/local/bin/ocropus-rtrain", line 290, in
network = load_lstm(last_save)
File "/usr/local/bin/ocropus-rtrain", line 176, in load_lstm
network = ocrolib.load_object(last_save)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 503, in load_object
fname = ocropus_find_file(fname)
File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 682, in ocropus_find_file
if os.path.exists(fname):
File "/usr/lib/python2.7/genericpath.py", line 18, in exists
os.stat(path)
TypeError: coercing to Unicode: need string or buffer, NoneType found

Can someone help me on this error please.
Thank you

@kba

This comment has been minimized.

Show comment
Hide comment
@kba

kba May 5, 2016

Collaborator

@anupamaray Can you open a new issue for this one? Please give your operating system and put the backtrace in a fenced code block, so it's more readable.

os.stat(path)
TypeError: coercing to Unicode: need string or buffer, NoneType found

Are you sure that the pictures are in word/*.bin.png? What does ls word/*.bin.png return?

Collaborator

kba commented May 5, 2016

@anupamaray Can you open a new issue for this one? Please give your operating system and put the backtrace in a fenced code block, so it's more readable.

os.stat(path)
TypeError: coercing to Unicode: need string or buffer, NoneType found

Are you sure that the pictures are in word/*.bin.png? What does ls word/*.bin.png return?

@ChillarAnand

This comment has been minimized.

Show comment
Hide comment
@ChillarAnand

ChillarAnand Dec 30, 2016

Contributor

Any updates on this issue? I am using all the defaults values except this code.

 ఁంఃఅఆఇఈఉఊఋఌఎఏఐఒఓఔకఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళవశషసహఽాిుూృౄెేౘౙౠౡౢౣ౦౧౨౩౪౫౬౭౮౯

For first 3K samples, everything went fine. After that 4 out of 10 samples are failing because of FloatingPointError.

Contributor

ChillarAnand commented Dec 30, 2016

Any updates on this issue? I am using all the defaults values except this code.

 ఁంఃఅఆఇఈఉఊఋఌఎఏఐఒఓఔకఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళవశషసహఽాిుూృౄెేౘౙౠౡౢౣ౦౧౨౩౪౫౬౭౮౯

For first 3K samples, everything went fine. After that 4 out of 10 samples are failing because of FloatingPointError.

@maluz

This comment has been minimized.

Show comment
Hide comment
@maluz

maluz Apr 17, 2017

I think I have the same issue. Can someone tell me what I should do to avoid this kind of error? I'm only just starting out with ocropy and ocrosis. Thank you!

# oops, got FloatingPointError overflow encountered in exp
Traceback (most recent call last):
  File "/usr/local/bin/ocropus-rtrain", line 289, in <module>
    pcs = network.trainSequence(line,cs,update=do_update,key=fname)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 900, in trainSequence
    self.outputs = array(self.lstm.forward(xs))
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 612, in forward
    xs = net.forward(xs)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 668, in forward
    outputs = [net.forward(xs) for net in self.nets]
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 566, in forward
    self.WIP,self.WFP,self.WOP)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 435, in forward_py
    go[t] = ffunc(gox[t])
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 383, in ffunc
    return 1.0/(1.0+exp(-x))
FloatingPointError: overflow encountered in exp
Traceback (most recent call last):
  File "/usr/local/bin/ocropus-rtrain", line 293, in <module>
    network = load_lstm(last_save)
  File "/usr/local/bin/ocropus-rtrain", line 179, in load_lstm
    network = ocrolib.load_object(last_save)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 436, in load_object
    fname = ocropus_find_file(fname)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 639, in ocropus_find_file
    full = os.path.join(prefix, basename, fname)
  File "/usr/lib/python2.7/posixpath.py", line 75, in join
    if b.startswith('/'):
AttributeError: 'NoneType' object has no attribute 'startswith'
[OCROCIS] [ERROR] Ocropus command failed: ocropus-rtrain --ntrain 30000 --savefreq 1000 --codec ./book/charset.txt --output ./iterations/01/models/model ./training/*/*.bin.png 2>&1

maluz commented Apr 17, 2017

I think I have the same issue. Can someone tell me what I should do to avoid this kind of error? I'm only just starting out with ocropy and ocrosis. Thank you!

# oops, got FloatingPointError overflow encountered in exp
Traceback (most recent call last):
  File "/usr/local/bin/ocropus-rtrain", line 289, in <module>
    pcs = network.trainSequence(line,cs,update=do_update,key=fname)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 900, in trainSequence
    self.outputs = array(self.lstm.forward(xs))
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 612, in forward
    xs = net.forward(xs)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 668, in forward
    outputs = [net.forward(xs) for net in self.nets]
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 566, in forward
    self.WIP,self.WFP,self.WOP)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 435, in forward_py
    go[t] = ffunc(gox[t])
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 383, in ffunc
    return 1.0/(1.0+exp(-x))
FloatingPointError: overflow encountered in exp
Traceback (most recent call last):
  File "/usr/local/bin/ocropus-rtrain", line 293, in <module>
    network = load_lstm(last_save)
  File "/usr/local/bin/ocropus-rtrain", line 179, in load_lstm
    network = ocrolib.load_object(last_save)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 436, in load_object
    fname = ocropus_find_file(fname)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 639, in ocropus_find_file
    full = os.path.join(prefix, basename, fname)
  File "/usr/lib/python2.7/posixpath.py", line 75, in join
    if b.startswith('/'):
AttributeError: 'NoneType' object has no attribute 'startswith'
[OCROCIS] [ERROR] Ocropus command failed: ocropus-rtrain --ntrain 30000 --savefreq 1000 --codec ./book/charset.txt --output ./iterations/01/models/model ./training/*/*.bin.png 2>&1

zuphilip added a commit that referenced this issue Apr 18, 2017

Clip exponential in ffunc to avoid overflow
This should avoid (hopefully) some possible FloatingPointError overflow errors.

The sigmoid function ffunc is for any x<-20 and x>20 already 0 resp. 1 up to 10^-9
and cutting will therefore not change the function substantially.

This idea is from @tmbdev in #5 (comment)
Implemented first in #49 (comment)
Additional infos from #79 (comment)
@zuphilip

This comment has been minimized.

Show comment
Hide comment
@zuphilip

zuphilip Apr 18, 2017

Collaborator

I tried to implement a possible fix in #201 for ocropy. Can someone check this out?

I don't have any details for ocrocis but maybe @uvius can help there.

Collaborator

zuphilip commented Apr 18, 2017

I tried to implement a possible fix in #201 for ocropy. Can someone check this out?

I don't have any details for ocrocis but maybe @uvius can help there.

@kba

This comment has been minimized.

Show comment
Hide comment
@kba

kba Dec 11, 2017

Collaborator

Since #201 was merged, can we close this?

Collaborator

kba commented Dec 11, 2017

Since #201 was merged, can we close this?

@zuphilip

This comment has been minimized.

Show comment
Hide comment
@zuphilip

zuphilip Dec 12, 2017

Collaborator

Yes, closing this issue, which was resolved by #201.

If you encounter new problems, then please open a new issue.

Collaborator

zuphilip commented Dec 12, 2017

Yes, closing this issue, which was resolved by #201.

If you encounter new problems, then please open a new issue.

@zuphilip zuphilip closed this Dec 12, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment