Higher error probability with first letter of line #195

urhub · 2017-03-21T06:45:23Z

Expected Behavior

Hallo. I am training Ocropus with the Hume dialogues pages. I am following a methodology of look ahead simulations. A trained model, starting with the default model, is applied to the Hume images, starting with page 8. Lines with errors are then picked as the training set and used to train the model. This is then applied to the pages that follow and so on. One thing I am observing is there are more errors with the first letter than a letter at any other position on a line. Is this expected or is it a bug or a deficiency?

Current Behavior

Expecting the error rate would be the same at any position of a line.

Possible Solution

Steps to Reproduce (for bugs)

Download the Hume dialogs pages.
Run Ocropus on these images (The image segmentation is easier starting page 8, so I started with that).
Pick up lines that shows errors. Generate text files with fixed lines.
Train Ocropus with the lines so produced.
Repeat steps 2, 3 and 4, each time running Ocropus on subsequent pages.

Your Environment

Python version: 2.7.10 for training, 2.7.6 while running on images

Git revision of ocropy: Not sure. I downloaded it 11 Feb 2017.
Operating System and version:
for training Cray supercomputer, for running images bash on Ubuntu on windows 10

zuphilip · 2017-03-21T13:44:38Z

Do you have an example page of this corpus? Or some url where to find the images? Scanned images can be skewed at the borders, which could lead to more errors...

Is the problem you describe independent of the training, i.e. with each model you see more errors in the beginning of the lines?

urhub · 2017-03-21T20:21:15Z

Hi Philipp, You can find the Hume Dialogues pages here: https://archive.org/details/dialoguesconcern1779hume The letters it makes errors with are not skewed, but they are the first letter in the sentence. Yes, it is independent of the training. The errors do not increase with training. The only observation is that there are more errors occurring with the letter in the first position of the line than any other position. I used the default settings as well as played around with the settings for nlbin. Umesh

zuphilip · 2017-03-21T21:52:55Z

Okay. I looked at the recognition output of 0008 and 0009 with the standard model and it looked okay. I didn't see anything special in the beginning of the lines. If you can point me to an example page where the standard model will create this behavior I can try to look at it.

In general I don't think that the beginning of lines should make a difference, but other factors can do: initials, dirt, bad printing, ligatures, italics, letter spacing, ...

urhub · 2017-03-30T23:54:08Z

Hi Philipp, You have to follow all the (five) steps to reproduce the problem. It is a long process and so I understand if you do not want to go through with it. Anyway I had to restart the procedure since we did not want to include the title and the last line of each of the pages. This time I am not coming across this problem. However I would appreciate your comments you may have about this procedure. Thanks. Umesh

urhub · 2017-03-31T22:56:39Z

Hi Philipp, I am attaching two files. The first is results from processing page 141 with a model generated at 214000 iterations of training, p141_m36pi214_0001.pdf. The second is results from processing page 141 with a model generated at 220000 iterations of training, p141_m36pi220_0001.pdf. Please look at results for 010005.png, 01000d.png and 010017.png. The first letter in the former shows error and is correct in the latter. This is the observation I made. Best, Umesh [p141_m36pi214_0001.pdf](https://github.com/tmbdev/ocropy/files/887108/p141_m36pi214_0001.pdf) [p141_m36pi220_0001.pdf](https://github.com/tmbdev/ocropy/files/887109/p141_m36pi220_0001.pdf)

zuphilip · 2017-04-01T12:50:37Z

I looked at the two files and see that in the first one most errors appear in the beginning of the lines (there is also an error at the end of the line for 010013). I am not aware that this observation has been drawn before by someone.

One idea to verify your observation could also be to look for an newly artificially created line by taking the second half of 01000c and the first half of 01000d and look whether the v is recognized correctly then:

What is known about OCR errors in general is that some confusions (e.g. e vs. c) are often encountered.

Is your model only trained for the one corpus or will with work with other similar text as well? It is interesting to see that your model work for normal text as well as for italics. If you have any stable OCRopus model which you want to share, then you can add it to the wiki here: https://github.com/tmbdev/ocropy/wiki/Models .

urhub · 2017-04-01T19:48:02Z

The bigger problem is I am not able to find what influences or causes this problem to occur. For example, when I use model generated at 228000 iterations, results for page 141 looks good, but results for page 231 exhibits this beginning of line issue, please see attached file p231_m36pi228_0001.pdf. BTW, line 010006 through 01000c of page 141 is part of my training set.

p231_m36pi228_0001.pdf

At the moment I am only targeting the Hume Dialogues book. My model works for only the italics that are part of the training set, it fails for words in italics that are outside the training set while it does not fail for normal words that are outside of the training set. What do you suggest I do for words in italics? Could the italics be a reason for this (higher probability of) beginning of the line errors? Once I have fully trained and tested the models I will surely add them to the repository you mention, thanks.

zuphilip · 2017-04-02T10:10:52Z

Did you tried to recognize with exactly the same model an artificial manually created line where the problematic parts are not in the beginning, as I suggested above?

Actually, it looks also that you have trained for a very long time and I am not sure that this is always better. You should observe your error probability during the whole training process and maybe stop a little earlier. See also this related blog post: http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html

What do you suggest I do for words in italics?

You can try to train them besides the normal text or you could come up with something more elaborated: detect italics text and separated these parts from the rest and train an individual model for italics. I haven't any experience about that, expect that italics make usually more problems.

amitdo · 2017-04-02T11:25:49Z

Actually, it looks also that you have trained for a very long time and I am not sure that this is always better. You should observe your error probability during the whole training process and maybe stop a little earlier.

It's called 'overfitting'.
See here: https://deeplearning4j.org/lstm.html

urhub · 2017-04-04T03:35:55Z

Thanks for the information, Philipp and Amit. I did follow danvk's post on training the OCR, but I do not have a way of automatically calculating the error probability at the moment. I will look into it. I am inclined to follow Philipp's suggestion, to train multiple models, one on normal text and one on italics. I figure I can then find the best results from running both the models.

urhub · 2017-04-14T02:41:52Z

You can try to train them besides the normal text or you could come up with something more
elaborated: detect italics text and separated these parts from the rest and train an individual model
for italics.

DONE, Phillipp. I have included the error information, the (optimal) models, example results in pdf format for page 230 of the Hume Dialogs, https://github.com/tmbdev/ocropy/wiki/Models.

zuphilip · 2017-04-18T20:33:07Z

Thank you that is interesting! Do you have some code to combine the output of the normal and italics text?

urhub · 2017-04-21T21:27:07Z

Not yet. I need a dictionary program. I also noticed that the performance is not so good for the capital letters. I have since tried training an individual model for capital text and the best I got is 5% error.

urhub · 2017-06-10T04:47:32Z

Hi Philip, I have checked in model for italics with normal mapped to symbols, model for normal with italics mapped to symbols and results of collating the two results, https://github.com/urhub/ocropy/tree/master/models
I also have the rather trivial python code for collating the results, if anybody is interested.

zuphilip added the ❔ question label Mar 21, 2017

urhub closed this as completed Sep 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Higher error probability with first letter of line #195

Higher error probability with first letter of line #195

urhub commented Mar 21, 2017

zuphilip commented Mar 21, 2017

urhub commented Mar 21, 2017 via email •

edited

zuphilip commented Mar 21, 2017

urhub commented Mar 30, 2017 via email •

edited

urhub commented Mar 31, 2017 via email •

edited

zuphilip commented Apr 1, 2017

urhub commented Apr 1, 2017

zuphilip commented Apr 2, 2017

amitdo commented Apr 2, 2017 •

edited

urhub commented Apr 4, 2017

urhub commented Apr 14, 2017

zuphilip commented Apr 18, 2017

urhub commented Apr 21, 2017

urhub commented Jun 10, 2017

Higher error probability with first letter of line #195

Higher error probability with first letter of line #195

Comments

urhub commented Mar 21, 2017

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

zuphilip commented Mar 21, 2017

urhub commented Mar 21, 2017 via email • edited

zuphilip commented Mar 21, 2017

urhub commented Mar 30, 2017 via email • edited

urhub commented Mar 31, 2017 via email • edited

zuphilip commented Apr 1, 2017

urhub commented Apr 1, 2017

zuphilip commented Apr 2, 2017

amitdo commented Apr 2, 2017 • edited

urhub commented Apr 4, 2017

urhub commented Apr 14, 2017

zuphilip commented Apr 18, 2017

urhub commented Apr 21, 2017

urhub commented Jun 10, 2017

urhub commented Mar 21, 2017 via email •

edited

urhub commented Mar 30, 2017 via email •

edited

urhub commented Mar 31, 2017 via email •

edited

amitdo commented Apr 2, 2017 •

edited