Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher error probability with first letter of line #195

Closed
urhub opened this issue Mar 21, 2017 · 14 comments
Closed

Higher error probability with first letter of line #195

urhub opened this issue Mar 21, 2017 · 14 comments

Comments

@urhub
Copy link

urhub commented Mar 21, 2017

Expected Behavior

Hallo. I am training Ocropus with the Hume dialogues pages. I am following a methodology of look ahead simulations. A trained model, starting with the default model, is applied to the Hume images, starting with page 8. Lines with errors are then picked as the training set and used to train the model. This is then applied to the pages that follow and so on. One thing I am observing is there are more errors with the first letter than a letter at any other position on a line. Is this expected or is it a bug or a deficiency?

Current Behavior

Expecting the error rate would be the same at any position of a line.

Possible Solution

Steps to Reproduce (for bugs)

  1. Download the Hume dialogs pages.
  2. Run Ocropus on these images (The image segmentation is easier starting page 8, so I started with that).
  3. Pick up lines that shows errors. Generate text files with fixed lines.
  4. Train Ocropus with the lines so produced.
  5. Repeat steps 2, 3 and 4, each time running Ocropus on subsequent pages.

Your Environment

  • Python version: 2.7.10 for training, 2.7.6 while running on images
  • Git revision of ocropy: Not sure. I downloaded it 11 Feb 2017.
  • Operating System and version:
    for training Cray supercomputer, for running images bash on Ubuntu on windows 10
@zuphilip
Copy link
Collaborator

Do you have an example page of this corpus? Or some url where to find the images? Scanned images can be skewed at the borders, which could lead to more errors...

Is the problem you describe independent of the training, i.e. with each model you see more errors in the beginning of the lines?

@urhub
Copy link
Author

urhub commented Mar 21, 2017 via email

@zuphilip
Copy link
Collaborator

Okay. I looked at the recognition output of 0008 and 0009 with the standard model and it looked okay. I didn't see anything special in the beginning of the lines. If you can point me to an example page where the standard model will create this behavior I can try to look at it.

In general I don't think that the beginning of lines should make a difference, but other factors can do: initials, dirt, bad printing, ligatures, italics, letter spacing, ...

@urhub
Copy link
Author

urhub commented Mar 30, 2017 via email

@urhub
Copy link
Author

urhub commented Mar 31, 2017 via email

@zuphilip
Copy link
Collaborator

zuphilip commented Apr 1, 2017

I looked at the two files and see that in the first one most errors appear in the beginning of the lines (there is also an error at the end of the line for 010013). I am not aware that this observation has been drawn before by someone.

One idea to verify your observation could also be to look for an newly artificially created line by taking the second half of 01000c and the first half of 01000d and look whether the v is recognized correctly then:

combined-line

What is known about OCR errors in general is that some confusions (e.g. e vs. c) are often encountered.


Is your model only trained for the one corpus or will with work with other similar text as well? It is interesting to see that your model work for normal text as well as for italics. If you have any stable OCRopus model which you want to share, then you can add it to the wiki here: https://github.com/tmbdev/ocropy/wiki/Models .

@urhub
Copy link
Author

urhub commented Apr 1, 2017

The bigger problem is I am not able to find what influences or causes this problem to occur. For example, when I use model generated at 228000 iterations, results for page 141 looks good, but results for page 231 exhibits this beginning of line issue, please see attached file p231_m36pi228_0001.pdf. BTW, line 010006 through 01000c of page 141 is part of my training set.

p231_m36pi228_0001.pdf

At the moment I am only targeting the Hume Dialogues book. My model works for only the italics that are part of the training set, it fails for words in italics that are outside the training set while it does not fail for normal words that are outside of the training set. What do you suggest I do for words in italics? Could the italics be a reason for this (higher probability of) beginning of the line errors? Once I have fully trained and tested the models I will surely add them to the repository you mention, thanks.

@zuphilip
Copy link
Collaborator

zuphilip commented Apr 2, 2017

Did you tried to recognize with exactly the same model an artificial manually created line where the problematic parts are not in the beginning, as I suggested above?

Actually, it looks also that you have trained for a very long time and I am not sure that this is always better. You should observe your error probability during the whole training process and maybe stop a little earlier. See also this related blog post: http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html

What do you suggest I do for words in italics?

You can try to train them besides the normal text or you could come up with something more elaborated: detect italics text and separated these parts from the rest and train an individual model for italics. I haven't any experience about that, expect that italics make usually more problems.

@amitdo
Copy link
Contributor

amitdo commented Apr 2, 2017

Actually, it looks also that you have trained for a very long time and I am not sure that this is always better. You should observe your error probability during the whole training process and maybe stop a little earlier.

It's called 'overfitting'.
See here: https://deeplearning4j.org/lstm.html

@urhub
Copy link
Author

urhub commented Apr 4, 2017

Thanks for the information, Philipp and Amit. I did follow danvk's post on training the OCR, but I do not have a way of automatically calculating the error probability at the moment. I will look into it. I am inclined to follow Philipp's suggestion, to train multiple models, one on normal text and one on italics. I figure I can then find the best results from running both the models.

@urhub
Copy link
Author

urhub commented Apr 14, 2017

You can try to train them besides the normal text or you could come up with something more
elaborated: detect italics text and separated these parts from the rest and train an individual model
for italics.

DONE, Phillipp. I have included the error information, the (optimal) models, example results in pdf format for page 230 of the Hume Dialogs, https://github.com/tmbdev/ocropy/wiki/Models.

@zuphilip
Copy link
Collaborator

Thank you that is interesting! Do you have some code to combine the output of the normal and italics text?

@urhub
Copy link
Author

urhub commented Apr 21, 2017

Not yet. I need a dictionary program. I also noticed that the performance is not so good for the capital letters. I have since tried training an individual model for capital text and the best I got is 5% error.

@urhub
Copy link
Author

urhub commented Jun 10, 2017

Hi Philip, I have checked in model for italics with normal mapped to symbols, model for normal with italics mapped to symbols and results of collating the two results, https://github.com/urhub/ocropy/tree/master/models
I also have the rather trivial python code for collating the results, if anybody is interested.

@urhub urhub closed this as completed Sep 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants