New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Higher error probability with first letter of line #195
Comments
Do you have an example page of this corpus? Or some url where to find the images? Scanned images can be skewed at the borders, which could lead to more errors... Is the problem you describe independent of the training, i.e. with each model you see more errors in the beginning of the lines? |
Hi Philipp,
You can find the Hume Dialogues pages here:
https://archive.org/details/dialoguesconcern1779hume
The letters it makes errors with are not skewed, but they are the first
letter in the sentence.
Yes, it is independent of the training. The errors do not increase
with training. The only observation is that there are more errors occurring
with the letter in the first position of the line than any other position.
I used the default settings as well as played around with the settings for
nlbin.
Umesh
|
Okay. I looked at the recognition output of In general I don't think that the beginning of lines should make a difference, but other factors can do: initials, dirt, bad printing, ligatures, italics, letter spacing, ... |
Hi Philipp,
You have to follow all the (five) steps to reproduce the problem.
It is a long process and so I understand if you do not want to go through
with it. Anyway I had to restart the procedure since we did not want to
include the title and the last line of each of the pages. This time I am
not coming across this problem. However I would appreciate your comments
you may have about this procedure. Thanks.
Umesh
|
Hi Philipp,
I am attaching two files. The first is results from processing page
141 with a model generated at 214000 iterations of training,
p141_m36pi214_0001.pdf. The second is results from processing page 141 with
a model generated at 220000 iterations of training, p141_m36pi220_0001.pdf.
Please look at results for 010005.png, 01000d.png and 010017.png. The first
letter in the former shows error and is correct in the latter. This is the
observation I made.
Best,
Umesh
[p141_m36pi214_0001.pdf](https://github.com/tmbdev/ocropy/files/887108/p141_m36pi214_0001.pdf)
[p141_m36pi220_0001.pdf](https://github.com/tmbdev/ocropy/files/887109/p141_m36pi220_0001.pdf)
|
I looked at the two files and see that in the first one most errors appear in the beginning of the lines (there is also an error at the end of the line for 010013). I am not aware that this observation has been drawn before by someone. One idea to verify your observation could also be to look for an newly artificially created line by taking the second half of 01000c and the first half of 01000d and look whether the What is known about OCR errors in general is that some confusions (e.g. Is your model only trained for the one corpus or will with work with other similar text as well? It is interesting to see that your model work for normal text as well as for italics. If you have any stable OCRopus model which you want to share, then you can add it to the wiki here: https://github.com/tmbdev/ocropy/wiki/Models . |
The bigger problem is I am not able to find what influences or causes this problem to occur. For example, when I use model generated at 228000 iterations, results for page 141 looks good, but results for page 231 exhibits this beginning of line issue, please see attached file p231_m36pi228_0001.pdf. BTW, line 010006 through 01000c of page 141 is part of my training set. At the moment I am only targeting the Hume Dialogues book. My model works for only the italics that are part of the training set, it fails for words in italics that are outside the training set while it does not fail for normal words that are outside of the training set. What do you suggest I do for words in italics? Could the italics be a reason for this (higher probability of) beginning of the line errors? Once I have fully trained and tested the models I will surely add them to the repository you mention, thanks. |
Did you tried to recognize with exactly the same model an artificial manually created line where the problematic parts are not in the beginning, as I suggested above? Actually, it looks also that you have trained for a very long time and I am not sure that this is always better. You should observe your error probability during the whole training process and maybe stop a little earlier. See also this related blog post: http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html
You can try to train them besides the normal text or you could come up with something more elaborated: detect italics text and separated these parts from the rest and train an individual model for italics. I haven't any experience about that, expect that italics make usually more problems. |
It's called 'overfitting'. |
Thanks for the information, Philipp and Amit. I did follow danvk's post on training the OCR, but I do not have a way of automatically calculating the error probability at the moment. I will look into it. I am inclined to follow Philipp's suggestion, to train multiple models, one on normal text and one on italics. I figure I can then find the best results from running both the models. |
DONE, Phillipp. I have included the error information, the (optimal) models, example results in pdf format for page 230 of the Hume Dialogs, https://github.com/tmbdev/ocropy/wiki/Models. |
Thank you that is interesting! Do you have some code to combine the output of the normal and italics text? |
Not yet. I need a dictionary program. I also noticed that the performance is not so good for the capital letters. I have since tried training an individual model for capital text and the best I got is 5% error. |
Hi Philip, I have checked in model for italics with normal mapped to symbols, model for normal with italics mapped to symbols and results of collating the two results, https://github.com/urhub/ocropy/tree/master/models |
Expected Behavior
Hallo. I am training Ocropus with the Hume dialogues pages. I am following a methodology of look ahead simulations. A trained model, starting with the default model, is applied to the Hume images, starting with page 8. Lines with errors are then picked as the training set and used to train the model. This is then applied to the pages that follow and so on. One thing I am observing is there are more errors with the first letter than a letter at any other position on a line. Is this expected or is it a bug or a deficiency?
Current Behavior
Expecting the error rate would be the same at any position of a line.
Possible Solution
Steps to Reproduce (for bugs)
Your Environment
for training Cray supercomputer, for running images bash on Ubuntu on windows 10
The text was updated successfully, but these errors were encountered: