-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't encode transcription #2695
Comments
That error message also occurs with other languages, for example Greek. Until there is a fix for this problem, I suggest to remove the characters which trigger it. |
@stweil is removing those characters safe? |
I tested just now on Ubuntu, using Andalus font and I get the following errors (sorted from log)
So it seems to be related to the characters |
|
This is a known issue. langdata and langdata_lstm have not been updated with new (tess4) language training data for Arabic.
While these characters are there in the current training_text they are not there in the ara.lstm-unicharset extracted from tessdata_best/ara.traineddata, that is the reason for error while doing the The error will not be there for |
That does not explain my encoding errors, because I trained from scratch. Nevertheless the critical characters are somehow missing in the Some characters can be encoded in different ways, notably characters with diacritical marks. For example the character |
I have seen similar errors for Devanagari for letters with nukta -https://r12a.github.io/scripts/devanagari/block#char093C There are both precomposed and decomposed form of these letters in Unicode. However the training_texts in langdata and langdata_lstm repos have the precomposed forms. Ray added normalization checks for training but probably didn't update the language training data in repos. |
I wonder whether that characters are handled correctly when calculating error rates. Programs like |
It looks like |
@stweil Please see comments by Ray in this thread - tesseract-ocr/langdata#63 |
I think this is a duplicate of issue #1012. |
There are other related issues too - e.g. see #2267 |
I marked that issue as a duplicate of issue #1012 now, so no need to reopen it. |
@Shreeshrii so you believe this is a solution for now? |
Do you need the characters % & = } with Arabic, if so try plusminus training? However, when I tried it yesterday the unicharset size reduced from 80+ to 70+, so some characters got dropped. I didn't compare the unicharsets or investigate further. However accuracy of finetuned data on the Andalus font training set was better. I didn't try separately for an eval set. If you don't need those characters, then remove just those chars from the training_text and try the Impact style training again. Please know that there are issues while training RTL languages so fine-tuning with minimal changes to model works best. Couple of things that I am aware of,
FYI, I do not know Arabic or any RTL language for that matter. |
@Shreeshrii, I decided to start fine-tuning with
Output:
I tried combining the new I believe that this isn't a bug, I must be doing something wrong!! any help please. |
@peterbence3 The unicharset is extracted from the training_text file. You don't need so many steps. I used the following script:
|
@Shreeshrii that was so useful, but there is one idea that I didn't get it yet. You said
Does the lstmtraining tool extract it automatically from the training_text or I should do it my self, if yes how? Regards |
It is generated by tesstrain.sh - see https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain_utils.sh#L346 |
@Shreeshrii thanks a lot, now its working super fine, steps I followed: 1. clone tesseract, followed installation instructions to build and install from source 2. clone langdata-lstm 3. getting the tessdata_best/ara.traineddata and place it at 4. edit 5. download a set of Arabic fonts that i need to fine tune for (place them in any folder) 6. generating the train data files as follows:
this will generate a new train data, including the 7. extracting 8. now everything is ready, execute the fine-tuning like:
9. enjoy with no encoding errors Thanks all |
@peterbence3 for extending the training_text for finetuning - you can try to reengineer the files from the trainnedata.
|
@peterbence3 Can you please tell how I should run the 6th point in your issue? How should I run tesstrain.sh ? There is no tesstrain.sh present in both the repositories. |
@Shreeshrii I understand that when calling lstmeval, the words will be reversed in arabic as it is intentional. |
any updates? |
Unable to fine-tune Arabic model for font 'Andalus', getting this error:
Please note that the line making the error is the pre-last line in the
ara.training_txt
file, that contains:&& التسجيل التوقيع ؟؟ المواضيع قد إلا منتدى المنتدى و
I'm using
langdata_lstm
for generating my training data and theara.traineddata
to continue from.generating data:
extracting old lstm:
combine_tessdata -e ../tesseract/tessdata/ara.traineddata ara.lstm
fine-tuning:
I'd checked the generated train data, where everything seems to be good, and tiff files includes all the train_text lines including the line making the error. I'd also tried to generate train data and fine tune for different fonts like 'Arial' and 'Tahoma' but still getting the same error.
I was thinking about removing the error line from the train_text file, but I don't know if it is safe or not. Besides, I think that 80 lines for training Arabic models is very small, isn't it?!!! So what if I decided to train for more lines of data, what should I do, and what files would be affected in such case?
Regards
The text was updated successfully, but these errors were encountered: