Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extra spaces in result when ocr chinese #991

Closed
BackT0TheFuture opened this issue Jun 14, 2017 · 21 comments
Closed

extra spaces in result when ocr chinese #991

BackT0TheFuture opened this issue Jun 14, 2017 · 21 comments

Comments

@BackT0TheFuture
Copy link

BackT0TheFuture commented Jun 14, 2017

win8.1 64bit tesseract 4.0.0alpha leptonica-1.74.2 (Jun 6 2017, 21:45:59) [MSC v.1910 LIB Release x64]

the reslut always contains extra spaces between character when using oem LstmOnly or TesseractAndLstm, oem TesseractOnly works normally but the result is bad.

image

oem: Default psm: SingleLine time: 114 ms. result: 伦 敦 楼 房 发 生 火 灾 中 使 馆 关 注 : 暂 无 中 国 公 民 受 伤

oem: TesseractOnly psm: AutoOsd time: 518 ms. result: 伦敦夺委房发生火火中使馆大汪 二 暂无中 ` 又伤

result_lines_chi.txt

@BackT0TheFuture
Copy link
Author

@zdenop @Shreeshrii @stweil
can you guys give me some advice about this ? thx.

@Shreeshrii
Copy link
Collaborator

@theraysmith Is there any config which will change this?

@Shreeshrii
Copy link
Collaborator

#988 regarding Japanese is also about similar results.

Could it be related in any way to handling of non-space delimited languages/CJK?

@Shreeshrii
Copy link
Collaborator

Possibly because of

We also want to force a separate word for every non
// space-delimited character when not in a dictionary context.

https://github.com/tesseract-ocr/tesseract/blob/a1c22fb0d0f6bde165ec7b7c3125420b0ba1d541/lstm/recodebeam.cpp

@Shreeshrii
Copy link
Collaborator

#1009

Similar problem with Korean.

@BackT0TheFuture
Copy link
Author

BackT0TheFuture commented Jun 27, 2017

I've also tested jpn and kor images , the results have similar problem.
there might be something wrong related to LSTM engine or its parameter.

@hoangtocdo90
Copy link

sorry for my bad english
in my case Japanese and OCR mode set to LTSM may by related .
i'm try to get single char only by using ResultIterator but sometimes final char in word doesn't have right coordinate. if change OCR mode to Tesseract Only it's okay.
Example :
大阪株式会社
[debug] -Char value = 大 left= 16 top = 243 right = 68 bottom = 295 conf = 99
[debug] -Char value = 阪 left= 75 top = 244 right = 128 bottom = 295 conf = 99
[debug] -Char value = 株 left= 130 top = 244 right = 185 bottom = 296 conf = 99
[debug] -Char value = 式 left= 190 top = 243 right = 281 bottom = 296 conf = 99
[debug] -Char value = 会 left= 306 top = 243 right = 362 bottom = 296 conf = 99
[debug] -Char value = 社 left= 2214 top = 2991 right = 2214 bottom = 2991 conf = 99

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 28, 2017

Using the solution suggested in #1009
The extra spaces are removed and only interword-spaces are preserved.

See attached output files:

--oem 1 --psm 6 -l chi_sim
chineses-1.txt

--oem 1 --psm 6 -l chi_sim -c preserve_interword_spaces=1

chineses-spaces.txt

Please note that this behavior is different from the behaviour of preserve_interword_spaces in 3.05 branch, where it preserves multiple spaces between words and does not compress them to single space.

@theraysmith Is this intentional for 4.0? Since there are reports of preserve_interword_spaces not working with 4.0 - see #781

@BackT0TheFuture
Copy link
Author

great, it works. Actually I noticed this variable, but I misunderstood and set it to false, definitely it doesn't work as expected. _

SetVariable("preserve_interword_spaces", false);

@hoangtocdo90
Copy link

@Shreeshrii can you help me ? i'm get wrong coordinate for last char in word. Thanks!

@Shreeshrii
Copy link
Collaborator

Sorry, @hoangtocdo90 . I do not know about ResultIterator

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 13, 2018

@jbreiden

This fix can be applied via adding the following to the config file and then running combine_tessdata.

# workaround to remove extra spaces in OCR result
# https://github.com/tesseract-ocr/tesseract/issues/991, 988 and 1009
preserve_interword_spaces 1

This applies to Chinese, Japanese and Korean.

@Shreeshrii
Copy link
Collaborator

Fixed via
tesseract-ocr/tessdata_fast#7
tesseract-ocr/tessdata_fast#8

@zdenop Please close this issue, after PR is merged in tessdata_fast.

@xhuvom

This comment was marked as off-topic.

@stweil
Copy link
Contributor

stweil commented May 21, 2019

@Shreeshrii, I just want to add preserve_interword_spaces 1 for tessdata and tessdata_best, too (see issue #2450). As far as I see, these files need to be fixed:

tessdata/chi_sim/chi_sim.config
tessdata/chi_tra/chi_tra.config
tessdata/jpn/jpn.config
tessdata/tha/tha.config
tessdata_best/chi_sim/chi_sim.config
tessdata_best/chi_tra/chi_tra.config
tessdata_best/jpn/jpn.config
tessdata_best/tha/tha.config

Is that correct?

@Shreeshrii
Copy link
Collaborator

@stweil Yes, that's right. Thank you.

@stweil
Copy link
Contributor

stweil commented May 21, 2019

I fixed tessdata_best/jpn_vert/jpn_vert.config which is included by tessdata_best/jpn/jpn.config.

@stweil
Copy link
Contributor

stweil commented Aug 24, 2021

What about chi_sim_vert, chi_tra_vert, jpn_vert in tessdata_fast. Do they need preserve_interword_spaces=1, too?

@ZetaLin
Copy link

ZetaLin commented Jun 9, 2023

@stweil actually,I used this method to extract the training data before I saw your article, but the pdf copy still had Spaces. That is to say, the exported txt does not have Spaces, but the pdf copy text still has extra Spaces.

I should also note that another workaround is to use the congfig setting without having to modify the training data.
the command line:

ocrmypdf -l chi_sim --tesseract-config my.cfg input.pdf out.pdf
Where my.cfg is a local file: the contents are:

preserve_interword_spaces 1

@legistek
Copy link

@ZetaLin same issue here. The raw text output is right but when output to PDF there is a space between every character.

@languagemaniac
Copy link

@stweil actually,I used this method to extract the training data before I saw your article, but the pdf copy still had Spaces. That is to say, the exported txt does not have Spaces, but the pdf copy text still has extra Spaces.

I should also note that another workaround is to use the congfig setting without having to modify the training data. the command line:

ocrmypdf -l chi_sim --tesseract-config my.cfg input.pdf out.pdf Where my.cfg is a local file: the contents are:

preserve_interword_spaces 1

Does that still work for you? I tried that but it's not working for me., don't know what to do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants