extra spaces in result when ocr chinese #991

BackT0TheFuture · 2017-06-14T13:54:17Z

win8.1 64bit tesseract 4.0.0alpha leptonica-1.74.2 (Jun 6 2017, 21:45:59) [MSC v.1910 LIB Release x64]

the reslut always contains extra spaces between character when using oem LstmOnly or TesseractAndLstm, oem TesseractOnly works normally but the result is bad.

oem: Default psm: SingleLine time: 114 ms. result: 伦敦楼房发生火灾中使馆关注 : 暂无中国公民受伤

oem: TesseractOnly psm: AutoOsd time: 518 ms. result: 伦敦夺委房发生火火中使馆大汪二暂无中 ` 又伤

result_lines_chi.txt

BackT0TheFuture · 2017-06-23T11:04:04Z

@zdenop @Shreeshrii @stweil
can you guys give me some advice about this ? thx.

Shreeshrii · 2017-06-23T12:14:10Z

@theraysmith Is there any config which will change this?

Shreeshrii · 2017-06-23T12:18:10Z

#988 regarding Japanese is also about similar results.

Could it be related in any way to handling of non-space delimited languages/CJK?

Shreeshrii · 2017-06-26T17:13:53Z

Possibly because of

We also want to force a separate word for every non
// space-delimited character when not in a dictionary context.

https://github.com/tesseract-ocr/tesseract/blob/a1c22fb0d0f6bde165ec7b7c3125420b0ba1d541/lstm/recodebeam.cpp

Shreeshrii · 2017-06-27T11:20:10Z

#1009

Similar problem with Korean.

BackT0TheFuture · 2017-06-27T15:32:46Z

I've also tested jpn and kor images , the results have similar problem.
there might be something wrong related to LSTM engine or its parameter.

hoangtocdo90 · 2017-06-28T04:18:25Z

sorry for my bad english
in my case Japanese and OCR mode set to LTSM may by related .
i'm try to get single char only by using ResultIterator but sometimes final char in word doesn't have right coordinate. if change OCR mode to Tesseract Only it's okay.
Example :
大阪株式会社
[debug] -Char value = 大 left= 16 top = 243 right = 68 bottom = 295 conf = 99
[debug] -Char value = 阪 left= 75 top = 244 right = 128 bottom = 295 conf = 99
[debug] -Char value = 株 left= 130 top = 244 right = 185 bottom = 296 conf = 99
[debug] -Char value = 式 left= 190 top = 243 right = 281 bottom = 296 conf = 99
[debug] -Char value = 会 left= 306 top = 243 right = 362 bottom = 296 conf = 99
[debug] -Char value = 社 left= 2214 top = 2991 right = 2214 bottom = 2991 conf = 99

Shreeshrii · 2017-06-28T12:57:06Z

Using the solution suggested in #1009
The extra spaces are removed and only interword-spaces are preserved.

See attached output files:

--oem 1 --psm 6 -l chi_sim
chineses-1.txt

--oem 1 --psm 6 -l chi_sim -c preserve_interword_spaces=1

chineses-spaces.txt

Please note that this behavior is different from the behaviour of preserve_interword_spaces in 3.05 branch, where it preserves multiple spaces between words and does not compress them to single space.

@theraysmith Is this intentional for 4.0? Since there are reports of preserve_interword_spaces not working with 4.0 - see #781

BackT0TheFuture · 2017-06-28T15:09:52Z

great, it works. Actually I noticed this variable, but I misunderstood and set it to false, definitely it doesn't work as expected. _

SetVariable("preserve_interword_spaces", false);

hoangtocdo90 · 2017-06-29T10:45:19Z

@Shreeshrii can you help me ? i'm get wrong coordinate for last char in word. Thanks!

Shreeshrii · 2017-06-29T11:18:18Z

Sorry, @hoangtocdo90 . I do not know about ResultIterator

Shreeshrii · 2018-01-13T11:15:37Z

@jbreiden

This fix can be applied via adding the following to the config file and then running combine_tessdata.

# workaround to remove extra spaces in OCR result
# https://github.com/tesseract-ocr/tesseract/issues/991, 988 and 1009
preserve_interword_spaces 1

This applies to Chinese, Japanese and Korean.

Shreeshrii · 2018-02-20T17:44:36Z

Fixed via
tesseract-ocr/tessdata_fast#7
tesseract-ocr/tessdata_fast#8

@zdenop Please close this issue, after PR is merged in tessdata_fast.

stweil · 2019-05-21T10:01:44Z

@Shreeshrii, I just want to add preserve_interword_spaces 1 for tessdata and tessdata_best, too (see issue #2450). As far as I see, these files need to be fixed:

tessdata/chi_sim/chi_sim.config
tessdata/chi_tra/chi_tra.config
tessdata/jpn/jpn.config
tessdata/tha/tha.config
tessdata_best/chi_sim/chi_sim.config
tessdata_best/chi_tra/chi_tra.config
tessdata_best/jpn/jpn.config
tessdata_best/tha/tha.config

Is that correct?

Shreeshrii · 2019-05-21T11:21:09Z

@stweil Yes, that's right. Thank you.

stweil · 2019-05-21T15:55:58Z

I fixed tessdata_best/jpn_vert/jpn_vert.config which is included by tessdata_best/jpn/jpn.config.

stweil · 2021-08-24T12:14:21Z

What about chi_sim_vert, chi_tra_vert, jpn_vert in tessdata_fast. Do they need preserve_interword_spaces=1, too?

ZetaLin · 2023-06-09T07:03:38Z

@stweil actually,I used this method to extract the training data before I saw your article, but the pdf copy still had Spaces. That is to say, the exported txt does not have Spaces, but the pdf copy text still has extra Spaces.

I should also note that another workaround is to use the congfig setting without having to modify the training data.
the command line:

ocrmypdf -l chi_sim --tesseract-config my.cfg input.pdf out.pdf
Where my.cfg is a local file: the contents are:

preserve_interword_spaces 1

legistek · 2023-07-17T15:28:32Z

@ZetaLin same issue here. The raw text output is right but when output to PDF there is a space between every character.

languagemaniac · 2023-10-21T23:31:09Z

@stweil actually,I used this method to extract the training data before I saw your article, but the pdf copy still had Spaces. That is to say, the exported txt does not have Spaces, but the pdf copy text still has extra Spaces.

I should also note that another workaround is to use the congfig setting without having to modify the training data. the command line:

ocrmypdf -l chi_sim --tesseract-config my.cfg input.pdf out.pdf Where my.cfg is a local file: the contents are:

preserve_interword_spaces 1

Does that still work for you? I tried that but it's not working for me., don't know what to do

abhishekchopde mentioned this issue Jun 28, 2017

Detection of extra spaces while running own trained tesseract for Korean OCR #1009

Closed

Shreeshrii mentioned this issue Jun 28, 2017

Spacing Between Words "Japanese" Recognition #988

Closed

This was referenced Feb 20, 2018

Fixes extra intra-word spacing in Chinese for 4.0 tesseract-ocr/langdata#109

Merged

Fix extra intra-word spaces by adding config file tesseract-ocr/tessdata_fast#7

Merged

zdenop closed this as completed in tesseract-ocr/tessdata_fast@719cfd4 Feb 21, 2018

This comment was marked as off-topic.

Sign in to view

stweil mentioned this issue May 21, 2019

Fix extra intra-word spacing for several Asian languages (GitHub issue #991) tesseract-ocr/tessdata_best#37

Merged

This was referenced Apr 22, 2020

Extra spaces in any output except txt for non space delimited languages #2702

Open

Chinese recognition was incorrectly segmented by spaces #2814

Closed

Eyxxxxx mentioned this issue Jan 15, 2021

extra space in the result pdf when the input pdf is in Chinese ocrmypdf/OCRmyPDF#715

Open

kazupon mentioned this issue Mar 25, 2021

Words array not correctly generated for Japanese/Korean/Chinese naptha/tesseract.js#413

Closed

amitdo added the non spaced words label Aug 24, 2021

shq251 mentioned this issue Sep 6, 2022

Observe white space spaces in result when OCR Japanese. #3916

Closed

Blu3train mentioned this issue Sep 21, 2022

Ground-truth preview: "Recognized value" display problems buliasz/tesstrain-windows-gui#18

Closed

sidazhou mentioned this issue Nov 7, 2022

Documentations for new users MohrJonas/obsidian-ocr#21

Closed

ghost mentioned this issue Jan 30, 2023

TextExtractor: Spaces are added before and after full-width punctuation marks in Chinese. | 在中文中的全形标点符号前后被添加空格。 microsoft/PowerToys#23016

Closed

ffchung mentioned this issue Dec 10, 2023

extra spaces in result when ocr chinese simon987/sist2#443

Closed

tenpai-git mentioned this issue Mar 20, 2024

Add Japanese Vertical Support Branch for Tesseract and Ocrmypdf OCR eikek/docspell#2505

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extra spaces in result when ocr chinese #991

extra spaces in result when ocr chinese #991

BackT0TheFuture commented Jun 14, 2017 •

edited

Loading

BackT0TheFuture commented Jun 23, 2017

Shreeshrii commented Jun 23, 2017

Shreeshrii commented Jun 23, 2017

Shreeshrii commented Jun 26, 2017

Shreeshrii commented Jun 27, 2017

BackT0TheFuture commented Jun 27, 2017 •

edited

Loading

hoangtocdo90 commented Jun 28, 2017

Shreeshrii commented Jun 28, 2017 •

edited

Loading

BackT0TheFuture commented Jun 28, 2017

hoangtocdo90 commented Jun 29, 2017

Shreeshrii commented Jun 29, 2017

Shreeshrii commented Jan 13, 2018 •

edited

Loading

Shreeshrii commented Feb 20, 2018

This comment was marked as off-topic.

stweil commented May 21, 2019

Shreeshrii commented May 21, 2019

stweil commented May 21, 2019

stweil commented Aug 24, 2021

ZetaLin commented Jun 9, 2023

legistek commented Jul 17, 2023

languagemaniac commented Oct 21, 2023

extra spaces in result when ocr chinese #991

extra spaces in result when ocr chinese #991

Comments

BackT0TheFuture commented Jun 14, 2017 • edited Loading

BackT0TheFuture commented Jun 23, 2017

Shreeshrii commented Jun 23, 2017

Shreeshrii commented Jun 23, 2017

Shreeshrii commented Jun 26, 2017

Shreeshrii commented Jun 27, 2017

BackT0TheFuture commented Jun 27, 2017 • edited Loading

hoangtocdo90 commented Jun 28, 2017

Shreeshrii commented Jun 28, 2017 • edited Loading

BackT0TheFuture commented Jun 28, 2017

hoangtocdo90 commented Jun 29, 2017

Shreeshrii commented Jun 29, 2017

Shreeshrii commented Jan 13, 2018 • edited Loading

Shreeshrii commented Feb 20, 2018

This comment was marked as off-topic.

stweil commented May 21, 2019

Shreeshrii commented May 21, 2019

stweil commented May 21, 2019

stweil commented Aug 24, 2021

ZetaLin commented Jun 9, 2023

legistek commented Jul 17, 2023

languagemaniac commented Oct 21, 2023

BackT0TheFuture commented Jun 14, 2017 •

edited

Loading

BackT0TheFuture commented Jun 27, 2017 •

edited

Loading

Shreeshrii commented Jun 28, 2017 •

edited

Loading

Shreeshrii commented Jan 13, 2018 •

edited

Loading