Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Different results for the same image depending on the order in which the files are processed #3452

Closed
nagadomi opened this issue Jun 7, 2021 · 9 comments
Labels

Comments

@nagadomi
Copy link
Contributor

nagadomi commented Jun 7, 2021

Separated from #3200

Environment

  • Tesseract Version: 5.0.0 Alpha (master branch)
  • Commit Number: bf979c8
  • Platform: Linux mpn1 5.4.0-74-generic #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

When processing multiple files with API, different results may occur for the same image, depending on the order in which the files are processed.
The same result is produced in the same order.

Expected Behavior:

The same result should be produced for the same image regardless of the processing order.

Note 1: I understand that multithreading and SIMD can cause minor differences in results, but this seems to be a different issue.
Note 2: The differences are mainly the result of diplopia issue.

Simple reproduce code (with tesserocr)

from PIL import Image
from tesserocr import PyTessBaseAPI, PSM, tesseract_version

if __name__ == "__main__":
    TESSDATA_DIR="/home/nagadomi/dev/tesseract-git/tessdata_fast" # fill tessdata directory
    test_image = Image.open("test1.png")

    print(tesseract_version(), "\n")

    print("* case1 API re-use")
    with PyTessBaseAPI(path=TESSDATA_DIR, lang="jpn_vert", psm=PSM.SINGLE_BLOCK_VERT_TEXT) as api:
        api.SetVariable("preserve_interword_spaces", "1")
        variants = set()
        for t in range(100):
            api.SetImage(test_image)
            text = api.GetUTF8Text()
            variants.add(text)
        print(f"{len(variants)} different results")
        print("----\n".join(variants))

    print("* case2 API re-create")
    variants = set()
    for t in range(100):
        with PyTessBaseAPI(path=TESSDATA_DIR, lang="jpn_vert", psm=PSM.SINGLE_BLOCK_VERT_TEXT) as api:
            api.SetVariable("preserve_interword_spaces", "1")
            api.SetImage(test_image)
            text = api.GetUTF8Text()
            variants.add(text)
    print(f"{len(variants)} different results")
    print("----\n".join(variants))

test1.png:
test1
result:

% OMP_THREAD_LIMIT=1 python3 variants.py
tesseract 5.0.0-alpha-20210401-108-gbf342
 leptonica-1.79.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 

* case1 API re-use
2 different results
パバイボ
パバイボ
パイポの
シューリンガン
----
パバイボ
パバイボ
パバイポの
シューリンガン

* case2 API re-create
1 different results
パバイボ
パバイボ
パバイポの
シューリンガン

Note 1: This code uses tesserocr to make it easier to reproduce, but our codebase uses Tesseract C-API via ctypes (it calls TessBaseAPISetImage,TessBaseAPIGetUTF8Text,TessBaseAPIClear), so I don't think it's a tesserocr issue.
Note 2: test1.png cannot reproduce this problem with tessdata_best, but I have confirmed that the same problem occurs with tessdata_best(float model). I have not succeeded in creating a publishable test image that can reproduce it.

@amitdo
Copy link
Collaborator

amitdo commented Jun 7, 2021

Can you reproduce this issue with a document written in English or other language written in the Latin script?. You say it related to the diplopia issue, so maybe an image uploaded by other people in one of the diplopia issues can be used to reproduce your issue.

Also, rry to reproduce the issue in the command line:

tesseract images out

images should be a list of files with full paths. You can give this file any name you want.

In your case it should contain something like this:

/path/to/test1.png
/path/to/test1.png

@nagadomi
Copy link
Contributor Author

nagadomi commented Jun 7, 2021

steps to reproduce this issue using the command line and ocrd-testset.zip.

% unzip ocrd-testset.zip -d ocrd-testset
% find ocrd-testset -name "*.tif" | sort > list1.txt
% find ocrd-testset -name "*.tif" | shuf --random-source list1.txt > list2.txt
% tesseract list1.txt out1 -l frk --tessdata-dir ../../tessdata_fast/
% tesseract list2.txt out2 -l frk --tessdata-dir ../../tessdata_fast/
% tesseract list1.txt out3 -l frk --tessdata-dir ../../tessdata_fast/

check_diff.py

def load_result(list_file, result_file):
    with open(list_file) as f1, open(result_file) as f2:
        files = f1.read().split("\n")
        results = f2.read().split("\x0c")
        return {fn: ret for fn, ret in zip(files, results)}

def print_diff(result1, result2):
    for key in result1.keys():
        if result1[key] != result2[key]:
            print(f"---- {key}")
            print(result1[key])
            print(result2[key])

result1 = load_result("list1.txt", "out1.txt")
result2 = load_result("list2.txt", "out2.txt")
result3 = load_result("list1.txt", "out3.txt")
assert(set(result1.keys()) == set(result2.keys()))

print("* out1 x out2")
print_diff(result1, result2)

print("* out1 x out3")
print_diff(result1, result3)
% python3 check_diff.py
* out1 x out2
---- ocrd-testset/bismarck_erinnerungen02_1898_0274_002.tif
obligatur fann durc4 feine Bertragsclaujel außer Kraft gejebt

obligatur fann durc; feine Bertragsclaujel außer Kraft gejebt

---- ocrd-testset/mueller_waldhornist_1821_0051_015.tif
n Sturm und Regen und Schnee,

In Sturm und Regen und Schnee,

* out1 x out3

out1 × out2 is in a different order, the results were different. out1 × out3 is in the same order, so there was no difference. This is why I guess that the results depend on the order.

@nagadomi
Copy link
Contributor Author

nagadomi commented Jun 8, 2021

I could not find any images in the diplopia issue threads that reproduce this issue.
I guess this issue occurs when some of the outputs of softmax function are close to each other, so it is hard to reproduce this issue for languages with few symbols like English.
Japanese has many symbols, some of which are very similar to each other, so it is easy to reproduce.
If a lot of images are used, as in ocrd-testset.zip example above, it can be reproduced, but debugging will be more difficult, I think.

The above test1.png can be reproduced in two lines.

% echo test1.png > list.txt; echo test1.png >> list.txt
% tesseract list.txt stdout -l jpn_vert --psm 5 --tessdata-dir ../../tessdata_fast
Page 0 : test2.png
パ バ イボ
パ バ イボ
パ バイ ポ の
シュ ー リ ン ガ ン

Page 1 : test2.png
パ バ イボ
パ バ イボ
パイ ポ の
シュ ー リ ン ガ ン

@nagadomi
Copy link
Contributor Author

nagadomi commented Jun 30, 2021

This issue seems to be fixed by #3474 (#3473). This issue is not reproduced in the latest master branch.
I have a feeling that the diplopia issue is more frequently occurring than before, but I have no evidence to that.

@nocun
Copy link
Contributor

nocun commented Jun 30, 2021

Is this type of issue added to some test suite? It would be great to be able to catch these types of errors as they happen.

@amitdo
Copy link
Collaborator

amitdo commented Jun 30, 2021

@nocun
Copy link
Contributor

nocun commented Jun 30, 2021

Are those tests run daily on CI? Could you check whether they no longer fail?

@stweil
Copy link
Contributor

stweil commented Jul 3, 2021

I don't think that we already have a test which would have detected this issue, otherwise we would have noticed the failure early. Maybe you want to add one?

@nocun
Copy link
Contributor

nocun commented Jul 4, 2021

@stweil That is a good idea, I will try to write one.

@amitdo amitdo added the bug label Jul 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants