Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicharset_extractor meet ICU ERROR with 64 bit Windows installer #1625

Closed
yzy1996 opened this issue Jun 3, 2018 · 26 comments
Closed

Unicharset_extractor meet ICU ERROR with 64 bit Windows installer #1625

yzy1996 opened this issue Jun 3, 2018 · 26 comments

Comments

@yzy1996
Copy link

@yzy1996 yzy1996 commented Jun 3, 2018

Environment

  • Tesseract Version: <4.00>
  • Platform: <Windows 64>

Current Behavior:

C:\Users\Jerry\Desktop\新建文件夹>unicharset_extractor chi_my.font.exp0.box
Extracting unicharset from box file chi_my.font.exp0.box
ICU ERROR: U_FILE_ACCESS_ERROR

But I find this will be solved by use [tesseract-ocr-setup-4.00.00dev.exe] . It will occur by use [tesseract-ocr-w64-setup-v4.0.0-beta.1.20180414.exe]

@Shreeshrii
Copy link
Collaborator

@Shreeshrii Shreeshrii commented Jun 3, 2018

Please try with w32 version for beta.1. Does that get the same error?

@yzy1996
Copy link
Author

@yzy1996 yzy1996 commented Jun 4, 2018

Thank you! I have tried win32 version for beta1. It works well.
So why do this occur on the version of win64? And I have searched for many ways to fix the ICU error, all ended in failure.

@Shreeshrii
Copy link
Collaborator

@Shreeshrii Shreeshrii commented Jun 4, 2018

@stweil has mentioned on the page that w64 version is experimental.

@stweil
Copy link
Contributor

@stweil stweil commented Jun 4, 2018

@yzy1996, can you provide chi_my.font.exp0.box or another file with which it is possible to reproduce the problem?

@yzy1996
Copy link
Author

@yzy1996 yzy1996 commented Jun 5, 2018

@stweil Thank you! I am glad to help find and modify the problem.

This is a simple file named chi_my.font.exp0.box. You need to change the format first(delete the suffix .txt). I found this has nothing to do with the content.

This file is enough, execute unicharset_extractor chi_my.font.exp0.box in the path that contains the file.

chi_my.font.exp0.box.txt

@stweil
Copy link
Contributor

@stweil stweil commented Jun 5, 2018

This line fails. I have currently no idea why icu::Normalizer2::getInstance(nullptr, "nfkc", UNORM2_COMPOSE, error_code) does not work for 64 bit Windows. The same code works earlier when called as icu::Normalizer2::getInstance(nullptr, "nfc", UNORM2_COMPOSE, error_code).

@Shreeshrii
Copy link
Collaborator

@Shreeshrii Shreeshrii commented Jun 5, 2018

@egorpugin Can you check whether this works with the Visual Studio 64bit version?

@egorpugin
Copy link
Contributor

@egorpugin egorpugin commented Jun 5, 2018

With 64 bit build for me it is:

Extracting unicharset from box file chi_my.font.exp0.box.txt
Other case A of a is not in unicharset
Wrote unicharset file unicharset```
@egorpugin
Copy link
Contributor

@egorpugin egorpugin commented Jun 5, 2018

@yzy1996 Are you using cppan builds or prebuilt installers from Mannheim?
Cppan executables works fine.

@Shreeshrii
Copy link
Collaborator

@Shreeshrii Shreeshrii commented Jun 5, 2018

@egorpugin Thanks for testing.

Are the artifacts from appyeyor builds directly usable?

https://ci.appveyor.com/project/zdenop/tesseract/build/job/xh1atjcupa6qxlox/artifacts

@egorpugin
Copy link
Contributor

@egorpugin egorpugin commented Jun 5, 2018

Sure, already for a long time.

@Shreeshrii
Copy link
Collaborator

@Shreeshrii Shreeshrii commented Jun 5, 2018

Do these include the training tools?

I would like to add the info to https://github.com/tesseract-ocr/tesseract/wiki#windows

What would be the best way to describe this so that people know that VS2017 and VS2015 builds are available for download?

@stweil
Copy link
Contributor

@stweil stweil commented Jun 5, 2018

The download is a little bit tricky because there is no fix URL. You have to select an Appveyor build which was successful and get the artifacts, a ZIP file which contains the executables.

By the way, it looks like the current Appveyor build needs maintenance. It reports "Error uploading cache entry to the cache storage: Remote server returned 500: There is not enough space on the disk." The build time increased from about 15 minutes to about 80 minutes.

@egorpugin
Copy link
Contributor

@egorpugin egorpugin commented Jun 5, 2018

By the way, it looks like the current Appveyor build needs maintenance.

That is because of additional VS2015 builds. Cppan cache on appveyor exceeds 100 MBs now for 4 builds, so it cannot be saved. At the moment I don't have funds to create cppan binary cache on the internet (to cache and download binaries instead of long source builds - which is cheaper in terms of used server resources).
So, build times are bigger.

@Shreeshrii
Copy link
Collaborator

@Shreeshrii Shreeshrii commented Jun 5, 2018

to cache and download binaries

Can they not be posted in a github repo?

@egorpugin
Copy link
Contributor

@egorpugin egorpugin commented Jun 5, 2018

If you are talking about every cppan package, I'm afraid github will ban me for abuse. :)
It could took up to 100-300GB and TBs later- because you need to cache lots of configurations.
If you mean only tesseract deps - I did not think about it yet.

@Shreeshrii
Copy link
Collaborator

@Shreeshrii Shreeshrii commented Jun 5, 2018

You could try it only for those projects where Cppan cache on appveyor exceeds 100 MBs, starting with tesseract deps :-)

@yzy1996
Copy link
Author

@yzy1996 yzy1996 commented Jun 6, 2018

@egorpugin I am using prebuilt installers from Mannheim.

@stweil
Copy link
Contributor

@stweil stweil commented Jun 6, 2018

It looks like the problem is caused by a buggy Cygwin package.

Extract from Cygwin package files:

$ tar tvJf mingw64-i686-icu-57.1-2.tar.xz | grep icu.*57.dll
-rwxr-xr-x Yaakov/None 25680896 2016-11-10 04:59 usr/i686-w64-mingw32/sys-root/mingw/bin/icudata57.dll
-rwxr-xr-x Yaakov/None  2405376 2016-11-10 04:59 usr/i686-w64-mingw32/sys-root/mingw/bin/icui18n57.dll
-rwxr-xr-x Yaakov/None    44032 2016-11-10 04:59 usr/i686-w64-mingw32/sys-root/mingw/bin/icuio57.dll
-rwxr-xr-x Yaakov/None   298496 2016-11-10 04:59 usr/i686-w64-mingw32/sys-root/mingw/bin/icule57.dll
-rwxr-xr-x Yaakov/None    40960 2016-11-10 04:59 usr/i686-w64-mingw32/sys-root/mingw/bin/iculx57.dll
-rwxr-xr-x Yaakov/None    78848 2016-11-10 04:59 usr/i686-w64-mingw32/sys-root/mingw/bin/icutest57.dll
-rwxr-xr-x Yaakov/None   195072 2016-11-10 04:59 usr/i686-w64-mingw32/sys-root/mingw/bin/icutu57.dll
-rwxr-xr-x Yaakov/None  1467904 2016-11-10 04:59 usr/i686-w64-mingw32/sys-root/mingw/bin/icuuc57.dll

$ tar tvJf mingw64-x86_64-icu-57.1-2.tar.xz | grep icu.*57.dll
-rwxr-xr-x Yaakov/None   15872 2016-11-10 05:08 usr/x86_64-w64-mingw32/sys-root/mingw/bin/icudata57.dll
-rwxr-xr-x Yaakov/None 2220032 2016-11-10 05:08 usr/x86_64-w64-mingw32/sys-root/mingw/bin/icui18n57.dll
-rwxr-xr-x Yaakov/None   51200 2016-11-10 05:08 usr/x86_64-w64-mingw32/sys-root/mingw/bin/icuio57.dll
-rwxr-xr-x Yaakov/None  315904 2016-11-10 05:08 usr/x86_64-w64-mingw32/sys-root/mingw/bin/icule57.dll
-rwxr-xr-x Yaakov/None   46080 2016-11-10 05:08 usr/x86_64-w64-mingw32/sys-root/mingw/bin/iculx57.dll
-rwxr-xr-x Yaakov/None   83456 2016-11-10 05:08 usr/x86_64-w64-mingw32/sys-root/mingw/bin/icutest57.dll
-rwxr-xr-x Yaakov/None  199168 2016-11-10 05:08 usr/x86_64-w64-mingw32/sys-root/mingw/bin/icutu57.dll
-rwxr-xr-x Yaakov/None 1488384 2016-11-10 05:08 usr/x86_64-w64-mingw32/sys-root/mingw/bin/icuuc57.dll

While the 32 bit package has a large icudata57.dll, that dll is much smaller in the 64 bit package. That might explain the failing call of icu::Normalizer2::getInstance(nullptr, "nfkc", UNORM2_COMPOSE, error_code).

@Shreeshrii
Copy link
Collaborator

@Shreeshrii Shreeshrii commented Jun 6, 2018

cppan is using a newer version of icudata

Unpacking : pvt.cppan.demo.unicode.icu.i18n-61.1.0...
Unpacking : pvt.cppan.demo.unicode.icu.data-61.1.0...

@stweil
Copy link
Contributor

@stweil stweil commented Jun 6, 2018

A local build of the code used by Cygwin works fine for me, so I have working 64 bit DLL files.

Here is a short test program:

#include <cstdio>
#include <unicode/errorcode.h>
#include <unicode/normalizer2.h>

int main(int argc, char *argv[])
{
  const char *name = "nfkc";
  if (argc == 2) {
    name = argv[1];
  }
  icu::ErrorCode error_code;
  error_code.reset();
  if (error_code.isFailure()) {
    printf("failed before icu::Normalizer2::getInstance\n");
  }
  icu::Normalizer2::getInstance(nullptr, name, UNORM2_COMPOSE, error_code);
  if (error_code.isFailure()) {
    printf("FAILED: icu::Normalizer2::getInstance(NULL, \"%s\", UNORM2_COMPOSE, error_code)\n", name);
  } else {
    printf("PASSED: icu::Normalizer2::getInstance(NULL, \"%s\", UNORM2_COMPOSE, error_code)\n", name);
  }
  return 0;
}
@egorpugin
Copy link
Contributor

@egorpugin egorpugin commented Jun 6, 2018

While the 32 bit package has a large icudata57.dll, that dll is much smaller in the 64 bit package.

It means you have no icu data (or partially).
Could you test cppan produced binaries? Maybe they could be used instead of Mannheim installer?

@stweil
Copy link
Contributor

@stweil stweil commented Jun 6, 2018

I reported the problem on the Cygwin mailing list. The next update of the Mannheim installer will include fixed icu DLL files.

@stweil
Copy link
Contributor

@stweil stweil commented Jun 8, 2018

The new installer is now available and includes two ICU DLL files which fix this issue. @yzy1996, maybe you can give it a try. I also suggest to add some information to the title of this issue: "Unicharset_extractor meet ICU ERROR with 64 bit Windows installer".

@yzy1996 yzy1996 changed the title Unicharset_extractor meet ICU ERROR Unicharset_extractor meet ICU ERROR with 64 bit Windows installer Jun 11, 2018
@yzy1996
Copy link
Author

@yzy1996 yzy1996 commented Jun 11, 2018

Thank you for helping solve the problem. I still have a lot to learn and I hope to be a member of you one day.

@Shreeshrii
Copy link
Collaborator

@Shreeshrii Shreeshrii commented Jun 11, 2018

Please close the issue as solved.

@yzy1996 yzy1996 closed this Jun 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants