Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add initial support for traineddata files in compressed archive formats (don't merge) #911

Closed
wants to merge 8 commits into from

Conversation

stweil
Copy link
Contributor

@stweil stweil commented May 12, 2017

This requires libminizip-dev, so expect failures from CI.

Up to now, little endian tesseract works with the new zip format.

More work is needed for training tools and big endian support and also to maintain
compatibility with the current proprietary format.

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil
Copy link
Contributor Author

stweil commented May 12, 2017

Open questions:

  • Do we want a new format? As I'm not the first one who had the idea, I think the answer is yes.
  • Do we need support for both old (current) and new format? I'd drop support for the old format
    and remove combine_tessdata.
  • Should the traineddata files in the new format add .zip to the file names? I'd omit .zip.
  • Should the code for minizip be added to the Tesseract sources, or should we add an external dependency to libminizip-dev?
  • Which one is better, zip or compressed tar?

@Shreeshrii
Copy link
Collaborator

Yes, there have been requests for more compact/compressed traineddata files.

Another Qn.

  • Should the new format be limited to tesseract 4.0 or also applied to 3.05?

@stweil
Copy link
Contributor Author

stweil commented May 12, 2017

libminizip-dev was added in Ubuntu Xenial (16.04), so the current Travis build environment which is based on Ubuntu Trusty does not provide it.

@Shreeshrii
Copy link
Collaborator

libminizip-dev was added in Ubuntu Xenial (16.04), so the current Travis build environment which is based on Ubuntu Trusty does not provide it.

Why not use a different compression library that is available on different o/s as well as older ubuntu versions?

@stweil
Copy link
Contributor Author

stweil commented May 12, 2017

On my Debian system I find these libraries: minizip (supported since 16.04), libzip (supported since 12.04), zzlib (supported since 12.04), libarchive (supported since 12.04). As far as I know all use licenses which are compatible with Tesseract. I assume any of those can be used and expect that none of them will be available as a binary for Windows (maybe also not for macOS), but I did not have a look.

@stweil
Copy link
Contributor Author

stweil commented May 12, 2017

Zip format reduces eng.traineddata from about 31 MiB to 16 MiB (48 % compression) by default. zip -9 improves the compression to 49 %. Other compressed formats achieve even better compression values:

31887360 eng.traineddata.tar
31873501 eng.traineddata
18121906 eng.traineddata.lz4
16461487 eng.traineddata.zip (default)
16372645 eng.traineddata.zip (maximum compression)
15193532 eng.traineddata.tar.bz2
13274164 eng.traineddata.tar.xz
13273173 eng.traineddata.7z

75100160 mya.traineddata.tar
75085274 mya.traineddata
42274775 mya.traineddata.lz4
39468033 mya.traineddata.tar.bz2
36296750 mya.traineddata.tar.gz
36075469 mya.traineddata.zip
28097639 mya.traineddata.7z
27937332 mya.traineddata.tar.xz

@zdenop
Copy link
Contributor

zdenop commented May 12, 2017

Please move discussion to tesseract-dev forum. This is significant change.

@egorpugin
Copy link
Contributor

egorpugin commented May 12, 2017

What about lz4?


btw, libarchive handles all formats.
https://github.com/libarchive/libarchive

@stweil
Copy link
Contributor Author

stweil commented May 12, 2017

Please move discussion to tesseract-dev forum. This is significant change.

See this discussion in the forum. I added a link to GitHub there.

@stweil
Copy link
Contributor Author

stweil commented May 12, 2017

libarchive handles all formats

It is also supported by current Linux distributions and would be interesting if compressed tar instead of zip is preferred. I added it to my previous post.

@egorpugin
Copy link
Contributor

egorpugin commented May 12, 2017

libarchive supported since very early ubuntu versions and in almost any other linuxes.
http://packages.ubuntu.com/search?suite=precise&searchon=names&keywords=libarchive

Personally, I'm using libarchive in cppan. The code for working with any formats is very simple, see:
https://github.com/egorpugin/primitives/blob/master/pack/include/primitives/pack.h
https://github.com/egorpugin/primitives/blob/master/pack/src/pack.cpp

Of course, this is pack/unpack archive code, but streaming code should be pretty similar and simple too.

@stweil
Copy link
Contributor Author

stweil commented May 12, 2017

What about lz4?

It does not compress very good (see result added to the list above).

@egorpugin
Copy link
Contributor

egorpugin commented May 12, 2017

Actually I wanted to say lzma, which is .xz/.7z extensions.
Sorry. :)

@stweil
Copy link
Contributor Author

stweil commented May 13, 2017

Rebased and added support for libzip.

@egorpugin
Copy link
Contributor

What libraries are currently in use in your PR?
libarchive? minizip? libzip?
I see libarchive in build scripts, but not in code.
Maybe it worth it to use only one implementation (library)? I don't like multiple implementations for same thing.

@stweil
Copy link
Contributor Author

stweil commented May 13, 2017

This is experimental code, as there is still no decision whether compressed archives should be supported at all, if yes with which format and which library.

The current code uses libzip, if that is not found minizip. If neither of those is found, it uses the normal code. I prepared the code for more experiments to support compressed tar archives with libarchive for example.

@stweil
Copy link
Contributor Author

stweil commented May 13, 2017

As you can see here, the implementations for the two currently supported libraries are very similar.

@stweil
Copy link
Contributor Author

stweil commented May 13, 2017

The latest code also supports libarchive (highest priority). With that library, all kinds of compressed archives should work (up to now I tested with zip only).

@stweil stweil changed the title RFC: Add initial support for traineddata files in zip format (don't merge) RFC: Add initial support for traineddata files in zip and other compressed archive formats (don't merge) May 13, 2017
@stweil
Copy link
Contributor Author

stweil commented May 14, 2017

As libarchive indeed supports all formats, I could compare the time needed for each format. Tesseract was run 5 times on each format with English on a simple hello world text. Below is the result sorted by time in seconds for each test. Interpretation:

  • The original Tesseract format, uncompressed tar and lz4 tar are similar and fastest.
  • zip needs about 150 ms more time than the original Tesseract format.
  • 7z and xz tar need about 850 ms more time than the original Tesseract format.
  • bz2 tar is slowest and needs about 1450 ms more time than the original Tesseract format.

The file i/o from disk did not play a role in this test because of the Linux file cache and the SSD of my computer.

  0.13 eng.traineddata.tar
  0.14 eng.traineddata
  0.14 eng.traineddata.tar
  0.14 eng.traineddata.tar
  0.14 eng.traineddata.tar
  0.15 eng.traineddata
  0.15 eng.traineddata.tar
  0.15 eng.traineddata.lz4
  0.16 eng.traineddata
  0.16 eng.traineddata
  0.17 eng.traineddata.lz4
  0.17 eng.traineddata.lz4
  0.18 eng.traineddata
  0.18 eng.traineddata.lz4
  0.22 eng.traineddata.lz4
  0.29 eng.traineddata.zip
  0.29 eng.traineddata.zip
  0.29 eng.traineddata.zip
  0.30 eng.traineddata.zip
  0.30 eng.traineddata.zip
  0.97 eng.traineddata.7z
  0.98 eng.traineddata.7z
  0.98 eng.traineddata.7z
  0.99 eng.traineddata.7z
  0.99 eng.traineddata.tar.xz
  0.99 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.04 eng.traineddata.7z
  1.55 eng.traineddata.tar.bz2
  1.56 eng.traineddata.tar.bz2
  1.61 eng.traineddata.tar.bz2
  1.62 eng.traineddata.tar.bz2
  1.66 eng.traineddata.tar.bz2

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 14, 2017 via email

@stweil
Copy link
Contributor Author

stweil commented May 14, 2017

Test results with libarchive for mya.traineddata (the largest of all traineddata files). I did not test lz4, but added a test with gz format.

0.48 mya.traineddata.tar
0.49 mya.traineddata
0.49 mya.traineddata.tar
0.49 mya.traineddata.tar
0.49 mya.traineddata.tar
0.50 mya.traineddata
0.52 mya.traineddata
0.52 mya.traineddata.tar
0.54 mya.traineddata
0.54 mya.traineddata
0.79 mya.traineddata.tar.gz
0.80 mya.traineddata.tar.gz
0.80 mya.traineddata.tar.gz
0.82 mya.traineddata.zip
0.84 mya.traineddata.zip
0.85 mya.traineddata.zip
0.86 mya.traineddata.tar.gz
0.86 mya.traineddata.zip
0.88 mya.traineddata.tar.gz
0.90 mya.traineddata.zip
2.38 mya.traineddata.7z
2.38 mya.traineddata.7z
2.38 mya.traineddata.tar.xz
2.40 mya.traineddata.7z
2.41 mya.traineddata.tar.xz
2.45 mya.traineddata.7z
2.45 mya.traineddata.tar.xz
2.46 mya.traineddata.7z
2.46 mya.traineddata.tar.xz
2.49 mya.traineddata.tar.xz
3.69 mya.traineddata.tar.bz2
3.74 mya.traineddata.tar.bz2
3.75 mya.traineddata.tar.bz2
3.79 mya.traineddata.tar.bz2
3.84 mya.traineddata.tar.bz2

libzip gives similar results, but only supports the zip format:

0.83 mya.traineddata.zip
0.84 mya.traineddata.zip
0.87 mya.traineddata.zip
0.88 mya.traineddata.zip
0.93 mya.traineddata.zip

libminizip:

0.84 mya.traineddata.zip
0.84 mya.traineddata.zip
0.85 mya.traineddata.zip
0.87 mya.traineddata.zip
0.92 mya.traineddata.zip

libzzip:

0.75 mya.traineddata.zip
0.78 mya.traineddata.zip
0.78 mya.traineddata.zip
0.79 mya.traineddata.zip
0.84 mya.traineddata.zip

@egorpugin
Copy link
Contributor

lzma compresses slower but better? Or is it also decompress slower?

@stweil
Copy link
Contributor Author

stweil commented May 14, 2017

lzma created the xz files. 7zip and lzma gave the best compression ratios, but both also need some time for the decompression (which is relevant for Tesseract): they need about 1.9 s more time (but still are faster than bz2).

Please note that the current code for all formats reads all parts of the tessdata file, no matter whether they are used or not, so the decompression overhead could be reduced.

@Shreeshrii
Copy link
Collaborator

@theraysmith wrote on 4/18/14

I have no objection to switching to zip (with no tar) for the tessdata files. That should be usable by everybody more easily.

and on 4/20/14

I spent some time looking at zlib. It doesn't seem to make it easy to randomly access named entities in >a gzip file, unless I am missing something. The memory compress/uncompress functions are quite nice >though.

For the next version it would be nice to:
Update tessdatamanager to cope with compressed components.
Eliminate fread/fscanf from file input code and allow everything to read from a memory buffer.
These can probably both be achieved with the TFile class that I added for 3.03.

This is a change in direction from my previous work with new classifier experiments, where I have been writing everything to use Serialize/DeSerialize and FILE streams, but this doesn't seem to be as portable as I had hoped, due to its reliance on fmemopen. It seems it would be better to make everything use memory buffers and push the file I/O responsibility out to TessDataManager/TFile, which could then just as easily deal with compressed files or in-memory data.

@stweil Do all the methods you tested support randomly accessing named entities?

@theraysmith Is there a particular reason for zip (with no tar)?

@stweil
Copy link
Contributor Author

stweil commented May 15, 2017

The current Tesseract code reads the whole tessdata file into memory and gets all data from memory. My implementation for compressed archive files does that, too. Therefore random access is trivial: all component files are in a vector of byte arrays.

@tesseract-ocr tesseract-ocr deleted a comment from stweil Jul 6, 2018
@ghost
Copy link

ghost commented Jul 8, 2018

@stweil @amitdo @egorpugin have you tested zstd compression? I have tested it, and its very fast. Also if you add a dictionary to it, the compression ratio would be even better, I think it's a game changer.
https://github.com/facebook/zstd

ZSTD compressing by a dictionary:

  • Create the dictionary
    zstd --train FullPathToTrainingSet/* -o dictionaryName

  • Compress with dictionary
    zstd -D dictionaryName FILE

  • Decompress with dictionary
    zstd -D dictionaryName --decompress FILE.zst

  • Increase dicitonary size
    zstd --train dirSamples/* -o dictionaryName --maxdict=1024KB

@stweil
Copy link
Contributor Author

stweil commented Jul 8, 2018

I have not tested it yet, but it looks like we get Zstandard support with libarchive. Pull request libarchive/libarchive#905 added Zstandard there.

@zdenop
Copy link
Contributor

zdenop commented Jul 8, 2018

AFAIR there was intention to use already used libraries e.g. not to increase dependencies.
To build tesseract with VS without cppan on Windows is already pain...

@egorpugin
Copy link
Contributor

egorpugin commented Jul 8, 2018

With next libarchive release I'll add zstd dependency into it in cppan, so tesseract will get it automatically.
(libarchive is used inside cppan extensively.)

@stweil
Copy link
Contributor Author

stweil commented Jul 8, 2018

Tesseract only needs to add a dependency on libarchive to get support for compressed archives.

@zdenop
Copy link
Contributor

zdenop commented Jul 8, 2018

So If I understand it right if we compress datafiles with Zstandard users will need on all platform to compile libarchive + Zstandard...

@stweil
Copy link
Contributor Author

stweil commented Jul 8, 2018

That's correct. Therefore I still would distribute the datafiles with zip format which hopefully has good support on all platforms. But users who need maximum performance then could repack their needed datafiles with a different compression standard.

@Shreeshrii
Copy link
Collaborator

A fast compressor/decompressor

https://github.com/google/snappy

@zdenop
Copy link
Contributor

zdenop commented Feb 26, 2019

Milestone is set to 4.1.0. Is it time to merge it? There was not a lof of changes here in last months...

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This requires libarchive-dev, libzip-dev or libminizip-dev.

Up to now, little endian tesseract works with the new format.
More work is needed for training tools and big endian support.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor Author

stweil commented Mar 5, 2019

Pull request #2290 now includes the implementation with libarchive, so this proof of concept is now obsolete and can be closed.

@stweil stweil closed this Mar 5, 2019
@ghost ghost removed the review label Mar 5, 2019
@amitdo amitdo added the RFC label Mar 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants