RFC: Add initial support for traineddata files in compressed archive formats (don't merge) #911

stweil · 2017-05-12T12:57:48Z

This requires libminizip-dev, so expect failures from CI.

Up to now, little endian tesseract works with the new zip format.

More work is needed for training tools and big endian support and also to maintain
compatibility with the current proprietary format.

Signed-off-by: Stefan Weil sw@weilnetz.de

stweil · 2017-05-12T13:10:45Z

Open questions:

Do we want a new format? As I'm not the first one who had the idea, I think the answer is yes.
Do we need support for both old (current) and new format? I'd drop support for the old format
and remove combine_tessdata.
Should the traineddata files in the new format add .zip to the file names? I'd omit .zip.
Should the code for minizip be added to the Tesseract sources, or should we add an external dependency to libminizip-dev?
Which one is better, zip or compressed tar?

Shreeshrii · 2017-05-12T13:19:23Z

Yes, there have been requests for more compact/compressed traineddata files.

Another Qn.

Should the new format be limited to tesseract 4.0 or also applied to 3.05?

stweil · 2017-05-12T13:25:28Z

libminizip-dev was added in Ubuntu Xenial (16.04), so the current Travis build environment which is based on Ubuntu Trusty does not provide it.

Shreeshrii · 2017-05-12T13:43:44Z

libminizip-dev was added in Ubuntu Xenial (16.04), so the current Travis build environment which is based on Ubuntu Trusty does not provide it.

Why not use a different compression library that is available on different o/s as well as older ubuntu versions?

stweil · 2017-05-12T13:59:03Z

On my Debian system I find these libraries: minizip (supported since 16.04), libzip (supported since 12.04), zzlib (supported since 12.04), libarchive (supported since 12.04). As far as I know all use licenses which are compatible with Tesseract. I assume any of those can be used and expect that none of them will be available as a binary for Windows (maybe also not for macOS), but I did not have a look.

stweil · 2017-05-12T14:09:34Z

Zip format reduces eng.traineddata from about 31 MiB to 16 MiB (48 % compression) by default. zip -9 improves the compression to 49 %. Other compressed formats achieve even better compression values:

31887360 eng.traineddata.tar
31873501 eng.traineddata
18121906 eng.traineddata.lz4
16461487 eng.traineddata.zip (default)
16372645 eng.traineddata.zip (maximum compression)
15193532 eng.traineddata.tar.bz2
13274164 eng.traineddata.tar.xz
13273173 eng.traineddata.7z

75100160 mya.traineddata.tar
75085274 mya.traineddata
42274775 mya.traineddata.lz4
39468033 mya.traineddata.tar.bz2
36296750 mya.traineddata.tar.gz
36075469 mya.traineddata.zip
28097639 mya.traineddata.7z
27937332 mya.traineddata.tar.xz

zdenop · 2017-05-12T14:20:56Z

Please move discussion to tesseract-dev forum. This is significant change.

egorpugin · 2017-05-12T14:33:42Z

What about lz4?

btw, libarchive handles all formats.
https://github.com/libarchive/libarchive

stweil · 2017-05-12T15:57:10Z

Please move discussion to tesseract-dev forum. This is significant change.

See this discussion in the forum. I added a link to GitHub there.

stweil · 2017-05-12T16:03:23Z

libarchive handles all formats

It is also supported by current Linux distributions and would be interesting if compressed tar instead of zip is preferred. I added it to my previous post.

egorpugin · 2017-05-12T16:07:26Z

libarchive supported since very early ubuntu versions and in almost any other linuxes.
http://packages.ubuntu.com/search?suite=precise&searchon=names&keywords=libarchive

Personally, I'm using libarchive in cppan. The code for working with any formats is very simple, see:
https://github.com/egorpugin/primitives/blob/master/pack/include/primitives/pack.h
https://github.com/egorpugin/primitives/blob/master/pack/src/pack.cpp

Of course, this is pack/unpack archive code, but streaming code should be pretty similar and simple too.

stweil · 2017-05-12T16:25:25Z

What about lz4?

It does not compress very good (see result added to the list above).

egorpugin · 2017-05-12T16:41:33Z

Actually I wanted to say lzma, which is .xz/.7z extensions.
Sorry. :)

stweil · 2017-05-13T14:54:40Z

Rebased and added support for libzip.

egorpugin · 2017-05-13T15:03:29Z

What libraries are currently in use in your PR?
libarchive? minizip? libzip?
I see libarchive in build scripts, but not in code.
Maybe it worth it to use only one implementation (library)? I don't like multiple implementations for same thing.

stweil · 2017-05-13T15:08:29Z

This is experimental code, as there is still no decision whether compressed archives should be supported at all, if yes with which format and which library.

The current code uses libzip, if that is not found minizip. If neither of those is found, it uses the normal code. I prepared the code for more experiments to support compressed tar archives with libarchive for example.

stweil · 2017-05-13T15:11:42Z

As you can see here, the implementations for the two currently supported libraries are very similar.

stweil · 2017-05-13T21:49:18Z

The latest code also supports libarchive (highest priority). With that library, all kinds of compressed archives should work (up to now I tested with zip only).

stweil · 2017-05-14T07:38:46Z

As libarchive indeed supports all formats, I could compare the time needed for each format. Tesseract was run 5 times on each format with English on a simple hello world text. Below is the result sorted by time in seconds for each test. Interpretation:

The original Tesseract format, uncompressed tar and lz4 tar are similar and fastest.
zip needs about 150 ms more time than the original Tesseract format.
7z and xz tar need about 850 ms more time than the original Tesseract format.
bz2 tar is slowest and needs about 1450 ms more time than the original Tesseract format.

The file i/o from disk did not play a role in this test because of the Linux file cache and the SSD of my computer.

  0.13 eng.traineddata.tar
  0.14 eng.traineddata
  0.14 eng.traineddata.tar
  0.14 eng.traineddata.tar
  0.14 eng.traineddata.tar
  0.15 eng.traineddata
  0.15 eng.traineddata.tar
  0.15 eng.traineddata.lz4
  0.16 eng.traineddata
  0.16 eng.traineddata
  0.17 eng.traineddata.lz4
  0.17 eng.traineddata.lz4
  0.18 eng.traineddata
  0.18 eng.traineddata.lz4
  0.22 eng.traineddata.lz4
  0.29 eng.traineddata.zip
  0.29 eng.traineddata.zip
  0.29 eng.traineddata.zip
  0.30 eng.traineddata.zip
  0.30 eng.traineddata.zip
  0.97 eng.traineddata.7z
  0.98 eng.traineddata.7z
  0.98 eng.traineddata.7z
  0.99 eng.traineddata.7z
  0.99 eng.traineddata.tar.xz
  0.99 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.04 eng.traineddata.7z
  1.55 eng.traineddata.tar.bz2
  1.56 eng.traineddata.tar.bz2
  1.61 eng.traineddata.tar.bz2
  1.62 eng.traineddata.tar.bz2
  1.66 eng.traineddata.tar.bz2

Shreeshrii · 2017-05-14T08:32:00Z

Please also try the test with a different language. Maybe one which has the largest traineddata size, to see if filesize has any impact to the relative speeds Thanks.

stweil · 2017-05-14T16:28:34Z

Test results with libarchive for mya.traineddata (the largest of all traineddata files). I did not test lz4, but added a test with gz format.

0.48 mya.traineddata.tar
0.49 mya.traineddata
0.49 mya.traineddata.tar
0.49 mya.traineddata.tar
0.49 mya.traineddata.tar
0.50 mya.traineddata
0.52 mya.traineddata
0.52 mya.traineddata.tar
0.54 mya.traineddata
0.54 mya.traineddata
0.79 mya.traineddata.tar.gz
0.80 mya.traineddata.tar.gz
0.80 mya.traineddata.tar.gz
0.82 mya.traineddata.zip
0.84 mya.traineddata.zip
0.85 mya.traineddata.zip
0.86 mya.traineddata.tar.gz
0.86 mya.traineddata.zip
0.88 mya.traineddata.tar.gz
0.90 mya.traineddata.zip
2.38 mya.traineddata.7z
2.38 mya.traineddata.7z
2.38 mya.traineddata.tar.xz
2.40 mya.traineddata.7z
2.41 mya.traineddata.tar.xz
2.45 mya.traineddata.7z
2.45 mya.traineddata.tar.xz
2.46 mya.traineddata.7z
2.46 mya.traineddata.tar.xz
2.49 mya.traineddata.tar.xz
3.69 mya.traineddata.tar.bz2
3.74 mya.traineddata.tar.bz2
3.75 mya.traineddata.tar.bz2
3.79 mya.traineddata.tar.bz2
3.84 mya.traineddata.tar.bz2

libzip gives similar results, but only supports the zip format:

0.83 mya.traineddata.zip
0.84 mya.traineddata.zip
0.87 mya.traineddata.zip
0.88 mya.traineddata.zip
0.93 mya.traineddata.zip

libminizip:

0.84 mya.traineddata.zip
0.84 mya.traineddata.zip
0.85 mya.traineddata.zip
0.87 mya.traineddata.zip
0.92 mya.traineddata.zip

libzzip:

0.75 mya.traineddata.zip
0.78 mya.traineddata.zip
0.78 mya.traineddata.zip
0.79 mya.traineddata.zip
0.84 mya.traineddata.zip

egorpugin · 2017-05-14T17:55:37Z

lzma compresses slower but better? Or is it also decompress slower?

stweil · 2017-05-14T18:15:46Z

lzma created the xz files. 7zip and lzma gave the best compression ratios, but both also need some time for the decompression (which is relevant for Tesseract): they need about 1.9 s more time (but still are faster than bz2).

Please note that the current code for all formats reads all parts of the tessdata file, no matter whether they are used or not, so the decompression overhead could be reduced.

Shreeshrii · 2017-05-15T04:14:53Z

@theraysmith wrote on 4/18/14

I have no objection to switching to zip (with no tar) for the tessdata files. That should be usable by everybody more easily.

and on 4/20/14

I spent some time looking at zlib. It doesn't seem to make it easy to randomly access named entities in >a gzip file, unless I am missing something. The memory compress/uncompress functions are quite nice >though.

For the next version it would be nice to:
Update tessdatamanager to cope with compressed components.
Eliminate fread/fscanf from file input code and allow everything to read from a memory buffer.
These can probably both be achieved with the TFile class that I added for 3.03.

This is a change in direction from my previous work with new classifier experiments, where I have been writing everything to use Serialize/DeSerialize and FILE streams, but this doesn't seem to be as portable as I had hoped, due to its reliance on fmemopen. It seems it would be better to make everything use memory buffers and push the file I/O responsibility out to TessDataManager/TFile, which could then just as easily deal with compressed files or in-memory data.

@stweil Do all the methods you tested support randomly accessing named entities?

@theraysmith Is there a particular reason for zip (with no tar)?

stweil · 2017-05-15T04:47:31Z

The current Tesseract code reads the whole tessdata file into memory and gets all data from memory. My implementation for compressed archive files does that, too. Therefore random access is trivial: all component files are in a vector of byte arrays.

ghost · 2018-07-08T06:13:49Z

@stweil @amitdo @egorpugin have you tested zstd compression? I have tested it, and its very fast. Also if you add a dictionary to it, the compression ratio would be even better, I think it's a game changer.
https://github.com/facebook/zstd

ZSTD compressing by a dictionary:

Create the dictionary
zstd --train FullPathToTrainingSet/* -o dictionaryName
Compress with dictionary
zstd -D dictionaryName FILE
Decompress with dictionary
zstd -D dictionaryName --decompress FILE.zst
Increase dicitonary size
zstd --train dirSamples/* -o dictionaryName --maxdict=1024KB

stweil · 2018-07-08T06:40:16Z

I have not tested it yet, but it looks like we get Zstandard support with libarchive. Pull request libarchive/libarchive#905 added Zstandard there.

zdenop · 2018-07-08T18:46:46Z

AFAIR there was intention to use already used libraries e.g. not to increase dependencies.
To build tesseract with VS without cppan on Windows is already pain...

egorpugin · 2018-07-08T19:44:34Z

With next libarchive release I'll add zstd dependency into it in cppan, so tesseract will get it automatically.
(libarchive is used inside cppan extensively.)

stweil · 2018-07-08T19:58:12Z

Tesseract only needs to add a dependency on libarchive to get support for compressed archives.

zdenop · 2018-07-08T20:09:57Z

So If I understand it right if we compress datafiles with Zstandard users will need on all platform to compile libarchive + Zstandard...

stweil · 2018-07-08T20:28:51Z

That's correct. Therefore I still would distribute the datafiles with zip format which hopefully has good support on all platforms. But users who need maximum performance then could repack their needed datafiles with a different compression standard.

Shreeshrii · 2018-08-24T17:39:15Z

A fast compressor/decompressor

https://github.com/google/snappy

zdenop · 2019-02-26T16:00:30Z

Milestone is set to 4.1.0. Is it time to merge it? There was not a lof of changes here in last months...

Signed-off-by: Stefan Weil <sw@weilnetz.de>

This requires libarchive-dev, libzip-dev or libminizip-dev. Up to now, little endian tesseract works with the new format. More work is needed for training tools and big endian support. Signed-off-by: Stefan Weil <sw@weilnetz.de>

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil · 2019-03-05T16:21:24Z

Pull request #2290 now includes the implementation with libarchive, so this proof of concept is now obsolete and can be closed.

stweil mentioned this pull request May 12, 2017

LSTM: Check for correct version of traineddata files #534

Closed

stweil force-pushed the zip branch from 99335df to fe87810 Compare May 13, 2017 14:53

stweil force-pushed the zip branch from fe87810 to 5ffd469 Compare May 13, 2017 14:56

stweil force-pushed the zip branch from f1e6f44 to 1c71077 Compare May 13, 2017 21:46

stweil changed the title ~~RFC: Add initial support for traineddata files in zip format (don't merge)~~ RFC: Add initial support for traineddata files in zip and other compressed archive formats (don't merge) May 13, 2017

tesseract-ocr deleted a comment from stweil Jul 6, 2018

zdenop added feature request build process labels Oct 13, 2018

zdenop added this to the 4.1.0 milestone Oct 13, 2018

stweil force-pushed the zip branch from d95424f to 5ce3aca Compare October 15, 2018 09:06

ghost assigned stweil Oct 15, 2018

ghost added the review label Oct 15, 2018

zdenop mentioned this pull request Oct 16, 2018

RFC: Tesseract 4.0.0 – open tasks #1423

Closed

stweil added 8 commits March 2, 2019 15:27

Simplify Tesseract::init_tesseract_lm

72b55fa

Signed-off-by: Stefan Weil <sw@weilnetz.de>

Reduce visibility of some methods in TessdataManager

25a89d4

Signed-off-by: Stefan Weil <sw@weilnetz.de>

Add initial support for traineddata files in zip format

92157af

This requires libarchive-dev, libzip-dev or libminizip-dev. Up to now, little endian tesseract works with the new format. More work is needed for training tools and big endian support. Signed-off-by: Stefan Weil <sw@weilnetz.de>

travis: Add dependencies for compressed traineddata archive files

b2dc4a4

Signed-off-by: Stefan Weil <sw@weilnetz.de>

Add support for zziplib

6f3a39f

Signed-off-by: Stefan Weil <sw@weilnetz.de>

Allow simultaneous support of different archive libraries

557638d

Signed-off-by: Stefan Weil <sw@weilnetz.de>

Cleaned archive code

ccd7d19

Signed-off-by: Stefan Weil <sw@weilnetz.de>

Increase buffer size for libarchive

a9c9599

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil force-pushed the zip branch from 5ce3aca to a9c9599 Compare March 2, 2019 14:27

stweil closed this Mar 5, 2019

ghost removed the review label Mar 5, 2019

amitdo added the RFC label Mar 21, 2021

RFC: Add initial support for traineddata files in compressed archive formats (don't merge) #911

RFC: Add initial support for traineddata files in compressed archive formats (don't merge) #911

Conversation

stweil commented May 12, 2017

stweil commented May 12, 2017 • edited Loading

Shreeshrii commented May 12, 2017

stweil commented May 12, 2017

Shreeshrii commented May 12, 2017

stweil commented May 12, 2017 • edited Loading

stweil commented May 12, 2017 • edited Loading

zdenop commented May 12, 2017

egorpugin commented May 12, 2017 • edited Loading

stweil commented May 12, 2017

stweil commented May 12, 2017

egorpugin commented May 12, 2017 • edited Loading

stweil commented May 12, 2017

egorpugin commented May 12, 2017 • edited Loading

stweil commented May 13, 2017

egorpugin commented May 13, 2017

stweil commented May 13, 2017 • edited Loading

stweil commented May 13, 2017

stweil commented May 13, 2017

stweil commented May 14, 2017 • edited Loading

Shreeshrii commented May 14, 2017 via email • edited Loading

stweil commented May 14, 2017 • edited Loading

egorpugin commented May 14, 2017

stweil commented May 14, 2017

Shreeshrii commented May 15, 2017

stweil commented May 15, 2017

ghost commented Jul 8, 2018 • edited by ghost Loading

stweil commented Jul 8, 2018

zdenop commented Jul 8, 2018

egorpugin commented Jul 8, 2018 • edited Loading

stweil commented Jul 8, 2018

zdenop commented Jul 8, 2018

stweil commented Jul 8, 2018

Shreeshrii commented Aug 24, 2018

zdenop commented Feb 26, 2019

stweil commented Mar 5, 2019

stweil commented May 12, 2017 •

edited

Loading

stweil commented May 12, 2017 •

edited

Loading

stweil commented May 12, 2017 •

edited

Loading

egorpugin commented May 12, 2017 •

edited

Loading

egorpugin commented May 12, 2017 •

edited

Loading

egorpugin commented May 12, 2017 •

edited

Loading

stweil commented May 13, 2017 •

edited

Loading

stweil commented May 14, 2017 •

edited

Loading

Shreeshrii commented May 14, 2017 via email •

edited

Loading

stweil commented May 14, 2017 •

edited

Loading

ghost commented Jul 8, 2018 •

edited by ghost

Loading

egorpugin commented Jul 8, 2018 •

edited

Loading