Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed docset #138

Open
char101 opened this issue May 4, 2014 · 34 comments
Open

Compressed docset #138

char101 opened this issue May 4, 2014 · 34 comments

Comments

@char101
Copy link

char101 commented May 4, 2014

Hi,

Do you have a plan to support compressed docset, like zipped archive? It will help very much in reducing the size and disk reading time.

@Kapeli
Copy link
Contributor

Kapeli commented May 4, 2014

This is something I tried to add to Dash, so for what it's worth I'll share my progress. I'm writing this from memory, so sorry for any mistakes.

What seems to be needed is an indexed archive format which lets you extract individual files really fast.

Archive formats I've tried:

  1. Zip has the best index as far as I can tell and extraction of individual files occurs really fast. The problem with zip is that the compression benefits are minimal. Zip seems to be really bad at archiving folders with a lot of small files. Some docsets even get bigger when you compress them with zip.
  2. 7Zip has an index but it sucks. As far as I can tell when you ask 7Zip to unarchive an individual file, it searches through its entire index to find files that match. This takes a very long time for large docsets.
  3. Tar has no index at all
  4. There is a way to index data inside of a gzip-compressed file, using https://code.google.com/p/zran/, so what I've tried is to make my own archive format which appends all of the files into one huge file and then compresses it with gzip and indexes it. This works great, but during unarchival of individual files it sometimes takes a lot longer to unarchive some files than others (during tests on my Mac most files unarchived in 0.01s with this format, but some files unarchived in 0.1-0.2s). I couldn't figure out why.

If anyone has any experience with archive formats, help would be appreciated 👍

@char101
Copy link
Author

char101 commented May 5, 2014

Hi,

Thanks for your explanation on this.

Personally I don't think the docset size itself really matters, with the size of current harddisks. The problem I'm facing is with the number of small files, the storage size could be much larger because each small files take at least a filesystem block (4 KB?), even when the file size is only 1 KB. Also moving the docsets takes a very long time.

I tried compressing several of the docsets with zip

Yii

Total size                       15 270 914
Size on disk                     15 904 768
Zipped (max)                      1 828 455
7z (PPMd)                           604 276

J2SE

Total size                      295 540 254 
Size on disk                    318 820 352
Zipped (max)                     52,252,656 
7z (PPMd)                        18,412,863

While zip does not compresses as small as 7z with PPMd, still it achieves a pretty good compression ratio.

@Kapeli
Copy link
Contributor

Kapeli commented May 5, 2014

Unfortunately, I have no interest in pursuing this other than for file size issues. The size of the current HDDs is not an issue, but the size of SSDs is.

Also there are hosting and bandwidth issues, so size matters there as well, but I think one way around that would be to compress using a format (zip) and then recompress that using tgz or other.

Zip does work for some docsets, but fails with others. I can't remember which. Sorry.

@char101
Copy link
Author

char101 commented May 5, 2014

I understand about your reason, but speaking about file size, surely for the user, having the docsets in zip format will still result in much less space than the uncompressed files.

What do you think about distributing the docsets in 7z format (less distribution bandwidth, faster download time - why 7z, because AFAIK only 7z supports PPMd algorithms and this is the fastest and smallest compression for text files), and converting them into zip after being downloaded by the user. Hopefully this can be done without using temporary file (more lifetime for SSD).

@lobrien
Copy link

lobrien commented Dec 6, 2014

We use a zip format for storing text documents in Mono's documentation tool. We use our own indexing code. If you're interested, I can point you to the specific code in the mono project.

@trollixx
Copy link
Member

trollixx commented Feb 9, 2015

Adding some thoughts...

I am planning to add QCH (Qt Assistant format) and CHM support to Zeal at some point in the future. Both formats provide everything in a single file.

CHM files are compressed with LZX algorithm.
QCH is just an SQLite database and does not provide any compression.

As the next step I'd like to evaluate a Zeal-specific format (most likely extended from QCH) which would provide some compression level for data. I am not sure if that would work out with planned full-text search.

@char101
Copy link
Author

char101 commented Feb 10, 2015

Compressing a single row in sqlite database would be less effective since the dictionary will be limited to that single text, isn't it?

I think it's more practical to use zip as the archive format, then embed the toc and index as json files inside the zip. Full text search can be created when adding the documentation the first time. Converters can be created to convert from chm/qch to the zip format. This will also keep the binary size smaller since you do not have to embed the decoding libraries to zeal. User which want to create their own documentation can simply zip the html files and add it to zeal.

@zjzdy
Copy link

zjzdy commented Mar 28, 2015

@Kapeli Could you try to test lrzip ?
http://ck.kolivas.org/apps/lrzip/
You can use -l or -Ul to comparison(use lzo) if you want to fast decompress.
benchmark: http://ck.kolivas.org/apps/lrzip/README.benchmarks

@Kapeli
Copy link
Contributor

Kapeli commented Mar 29, 2015

A lot has changed since I last posted in this issue. I forgot it even exists. Sorry!

Anyways, Dash for iOS supports archived docsets right now. Dash for OS X will get support for archived docsets in a future update too. Archived docsets are only supported for my "official" docsets (i.e. the ones at https://kapeli.com/docset_links) and for user contributed docsets. This is enough, as these are the docsets that can be quite large, others are not really an issue.

I still use tgz for the archived docset format, the only difference is that I compress the docsets using tarix, which has proven to be very reliable.

Performance-wise, it takes about 5-10 times longer to read a file from archive than it takes to read it directly from disk. Directly from disk on my Mac it takes up to 0.001s for the larger doc pages, while from an archived docset it takes up to 0.01s.

Despite that, there's no noticeable impact as when a page is loaded the actual read of the files takes very very little time when compared to the loading of the WebView and the DOM and whatever (the WebView takes up about 90% of the load time).

@zjzdy
Copy link

zjzdy commented Mar 29, 2015

@Kapeli @trollixx Dash needn't all decompress that tgz file .(? I'm not sure whether you want to express this meaning? ) But zeal also need all decompress that tgz file .So I think you should understand what I want to express. :)
I watch tarix project ,this is a good project,But I found it as thought long time not update ?

@Kapeli
Copy link
Contributor

Kapeli commented Mar 29, 2015

Dash does not need to decompress the tgz file anymore, no.

@trollixx
Copy link
Member

Sounds interesting. I'll look into handling of tarix indices to eliminate docset unpacking. I haven't heard about tarix before.

@zjzdy
Copy link

zjzdy commented Apr 1, 2015

Kind reminder, can extract the index file in advance, because the index file is IO intensive

@RJVB
Copy link

RJVB commented Feb 4, 2017

In the meantime Mac users can use HFS compression, and Linux users put their docset folder on a filesystem with transparent compression like btrfs or ZFS.

@reclaimed
Copy link

about bundling, in numbers:

  • a VHD container (ntfs, compression enabled) with docsets has the size of 19 Gb
  • 700 thousands of files inside of the VHD have the summary size of ~9 Gb

It seems 10 Gb was spent to store file tables, attributes etc

I think the bundling (doesn't matter compressed or not) is a must.

@RJVB
Copy link

RJVB commented Mar 31, 2017 via email

@char101
Copy link
Author

char101 commented Mar 31, 2017

That many files will almost unavoidably lead to disk space overhead ("waste") because changes are slim that the majority will be an exact multiple of the disk block size (4096 for most modern disks). Not to mention the free-space fragmentation they can cause.

I think what you meant was the filesystem block. A disk block (sector) is only used for addressing while a single file cannot occupy less that a filesystem block.

@livelazily
Copy link

How about using dar to store and compress docset?

@RJVB
Copy link

RJVB commented Nov 28, 2018

If the goal is not to preserve the docset bundle "as is", couldn't you use a lightweight key/value database engine like LMDB? File names (or paths) would be the keys, and then you can use whatever compression gives the desired cost/benefit trade-off to store the values (i.e. file content). I've used this approach (with LZ4 compression) to replace a file-based data cache in my personal KDevelop fork, and it works quite nicely (with an API that mimics the file IO API). This gives me 2 files on disk instead of thousands which is evidently a lot more efficient.

FWIW my docset collection is over 3Gb before HFS compression, just over 1Gb after. I have enough diskspace not to compress, but that doesn't mean I spit on saving 2Gb. "There are no small economies" as they say in France, and following that guideline is probably why I still have lots of free disk space.

@trollixx
Copy link
Member

trollixx commented Dec 2, 2018

SQLite with LZ4 or zstd for blob compression is what I have in mind. There are also some larger goals that I hope to achieve with moving to the new docset format, such as embedded metadata, ToC support, etc.

@char101
Copy link
Author

char101 commented Dec 2, 2018

Zstandard supports precomputed dictionary which should be beneficial for compressing a lot of small files.

@RJVB
Copy link

RJVB commented Dec 2, 2018 via email

@char101
Copy link
Author

char101 commented Dec 3, 2018

I think that argument is largely moot when you combine files in a single compressed file

When storing the files in a key value database or sqlite, each file is compressed independently, which is why a precomputed dictionary will reduce the compression significantly, not to mention reducing the space usage required to store the dictionary multiple times in each row.

@trollixx
Copy link
Member

trollixx commented Dec 3, 2018

Using one dictionary per docset is an interesting idea, definitely worth benchmarking.

Regarding LZ4 and zstd, I just mentioned these two as an example, nothing has been decided so far.

@fearofshorts
Copy link

Just wanted to say that I feel this is the number 1 issue with Zeal and should be given much higher priority. Tens to hundreds of thousands of files means that any time I perform any large disk I/O tasks on any of my systems they get choked on the Zeal docsets.

If I try to use a large-directory-finding program (like WinDirStat or KDirStat) I have to wait as Zeal's docsets take up roughly 1/2 to 1/3rd of the total time searching. Making backups or copies of my home directory takes ages as the overhead for reading each of these files is incredible. I bet the the search cache must be much larger and slower on each of my systems because of having to index all of Zeals docsets.

Even Doom (a very, very early example of a game we have the source code to) solved this problem back in the day. Almost all of the games data is stored in a few "WAD" files (standing for "Wheres All the Data?"). If a user wants to play user-made mods or backup their game's data they just need to copy and paste a WAD file.

Sorry if this comes accross as bitching or complaining, I'm just trying to express how much this issue matters to me (and presumably many other users). I'm going to try to dust off my programming skills and work on this too.

@coding-moding
Copy link

coding-moding commented Jul 3, 2020

here is workaround for this issue:

#!/bin/bash
cd /home/user/.local/share/Zeal/Zeal
mkdir -p mnt mnt/lowerdir mnt/upperdir mnt/workdir
#sudo mount docsets.sqsh mnt/lowerdir -t squashfs -o loop
#sudo mount -t overlay -o lowerdir=mnt/lowerdir,upperdir=mnt/upperdir,workdir=mnt/workdir overlay docsets
mount mnt/lowerdir
mount docsets
/usr/bin/zeal "$@"
umount docsets
umount mnt/lowerdir

#prepare:
#mksquashfs docsets docsets.sqsh

#fstab:
#/home/user/.local/share/Zeal/Zeal/docsets.sqsh /home/user/.local/share/Zeal/Zeal/mnt/lowerdir squashfs user,loop,ro 0 0
#/dev/loop0 /home/user/.local/share/Zeal/Zeal/mnt/lowerdir squashfs user,loop,ro 0 0
#overlay /home/user/.local/share/Zeal/Zeal/docsets overlay noauto,lowerdir=/home/user/.local/share/Zeal/Zeal/mnt/lowerdir,upperdir=/home/user/.local/share/Zeal/Zeal/mnt/upperdir,workdir=/home/user/.local/share/Zeal/Zeal/mnt/workdir,user 0 0

#name the file /usr/local/bin/zeal to have higher priority over /usr/bin/zeal
#so docsets.sqsh will be mounted before running zeal and unmounted after it exits

@RJVB
Copy link

RJVB commented Jul 3, 2020 via email

@coding-moding
Copy link

coding-moding commented Jul 3, 2020

nope. you completely missed the point). look at the year:

char101 commented on 4 May 2014

and there wasn't a solution. just speculations like yours.)

@ashtonian
Copy link

any progress or formal thoughts on this or where to start? Text compression would be 🔥. Currently using 1/3 of my 256gb expensive mac nvme.

@char101
Copy link
Author

char101 commented Sep 16, 2020

Maybe you can zip your data directory, mount it using fuse, then put an overlay filesystem over it to allow for modifications.

@RJVB
Copy link

RJVB commented Sep 16, 2020 via email

@LawrenceJGD
Copy link

LawrenceJGD commented Jul 24, 2021

Currently I don't use Zeal because the docsets take up a lot of hard disk space and I don't have much space available, but I've seen a format that's used to archive web pages, it's called WARC (Web ARChive), it has support for compression and indexing, here I leave some links with information about the format:

@macsunmood
Copy link

Any updates? Will this feature be considered ?

@tophf
Copy link

tophf commented Apr 16, 2023

FWIW, there's a multi-platform kit PFM using which a docset can be turned into a private compressed virtual file system transparently, so hopefully there'll be no need to modify the existing code much, at least conceptually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests