You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd suggest a lower bound of 64 bytes and an upper bound of 1 Mbyte [for uncompressed chunks].
there are options ZCK_CHUNK_MIN and ZCK_CHUNK_MAX that can be set, but the defaults are 1 byte for the min and 10MB for the max.
Things which need to be implemented:
Basic design goals
Manual chunking
Split string (which goes to the NEXT chunk
Split string (which goes to the Current chunk, not the next one)
calculate optimal chunk size (combine / split created chunks).
Maybe we need a 2-pass-algorithm, which would first try to chunk it with the given string, but if a chunk is > avg*4 || > ZCK_CHUNK_MAX, split it. If a chunk is < avg / 4 || < ZCK_CHUNK_MIN, merge it with the next one.
automatic splitting using buzhash
calculate optimal chunk size.
If it is text, try chunking at line breaks with ZCK_CHUNK_MIN in mind.
For all other files, create a 2-pass-algorithm similar to above algorithm.
Some intelligent separators for common file formats.
If ZCK_CHUNK_MIN and/or _MAX are not set, chose from a standard config.
What is a good default separator? We should make even binary formats more likely to have same chunks at different versions.
About ZCK_CHUNK_MIN / ZCK_CHUNK_MAX
About ZCK_CHUNK_MIN: The HTTP(2) headers alone take some bytes. Connection overhead + request headers + response headers can easily become 1024 bytes or 1 KiB. That said, I'd say set ZCK_CHUNK_MIN to at least 5 KiB, better 10 KiB, because it does not make sense to send 1k of connection overhead to receive 1 byte of usable data.
I think the ZCK_CHUNK_MAX depends on the file format and the target audience. If you are downloading isos (let's say you have your ubuntu-18.04.iso and want to download ubuntu-18.04.2, which are both about 1.9 GB) you might want to have a different ZCK_CHUNK_MAX compared to a 6 to 40 MiB repository metadata file. But: The bigger the chunk, the more unlikely it is to get a same hash, especially for binary formats.
This new interface method is needed:
Create chunks from a zchunk file, compress them, write a header.
Or in short: Create a zchunk file.
The same rules as in zchunk/zchunk#4 should apply.
Things which need to be implemented:
Basic design goals
Maybe we need a 2-pass-algorithm, which would first try to chunk it with the given string, but if a chunk is
> avg*4 || > ZCK_CHUNK_MAX
, split it. If a chunk is< avg / 4 || < ZCK_CHUNK_MIN
, merge it with the next one.buzhash
About ZCK_CHUNK_MIN / ZCK_CHUNK_MAX
About
ZCK_CHUNK_MIN
: The HTTP(2) headers alone take some bytes. Connection overhead + request headers + response headers can easily become1024 bytes
or1 KiB
. That said, I'd say setZCK_CHUNK_MIN
to at least5 KiB
, better10 KiB
, because it does not make sense to send 1k of connection overhead to receive 1 byte of usable data.I think the
ZCK_CHUNK_MAX
depends on the file format and the target audience. If you are downloading isos (let's say you have your ubuntu-18.04.iso and want to download ubuntu-18.04.2, which are both about 1.9 GB) you might want to have a different ZCK_CHUNK_MAX compared to a 6 to 40 MiB repository metadata file. But: The bigger the chunk, the more unlikely it is to get a same hash, especially for binary formats.Default properties file /
fileformats.chunking.yml
We could create a file for predefined file formats. Somethink like this: https://gist.github.com/bmhm/18b57655e0c0c8a5a38d6cdf487866e4
This file could easily be extended for other file formats.
The text was updated successfully, but these errors were encountered: