Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create chunking algorithm #2

Open
11 tasks
bmarwell opened this issue May 20, 2019 · 0 comments
Open
11 tasks

Create chunking algorithm #2

bmarwell opened this issue May 20, 2019 · 0 comments
Milestone

Comments

@bmarwell
Copy link
Collaborator

bmarwell commented May 20, 2019

This new interface method is needed:

Create chunks from a zchunk file, compress them, write a header.
Or in short: Create a zchunk file.

The same rules as in zchunk/zchunk#4 should apply.

I'd suggest a lower bound of 64 bytes and an upper bound of 1 Mbyte [for uncompressed chunks].

there are options ZCK_CHUNK_MIN and ZCK_CHUNK_MAX that can be set, but the defaults are 1 byte for the min and 10MB for the max.

Things which need to be implemented:

Basic design goals

  • Manual chunking
    • Split string (which goes to the NEXT chunk
    • Split string (which goes to the Current chunk, not the next one)
    • calculate optimal chunk size (combine / split created chunks).
      Maybe we need a 2-pass-algorithm, which would first try to chunk it with the given string, but if a chunk is > avg*4 || > ZCK_CHUNK_MAX, split it. If a chunk is < avg / 4 || < ZCK_CHUNK_MIN, merge it with the next one.
  • automatic splitting using buzhash
    • calculate optimal chunk size.
      • If it is text, try chunking at line breaks with ZCK_CHUNK_MIN in mind.
      • For all other files, create a 2-pass-algorithm similar to above algorithm.
    • Some intelligent separators for common file formats.
    • If ZCK_CHUNK_MIN and/or _MAX are not set, chose from a standard config.
    • What is a good default separator? We should make even binary formats more likely to have same chunks at different versions.

About ZCK_CHUNK_MIN / ZCK_CHUNK_MAX

About ZCK_CHUNK_MIN: The HTTP(2) headers alone take some bytes. Connection overhead + request headers + response headers can easily become 1024 bytes or 1 KiB. That said, I'd say set ZCK_CHUNK_MIN to at least 5 KiB, better 10 KiB, because it does not make sense to send 1k of connection overhead to receive 1 byte of usable data.

I think the ZCK_CHUNK_MAX depends on the file format and the target audience. If you are downloading isos (let's say you have your ubuntu-18.04.iso and want to download ubuntu-18.04.2, which are both about 1.9 GB) you might want to have a different ZCK_CHUNK_MAX compared to a 6 to 40 MiB repository metadata file. But: The bigger the chunk, the more unlikely it is to get a same hash, especially for binary formats.

Default properties file / fileformats.chunking.yml

We could create a file for predefined file formats. Somethink like this: https://gist.github.com/bmhm/18b57655e0c0c8a5a38d6cdf487866e4

This file could easily be extended for other file formats.

@bmarwell bmarwell added this to the 1.0.0 milestone May 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant