This estimates the amount of chunk level dedupe available in a collection of parquet files. All files will be chunked together so if there are multiple versions of the file, all versions should be provided together to see how much can be saved using a CDC based content addressable storage system.
This tool is primarily designed to evaluate the deduplication effectiveness of a content defined chunking feature in Apache Parquet.
Content Defined Chunking (CDC) is a technique used to divide data into variable-sized chunks based on the content of the data itself, rather than fixed-size boundaries. This method is particularly effective for deduplication because it ensures that identical data segments are consistently chunked in the same way, even if their positions within the dataset change. By using CDC, systems can more accurately identify and eliminate duplicate data, leading to significant storage savings.
Apache Parquet is a columnar storage format that is widely used in the data engineering community. It is a binary file format that stores data in a columnar fashion, which allows for efficient reads and writes. In the context of Apache Parquet, CDC can enhance the efficiency of data storage and retrieval by creating chunks that are robust to data modifications such as updates, inserts, and deletes.
The implementation has not yet been merged upstream and is currently available on the following branch: content-defined-chunking.
This implementation introduces a gearhash-based chunker for each leaf column
within the ColumnWriter
objects, which persist throughout the lifecycle of
the Parquet file writer. As new Arrow Arrays are appended to the column
writers, the record-shredded arrays (def_levels, rep_levels, leaf_array)
are processed by the chunker to identify chunk boundaries.
Unlike the traditional approach where pages are split once a page's size reaches the default limit (typically 1MB), this implementation splits pages based on the chunk boundaries identified by the chunker. Consequently, the resulting chunks will have variable sizes but will be more resilient to data modifications such as updates, inserts, and deletes. This method enhances the robustness and efficiency of data storage and retrieval in Apache Parquet by ensuring that identical data segments are consistently chunked in the same manner, regardless of their position within the dataset.
Compile:
maturin develop -r
This will install a command line tool called de
.
Run the deduplication estimator on two or more files:
de dedup a.parquet b.parquet
Generate synthetic data and visualize deduplication:
de synthetic -s 1 -e 1 '{"a": "int", "b": "str", "c": ["float"]}'
Checkout all revisions of a file within a git repository:
https://huggingface.co/datasets/cfahlgren1/hub-stats/blob/main/datasets.parquet
❯ de revisions -d /tmp/datasets ~/Datasets/hub-stats/datasets.parquet
datasets.parquet has 194 revisions
Checking out 50a6ff0
Checking out 1ae78e6
Checking out 03613ea
...
Generate deduplication statistics for a directory of parquet files:
❯ de stats --with-json /tmp/datasets
100%|██████████████████████████████████████████| 194/194 [01:18<00:00, 2.46it/s]
Writing CDC Parquet files
100%|██████████████████████████████████████████| 194/194 [00:12<00:00, 16.01it/s]
100%|██████████████████████████████████████████| 194/194 [00:09<00:00, 19.48it/s]
Estimating deduplication for JSONLines
Estimating deduplication for Parquet
Estimating deduplication for CDC Snappy
Estimating deduplication for CDC ZSTD
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ ┃ ┃ ┃ ┃ ┃ Compressed ┃
┃ ┃ ┃ ┃ Compressed ┃ Deduplicat… ┃ Deduplica… ┃
┃ Title ┃ Total Size ┃ Chunk Size ┃ Chunk Size ┃ Ratio ┃ Ratio ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ JSONLines │ 93.0 GiB │ 64.9 GiB │ 12.4 GiB │ 70% │ 13% │
│ Parquet │ 16.2 GiB │ 15.0 GiB │ 13.4 GiB │ 93% │ 83% │
│ CDC Snappy │ 16.2 GiB │ 8.3 GiB │ 7.9 GiB │ 51% │ 48% │
│ CDC ZSTD │ 8.9 GiB │ 5.5 GiB │ 5.5 GiB │ 61% │ 61% │
└────────────┴────────────┴────────────┴─────────────┴─────────────┴────────────┘
It also generates a plot comparing the results:
The experiment can be reproduced by running the following command:
❯ de synthetic -s $S -e $E '{"a": "int"}'
This will generate $S
million (2^20) records of schema <a: int>
with variants
including $E
edits (inserted/deleted/updated block of 10 records) evenly
distributed in the table and another variant with 5% of the records appended to
the end. The resulting parquet files have $S
row groups.
The tool also generates .jsonlines files for comparison since content defined chunking is well suited for row major formats.
The experiment can be reproduced by running the following command:
❯ de synthetic -s 1 -e 1 '{"a": "int"}'
❯ de synthetic -s 1 -e 2 '{"a": "int"}'
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
The experiment can be reproduced by running the following command:
❯ de synthetic -s 4 -e 1 '{"a": "int"}'
❯ de synthetic -s 4 -e 2 '{"a": "int"}'
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
The experiment can be reproduced by running the following command:
❯ de synthetic -s 1 -e 1 '{"a": "int", "b": "str", "c": ["float"]}'
❯ de synthetic -s 1 -e 2 '{"a": "int", "b": "str", "c": ["float"]}'
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
The experiment can be reproduced by running the following command:
❯ de synthetic -s 4 -e 1 '{"a": "int", "b": "str", "c": ["float"]}'
❯ de synthetic -s 4 -e 2 '{"a": "int", "b": "str", "c": ["float"]}'
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite |
---|---|---|---|---|
None | ![]() |
![]() |
![]() |
![]() |
Snappy | ![]() |
![]() |
![]() |
|
ZSTD | ![]() |
![]() |
![]() |
![]() |
https://huggingface.co/datasets/cfahlgren1/hub-stats/blob/main/spaces.parquet
Checkout all revisions of spaces.parquet:
❯ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/cfahlgren1/hub-stats.git ~/Datasets/hub-stats
❯ de revisions -d /tmp/spaces ~/Datasets/hub-stats/spaces.parquet
This will take some time depending on the network speed. Then the evaluation which automatically:
- Converts the parquet files to JSONLines
- Converts the parquet files to CDC Parquet with Snappy and ZSTD compressions
- Estimates the deduplication ratio for each format
- Prints the results and generates a plot for comparison
❯ de stats --with-json /tmp/spaces
Writing JSONLines files
100%|██████████████████████████████████████████| 193/193 [07:14<00:00, 2.25s/it]
Writing CDC Parquet files
100%|██████████████████████████████████████████| 193/193 [00:38<00:00, 4.97it/s]
100%|██████████████████████████████████████████| 193/193 [00:31<00:00, 6.12it/s]
Estimating deduplication for JSONLines
Estimating deduplication for Parquet
Estimating deduplication for CDC Snappy
Estimating deduplication for CDC ZSTD
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ ┃ ┃ ┃ ┃ ┃ Compressed ┃
┃ ┃ ┃ ┃ Compressed ┃ Deduplicat… ┃ Deduplica… ┃
┃ Title ┃ Total Size ┃ Chunk Size ┃ Chunk Size ┃ Ratio ┃ Ratio ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ JSONLines │ 157.8 GiB │ 12.8 GiB │ 2.7 GiB │ 8% │ 2% │
│ Parquet │ 24.3 GiB │ 23.5 GiB │ 22.6 GiB │ 97% │ 93% │
│ CDC Snappy │ 24.3 GiB │ 8.8 GiB │ 8.5 GiB │ 36% │ 35% │
│ CDC ZSTD │ 15.3 GiB │ 6.2 GiB │ 6.2 GiB │ 41% │ 41% │
└────────────┴────────────┴────────────┴─────────────┴─────────────┴────────────┘
❯ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/openfoodfacts/product-database.git ~/Datasets/product-database
❯ de revisions -d /tmp/food ~/Datasets/product-database/food.parquet
food.parquet has 32 revisions
Checking out 2e19b51
Checking out 1f84d31
Checking out d31d108
Checking out 9cd809c
Checking out 41e5f38
Checking out 9a30ddd
...
Generate the stats but skip generating the JSONLines files since the compressed parquet files are around 6GB each so the uncompressed JSONLines files would require too much disk space.
❯ de stats /tmp/food --max-processes 8
Writing CDC Parquet files
100%|████████████████████████████████████████████| 32/32 [07:17<00:00, 13.67s/it]
100%|████████████████████████████████████████████| 32/32 [06:48<00:00, 12.77s/it]
Estimating deduplication for Parquet
Estimating deduplication for CDC Snappy
Estimating deduplication for CDC ZSTD
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ ┃ ┃ ┃ ┃ ┃ Compressed ┃
┃ ┃ ┃ ┃ Compressed ┃ Deduplicat… ┃ Deduplica… ┃
┃ Title ┃ Total Size ┃ Chunk Size ┃ Chunk Size ┃ Ratio ┃ Ratio ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Parquet │ 182.6 GiB │ 148.0 GiB │ 140.5 GiB │ 81% │ 77% │
│ CDC Snappy │ 178.3 GiB │ 75.6 GiB │ 73.1 GiB │ 42% │ 41% │
│ CDC ZSTD │ 109.6 GiB │ 55.9 GiB │ 55.5 GiB │ 51% │ 51% │
└────────────┴────────────┴────────────┴─────────────┴─────────────┴────────────┘