# Summary

- data set:
    - wind from direction
    - 24 consecutive time steps
    - 3 ensemble members, 4 height levels

- combining files with single time slices led to a size reduction of 4% by deduplicating meta data
- different chunk sizes did not result in smaller files 

- the currently used compression scheme is quite expensive:
    - compress: 7:14 CPU time, 1:56 Wall time => 96MB
    - decompress: 0:13 CPU time, 0:04 Wall time
- an alternative compression scheme was found the gives only minimally lower compression ratios for this dataset, but is several times faster:
    - compress: 0:58 CPU time, 0:15 Wall time => 99MB (using Zstd level 7)
    - decompress: 0:02 CPU time, 0:01 Wall time

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
from pathlib import Path

import xarray as xr
import zarr

# Data stored as fields

In [3]:
# size on disk
! du -sh wind_from_direction

100M	wind_from_direction


In [4]:
# size of single dataset
! du -sh wind_from_direction/2020/03/15/15/MOGREPS-UK__wind_from_direction__2020-03-15T15__2020-03-16T21.zarr/

4.1M	wind_from_direction/2020/03/15/15/MOGREPS-UK__wind_from_direction__2020-03-15T15__2020-03-16T21.zarr/


In [5]:
# chunking
ds = xr.open_zarr('wind_from_direction/2020/03/15/15/MOGREPS-UK__wind_from_direction__2020-03-15T15__2020-03-16T21.zarr/')
ds.wind_from_direction.data

Unnamed: 0,Array,Chunk
Bytes,18.74 MB,782.25 kB
Shape,"(3, 4, 706, 553)","(1, 2, 353, 277)"
Count,25 Tasks,24 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 18.74 MB 782.25 kB Shape (3, 4, 706, 553) (1, 2, 353, 277) Count 25 Tasks 24 Chunks Type float32 numpy.ndarray",3  1  553  706  4,

Unnamed: 0,Array,Chunk
Bytes,18.74 MB,782.25 kB
Shape,"(3, 4, 706, 553)","(1, 2, 353, 277)"
Count,25 Tasks,24 Chunks
Type,float32,numpy.ndarray


# Store as single dataset

## first, let's rewrite as individual files and see whether there is a difference to the given dataset

In [6]:
 import lzma
import numcodecs

lzma_filters = [
    dict(id=lzma.FILTER_DELTA, dist=4),  # data is 32bit floats
    dict(id=lzma.FILTER_LZMA2, preset=9)]
compressor = numcodecs.LZMA(filters=lzma_filters, format=lzma.FORMAT_RAW)
var_name = 'wind_from_direction'
encoding = {var_name: {'compressor': compressor}}

for i, path in enumerate(Path('wind_from_direction').rglob('*.zarr')):
    ds = xr.open_zarr(str(path))
    ds.to_zarr(f'test_output/small_files/{i}.zarr', mode='w', consolidated=True, encoding=encoding)

In [7]:
! du -sh test_output/small_files

100M	test_output/small_files


## Store as one dataset

Ok, now let's read all the data and store it together with the same compression method.

In [8]:
all_data = xr.concat([xr.open_zarr(str(path)) for path in Path('wind_from_direction').rglob('*.zarr')], dim='time')

In [9]:
all_data

Unnamed: 0,Array,Chunk
Bytes,106.18 kB,4.42 kB
Shape,"(24, 553, 2)","(1, 553, 2)"
Count,96 Tasks,24 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 106.18 kB 4.42 kB Shape (24, 553, 2) (1, 553, 2) Count 96 Tasks 24 Chunks Type float32 numpy.ndarray",2  553  24,

Unnamed: 0,Array,Chunk
Bytes,106.18 kB,4.42 kB
Shape,"(24, 553, 2)","(1, 553, 2)"
Count,96 Tasks,24 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,135.55 kB,5.65 kB
Shape,"(24, 706, 2)","(1, 706, 2)"
Count,96 Tasks,24 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 135.55 kB 5.65 kB Shape (24, 706, 2) (1, 706, 2) Count 96 Tasks 24 Chunks Type float32 numpy.ndarray",2  706  24,

Unnamed: 0,Array,Chunk
Bytes,135.55 kB,5.65 kB
Shape,"(24, 706, 2)","(1, 706, 2)"
Count,96 Tasks,24 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,449.76 MB,782.25 kB
Shape,"(24, 3, 4, 706, 553)","(1, 1, 2, 353, 277)"
Count,1752 Tasks,576 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 449.76 MB 782.25 kB Shape (24, 3, 4, 706, 553) (1, 1, 2, 353, 277) Count 1752 Tasks 576 Chunks Type float32 numpy.ndarray",3  24  553  706  4,

Unnamed: 0,Array,Chunk
Bytes,449.76 MB,782.25 kB
Shape,"(24, 3, 4, 706, 553)","(1, 1, 2, 353, 277)"
Count,1752 Tasks,576 Chunks
Type,float32,numpy.ndarray


In [10]:
%%time
_ = (all_data
   .to_zarr('test_output/one_dataset.zarr', mode='w', consolidated=True, encoding=encoding)
)

CPU times: user 7min 9s, sys: 5.18 s, total: 7min 14s
Wall time: 1min 56s


In [11]:
! du -sh test_output/one_dataset.zarr

96M	test_output/one_dataset.zarr


In [12]:
all_data.wind_from_direction.data

Unnamed: 0,Array,Chunk
Bytes,449.76 MB,782.25 kB
Shape,"(24, 3, 4, 706, 553)","(1, 1, 2, 353, 277)"
Count,1752 Tasks,576 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 449.76 MB 782.25 kB Shape (24, 3, 4, 706, 553) (1, 1, 2, 353, 277) Count 1752 Tasks 576 Chunks Type float32 numpy.ndarray",3  24  553  706  4,

Unnamed: 0,Array,Chunk
Bytes,449.76 MB,782.25 kB
Shape,"(24, 3, 4, 706, 553)","(1, 1, 2, 353, 277)"
Count,1752 Tasks,576 Chunks
Type,float32,numpy.ndarray


The chunks are still the same as in the individual fields. Thus the size reduction comes from deduplicating coordinates.

## LZMA is quite slow, is there an alternative?

Zstandard + storing deltas + shuffling (using Blosc) achieves a similar compression ratio, but is considerably faster.

In [13]:
def compress_with_zstd(data, level):
    encoding = {
        var_name: {
            'filters': [zarr.Delta(dtype='float32')],
            'compressor': zarr.Blosc(cname='zstd', clevel=level, shuffle=zarr.Blosc.AUTOSHUFFLE)
        }
    }
    _ = (
        data
        .to_zarr(f'test_output/one_dataset_zstd_{level}.zarr', mode='w', consolidated=True, encoding=encoding)
    )

In [14]:
for level in range(1, 10):
    print(f'Level {level}')
    %time compress_with_zstd(all_data, level)

Level 1
CPU times: user 14.3 s, sys: 456 ms, total: 14.8 s
Wall time: 4.06 s
Level 2
CPU times: user 15.9 s, sys: 596 ms, total: 16.5 s
Wall time: 4.52 s
Level 3
CPU times: user 19.3 s, sys: 576 ms, total: 19.9 s
Wall time: 5.39 s
Level 4
CPU times: user 29.2 s, sys: 421 ms, total: 29.6 s
Wall time: 7.89 s
Level 5
CPU times: user 36.9 s, sys: 449 ms, total: 37.4 s
Wall time: 9.89 s
Level 6
CPU times: user 40.5 s, sys: 454 ms, total: 41 s
Wall time: 11 s
Level 7
CPU times: user 56.9 s, sys: 572 ms, total: 57.5 s
Wall time: 15 s
Level 8
CPU times: user 1min 2s, sys: 573 ms, total: 1min 3s
Wall time: 20 s
Level 9
CPU times: user 3min 1s, sys: 578 ms, total: 3min 1s
Wall time: 48.4 s


In [15]:
! du -sh test_output/one_dataset_zstd_*.zarr

113M	test_output/one_dataset_zstd_1.zarr
109M	test_output/one_dataset_zstd_2.zarr
107M	test_output/one_dataset_zstd_3.zarr
106M	test_output/one_dataset_zstd_4.zarr
106M	test_output/one_dataset_zstd_5.zarr
101M	test_output/one_dataset_zstd_6.zarr
99M	test_output/one_dataset_zstd_7.zarr
99M	test_output/one_dataset_zstd_8.zarr
97M	test_output/one_dataset_zstd_9.zarr


# Let's use Zstd to experiment with different chunkings

In [16]:
all_data.wind_from_direction.data

Unnamed: 0,Array,Chunk
Bytes,449.76 MB,782.25 kB
Shape,"(24, 3, 4, 706, 553)","(1, 1, 2, 353, 277)"
Count,1752 Tasks,576 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 449.76 MB 782.25 kB Shape (24, 3, 4, 706, 553) (1, 1, 2, 353, 277) Count 1752 Tasks 576 Chunks Type float32 numpy.ndarray",3  24  553  706  4,

Unnamed: 0,Array,Chunk
Bytes,449.76 MB,782.25 kB
Shape,"(24, 3, 4, 706, 553)","(1, 1, 2, 353, 277)"
Count,1752 Tasks,576 Chunks
Type,float32,numpy.ndarray


Let's use Zarr's heuristic to determine a chunk size from the data:

In [17]:
from zarr.util import guess_chunks

guess_chunks(all_data.wind_from_direction.data.shape, 32)

(6, 1, 1, 177, 139)

In [18]:
for level in range(1, 10):
    print(f'Level {level}')
    %time compress_with_zstd(all_data.chunk(chunks={'time': 3, 'realization': 1, 'height': 1, 'projection_y_coordinate': 177, 'projection_x_coordinate': 139}), level)

Level 1
CPU times: user 15.7 s, sys: 805 ms, total: 16.5 s
Wall time: 5.03 s
Level 2
CPU times: user 18.2 s, sys: 840 ms, total: 19 s
Wall time: 6 s
Level 3
CPU times: user 21.1 s, sys: 837 ms, total: 22 s
Wall time: 6.46 s
Level 4
CPU times: user 30.5 s, sys: 933 ms, total: 31.5 s
Wall time: 9.28 s
Level 5
CPU times: user 37.7 s, sys: 940 ms, total: 38.6 s
Wall time: 10.6 s
Level 6
CPU times: user 35.3 s, sys: 937 ms, total: 36.3 s
Wall time: 10 s
Level 7
CPU times: user 50.6 s, sys: 1.01 s, total: 51.6 s
Wall time: 13.7 s
Level 8
CPU times: user 52 s, sys: 1.1 s, total: 53.1 s
Wall time: 14.3 s
Level 9
CPU times: user 4min 27s, sys: 1.12 s, total: 4min 29s
Wall time: 1min 8s


In [19]:
! du -sh test_output/one_dataset_zstd_*.zarr

114M	test_output/one_dataset_zstd_1.zarr
111M	test_output/one_dataset_zstd_2.zarr
110M	test_output/one_dataset_zstd_3.zarr
108M	test_output/one_dataset_zstd_4.zarr
108M	test_output/one_dataset_zstd_5.zarr
105M	test_output/one_dataset_zstd_6.zarr
105M	test_output/one_dataset_zstd_7.zarr
105M	test_output/one_dataset_zstd_8.zarr
101M	test_output/one_dataset_zstd_9.zarr


In [20]:
for level in range(1, 10):
    print(f'Level {level}')
    %time compress_with_zstd(all_data.chunk(chunks={'time': 6, 'realization': 1, 'height': 1, 'projection_y_coordinate': 177, 'projection_x_coordinate': 139}), level)

Level 1
CPU times: user 15.9 s, sys: 653 ms, total: 16.5 s
Wall time: 4.89 s
Level 2
CPU times: user 16.8 s, sys: 617 ms, total: 17.4 s
Wall time: 4.98 s
Level 3
CPU times: user 20.7 s, sys: 549 ms, total: 21.2 s
Wall time: 5.94 s
Level 4
CPU times: user 29.3 s, sys: 691 ms, total: 30 s
Wall time: 8.28 s
Level 5
CPU times: user 36.6 s, sys: 640 ms, total: 37.2 s
Wall time: 9.99 s
Level 6
CPU times: user 36.7 s, sys: 612 ms, total: 37.3 s
Wall time: 10.3 s
Level 7
CPU times: user 53.1 s, sys: 554 ms, total: 53.6 s
Wall time: 14.5 s
Level 8
CPU times: user 57.9 s, sys: 671 ms, total: 58.6 s
Wall time: 16 s
Level 9
CPU times: user 3min 16s, sys: 728 ms, total: 3min 16s
Wall time: 50.8 s


In [21]:
! du -sh test_output/one_dataset_zstd_*.zarr

113M	test_output/one_dataset_zstd_1.zarr
109M	test_output/one_dataset_zstd_2.zarr
108M	test_output/one_dataset_zstd_3.zarr
106M	test_output/one_dataset_zstd_4.zarr
106M	test_output/one_dataset_zstd_5.zarr
101M	test_output/one_dataset_zstd_6.zarr
100M	test_output/one_dataset_zstd_7.zarr
100M	test_output/one_dataset_zstd_8.zarr
98M	test_output/one_dataset_zstd_9.zarr


In [22]:
for level in range(1, 10):
    print(f'Level {level}')
    %time compress_with_zstd(all_data.chunk(chunks={'time': 3, 'realization': 1, 'height': 1, 'projection_y_coordinate': 353, 'projection_x_coordinate': 277}), level)

Level 1
CPU times: user 15.5 s, sys: 498 ms, total: 15.9 s
Wall time: 4.6 s
Level 2
CPU times: user 16.4 s, sys: 429 ms, total: 16.9 s
Wall time: 4.7 s
Level 3
CPU times: user 19.7 s, sys: 522 ms, total: 20.2 s
Wall time: 5.6 s
Level 4
CPU times: user 28.4 s, sys: 462 ms, total: 28.8 s
Wall time: 8.34 s
Level 5
CPU times: user 35.3 s, sys: 545 ms, total: 35.9 s
Wall time: 9.56 s
Level 6
CPU times: user 34 s, sys: 584 ms, total: 34.6 s
Wall time: 9.1 s
Level 7
CPU times: user 53.1 s, sys: 485 ms, total: 53.6 s
Wall time: 14.1 s
Level 8
CPU times: user 55.6 s, sys: 552 ms, total: 56.2 s
Wall time: 14.6 s
Level 9
CPU times: user 3min 24s, sys: 600 ms, total: 3min 24s
Wall time: 52.4 s


In [23]:
! du -sh test_output/one_dataset_zstd_*.zarr

112M	test_output/one_dataset_zstd_1.zarr
108M	test_output/one_dataset_zstd_2.zarr
107M	test_output/one_dataset_zstd_3.zarr
105M	test_output/one_dataset_zstd_4.zarr
105M	test_output/one_dataset_zstd_5.zarr
100M	test_output/one_dataset_zstd_6.zarr
99M	test_output/one_dataset_zstd_7.zarr
99M	test_output/one_dataset_zstd_8.zarr
97M	test_output/one_dataset_zstd_9.zarr


# How fast can the entire dataset be read?

In [24]:
%%time
ds = xr.open_zarr('test_output/one_dataset.zarr')
_ = ds.wind_from_direction.values

CPU times: user 12.3 s, sys: 511 ms, total: 12.8 s
Wall time: 3.74 s


In [25]:
%%time
ds = xr.concat([xr.open_zarr(str(path)) for path in Path('test_output/small_files').rglob('*.zarr')], dim='time')
_ = ds.wind_from_direction.values

CPU times: user 12.7 s, sys: 1.1 s, total: 13.8 s
Wall time: 4.56 s


In [26]:
for level in range(1, 10):
    print(f'Level {level}')
    %time ds = xr.open_zarr(f'test_output/one_dataset_zstd_{level}.zarr'); _ = ds.wind_from_direction.values

Level 1
CPU times: user 1.88 s, sys: 292 ms, total: 2.17 s
Wall time: 814 ms
Level 2
CPU times: user 1.9 s, sys: 629 ms, total: 2.53 s
Wall time: 1.17 s
Level 3
CPU times: user 1.88 s, sys: 235 ms, total: 2.11 s
Wall time: 835 ms
Level 4
CPU times: user 1.84 s, sys: 661 ms, total: 2.5 s
Wall time: 1.21 s
Level 5
CPU times: user 1.75 s, sys: 292 ms, total: 2.05 s
Wall time: 818 ms
Level 6
CPU times: user 1.79 s, sys: 949 ms, total: 2.74 s
Wall time: 1.54 s
Level 7
CPU times: user 1.74 s, sys: 257 ms, total: 2 s
Wall time: 820 ms
Level 8
CPU times: user 1.82 s, sys: 637 ms, total: 2.45 s
Wall time: 1.24 s
Level 9
CPU times: user 1.76 s, sys: 308 ms, total: 2.07 s
Wall time: 816 ms
