# Using ZFP to compress data

## Importing the required libraries/packages


We are going to import the following libraries and/or packages:

- `numpy`,
- `xarray`,
- `math`
- `zfpy` which is a python wrapper for ZFP algorithm,
- `getsizeof` to check the array sizes in bytes.

For the full documentation of `zfpy` you could click
[here](https://zfp.readthedocs.io/en/release0.5.5/python.html).


In [1]:
import numpy as np
import xarray as xr
import math
import zfpy
from sys import getsizeof

I will provide a separate instruction on how to install `zfpy`.


## Compressing a numpy array

First we try to compress some numpy array of type `integer` and `double`:


In [2]:
int_arr = np.arange(0, 100, dtype="int64")
print(f"type(int_arr): {type(int_arr)}")
print(f"int_array.dtype: {int_arr.dtype}")
print(f"number of elements: {int_arr.size}")
print(f"int_array.nbytes: {int_arr.nbytes}")
print(f"size in memory: {getsizeof(int_arr)}")

type(int_arr): <class 'numpy.ndarray'>
int_array.dtype: int64
number of elements: 100
int_array.nbytes: 800
size in memory: 896


So our integer `ndarray` which has 100 `int64` elements, i.e. 8B per elements,
consumes 800B for the data portion. The total size in memory (visible to Python)
is 896B.

Now let's compress it using `zfpy`:


In [3]:
int_arr_compressed = zfpy.compress_numpy(int_arr)
print(f"type(int_arr_compressed): {type(int_arr_compressed)}")
print(f"number of elements: {len(int_arr_compressed)}")
print(f"size in memory: {getsizeof(int_arr_compressed)}")

type(int_arr_compressed): <class 'bytes'>
number of elements: 264
size in memory: 297


The compressed version has 264 elements of type bytes, but the overall memory
consumption is 264B.

if we donot provide any other parameters/arguments, the lossless compression is
used. This means that once decompressed, we should get the exact data. Let's
check it:


In [4]:
int_arr_decompressed = zfpy.decompress_numpy(int_arr_compressed)
print(f"min difference: {(int_arr_decompressed- int_arr).min()}")
print(f"max difference: {(int_arr_decompressed- int_arr).max()}")

min difference: 0
max difference: 0


BTW, now you know how to decompress data using `zfpy` as well.

Let's repeat the same procedure for a `double` array:


In [5]:
double_arr = np.arange(0, 100, dtype="double")
print(f"type(double_arr): {type(double_arr)}")
print(f"double_array.dtype: {double_arr.dtype}")
print(f"number of elements: {double_arr.size}")
print(f"double_array.nbytes: {double_arr.nbytes}")
print(f"size in memory: {getsizeof(double_arr)}")

type(double_arr): <class 'numpy.ndarray'>
double_array.dtype: float64
number of elements: 100
double_array.nbytes: 800
size in memory: 896


Now compressing:


In [6]:
double_arr_compressed = zfpy.compress_numpy(double_arr)
print(f"type(double_arr_compressed): {type(double_arr_compressed)}")
print(f"number of elements: {len(double_arr_compressed)}")
print(f"size in memory: {getsizeof(double_arr_compressed)}")

type(double_arr_compressed): <class 'bytes'>
number of elements: 128
size in memory: 161


Although both `int_arr` and `double_arr` are storing same numbers (but in
different data type) it appears that the `double` is compressed better. Let's
decompress and make sure it was lossless:


In [7]:
double_arr_decompressed = zfpy.decompress_numpy(double_arr_compressed)
print(f"min difference: {(double_arr_decompressed- double_arr).min()}")
print(f"max difference: {(double_arr_decompressed- double_arr).max()}")

min difference: 0.0
max difference: 0.0


All looking good so far.

Now let's try a real data set.


## Compressing real data set

Now let's open some real data and check it out:


In [8]:
ds = xr.open_dataset("../../../data/cam-fv/orig.TS.100days.nc")
TS = ds.TS.values
print(f"type(TS): {type(TS)}")
print(f"TS.dtype: {TS.dtype}")
print(f"number of elements: {TS.size}")
print((f"TS.nbytes: {TS.nbytes}B or %.2fMB") % (TS.nbytes / 1024 / 1024))

print(f"size in memory: {getsizeof(TS)}")

type(TS): <class 'numpy.ndarray'>
TS.dtype: float32
number of elements: 5529600
TS.nbytes: 22118400B or 21.09MB
size in memory: 128


Now compressing:


In [9]:
TS_compressed = zfpy.compress_numpy(TS)
print(f"type(TS_compressed): {type(TS_compressed)}")
print(
    (f"number of elements: {len(TS_compressed)} (%.2fMB)")
    % (len(TS_compressed) / 1024 / 1024)
)
print(f"size in memory: {getsizeof(TS_compressed)}")

type(TS_compressed): <class 'bytes'>
number of elements: 12464768 (11.89MB)
size in memory: 12464801


Let's check the compression ratio:


In [10]:
print("compression ratio: %0.2f%%\n" % (len(TS_compressed) / TS.nbytes * 100))

compression ratio: 56.35%



Now let's decompress and make sure we are getting the original input:


In [11]:
TS_decompressed = zfpy.decompress_numpy(TS_compressed)
print(f"min difference: {(TS_decompressed- TS).min()}")
print(f"max difference: {(TS_decompressed- TS).max()}")

min difference: 0.0
max difference: 0.0


All is good. Now let's try different tolerance and see how the error and
compression ratio changes:


In [12]:
tolerance = -1
tmpTS_compressed = zfpy.compress_numpy(TS, tolerance=tolerance)
compression_ratio = len(tmpTS_compressed) / TS.nbytes * 100
tmpTS_decompressed = zfpy.decompress_numpy(tmpTS_compressed)
rmse = np.sqrt(np.power(tmpTS_decompressed - TS, 2).mean())
min_err = (tmpTS_decompressed - TS).min()
max_err = (tmpTS_decompressed - TS).max()
results = [(tolerance, compression_ratio, rmse, min_err, max_err)]
zfpy.compress_numpy(TS, tolerance=-1)
max_p = 5
print(f"Trying tolerance = 10^p; 0 <= p <= {max_p}. Please wait ", end="")
for p in range(-max_p, 1):
    print(".", end="")
    tolerance = math.pow(10, p)
    tmpTS_compressed = zfpy.compress_numpy(TS, tolerance=tolerance)

    compression_ratio = len(tmpTS_compressed) / TS.nbytes * 100
    tmpTS_decompressed = zfpy.decompress_numpy(tmpTS_compressed)

    rmse = np.sqrt(np.power(tmpTS_decompressed - TS, 2).mean())

    min_err = (tmpTS_decompressed - TS).min()
    max_err = (tmpTS_decompressed - TS).max()

    results.append((tolerance, compression_ratio, rmse, min_err, max_err))

print("\n\n")
print(" Tolerance     CR               RMSE (min, max err)")
print("-----------------------------------------------------------")
for e in results:
    print((f"%9.{max_p}f     %5.2f    %.8f (%11.8f,%11.8f)") % e)
print("-----------------------------------------------------------")

Trying tolerance = 10^p; 0 <= p <= 5. Please wait ......


 Tolerance     CR               RMSE (min, max err)
-----------------------------------------------------------
 -1.00000     56.35    0.00000000 ( 0.00000000, 0.00000000)
  0.00001     58.65    0.00000028 (-0.00001526, 0.00000000)
  0.00010     54.91    0.00000135 (-0.00003052, 0.00001526)
  0.00100     42.41    0.00004350 (-0.00027466, 0.00024414)
  0.01000     33.03    0.00034111 (-0.00196838, 0.00222778)
  0.10000     23.71    0.00266330 (-0.01647949, 0.01643372)
  1.00000     12.70    0.03611637 (-0.27249146, 0.26344299)
-----------------------------------------------------------
