# Demo of saving huge arrays with h5 to reduce file sizes

_Warning_: multiple files saved in this script will have sizes of approx. 37MB and 159MB.

We'll compare the sizes of files containing huge arrays that are created by these libraries:

In [1]:
import numpy as np
import pickle
import h5py

| Test | Success | File size (MB) | Comment |
|---|---|---|---|
| `pickle` | - | 159 | Starting point. | 
| `np.save` | No | 159 | No difference. | 
| `h5` lossless | Yes | 80 | Halves the size of the file with no drawbacks. Also retains halved size when array is loaded back into memory. | 
| `h5` with loss, `scaleoffset=4` | Yes | 40 | Reduces the precision to 4 decimal places. | 
| Reshape each 2D grid to 1D row | No | 159 | No difference for any library. |
| Reshape each 2D grid to 1D row and lose repeat values | Yes | est. 80 | Almost halves the size as each 2D grid has diagonal symmetry, most values feature twice. |
| Change dtype | Yes | 40MB | Changing default dtype from `float32` to `float16` quarters the file size and reduces the precision to 4 significant figures. |

Combining the best reductions in size:

+ Change dtype to lose precision
+ Reshape each grid to row and lose repeats

gives a __final file size of 20MB__, down from the starting 159MB. This is ~1/8 or 12.5% of the starting size. At this point, saving with `h5` or `pickle` makes no difference.

# Generate some data:

Generate some array of random data: 

In [2]:
fake_array = np.random.rand(1000, 141, 141)

# Increase some values to outside rand range:
fake_array[4:14, 4:14] += 10
fake_array[24:34, 24:34] -= 10

# Make the arrays symmetrical around the diagonal.
for i in range(fake_array.shape[0]):
    a = fake_array[i]
    fake_array[i] = a + np.transpose(a) 

Check the size of this in memory: 

In [3]:
fake_array.nbytes

159048000

# Save with different libraries

## pickle:

Save with pickle:

In [4]:
with open('fakearray_frompickle.p', 'wb') as filehandler:
    pickle.dump(fake_array, filehandler)

Load the pickled file:

In [5]:
with open('fakearray_frompickle.p', 'rb') as filehandler:
    fake_array_from_pickle = pickle.load(filehandler)

Check it works as expected:

In [6]:
print(type(fake_array_from_pickle))
print(fake_array_from_pickle.shape)

<class 'numpy.ndarray'>
(1000, 141, 141)


Check the size in memory now: 

In [7]:
fake_array_from_pickle.nbytes

159048000

## numpy:

Save this array to file using numpy's save function: 

In [8]:
np.save('fakearray_fromsave.npy', fake_array)

Load it back in:

In [9]:
fake_array_from_save = np.load('fakearray_fromsave.npy')

Check the size in memory now: 

In [10]:
fake_array_from_save.nbytes

159048000

## h5:

Save this array to file using h5.

In [11]:
with h5py.File('fakearray_fromh5.hdf5', 'w') as f:
    # Create an empty dataset...
    dset = f.create_dataset('fake_array', fake_array.shape, dtype='float32')
    # ... and then fill it with values from our array.
    dset[...] = fake_array

Load the array back in: 

In [12]:
with h5py.File('fakearray_fromh5.hdf5', 'r') as f:
    dset = f['fake_array']
    
    # Convert to numpy array:
    fake_array_from_h5 = dset[:] 

Check it works as expected:

In [13]:
print(type(fake_array_from_h5))
print(fake_array_from_h5.shape)

<class 'numpy.ndarray'>
(1000, 141, 141)


Check the size in memory now: 

In [14]:
fake_array_from_h5.nbytes

79524000

# Lossy save with `h5`

Save this array to file using h5.

`scaleoffset` is the number of decimal places that the file will retain. Any digits beyond that will not be retained, and when the array is loaded back in, the values of these later digits will have changed.

By "decimal places" it does seem to be after the point in normal notation (e.g. 10.333333) rather than scientific (e.g. 1.03333e+01). 

Increasing `scaleoffset` increases the size of the saved file. For this `fake_array`, each increase in scaleoffset of 1 causes an increase in the file size of ~8MB.

In [15]:
with h5py.File('fakearray_lossy_fromh5.hdf5', 'w') as f:
    # Create an empty dataset...
    dset = f.create_dataset('fake_array', fake_array.shape, 
                            dtype='float32', scaleoffset=4)
    # ... and then fill it with values from our array.
    dset[...] = fake_array

Load the array back in: 

In [16]:
with h5py.File('fakearray_lossy_fromh5.hdf5', 'r') as f:
    dset = f['fake_array']
    
    # Convert to numpy array:
    fake_array_lossy_from_h5 = dset[:] 

Check it works as expected:

In [17]:
print(type(fake_array_lossy_from_h5))
print(fake_array_lossy_from_h5.shape)

<class 'numpy.ndarray'>
(1000, 141, 141)


Check the size in memory now: 

In [18]:
fake_array_lossy_from_h5.nbytes

79524000

In [19]:
size_orig = fake_array.nbytes
size_h5lossy = fake_array_lossy_from_h5.nbytes

print('Original size | Reduced size | Ratio of old to new')
print(f'    {size_orig}    |     {size_h5lossy}    | {100*size_h5lossy/size_orig}%')

Original size | Reduced size | Ratio of old to new
    159048000    |     79524000    | 50.0%


### Compare lossy array with original

In [20]:
print('Diff from lossless h5   | Diff from lossy h5')
print('--------------------------------------')
for i, orig_value in enumerate(fake_array[0][0][:10]):
    print(f'    {(orig_value-fake_array_from_h5[0][0][i]):9.6f}    |', 
          f'    {(orig_value-fake_array_lossy_from_h5[0][0][i]):9.6f}    ')
print('... plus more values that aren\'t printed.')

Diff from lossless h5   | Diff from lossy h5
--------------------------------------
     0.000000    |      0.000034    
     0.000000    |     -0.000014    
     0.000000    |     -0.000045    
    -0.000000    |      0.000035    
     0.000000    |      0.000016    
     0.000000    |      0.000034    
     0.000000    |     -0.000019    
     0.000000    |      0.000034    
     0.000000    |      0.000036    
     0.000000    |      0.000022    
... plus more values that aren't printed.


---

# Reshape grids from 2D to 1D

Reshape to one row per patient with each row being absurdly long:

In [21]:
fake_array_2d = fake_array.reshape(fake_array.shape[0], 
                                   int(fake_array.shape[1]**2.0))

fake_array_2d.shape

(1000, 19881)

Convert back to the original 3D shape:

In [22]:
fake_array_back_to_3d = fake_array_2d.reshape(
    fake_array_2d.shape[0], 
    int(fake_array_2d.shape[1]**0.5), 
    int(fake_array_2d.shape[1]**0.5)
)

fake_array_back_to_3d.shape

(1000, 141, 141)

Check that this array matches the original: 

In [23]:
np.all(fake_array_back_to_3d == fake_array)

True

Check size in memory: 

In [24]:
fake_array_2d.nbytes

159048000

But how much room does it take up on disk?

## numpy:

Save this array to file using numpy's save function: 

In [25]:
np.save('fakearray2d_fromsave.npy', fake_array_2d)

Load it back in:

In [26]:
fake_array2d_from_save = np.load('fakearray2d_fromsave.npy')

Check the size in memory now: 

In [27]:
fake_array2d_from_save.nbytes

159048000

The size on disk is the same as saving the 3D array from earlier. 

## pickle:

Save with pickle:

In [28]:
with open('fakearray2d_frompickle.p', 'wb') as filehandler:
    pickle.dump(fake_array_2d, filehandler)

Load the pickled file:

In [29]:
with open('fakearray2d_frompickle.p', 'rb') as filehandler:
    fake_array_2d_from_pickle = pickle.load(filehandler)

Check it works as expected:

In [30]:
print(type(fake_array_2d_from_pickle))
print(fake_array_2d_from_pickle.shape)

<class 'numpy.ndarray'>
(1000, 19881)


Check the size in memory now: 

In [31]:
fake_array_2d_from_pickle.nbytes

159048000

The size on disk is the same as saving the 3D array from earlier.

## h5:

Save this array to file using h5.

`scaleoffset` is the number of decimal places that the file will retain. Any digits beyond that will not be retained, and when the array is loaded back in, the values of these later digits will have changed.

By "decimal places" it does seem to be after the point in normal notation (e.g. 10.333333) rather than scientific (e.g. 1.03333e+01). 

Increasing `scaleoffset` increases the size of the saved file. For this `fake_array`, each increase in scaleoffset of 1 causes an increase in the file size of ~8MB.

In [32]:
with h5py.File('fakearray2d_fromh5.hdf5', 'w') as f:
    # Create an empty dataset...
    dset = f.create_dataset('fake_array_2d', fake_array_2d.shape, 
                            dtype='float32', scaleoffset=4)
    # ... and then fill it with values from our array.
    dset[...] = fake_array_2d

Load the array back in: 

In [33]:
with h5py.File('fakearray2d_fromh5.hdf5', 'r') as f:
    dset = f['fake_array_2d']
    
    # Convert to numpy array:
    fake_array_2d_from_h5 = dset[:] 

Check it works as expected:

In [34]:
print(type(fake_array_2d_from_h5))
print(fake_array_2d_from_h5.shape)

<class 'numpy.ndarray'>
(1000, 19881)


Check the size in memory now: 

In [35]:
fake_array_2d_from_h5.nbytes

79524000

The size on disk is the same as saving the 3D array from earlier.

# Remove symmetric values and reshape to 2D

## Reshape 2D grid to 1D row

Define a function to reduce the 2D SHAP grid to a single 1D row.

In [36]:
def reshape_triangle_to_row(arr):
    """
    The values are symmetrical across a diagonal: 
    00 01 02 03 04 05 ...
    01 11 12 13 14 15 ...
    02 12 22 23 24 25 ... 
    03 13 23 33 34 35 ... 
    04 14 24 34 44 45 ... 
    05 15 25 35 45 55 ...
    :: :: :: :: :: ::
    So we can convert a 141x141 grid into a 1x19,881 row and retain 
    all of the information, *or* use the symmetry to ditch the 
    repeats and keep only a 1x10,011 row.
    """
    
    # Number of columns in this array: 
    n_cols = arr.shape[0]
    
    # Place the values in this list:
    row = np.array([]) 
    
    for col in range(n_cols): 
        # For first column, keep all values.
        # For second, keep all except the first.
        # For third, keep all except the first and second.
        # ... for last column, keep only the final value. 
        
        # The actual values to store: 
        values_to_keep = arr[col, col:] 
        
        row = np.concatenate((row, values_to_keep))

    return row 

Move the grid values into a row: 

In [37]:
%%time

fake_array_short_row = reshape_triangle_to_row(fake_array[0])

CPU times: user 998 µs, sys: 259 µs, total: 1.26 ms
Wall time: 904 µs


In [38]:
fake_array_short_row.shape

(10011,)

In [39]:
fake_array_short_row.nbytes

80088

In [40]:
size_grid = fake_array[0].nbytes
size_short_row = fake_array_short_row.nbytes

print('Grid size | Short row size | Ratio of old to new')
print(f'  {size_grid}  |      {size_short_row}     | ' + 
      f'{100*size_short_row/size_grid:4.2f}%')

Grid size | Short row size | Ratio of old to new
  159048  |      80088     | 50.35%


## Reshape the 1D row to a 2D grid with symmetry

Define another function to restore that 1D row to a 2D grid.

In [92]:
def reshape_row_to_sym_grid(row, n_cols=141, dtype=np.float64):
    """
    Take a row of values [00, 01, 02, 03, ... ] and place them into
    a grid with diagonal symmetry:
    00 01 02 03 04 05 ...
    01 11 12 13 14 15 ...
    02 12 22 23 24 25 ... 
    03 13 23 33 34 35 ... 
    04 14 24 34 44 45 ... 
    05 15 25 35 45 55 ...
    :: :: :: :: :: ::
    """
    # Sanity check that this row will fit in the required grid: 
    if len(row) != np.sum(np.arange(1, n_cols+1, 1)):
        raise ValueError(
            f'This row can\'t fit into a {n_cols}x{n_cols} grid.')
            
    # Create a new empty grid:
    grid = np.zeros((n_cols, n_cols), dtype=dtype)
    
    # Fill the grid with data from input row.
    # Keep track of how many values in the row are put in the grid:
    values_moved_to_grid = 0
    for col in range(0, n_cols):
        n_values_to_move_here = n_cols - col 
        
        values_to_move = row[values_moved_to_grid:(values_moved_to_grid + 
                                                   n_values_to_move_here)]
        grid[col, col:] = values_to_move
        grid[col:, col] = values_to_move
        
        values_moved_to_grid += n_values_to_move_here
    
    return grid

Restore the row to a grid:

In [42]:
%%time

fake_array_sym_grid = reshape_row_to_sym_grid(fake_array_short_row, n_cols=141)

CPU times: user 456 µs, sys: 118 µs, total: 574 µs
Wall time: 416 µs


In [43]:
fake_array_sym_grid.shape

(141, 141)

In [44]:
fake_array_sym_grid.nbytes

159048

Check that this restored grid matches the original:

In [45]:
np.all(fake_array_sym_grid == fake_array[0])

True

# Change dtype

Data was originally in float32 (double). Reduce the stored precision by converting it to float16 (single):

In [46]:
size_f32 = fake_array.nbytes
size_f16 = fake_array.astype(np.float16).nbytes

print('Original size | Reduced size | Ratio of old to new')
print(f'  {size_f32}   |   {size_f16}   | {100*size_f16/size_f32}%')

Original size | Reduced size | Ratio of old to new
  159048000   |   39762000   | 25.0%


Check the grid: 

In [47]:
fake_array[0].astype(np.float16)

array([[0.4465, 1.012 , 1.565 , ..., 1.17  , 1.068 , 1.529 ],
       [1.012 , 1.101 , 1.254 , ..., 0.8364, 1.276 , 1.535 ],
       [1.565 , 1.254 , 0.3174, ..., 0.5176, 0.3518, 1.539 ],
       ...,
       [1.17  , 0.8364, 0.5176, ..., 0.0852, 0.795 , 1.281 ],
       [1.068 , 1.276 , 0.3518, ..., 0.795 , 1.483 , 1.034 ],
       [1.529 , 1.535 , 1.539 , ..., 1.281 , 1.034 , 1.449 ]],
      dtype=float16)

We've lost all of the data after four significant figures.

Save with pickle:

In [48]:
with open('fakearray_float16_frompickle.p', 'wb') as filehandler:
    pickle.dump(fake_array.astype(np.float16), filehandler)

# Combine the biggest gains

Use the array at the beginning and make these changes:

+ Store as `np.float16`
+ Flatten grids to rows and remove repeats
+ Save with `h5`

Build the array to save:

In [49]:
combo_array = []

for i in range(fake_array.shape[0]):
    combo_array.append(
        reshape_triangle_to_row(
            fake_array[i].astype(np.float16)
        )
    )

combo_array = np.array(combo_array)

## pickle:

In [50]:
with open('fakearray_combo.p', 'wb') as filehandler:
    pickle.dump(combo_array.astype(np.float16), filehandler)

Load the pickled file:

In [93]:
with open('fakearray_combo.p', 'rb') as filehandler:
    combo_array_from_pickle = pickle.load(filehandler).astype(np.float16)

Check the size in memory now: 

In [83]:
combo_array_from_pickle.nbytes

20022000

Same as the size of the saved file.

## `h5`:

In [53]:
with h5py.File('fakearray_combo.hdf5', 'w') as f:
    # Create an empty dataset...
    dset = f.create_dataset('combo_array', combo_array.shape, dtype='float16')
    # ... and then fill it with values from our array.
    dset[...] = combo_array

Load the array back in: 

In [54]:
with h5py.File('fakearray_combo.hdf5', 'r') as f:
    dset = f['combo_array']
    
    # Convert to numpy array:
    combo_array_from_h5 = dset[:] 

Check the size in memory now: 

In [55]:
combo_array_from_h5.nbytes

20022000

Same as the size of the saved file.

## Recover the original array from the pickled file:

In [94]:
# Convert the array to a list to more easily overwrite the values:
combo_array_from_pickle = list(combo_array_from_pickle)

for i, short_row in enumerate(combo_array_from_pickle):
    grid = reshape_row_to_sym_grid(short_row, n_cols=141, dtype=np.float16)
    # Overwrite the short row with the grid:
    combo_array_from_pickle[i] = grid
    
# Convert back to array:
combo_array_from_pickle = np.array(combo_array_from_pickle)

Check that it's recovered completely:

In [95]:
combo_array_from_pickle.shape

(1000, 141, 141)

In [96]:
np.all(combo_array_from_pickle == fake_array.astype(np.float16))

True