## Exporting data with numpy and h5py

This notebook shows different ways to export the data for eoflow using numpy or h5py.

In [None]:
import os
import numpy as np
import h5py

In [None]:
# Create temp dir
os.makedirs('temp')

### Method 1: saving arrays using numpy

Let's create some numpy arrays to represent our features and labels.

In [None]:
features = np.random.random(size=(1024, 32, 32, 13))
labels = np.random.randint(10, size=(1024,))

features.shape, labels.shape

For numpy use the `np.savez` function to save multiple arrays into a single `.npz` file.

In [None]:
np.savez('temp/data.npz', features=features, labels=labels)

Numpy reads and writes the whole file at the same time. Therefore the file size should be small to reduce the overhead.

If the dataset size is large (can't fit into memory) it is better to split the dataset into multiple .npz files, or use the hdf5 format.

### Method 2: saving arrays using h5py

Let's save the same data using the h5py library.

In [None]:
with h5py.File('temp/data.hdf5', 'w') as file:
    file.create_dataset('features', data=features)
    file.create_dataset('labels', data=labels)

The h5py allows us to create seperate datasets (and groups of datasets) and save the data to it. The format also allows for sequential reading. This means that only part of the data that is needed can be loaded. Therefore the spliting of the dataset into smaller pieces is not needed anymore.

However, if the dataset we want to export is too big to fit into memory we cannot use this method to export the data. That's where the **Method 3** comes in.

### Method 3: saving arrays iteratively using h5py

The h5py allows us to write the data in parts (e.g. row by row). The datasets we create can be indexed and written to similarly to numpy arrays. Let's export a dataset produced from a generator.

In [None]:
def _generate_data(num_examples):
    """ Generates specified number of examples (example by example)."""
    
    for i in range(num_examples):
        features = np.random.random(size=(32, 32, 13))
        labels = np.random.randint(10, size=())
        
        yield features, labels

In [None]:
with h5py.File('temp/data_gen.hdf5', 'w') as file:
    num_examples = 1024
    
    # Define datasets (total shape)
    features_ds = file.create_dataset('features', (num_examples, 32, 32, 13), dtype=np.float32)
    labels_ds = file.create_dataset('labels', (num_examples,), dtype=np.int32)
    
    # Store the generated data into the datasets
    for i, (features, labels) in enumerate(_generate_data(num_examples)):
        features_ds[i] = features
        labels_ds[i] = labels

**NOTE**: the `data_gen.hdf5` size is smaller, because we specified the dtype of the features to be float32, while the original dtype of the array is float64.