# Storing and Organizing Array Data

#### A little excursion on how to store, organize and handle large array data sets  ... 

#### So far we have used:
***NumPy*** files: binary arrays
* [save binary: numpy.save()](https://numpy.org/doc/stable/reference/generated/numpy.save.html)
* [save multiple arrays: numpy.savez()](https://numpy.org/doc/stable/reference/generated/numpy.savez.html)
* [save multiple compressed: numpy.savez.compressed()](https://numpy.org/doc/stable/reference/generated/numpy.savez_compressed.html)
* ...

In [None]:
import numpy as np
d1 = np.random.random(size = (1000,20))
d2 = np.random.random(size = (1000,200))

In [None]:
#save to binary 
np.save("myArray.npy", d1)

In [None]:
#load
d3=np.load("myArray.npy")

In [None]:
#check
(d1==d3).all()

## The HDF5 Data Container Format
<img src="IMG/HDF_logo.png">

Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data with APIs for many programming languages.

#### HDF5 Structure
<img src="IMG/hdf5-folder.png" width=800>
<font size=5>[Image Source: https://www.sphenisc.com/doku.php/software/development/hdf5-phdf5]</font>

### HDF5 Key Features:
* POSIX-like syntax for internal data structures /path/to/resource
    * folders
    * meta data
    * comments (even code)
    * arrays 
* fast $n$-D data access 
* data compression
* APIs for many programming languages 

### In Python:
* ***h5py***: http://docs.h5py.org/en/stable/index.html
* ***HDF5 Docs:*** https://portal.hdfgroup.org/display/support

## Creating a Data Set

In [None]:
import numpy as np
import h5py #this is the HDF5 lib 

In [None]:
#create some random data
matrix1 = np.random.random(size = (1000,1000))
matrix2 = np.random.random(size = (10000,100))

In [None]:
# write it to the same file - in two different arrays
with h5py.File('hdf5_data.h5', 'w') as hdf: #note the write mode 'w'
    hdf.create_dataset('dataset1', data=matrix1)
    hdf.create_dataset('dataset2', data=matrix2)

## Reading 

In [None]:
#opening, listing and reading files
with h5py.File('hdf5_data.h5','r') as hdf:
    ls = list(hdf.keys())
    print('List of datasets in this file: \n', ls)
    data = hdf.get('dataset2') #here data is still some hdf5 object
    dataset1 = np.array(data) #need to convert it into numpy
    print('Shape of dataset1: \n', dataset1.shape)

In [None]:
dataset1

In [None]:
f = h5py.File('hdf5_data.h5', 'r')
ls = list(f.keys())
f.close()

In [None]:
ls

## Array Slicing
HDF5 support fancy array slicing - so we do not read all data just to get a slice: http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

In [None]:
f = h5py.File('hdf5_data.h5', 'r')
f['dataset1'][100:120,:] # this notation mostly follows numpy notation -> try different slices!

## Creating Groups
We can organize data in groups, just like in file systems where we have files (here datasets) in folders (here groups) 

In [None]:
matrix1 = np.random.random(size = (1000,1000))
matrix2 = np.random.random(size = (1000,1000))
matrix3 = np.random.random(size = (1000,1000))
matrix4 = np.random.random(size = (1000,1000))

In [None]:
with h5py.File('hdf5_groups.h5', 'w') as hdf:
    G1 = hdf.create_group('Group1')
    G1.create_dataset('dataset1', data = matrix1)
    G1.create_dataset('dataset4', data = matrix4)
 
    G21 = hdf.create_group('Group2/SubGroup1')
    G21.create_dataset('dataset3', data = matrix3)
    
    G22 = hdf.create_group('Group2/SubGroup2')
    G22.create_dataset('dataset2', data = matrix2)

## Reading Groups

In [None]:
with h5py.File('hdf5_groups.h5','r') as hdf:
    base_items = list(hdf.items())
    print('Items in the base directory:', base_items)
    G2 = hdf.get('Group2')
    G2_items = list(G2.items())
    print('Items in Group2:', G2_items)
    G21 = G2.get('/Group2/SubGroup1')
    G21_items = list(G21.items())
    print('Items in Group21:', G21_items)
    dataset3 = np.array(G21.get('dataset3'))
    print(dataset3.shape)


### What is happening? Interpret the results.

## Compress Data
HDF5 also support native data compression:

In [None]:
matrix1 = np.random.random(size = (1000,1000))
matrix2 = np.random.random(size = (1000,1000))
matrix3 = np.random.random(size = (1000,1000))
matrix4 = np.random.random(size = (1000,1000))

In [None]:
with h5py.File('hdf5_groups_compressed.h5', 'w') as hdf:
    G1 = hdf.create_group('Group1')
    G1.create_dataset('dataset1', data = matrix1, compression="gzip", compression_opts=9)
    G1.create_dataset('dataset4', data = matrix4, compression="gzip", compression_opts=9)
 
    G21 = hdf.create_group('Group2/SubGroup1')
    G21.create_dataset('dataset3', data = matrix3, compression="gzip", compression_opts=9)
    
    G22 = hdf.create_group('Group2/SubGroup2')
    G22.create_dataset('dataset2', data = matrix2, compression="gzip", compression_opts=9)

## Attributes
We can add meta information in form of attributes of files, groups and datasets:

In [None]:
matrix1 = np.random.random(size = (1000,1000))
matrix2 = np.random.random(size = (10000,100))

In [None]:
# Create the HDF5 file
hdf = h5py.File('test.h5', 'w')

# Create the datasets
dataset1 = hdf.create_dataset('dataset1', data=matrix1)
dataset2 = hdf.create_dataset('dataset2', data=matrix2)

# Set attributes
dataset1.attrs['CLASS'] = 'DATA MATRIX'
dataset1.attrs['VERSION'] = '1.1'

hdf.close()

In [None]:
# Read the HDF5 file
hdf = h5py.File('test.h5', 'r')
ls = list(hdf.keys())
print('List of datasets in this file: \n', ls)
data = hdf.get('dataset1')
dataset1 = np.array(data)
print('Shape of dataset1: \n', dataset1.shape)
#read the attributes
k = list(data.attrs.keys())
v = list(data.attrs.values())
print(k[0])
print(v[0])
print(data.attrs[k[0]])

hdf.close()