# Gearing up Python for High-Speed HDF Adventures
## Khairulmizam Samsudin, PhD
#### Faculty of Engineering, Universiti Putra Malaysia
#### khairulmizam@upm.edu.my

In [None]:
from util import *

In [28]:
# Generate diagram for Data Flow slide
graph = """
sequenceDiagram
    participant Sensor as Sensor
    participant LocalStorage as Local Storage (HDF5)
    participant Preprocessing as Preprocessing
    participant AISystem as AI System
    participant Decision as Decision/Action

    Sensor->>LocalStorage: Collects Raw Data
    LocalStorage->>Preprocessing: Stores Raw Data
    Preprocessing->>LocalStorage: Processes and Saves Clean Data
    LocalStorage->>AISystem: Provides Clean Data
    MLModel->>Decision: Predicts/Analyzes Data
    Decision-->>AISystem: Feedback for System Improvement
    AISystem-->>Preprocessing: Requests More Data (if needed)
"""
mermaid_ink(graph, "images/context1.jpg")

graph = """
sequenceDiagram
    box white Focus of this talk
    participant Sensor as Sensor
    participant LocalStorage as Local Storage (HDF5)
    participant Preprocessing as Preprocessing
    end
    participant AISystem as AI System
    participant Decision as Decision/Action

    Sensor->>LocalStorage: Collects Raw Data
    LocalStorage->>Preprocessing: Stores Raw Data
    Note over Sensor, LocalStorage: Focus of this talk
    Preprocessing->>LocalStorage: Processes and Saves Clean Data
    LocalStorage->>AISystem: Provides Clean Data
    MLModel->>Decision: Predicts/Analyzes Data
    Decision-->>AISystem: Feedback for System Improvement
    AISystem-->>Preprocessing: Requests More Data (if needed)
"""
#mermaid_ink(graph)
mermaid_ink(graph, "images/context1a.jpg")

### Data Flow
![Context of HDF5](./images/context1.jpg)

### Data Flow
![Context of HDF5](./images/context1a.jpg)

### Background
- Focus on optimizing data storage to handle large datasets efficiently and at high speeds during data acquisition stage
- **Data acquisition** is the crucial first step in AI and engineering applications.
  - The quality and efficiency of data collection directly impact the performance and accuracy of AI/ML models
  - Challenges: size, bandwidth, latency and cost

### Introduction to HDF5
- **Overview**:
  - [HDF5](https://www.hdfgroup.org/) is an open file format and set of tools for managing complex data.
  - Ideal for storing large volumes of numerical data efficiently. (e.g. MRI, LIDAR point cloud, genomic, geospatial, earth science) 
  - Supported programming languages: **C**, C++, Fortran, **Python**, Java and many more.

### Why HDF5?
- **Performance**: Faster read/write speeds compared to traditional formats like CSV
- **Scalability**: Capable of handling large datasets without significant performance degradation.
- **Flexibility**: Supports a variety of data types and complex data structures. i.e. metadata and data

### Limitations of HDF5
1. **Inflexibility of Datasets**: Once a dataset is written in HDF5, it cannot be directly overwritten.
2. **Non-Extendable Datasets**:By default, datasets in HDF5 are of a fixed size, meaning they cannot grow beyond their initial allocation.
3. **Fragmentation Issues**: Appending datasets can lead to fragmentation, where the file system has to manage scattered blocks of data.

### Demo setup

In [39]:
import numpy as np
import csv
import h5py
import os

%load_ext memory_profiler

# Generate sample data
large_data = np.random.rand(int(1e5), 10)

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler


### HDF5 vs CSV Read/Write performance

In [49]:
def csv_write(filename, data):
    with open(filename, 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerows(data)
def h5_write(filename, data, dtype = 'float64', chunk_size = True):
    with h5py.File(filename, 'w') as file:
        file.create_dataset('dataset', data=data.astype(dtype), chunks = chunk_size)    
def csv_read(filename):
    with open('sample_data.csv', 'r') as file:
        reader = csv.reader(file)
        data_csv = [row for row in reader]
def h5_read(filename ):
    with h5py.File('sample_data.h5', 'r') as file:
        data_hdf5 = file['dataset'][:]

print(f"Profiling Write")
%timeit csv_write('sample_data.csv', large_data)
%timeit h5_write('sample_data.h5', large_data)
print(f"Profiling Read")
%timeit csv_read('sample_data.csv')
%timeit h5_read('sample_data.h5')

Profiling Write
1.21 s ± 84.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
72.2 ms ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Profiling Read
239 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.6 ms ± 44.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### HDF5 vs CSV Size

In [18]:
csv_size = os.path.getsize('sample_data.csv')
hdf5_size = os.path.getsize('sample_data.h5')

print(f"CSV File Size: {csv_size / 1024**2:.2f} MB")
print(f"HDF5 File Size: {hdf5_size / 1024**2:.2f} MB")

CSV File Size: 18.47 MB
HDF5 File Size: 7.63 MB


### Alternatives

| Feature | **h5py** | **PyTables** | **Tiled** | **Parquet** | **Feather** | **CSV** |
| :-- | :-: | :-: | :-: | :-: | :-: | :-: |
| **Hierarchical Data**              | X | X | X | X |  |  |
| **Dense Data**                     | X | X | X | X | X |  |
| **Sparse Data**                    | ~ | X | X | X | X |  |
| **Compression**                    | X | X | X | X | X |  |
| **Data Querying**                  |   | X | X | X |   |  |
| **Parallel I/O**                   | X | X | X | X |   |  |
| **Hardware Interface (C/C++)**     | X | X |   |  |   |  |
| **Portability** | X | X  |  |   |  |  |

### Best Practices (BP) for Using HDF5
   1. Use the Right Data Type
   2. Store Data Incrementally
   3. Proper Chunking
   4. Enable Compression
   5. Optimize Access Patterns
   9. Leverage Virtual Datasets (VDS)

#### [BP-1] Use the Right Data Type
   - Choose the smallest suitable data type that can accurately represent your data.
   - Avoid excessive precision (e.g. `float64` vs `float32` vs `float16`)

In [51]:
print(f"Writing RGB8 as int64")
%timeit h5_write('sample_data_int64.h5', large_data, dtype = 'int64')
print(f"Writing RGB8 as int8")
%timeit h5_write('sample_data_int8.h5', large_data, dtype = 'int8')

hdf5_int64_size = os.path.getsize('sample_data_int64.h5')
print(f"HDF5 float64 File Size: {hdf5_float64_size / 1024**2:.2f} MB")
hdf5_int8_size = os.path.getsize('sample_data_int8.h5')
print(f"HDF5 int8 File Size: {hdf5_int8_size / 1024**2:.2f} MB")

Writing RGB8 as int64
61.9 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Writing RGB8 as int8
13.3 ms ± 6.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
HDF5 float64 File Size: 7.63 MB
HDF5 int8 File Size: 0.96 MB


#### [BP-2] Store Data Incrementally
   - Avoid keeping everything in memory
   - Write data to the HDF5 file incrementally rather than accumulating everything in memory and writing it all at once.
   - This is particularly important for large datasets that might not fit into memory.

#### [BP-3] Proper Chunking
   - Choose chunk sizes that match typical access pattern to minimize I/O
   - Find a balance that works for your specific use case by testing different chunk sizes.

In [58]:
chunk_sizes = [(10, 10), (1000, 10), (10000, 10)]
for chunk_size in chunk_sizes:
    filename = f'sample_data_chunk_{chunk_size[0]}x{chunk_size[1]}.h5'
    print(f"Writing {filename}")
    %timeit h5_write(filename, large_data, chunk_size = chunk_size)

Writing sample_data_chunk_10x10.h5
117 ms ± 7.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Writing sample_data_chunk_1000x10.h5
54.3 ms ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Writing sample_data_chunk_10000x10.h5
56.1 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### [BP-4] Enable Compression
   - Use compression filters like `gzip`, `lzf`, or `szip` to reduce file size. `gzip` is a good default choice that balances speed and compression ratio.
   - Compression is more effective on data with redundancy (e.g. sparse matrices, repeated values)

#### [BP-5] Optimize Access Patterns
   - Sequential access is faster than random access. Structure your data and access patterns to read and write data sequentially whenever possible.
   - Store related data in the same chunk or group to reduce the number of I/O operations needed to access the data.

### Common Mistakes
   - Assuming default settings are optimal. Always evaluate and adjust.
   - Large or complex metadata stored as attributes can degrade performance. 
   - Compression reduces file size but can add overhead to I/O operations. It’s important to test if the benefits outweigh the costs in terms of performance.

### Conclusion
   - HDF5 is ideal for during data-intensive AI and engineering applications
   - HDF5 is effective for handling large, complex datasets.
   - Proper data type selection, incremental storage, and chunking are essential.
   - Optimization ensures efficiency in speed, memory, and storage.

## Thank You
https://github.com/x0urc3/talks/pyconmy2024