# How to create a HDF5 Dataset
Hey guys! As already mentioned in the discussions <br>
( https://www.kaggle.com/competitions/google-research-identify-contrails-reduce-global-warming/discussion/409401 )<br>
, **HDF5** ist a great data format to manage such huge datasets and achieve fast read access.

That's why I wrote some code to convert the dataset to HDF5. As I couldn't find a way to upload datasets of this size (300+GB) on kaggle, this notebook only processes the first 2500 instances as an example. I'm currently working on finding a nice way to share the full HDF5-file.
<br><br>
I also want to mention that this is V1, which means **it is only a prototype**. I'm not an HDF5 expert myself, and I will constantly improve this version. So if you have any advice, **feel free to help!!!**
Some things I want to implement soon are for example parallel processing, fundamental preprocessing, and a better dataloader. 
<br><br>
To demonstrate the perfomance speedup, I used this Pytorch dataloader: <br>
https://www.kaggle.com/code/thomasrochefort/pytorch-dataloader-example
<br>So **check out this notebook** for more details about this part!!


### Versions: 
V1: <br>- 2x faster with num_workers = 1 <br>
    - Problems with num_workers = 4, almost no speedup <br>
    (Probably the bottleneck is opening the hdf5-file) <br>
    
V2: <br>- Now dtype float16 is used since we don't lose much information, as mentioned in discussions <br>
    - hdf5-file isn't opened in the get_item method but in the constructor. <br>
    - Much faster with num_workers = 1 <br>
    - 1.5x faster with num_workers = 4

In [2]:
%pip install h5py

Collecting h5py
  Downloading h5py-3.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: h5py
Successfully installed h5py-3.9.0
[0mNote: you may need to restart the kernel to use updated packages.


In [1]:
# Imports
import numpy as np
import pandas as pd
import os
import json
import h5py
from tqdm import tqdm
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

In [2]:
BASE_DIR = "/kaggle/input/google-research-identify-contrails-reduce-global-warming"
HDF_DIR = "/kaggle/working/dataset_train/hdf5/"

In [3]:
train_ids = os.listdir(BASE_DIR +'/'+ "train")
val_ids = os.listdir(BASE_DIR +'/'+ "validation")
test_ids = os.listdir(BASE_DIR +'/'+ "test")

print(f"num train samples: \t\t{len(train_ids)}")
print(f"num validation samples: \t{len(val_ids)}")
print(f"num test samples: \t\t{len(test_ids)}")

num train samples: 		20529
num validation samples: 	1856
num test samples: 		2


**Note:** Every Instance saves arrays for band_08-16 and two arrays for the "ground truth". <br>
Validation instances dont have human_individual_masks.npy and test instances obviously don't have ground truth arrays:

In [5]:
def saveToHDF5(id_, type_, group):
    if type_ not in ["train", "validation", "test"]:
        raise ValueError("type has do be one of ['train', 'validation', 'test']")
    
    instance = group.create_group(id_)
    path = BASE_DIR + f"/{type_}/" + id_
    
    bands_data = []
    for i in range(8, 17):
        band = f"band_{str(i).zfill(2)}"
        array = np.load(f"{path}/{band}.npy")
        bands_data.append(array)
    bands_data = np.stack(bands_data, axis=-1).astype(np.float16) 
    instance.create_dataset("bands_data", data=bands_data)
    
    # Only train and validation have human_pixel_masks
    if type_ in ["train", "validation"]:
        agg_masks = np.load(path + "/human_pixel_masks.npy").astype(np.float16)
        instance.create_dataset("human_pixel_masks", data=agg_masks)
    if type_ == "train":
        ind_masks = np.load(path + "/human_individual_masks.npy").astype(np.float16)
        instance.create_dataset("human_individual_masks", data = ind_masks)

In [35]:
# Create the HDF5-file
with h5py.File(HDF_DIR + "/Google_Contrails.hdf5", "w") as f:
    
    print("Processing Training Data...")
    train = f.create_group("train")
    for train_id in tqdm(train_ids):
        saveToHDF5(train_id, "train", train)
        
    print("\nProcessing Validation Data...")
    validation = f.create_group("validation")
    for val_id in tqdm(val_ids):
        saveToHDF5(val_id, "validation", validation)
    
    print("\nProcessing Test Data...")
    test = f.create_group("test")
    for test_id in tqdm(test_ids):
        saveToHDF5(test_id, "test", test)
        
    print("\nFinished all jobs!")  

Processing Training Data...


100%|██████████| 20529/20529 [54:14<00:00,  6.31it/s]  



Processing Validation Data...


100%|██████████| 1856/1856 [04:47<00:00,  6.46it/s]



Processing Test Data...


100%|██████████| 2/2 [00:00<00:00,  6.60it/s]


Finished all jobs!





## Now let's compare the speedup!<br>
As mentioned earlier, I'm comparing the speedup by iterating over a dataloader once. I used this dataloader:<br>
https://www.kaggle.com/code/thomasrochefort/pytorch-dataloader-example 
<br>
For the HDF5-Dataloader, I had to make some small adjustments.

### 1) No HDF5

In [36]:
class ContrailDataset(Dataset):
    def __init__(self, base_dir, data_type='train', transform=None):
        assert data_type in ['train', 'validation', 'test'], \
            "'data_type' should be one of 'train', 'validation', or 'test'"

        self.base_dir = base_dir
        self.data_type = data_type
        self.transform = transform
        self.record = os.listdir(self.base_dir +'/'+ self.data_type)

    def __len__(self):
        return len(self.record)

    def __getitem__(self, idx):
        record_id = self.record[idx]
        record_dir = os.path.join(self.base_dir, self.data_type, record_id)

        bands_data = []
        for i in range(8, 17):
            band_file = os.path.join(record_dir, f'band_{str(i).zfill(2)}.npy')
            band_data = np.load(band_file)
            bands_data.append(band_data)
        bands_data = np.stack(bands_data, axis=-1)

        if self.data_type in ['train', 'validation']:
            pixel_masks_file = os.path.join(record_dir, 'human_pixel_masks.npy')
            pixel_masks = np.load(pixel_masks_file)
        else:
            pixel_masks = None
        return bands_data, pixel_masks


def get_dataloader(base_dir, data_type, batch_size, transform=None):
    dataset = ContrailDataset(base_dir, data_type=data_type, transform=transform)
    dataloader = DataLoader(dataset, batch_size=batch_size, num_workers = 4)
    return dataloader

In [37]:
train_dataloader = get_dataloader(BASE_DIR, 'train', batch_size=16)

In [38]:
for bands, masks in tqdm(train_dataloader):
    continue

 17%|█▋        | 212/1284 [02:01<10:13,  1.75it/s]


KeyboardInterrupt: 

### 2) With HDF5

In [10]:
class ContrailDatasetHDF5(Dataset):
    def __init__(self, hdf5_file, data_type='train', transform=None):
        assert data_type in ['train', 'validation', 'test'], \
            "'data_type' should be one of 'train', 'validation', or 'test'"

        self.hdf5_file = hdf5_file
        self.data_type = data_type
        self.transform = transform
        f = h5py.File(hdf5_file, "r")
        self.group = f[self.data_type]
        self.records = list(self.group.keys())[:2000]

    def __len__(self):
        return len(self.records)

    def __getitem__(self, idx):
        record_id = self.records[idx]
        
        bands = self.group[f"{record_id}/bands_data"][:,:,:,:]
        # print(bands.shape)
        if self.data_type in ['train', 'validation']:
            pixel_masks = self.group[f"{record_id}/human_pixel_masks"][()]
        else: 
            pixel_masks = None
        return bands, pixel_masks
    

def get_dataloader_hdf5(path, data_type, batch_size, transform=None):
    dataset = ContrailDatasetHDF5(path, data_type=data_type, transform=transform)
    dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=1)
    return dataloader

In [9]:
path = HDF_DIR + "/Google_Contrails.hdf5"
train_dataloader_hdf5 = get_dataloader_hdf5(path, 'train', batch_size=1)
for bands, masks in tqdm(train_dataloader_hdf5):
    continue
        # break

100%|██████████| 2000/2000 [00:56<00:00, 35.71it/s]


In [11]:
path = HDF_DIR + "/Google_Contrails.hdf5"
train_dataloader_hdf5 = get_dataloader_hdf5(path, 'train', batch_size=1)
for bands, masks in tqdm(train_dataloader_hdf5):
    continue
        # break

100%|██████████| 2000/2000 [00:44<00:00, 44.94it/s]


### Results
As you can see, the speedup is by a factor of 1.5. I'm confident we can do much better, and there are many things to improve. But the HDF5 file isn't just faster in reading. We can do our preprocessing to save even more time, and the bands are now already concatenated
<br><br>
It would be great if somebody could tell me if there is a way to upload the full HDF5-file (300+GB) to kaggle. Otherwise, I will share a link to my Google Drive. <br><br><br>
PS: If you want to save the dataset folder of this notebook, use: <br>
!kaggle datasets create -p "/kaggle/dataset_train/hdf5/" --dir-mode zip