In [None]:
import os
from pathlib import Path
import cv2 
import h5py
import numpy as np
from tqdm import tqdm
import random
import sys

# Dataset generation

## First step: Extracting images from the .h5 file

### Importing the function

In [None]:
sys.path.append("..") # TODO: Fix it!
from dataset.extract_images_from_h5 import extract_images_from_h5

### Defining the paths

A `.h5` file containing a U-Net example is provided in this example (`UNET_dataset.h5`).

The following U-Net dataset was created using the image annotation tool available [here](https://github.com/sinmec/multilabellerg).

The current implementation can read multiple `.h5` files. This is done by simply placing the dataset files in the same` h5_dataset_path` folder.

The images are extacted to the `output_path` folder.

In [None]:
h5_dataset_path = Path(os.getcwd())
output_path = Path(r"dataset_UNET")


### Running the function

In [None]:
extract_images_from_h5(h5_dataset_path, output_path)

By running this function, you'll see that it has extracted the raw images (`imgs_full`) and masks (`masks_full`) from the `.h5`  file to the `output_path`.

## Second step: Generating the dataset

### Importing the function

In [None]:
from dataset.create_dataset import create_dataset

### Defining parameters

In [None]:
# Size of the sub-images
sub_image_size = 128

# Number of random samples
random_samples = 64

# Number of Validation images
N_VALIDATION = 2

# Number of Verification images
N_VERIFICATION = 2

# Path to the dataset
path = output_path

The U-Net is trained with sub-images from the original images. For instance, in this example, the "full" images have a `(252 x 1024)` shape. However, the U-Net is trained with small portions of these full images. The size of the sub-images is defined by the `sub_image_size` variable. In this example, the sub-images have a size of `(128 x 128)`.

In addition to the sub-images, the dataset samples are created in a stochastic manner. First, a random number generator generates a random point located in the image. From this point, a small rectangular sub-image with `sub_image_size` is generated. Then, the area from the mask and raw image is extracted, and a sample (sub-image and mask) is created. This process is repeated `random_samples` times for each full image.


### Running the function

In [None]:
create_dataset(path, sub_image_size, random_samples, N_VALIDATION, N_VERIFICATION)

By running this function, you'll see that it created three new folders on the chosen path:
 - Training: Samples used during the U-Net training
 - Validation: Samples used for validation purpouses during training - Total of `N_VALIDATION` full images
 - Verification: Samples used to evaluate the U-Net accuracy after the training step. Unseen data during training  - Total of `N_VERIFICATION` full images
 
Each folder cotains 4 folders:
 - `images`: Sub-images with size (`sub_image_size x sub_image_size`) extracted from the raw images
 - `masks`: Sub-images with size (`sub_image_size x sub_image_size`) extracted from the labelled masks

The folder created in this step should be sent to the `UNET/dataset` folder for training.
