# Exporting SETI - BL competition data to TFRecord

I will export the [seti-breakthrough-listen competition's dataset](https://www.kaggle.com/c/seti-breakthrough-listen) to TFRecord format for faster data handling and enabling [TPU](https://www.kaggle.com/docs/tpu) processing. I will perform some preprocessing prior to saving the data, in order to accelerate posterior model training. In order to try different data preprocessing routines, I will commit a new version of this notebook each time I update the preprocessing function, and the dataset will be updated too.

I will use a simple tool that I have developed for this task called [TFRecord Dataset](https://github.com/ChusJM/tfrecord_dataset). This tool has been inspired by some very interesting work I have found here in Kaggle, please read the [acknowledgements section](https://github.com/ChusJM/tfrecord_dataset#acknowledgements) in the repository page for further information.

_Disclaimer_: To avoid unwanted data distribution outside the competition scope that might be against the competition rules (I'm not sure), I've kept the generated dataset as private, but you can create your own copy by running this notebook. This way, you can include your custom preprocessing routine by editing the notebook before running it.

## Issues

The dataset is saved and later loaded correctly, but I have not been able to use it to train models on TPUs, although with GPU all works fine. I do not know if there is a problem with the dataset or with the training code on TPU itself. 

## Preparing Kaggle Dataset

Before doing anything with the data, it is necessary to initialize the a Kaggle Dataset where the data will be uploaded. Thanks to [xhlulu](https://www.kaggle.com/xhlulu) for these notebooks ([train](https://www.kaggle.com/xhlulu/seti-create-training-tf-records), [test](https://www.kaggle.com/xhlulu/seti-create-test-tf-records)) which serve as a great explanation of the process.

First, we initialize the local environment to work with Kaggle's API and the local directory with the needed metadata. For this snippet to work, it is neccessary that you include your own Kaggle credentials as *Secrets* in the notebook (see this [post](https://www.kaggle.com/product-feedback/114053) for more information).

In [None]:
import os
from kaggle_secrets import UserSecretsClient

secrets = UserSecretsClient()
os.environ['KAGGLE_USERNAME'] = secrets.get_secret("KAGGLE_USERNAME")
os.environ['KAGGLE_KEY'] = secrets.get_secret("KAGGLE_KEY")

And we create the local directory where the files will be saved. The directory is initialized with a metadata `*.json` file,

In [None]:
import json
from pathlib import Path


OUTPUT_DATASET_PATH = Path('/kaggle/dataset')
DATASET_NAME = 'seti-breaktrough-listen-preprocessed'
SUBSET_NAME = 'train'
DATASET_TITLE = 'SETI Breakthrough Listen - Preprocessed train set'
if not OUTPUT_DATASET_PATH.exists():
    OUTPUT_DATASET_PATH.mkdir()

meta = dict(
    id=f"chusjm/{DATASET_NAME}-{SUBSET_NAME}",
    title=DATASET_TITLE,
    isPrivate=True,
    licenses=[
        {"name": "copyright-authors"}
    ]
)

with open(OUTPUT_DATASET_PATH / 'dataset-metadata.json', 'w') as f:
    json.dump(meta, f)

Only the **first time** the notebook is run, it is also necessary to create the Kaggle Dataset.

In [None]:
#!touch /kaggle/dataset/dummy.txt
#!kaggle datasets create -p "/kaggle/dataset" --dir-mode zip
#!rm /kaggle/dataset/dummy.txt

## Getting the tool

This tool will allow us to handle all dataset files, split the dataset in even partitions and export the chunks to `*.tfrecord` format.

In [None]:
!git clone https://github.com/ChusJM/tfrecord_dataset.git

## Loading dataset structure

In this dataset, each ``*.npy`` file contains one data sample (_cadence snippet_), and there is a `train_labels.csv` file that contains the correspondence between each file name (without extension) and its label (``0`` or ``1``).

The first part of the tool will allow me to list all dataset files and read the labels.

In [None]:
import tfrecord_dataset.tfrecord_dataset.dataset as dataset
import tfrecord_dataset.tfrecord_dataset.tfrecords as tfrecords
from pathlib import Path

# The dataset root directory contains several subdirectories with *.npy files, one per example (cadence snippet).
DATASET_PATH = Path('../input/seti-breakthrough-listen/')
# Labels are integers: 1 when the cadence is labeled as positive (E.T. signal) and 0 when the cadence is labeled as negative.
label_type = int
# Used to ignore the first row of the CSV, which is just the names of the columns.
labels_file_has_header = True
# Load dataset structure.
seti_dataset = dataset.Dataset(
    train_set_dir=str(DATASET_PATH / 'train'),  # Train subdirectory
    test_set_dir=str(DATASET_PATH / 'test'),    # Test subdirectory
    file_format='*.npy',
    train_labels_file=str(DATASET_PATH / 'train_labels.csv'),
    label_type=label_type,
    train_labels_file_has_header=labels_file_has_header
)

## Exporting to TFRecord

To create the TFRecord dataset, it is necessary to distribute the original data in different chunks or splits of equal size, and each one will be saved in a different TFRecord file. Thus, the content of several ``*.npy`` files will go together in one TFRecord. The best way to do this, as it is done in [awsaf49's](https://www.kaggle.com/awsaf49) [notebook](https://www.kaggle.com/awsaf49/seti-bl-256x256-tfrec-data/), is to distribute the ``*.npy`` files at random, but **ensuring** the original label distribution is kept on each chunk. This way, it is possible to split the TFRecord dataset _a posteriori_ in further subsets (train and validation, K-folds, etc.) just by distributing the TFRecord files.

Once the dataset structure has been loaded, the tool will help me split the dataset and iterate through the list of partitions in order to save them to `*.tfrecord` files.

### Custom loading and preprocessing function

Before proceeding with the exportation, it is necessary to define a function that will load and preprocess the data. The tool includes a basic one for `*.npy` files, but it will be necessary to add more preprocessing steps to try to improve models performance for the competition.

In [None]:
import cv2
import numpy as np

# Define output shape for the data. This one is selected to match ImageNet, since most
# models are pretrained using this dataset and thus they have this shape.
OUTPUT_SHAPE = (224, 224)


def custom_npy_data_preprocessor(example_path):
    """
    Loads and preprocesses an example, given its path. The example must be a `*.npy` file that contains an array
    that can (and will) be casted to float32. It also returns the data type and the shape before flattening the array,
    to allow serialization and later reconstruction, and a string to serve as example identifier (e.g. the file name).
    :param example_path: Path to the `*.npy` file.
    :type example_path: str
    :return: Flattened numpy array of type np.float32 (it must be compatible with a list of float), data type (to allow
        for serialization and later parsing), data shape before flattening (to allow for unflattening), and example ID.
    :rtype: tuple[numpy.ndarray, type, tuple[int], str]
    """
    # Load cadence snippet.
    data = np.load(example_path).astype(np.float32)
    # Normalize each observation of the cadence. Observations are placed along the first axis.
    # The normalization maps the input range to [0.0, 255.0]. This is needed to compute the
    # equalization correctly.
    data = np.array([_normalize(obs) for obs in data])
    # Equalize each observation of the cadence
    data = np.array([_equalize(obs) for obs in data])
    # Transpose from OBS, H, W to H, W, OBS)
    data = data.transpose((1, 2, 0))
    # Resize to OUTPUT_SHAPE
    data = cv2.resize(data, OUTPUT_SHAPE, cv2.INTER_AREA)
    
    return data.ravel(), data.dtype, data.shape, Path(example_path).stem

def _normalize(input_array):
    """
    Normalizes a 2D array linearly using its maximum and minimum values to map its range to [0.0, 255.0]
    
    :param input_array: Input 2D array of type float.
    :type input_array: numpy.ndarray
    :return: Normalized array of type float in the range [0.0, 255.0]
    :rtype: numpy.ndarray
    """
    # Normalize and map range to [0.0, 255.0]
    min_value, max_value = input_array.min(), input_array.max()
    input_array = (input_array - min_value) / (max_value - min_value) * 255.0
    
    return input_array

def _equalize(input_array):
    """
    Equalizes a 2D array using Contrast Limited Adaptive Histogram Equalization (CLAHE).
    This enhances the local contrast in regions of the image.
    
    :param input_array: Input 2D array to be equalized, of type float. It will be cast to byte values so its data must in the range (0, 255)
        to minimize rounding loss and to avoid clipping.
    :type input_array: numpy.ndarray
    :return: Equalized version of the same 2D array, cast back to float values (single precision),
    :rtype: numpy.ndarray
    """
    # Cast to integer.
    input_array = np.round(input_array).astype(np.uint8)
    # Apply CLAHE
    clahe = cv2.createCLAHE(4)
    input_array = clahe.apply(input_array)
    # Back to float
    input_array = input_array.astype(np.float32)

    return input_array

### Dataset exportation

Once the preprocessing function has been defined, we are ready to iterate through the list of chunks and save each chunk to a different TFRecord file.

In [None]:
import tqdm


N_WORKERS = 8  # It seems that Kaggle's TPU support allocates 8 replicas.
N_SPLITS = 10*N_WORKERS # Following Tensorflow's recommendations, the number of splits should be ~10*N,
                        # provided the file size is 100+ MB (ideally), being N the number of workers.
                        # In this case, the train and test subsets have tens of GB of data, so the TFRecord file
                        # size will be in the order of hundreds of MB or even a thousand MB for 8 workers.
SHUFFLE = True  # To ensure the sample distribution is truly random
RANDOM_STATE = 42  # To ensure reproducibility (same dataset will be generated within different executions)

data_type = None
data_shape = None

# Create splits after shuffling the examples using the specified seed for the random number generator.
for idx, train_set_split in tqdm.tqdm(
        enumerate(seti_dataset.train_set_in_splits_generator(n_splits=N_SPLITS, shuffle=SHUFFLE, seed=RANDOM_STATE)),
        total=N_SPLITS):
    # Get the data type and shape from the first example, using the specific data preprocessing function.
    if idx == 0:
        _, data_type, data_shape, _ = custom_npy_data_preprocessor(train_set_split[0, 0])
    # Save the split into a tfrecord file.
    tfrecords.write_dataset_to_file(
        dataset=train_set_split,
        file_path=str(OUTPUT_DATASET_PATH / f'{DATASET_NAME}_{SUBSET_NAME}_{idx}.tfrecord'),
        data_preprocessing_function=custom_npy_data_preprocessor
    )

In [None]:
!ls -las "/kaggle/dataset"

## Loading from TFRecord

For demonstration purposes, the code to load the dataset from the generated files is included here.

In [None]:
# Load the train set from the tfrecords.
tf_train_set = tfrecords.load_dataset_from_files(
    file_paths=list(map(str, OUTPUT_DATASET_PATH.glob(f'{DATASET_NAME}_{SUBSET_NAME}_*.tfrecord'))),
    data_shape=data_shape, data_type=data_type, label_type=label_type
)


# Plot an example to check it looks correct.
import matplotlib.pyplot as plt

def plt_cadence(cadence, title):
    cadence_labels = 'ABACAD'
    props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
    fig = plt.figure(figsize=(9,9))
    gs = fig.add_gridspec(cadence.shape[0], hspace=0)
    axs = gs.subplots(sharex=True, sharey=True)
    fig.suptitle(title)
    for ax, (label, obs) in zip(axs, zip(cadence_labels, cadence)):
        im = ax.imshow(obs.astype(float), aspect='auto')
        ax.label_outer()
        # place a text box in upper left in axes coords
        ax.text(0.05, 0.95, label, transform=ax.transAxes, fontsize=12,
                verticalalignment='top', bbox=props)
        fig.colorbar(im, ax=ax)
    plt.tight_layout()
    return fig


# Read each recovered example.
for recovered in tf_train_set:
    data, label, example_id = recovered['data'], recovered['label'], recovered['example_id']
    plt_cadence(data.numpy().transpose((2, 0, 1)), f'Example cadence #{example_id} (label: {label})')
    # Show only the first and break the loop. One is enough for demonstration.
    break


## Uploading as a Kaggle Dataset

Finally I will update the dataset with the newly generated files using Kaggle's API. It seems that, although I am using the `-d` option, which tells Kaggle Datasets to delete the previous data, it is necessary to do it manually first to avoid reaching the storage limit. Because of that, before running this notebook, it is necessary to go to the dataset's page and upload a new version with all the previous files removed and only a "placeholder" empty `*.txt` file, to make room for the new files that will be uploaded with the next command.

In [None]:
!kaggle datasets version -p "/kaggle/dataset" -m "Updated via notebook" --dir-mode zip -d