# Working with Complex and Scrace Datasets [Notebook 2]

## Introduction

When it comes to deep learning applications and its development, it is certain that Data is the most essential component of the process. By and of itself, the training data should flow into the networks unobstructed. It should contain meaningful and impactful information for the network/model. Prior to entering the network, the dataset should be prepared by various transformations and such. 

Today, datasets can be obtained from several complex structures or that are stored on heterogenous devices, this inherently also increases the complexity in its handling as well. While this assumes that the data is readily avaible, on the other hand, there cases where the relevant training data (images or annotations) can be unavailable or scarce. 

This project will venture into dealing with these kinds of cases where it also explores the framework that is available with TensorFlow to set up optimised data pipelines (tf.data API).


## Breakdown of this Project:
- Building efficient input pipelines with "tf.data" for extracting and processing data samples of all kinds. (Notebook 1 & 2)
- Augment and render images to help compensate for scarcity of training data. (Notebook 3 & 4)
- Different types of Doamin Adaptation methods and how it helps to train more robust models. (Notebook 5 & 6)
- Create or generate novel images with generative models such as Variational AutoEncoders (VAEs) and Generative Adversarial Networks (GANs). (Notebook 7 & 8)

## Requirements:
1. Vispy
2. Tensorflow 2.++
3. Numpy
4. OS

## Dataset:

The dataset can be obtain from the link: http://www.laurencemoroney.com/rock-paper-scissors-dataset/ or from https://www.tensorflow.org/datasets/catalog/rock_paper_scissors

Rock Paper Scissors is a dataset containing 2,892 images of different types of hands in Rock/Paper/Scissors poses that were 3D-rendered. Each of these images are 300×300 pixels in 24-bit colour.

In [1]:
import os
import tensorflow as tf
import numpy as np

# Run on GPU:
os.environ["CUDA_VISIBLE_DEVICES"]= "0" 

# Set the random set seed number: for reproducibility.
Seed_nb = 42

# Set to run or not run the code block: for code examples only. (0 = run code, and 1 = dont run code)
dont_run = 0

## 1 - Generating and Parsing TFRecords:

As part of building an efficient input pipelines from the previous notebook. This notebook (2) will venture into __TFRecords__ where it is a format to persist datasets with TensorFlow. 

## 1.1 - Reminder of the ETL process:

##### The following diagram below demonstrates the overall ETL process for computer vision tasks:

<img src="Description Images/ETL_ComputerVision.png" width="750">

Image Ref -> self-made.

## 1.2 - About TFRecords: Transform by Parsing TFRecord Files

Recalling back from the previous notebook: A more efficient way of inputting the data into the pipeline would be using the "TFRecord" file format, where it will store large number of images together into a binary file and make them directly accessible for read-from-disk operations. This is because the process of iterating through the image files, by the methods mentioned in the above section, are inefficient. 

In more detail, TFRecord files are binary files where it will aggregate the data samples such as labels, images, and metadata. It can be serialised with a "tf.train.Example" instance, where it is a dictionaries that names each of the data elements (features). For example:
- {'img'; image1, 'label': label_1, ...}

Each of these element or feature that a sample contains would be an instance of "tf.train.Feature" or of its subclasses. These types of objects will be stored as lists of bytes, floats or integers.
To utilise TFRecord files as part of the input pipelines, the record/data can be passed with:
- tf.data.TFRecordDataset(filename)

More examples can be found in the lin: https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset.

## 2 - Writing TFRecords:

From the above section, it was mentioned that TFRecords files can be stored as one binary archive with the complete dataset for training, where in doing so, it can be efficiently be reused and iterated over. The following will demonstrate the use of "tf.data" and "tf.train" APIs to process a chosen dataset into a TFRecord file.

## 2.1 - Preparing the Dataset:

Here, the chosen dataset will be the __Rock-Paper-Scissors__ dataset from "tensorflow-datasets".

In [2]:
# Import the required library:
import tensorflow_datasets as tfds

In [3]:
# Load in the Dataset:
hands_builder = tfds.builder(name="rock_paper_scissors")
hands_builder.download_and_prepare()

# After the download, print out some information on the dataset:
print(hands_builder.info)

tfds.core.DatasetInfo(
    name='rock_paper_scissors',
    version=3.0.0,
    description='Images of hands playing rock, paper, scissor game.',
    homepage='http://laurencemoroney.com/rock-paper-scissors-dataset',
    features=FeaturesDict({
        'image': Image(shape=(300, 300, 3), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
    }),
    total_num_examples=2892,
    splits={
        'test': 372,
        'train': 2520,
    },
    supervised_keys=('image', 'label'),
    citation="""@ONLINE {rps,
    author = "Laurence Moroney",
    title = "Rock, Paper, Scissors Dataset",
    month = "feb",
    year = "2019",
    url = "http://laurencemoroney.com/rock-paper-scissors-dataset"
    }""",
    redistribution_info=,
)



In [None]:
# 

In [None]:
break

In [None]:
<img src="Description Images/.png" width="750">

Image Ref -> 