# Working with Complex and Scrace Datasets [Notebook 1]

## Introduction

When it comes to deep learning applications and its development, it is certain that Data is the most essential component of the process. By and of itself, the training data should flow into the networks unobstructed. It should contain meaningful and impactful information for the network/model. Prior to entering the network, the dataset should be prepared by various transformations and such. 

Today, datasets can be obtained from several complex structures or that are stored on heterogenous devices, this inherently also increases the complexity in its handling as well. While this assumes that the data is readily avaible, on the other hand, there cases where the relevant training data (images or annotations) can be unavailable or scarce. 

This project will venture into dealing with these kinds of cases where it also explores the framework that is available with TensorFlow to set up optimised data pipelines (tf.data API).


## Breakdown of this Project:
- Building efficient input pipelines with "tf.data" for extracting and processing data samples of all kinds. (Notebook 1 & 2)
- Augment and render images to help compensate for scarcity of training data. (Notebook 3 & 4)
- Different types of Doamin Adaptation methods and how it helps to train more robust models. (Notebook 5 & 6)
- Create or generate novel images with generative models such as Variational AutoEncoders (VAEs) and Generative Adversarial Networks (GANs). (Notebook 7 & 8)

## Requirements:
1. Vispy
2. Tensorflow 2.++
3. Numpy
4. OS

## Dataset:

The dataset can be obtain from the link: https://www.cityscapes-dataset.com/dataset-overview/.

Quoted from the website: "The Cityscapes Dataset focuses on semantic understanding of urban street scenes." It consists of >5,000 images with fine-grained semantic labels, 20,000 images with coarser annotations that were shot from the view point of driving a car around different cities in Germany.


## 1 - TensorFlow Data Pipelines and its Structure: 

__Extract, Transform, Load (ETL)__ is an existing paradigm for data processing in general, where for computer vision tasks, the __ETL Pipelines__ are used to process the raw data into training data before being fed into the models/networks. 

##### The following diagram below demonstrates the overall ETL process for computer vision tasks:

<img src="Description Images/ETL_ComputerVision.png" width="750">

Image Ref -> self-made.

In more detail, each stage can be described as follows:
- For __Extract__: This stage consists of selecting the desired data sources and proceeding to extract its contents. Sources can be varied such as CSV files with filenames for the images, images already in a folder and so on. Note that sources can also be stored on different types of devices like on a local or remote machine/storage device. Part of the extractor's task is to list these sources for each of the originating content/data. 

- For __Transform__: Once the data have been fetched from their respective sources, the next stage would be to transform the data. These transformations can include parsing of the extracted data into a common format for the task. Example: Parsing bytes of images (JPEG or PNG) into a matrix representation (tensors). Additional transformations can also be applied such as cropping/scaling of the images or augmentations with some operations. Similarly these can be performed to the annotations for supervised learning. the final transformation format would usually be tensors (if not only in tensors) after parsing so that it can be compatible with computation of the loss function for model training.

- For __Load__: After the previous stage, the Data would then be "loaded" into the target structure. In terms of machine learning, this would be in batch samples where it is sent to the device that will run the model (like GPUs). After this process, the dataset can be cached or saved.

#### API Information:

Link: https://www.tensorflow.org/guide/data and https://www.tensorflow.org/api_docs/python/tf/data

## 2 - Brief Description of the API Methods:

This section will go through some of the methods that are frequently used for the ETL process.

### 2.1 - Extract: from tensors, text files, TFRecord files and so on.

The __1st Stage__ of the ETL process is Extract the files needed and process them.

#### 2.1.1 - From NumPy and TensorFlow data:

With the dataset source already in the format of either NumPy or TensorFlow, these can be directly passed into "tf.data". The methods that can be used are:
- tf.data.Dataset.from_tensor()
- tf.data.Dataset.from_tensor_slices()

#### 2.1.2 - From Files:

When the files of interest to the task resides within folders, these files can be read by firstly listing them with ".list_files()" which also allows makes an iterable object. The individual files can the be opened with:
- tf.io.read_file()

Additionally, the API also accounts for binary or text files, CSV files, or image/label information in text files (from public datasets). The following can be used to iterate and read them:
- tf.data.TextLineDataset()
- tf.data.experimental.CsvDataset()

#### 2.1.3 - From other Input Sources: (generators, SQL Database, range and so on)

Note that "tf.data.Dataset" is quite the comprehensive API package, where it also account for a range of input sources. Where for example, it can be used to iterate over numbers with:
- .range()

Or work with Python Generators like:
- .from_generator()

Another is data that is sotred in a SQL database, where TensorFlow does have experimental tools to interact with them, such as:
- tf.data.experimental.SqlDataset()


### 2.2 - Transform: with parsing, augmenting and so on.

The __2nd Stage__ of the ETL pipeline is to Transform the files that was extracted from the source. Here, the transformations can be split into two ways:
1. Performing the transformation on the data samples individually.
2. Performing the transformation on the entire dataset as a whole.

The following will describe more on (1).

#### 2.2.1 - Transform by Parsing Images and Labels:






In [1]:
break

SyntaxError: 'break' outside loop (<ipython-input-1-6aaf1f276005>, line 4)

In [None]:
<img src="Description Images/.png" width="750">

Image Ref -> 