# Data loading in Keras and TensorFlow

Modern deep learning comes with two considerations that affect the way we process input data:

1. The data are typically too big to fit in memory.
2. We usually have two separate computing devices, the CPU and the GPU.

Point 1 means that we have to process the data in _batches_, where we load a subset of the data into memory, run one step of gradient descent on it, and then load the next subset and proceed with another training step.

Point 2 means that while the GPU is running gradient descent and backpropagation on one batch, the GPU is free to load and pre-process the next batch in parallel. So whenever the GPU is done with one training iteration, it  can immediately start with the next one, without waiting for the data to be loaded from disk.

TensorFlow provides the functionality to do this efficiently without requiring much effort from us, and in this notebook we will try it out for different types of input data.

In [None]:
import numpy as np
import tensorflow as tf
import keras
import matplotlib.pyplot as plt

The core object we will interact with is a [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset), which has useful methods like

 - `batch(batch_size)` which makes batches of give size,
 - `prefetch(buffer_size)` which load the next batch in advance, and
 - `map(map_func)` which applies a function to each element, like Python's built-in `map()` function.

 For additional information, have a look at the `tf.data` [tutorial](https://www.tensorflow.org/guide/data), and the `tf.data.Dataset` [documentation](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).


First, let's make a `Dataset` from a list or an array, which is done by a function with the very non-obvious name `from_tensor_slices`:

In [None]:
dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])
dataset

The contents of a `Dataset` is more or less hidden to us, since its elements usually doesn't exist before they are needed. But, they are iterable:

In [None]:
for element in dataset:
    print(element)

The elements are `tf.Tensor`s, with their benefits and drawbacks, but remember we can convert to regular NumPy arrays by calling `.numpy()`.

### <span style="color: red; font-weight: bold;">Exercise:<span>

From the above `dataset`, extract the original Python list of integers, `[8, 3, 0, 8, 2, 1]`.

In [None]:
original_list = ...
print(original_list)

## Reading data

Let's start doing more realistic stuff, like reading in different types of file formats.

We will try CSV files, images, and a custom file format.

### CSV files

Columns of data stored in CSV (_comma-separated values_) files can be imported to TensorFlow through the common [Pandas](https://pandas.pydata.org/docs/index.html) format, which is shown in this [tutorial](https://www.tensorflow.org/tutorials/load_data/csv). But in case all the data fits in memory, there is not that much of a reason to use `tf.data.Dataset` in the first place.

Let's construct an example where we have data spread over many CSV files, and want to read them in in an efficient manner.

Here we get the _California housing prices_ dataset, and simply split it in 20 different files, for illustration purposes.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)


import numpy as np
from pathlib import Path

def save_to_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = Path() / "datasets" / "housing"
    housing_dir.mkdir(parents=True, exist_ok=True)
    filename_format = "my_{}_{:02d}.csv"

    filepaths = []
    m = len(data)
    chunks = np.array_split(np.arange(m), n_parts)
    for file_idx, row_indices in enumerate(chunks):
        part_csv = housing_dir / filename_format.format(name_prefix, file_idx)
        filepaths.append(str(part_csv))
        with open(part_csv, "w") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_csv_files(test_data, "test", header, n_parts=10)

One such file now looks like

In [None]:
print("".join(open(train_filepaths[0]).readlines()[:4]))

and we have these different file paths:

In [None]:
print(train_filepaths)

### Pipeline for reading multiple files

Now we start building out input pipeline. This can be done for any type of files (not just CSV), but we show it here for the CSV case.

In [None]:
# A dataset containing our list of files.
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

# (check that it works as expected)
for filepath in filepath_dataset:
    print(filepath)

We would like the files to be read in parallel, so we can process the contents of one while loading another.

This is achieved by _interleaving_ the files, using `tf.data.Dataset.interleave()`.

In [None]:
# Number of parallel processes. This can also be set to `tf.data.AUTOTUNE`, then
# TensorFlow determines the value itself.
n_readers = 5

# Now make a new dataset from the file paths.
# We use `TextLineDataset` since we have text inputs,
# and `skip(1)` skips the header line.
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers
)

# Print the first five values:
for line in dataset.take(5):
    print(line)

### Add preprocessing

From the above cell we see that our values are still strings, but we want floats, and probably some additional steps too, like standardisation of features.

Let's add a preprocessing function.

### <span style="color: red; font-weight: bold;">Exercise:<span>

implement feature standardisation (you may use scikit-learn's StandardScaler) in the `preprocess` function below.

In [None]:
num_columns = 8

def parse_csv_line(line):
    """
    TensorFlow is peculiar about its types, have a look
    at https://www.tensorflow.org/api_docs/python/tf/io/decode_csv
    """
    defs = [float()] * num_columns
    fields = tf.io.decode_csv(line, record_defaults=defs)

    # Return first the features, then the target
    return tf.stack(fields[:-1]), tf.stack(fields[-1:])


def preprocess(line):
    x, y = parse_csv_line(line)

    # TODO:
    # Feature standardisation

    return x, y

### Add together everything

Now we apply the preprocessing function to our dataset, and do the remaining performance steps: batching and pre-fetching.

In [None]:
preprocesse_dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
shuffled_dataset = dataset.shuffle(buffer_size=10000)
batched_dataset = dataset.batch(batch_size=128)
prefetched_dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)

Like always, see if it works!

### <span style="color: red; font-weight: bold;">Exercise:<span>

Print the first three data points of the final dataset.

In [None]:
# Your code

## Images

For images we typically always store one image in one file, and Keras gives us a very nice conventience function for getting images into a `tf.data.Dataset`, `keras.utils.image_dataset_from_directory()`, which we have used in previous notebooks already.

The requirements for using it is that we save images in a directory structure that looks like
```
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg
```


Even though we are already familiar with it, let's give it a little test.

In [None]:
!curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_5340.zip
!unzip -q kagglecatsanddogs_5340.zip
!ls

EXERCISE

Load the downloaded images into a dataset using `keras.utils.image_dataset_from_directory()`, satisfying the following conditions:
- images have 124x124 pixels resolution
- labels are categorical
- the batch size is 64
- the image order is shuffled.

Then plot the first three images in the dataset.

## Custom data

In case you need to load data of a custom format -- which often happens in research and development settings -- the step of getting a single file or data point into a `tf.Tensor` has to be specifically coded, but going from `tf.Tensor` to a `tf.data.Dataset` is still rather general.

Let's make a silly example just to illustrate.

In [None]:
# Some files in a wacky binary format, which contains an unknown number of data
# points each.
my_files = [
    'file1.xyz',
    'file2.xyz',
    'file3.xyz',
    'file4.xyz',
    'file5.xyz'
]

# (just write empty files)
for filename in my_files:
    with open(filename, 'w') as fout:
        fout.write('xyz')

def read_file(filename, num_columns=10):
    """
    Here we just generate some random stuff :)
    """

    # Random number of data points
    num_data_points = np.random.randint(1, 10)

    for _ in range(num_data_points):

        # Random data
        data = np.random.uniform(size=(num_columns,))

        # Convert to Tensor
        data = tf.constant(data)

        # Return one data point at a time.
        yield data

With our custom file reading function, we can use `tf.dataset.Dataset.from_generator`, which doesn't need to know beforehand how many data points each file contains. It **does** need to know, however, the shape/length of each data point, which is specified as `output_signature`.

Writing custom functions with `tf.data.Dataset` gets convoluted rather fast, but that is the price to pay for performance 🤷‍♂️

In [None]:
num_features = 10

filepaths = tf.data.Dataset.list_files(my_files)

dataset = filepaths.interleave(
    lambda filepath: tf.data.Dataset.from_generator(
        read_file,
        output_signature=(
            tf.TensorSpec(shape=(num_features, ))
        ),
        args=(filepath, num_features)
    ),
)

### <span style="color: red; font-weight: bold;">Exercise:<span>

Batch the dataset and read the first five data points.

In [None]:
# Your code
