# How to load data into a model?

Since in the most wide spread (and presumably most stably generalizing) case of minibatch gradient descent, we are presupposing the ability to iterate through (with some kind of generator) the data multiple times (epoch).

We have the choice to build up our own generators, use external tools (as for example [Blaze](http://blaze.pydata.org/) / [Dask](http://docs.dask.org/en/latest/why.html)), or utilize TensorFlow's [Data API](https://www.tensorflow.org/guide/datasets). This question is crucial for performance, since if we utilize GPUs, they have separate, dedicated memory only accessible via a copy operation from the main RAM, which even utilizes the CPU as well as the internal "bus" interfaces, thus it can become the single biggest bottleneck for training. 

Plain English version: Even if you by an expensive GPU, if you load data inefficiently, training will be super slow.

For some more guidance on the design of high performance models see the [guide](https://www.tensorflow.org/guide/data_performance) by the TF team.

## The tf.data API

For the data API the documentation is surprisingly informative:

"The tf.data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training. The pipeline for a text model might involve extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and batching together sequences of different lengths. The tf.data API makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations.

The tf.data API introduces a tf.data.Dataset abstraction that represents a sequence of elements, in which each element consists of one or more components. For example, in an image pipeline, an element might be a single training example, with a pair of tensor components representing the image and its label.

There are two distinct ways to create a dataset:

A data source constructs a Dataset from data stored in memory or in one or more files.

A data transformation constructs a dataset from one or more tf.data.Dataset objects."

The whole approach bears some resemblance to Scikit's pipeline, albeit with more emphasis on data cleaning and manipulation, and less on the successive models - since typically we use TF for neural models, that contain the feature extractor hierarchy inside them.

Also, there is a very elaborate mechanism for parallel load from the filesystem in a streaming manner, as well as efficient usage of the TFRecord format included.

## Dataset usage example

Dataset has a functional style usage whereby we can chain together preparation steps for our data. 

**Definition:**

```python
training_dataset = tf.data.Dataset.range(100).map(
    lambda x: x + tf.random_uniform([], -10, 10, tf.int64))
```

**Alternatively:**

```python
np_array = np.random.randint(low=-10, high=10, size=10)

training_dataset = tf.data.Dataset.from_tensor_slices(np_array)
```


**Iteration:**

The thus resulting dataset is an "iterable", thus can handle the following


```python
for i in training_dataset:
    print(i)
```


## Further enhancements

By default it is advised to use `Dataset` for feeding your data, and even utilizing `Dataset.prefetch()`, to gain some speed (see discussion [here](https://stackoverflow.com/questions/47064693/tensorflow-data-api-prefetch)), as well as using `TFRecord` file format (see this [post](https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564)).

The complete performance guide can be found [here](https://www.tensorflow.org/performance/datasets_performance).


## Getting things to work with Keras

### Fits in memory

The default case is, when the whole data fits nicely into memory.

The design pattern (in case of _sequential API_) is:

```python
import numpy as np
from keras.models import Sequential

# Load entire dataset
X, y = np.load('some_training_set_with_labels.npy')

# Design model
model = Sequential()
model.add(Dense(..., input_shape=(784,))) 
# WARNING!
# THIS is equivalent!!!
# model.add(Dense(..., input_dim=784))

[...] # Your architecture
model.compile()

# Train model on your dataset
model.fit(X,y,...)
```

### The `generator` way

In the not so trivial case, when you don't fit into memory, you either use a default generator (available for images, text and sequences in Keras), or you write your own generator function

Keras is not detailing this too much, see [here](https://keras.io/getting-started/faq/#how-can-i-use-keras-with-datasets-that-dont-fit-in-memory), but then gives a specific example [here](https://keras.io/utils/#sequence) (There is a bit more annotated version you can find [here](https://medium.com/datadriveninvestor/keras-training-on-large-datasets-3e9d9dbc09d4)). 

 ```python
from skimage.io import imread
from skimage.transform import resize
import numpy as np

from skimage.io import imread
from skimage.transform import resize
import numpy as np
import math

# Here, `x_set` is list of path to the images
# and `y_set` are the associated classes.

class CIFAR10Sequence(Sequence):

    def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return math.ceil(len(self.x) / self.batch_size)

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) *
        self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) *
        self.batch_size]

        return np.array([
            resize(imread(file_name), (200, 200))
               for file_name in batch_x]), np.array(batch_y)
```

```python
my_generator = CIFAR10Sequence(X,y,batch_size)

model.fit(my_generator,...)
```
 
A good description of writing own generator functions can be found [here](https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly).


### Fitting directly on a `Dataset`

But if we already made ourselves familiar with the TF Dataset API, why not use just that?

```python
from tensorflow.data import Dataset

my_dataset = Dataset.from_tensor_slices(x).repeat().batch(...)

...

model.fit(my_dataset,steps_per_epoch=1, epochs=...)
```

Here we see, that: "Starting from Tensorflow 1.9, one can pass tf.data.Dataset object directly into keras.Model.fit() and it would act similar to fit_generator."

[Source](https://stackoverflow.com/questions/46135499/how-to-properly-combine-tensorflows-dataset-api-and-keras)