# Setting up a Scalable ML Data Pipeline

As we have seen, in deep learning we often deal with large datasets, which might even exceed the memory available to us.  

In this lab you will learn how to set up a more scalable data pipeline where the data stays on disk until is needed during training.

Once again here is the code to download the Intel Image Classification dataset.

In [None]:
import os
if not os.path.exists('seg_train'):
  !wget -O archive.zip https://www.dropbox.com/scl/fi/ribf92om67kpi34wukl7q/archive.zip?rlkey=qn5v9cwvaqwba8jhsr7diyxnm&dl=1
  !unzip -qq archive.zip

In [None]:
import numpy as np
import keras
from matplotlib import pyplot as plt

This time we will use the Keras function `image_dataset_from_directory`.  It expects the images to be stored in separate directories according to their labels:

```
   dog/
       - dog1.jpg
       - dog2.jpg
       - ...
   cat/
       - cat1.jpg
       - cat2.jpg
       - ...
```

It returns a Tensorflow `Dataset` object.  Note that it does not load the images from disk -- it just looks the directory and catalogs which images are available.

In [None]:
train_ds = keras.preprocessing.image_dataset_from_directory('seg_train/seg_train')
train_ds

In [None]:
train_ds.class_names

When we iterate over the dataset, it loads batches of images from disk.  The batch size is set by the `batch_size` argument to `image_dataset_from_directory`.

Here `.take(1)` tells the dataset we only want the first batch.

Because the data is returned as `EagerTensor`s, we have to call `.numpy()` for them to be actually loaded and converted to Numpy arrays.

In [None]:
for images, labels in train_ds.take(1):
  print('images:',images.shape,images.dtype,'labels:',labels.shape,labels.dtype)
  print('image data range:',images[0].numpy().min(),images[0].numpy().max())
  plt.imshow(images[0].numpy().astype('uint8'))
  plt.title(labels[0].numpy())
  plt.show()

`image_dataset_from_directory` resizes the images so that they all have the same shape.  You can control the image size through the `image_size` argument.  The default is $256\times256$.

If the original image is not square, then the image will be somewhat squashed by the resize operation.  To avoid this, you can set `crop_to_aspect_ratio=True` so that it will center crop the image before resizing.

`image_dataset_from_directory` can automatically create a validation split for you, using the `validation_split` argument.  You need to call the function twice: once with `subset='train'` and once with `subset='validation'` to make both datasets.  And, you should set the `seed` argument to ensure that the same split is used both times!

In [None]:
train_ds = keras.preprocessing.image_dataset_from_directory(
    'seg_train/seg_train',
    subset='training',
    validation_split=0.1,
    seed=42)

In [None]:
val_ds = keras.preprocessing.image_dataset_from_directory(
    'seg_train/seg_train',
    subset='validation',
    validation_split=0.1,
    seed=42)

## Exercises

Try using `image_dataset_from_directory` in your CNN training.

1. First, create the train, val, and test datasets using `image_dataset_from_directory`.

Set the image size to 128x128 with center cropping, and use a validation split of 0.1.


2. Set up a CNN for image classification.  We can use a bigger CNN than last time now that we are not using up all that memory to store the dataset.

Here's my suggested architecture:

* Input layer
* 2D convolution, 3x3 kernel, 32 channels, ReLU activation
* Max pooling: 2x2 kernel, stride of 2
* 2D convolution, 3x3 kernel, 64 channels, ReLU activation
* Max pooling: 2x2 kernel, stride of 2
* 2D convolution, 3x3 kernel, 128 channels, ReLU activation
* Max pooling: 2x2 kernel, stride of 2
* 2D convolution, 3x3 kernel, 256 channels, ReLU activation
* Max pooling: 2x2 kernel, stride of 2
* 2D convolution, 3x3 kernel, 512 channels, ReLU activation
* Max pooling: 2x2 kernel, stride of 2
* Flatten
* Dense output layer configured for multi-class classification

However, we are missing something -- the data preprocessing!  Right now the images are on [0 255] range which is not ideal for NN training.

To address this, we can add a `Lambda` layer right after the `Input` layer.  It should look like this:

`Lambda(lambda x:x/128-1)`

The will preprocess the images so to be on [-1 1] range on-the-fly, as the data is processed in the network.

Try it out and see what accuracy you can get!  (I reached 82.6% test accuracy with this one.)

In [None]:
from keras import Sequential
from keras.optimizers import SGD, Adam
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, Flatten, Lambda
from keras.regularizers import L2