# Chapter 13: Loading and Preprocessing data

Not much to follow along, as the examples in the book are all snippets of data rather than a full run.

The exercises will be the bulk of the learning.

# Exercises

1. Use the data API to read model data, in a repeatable and fast way. Also, to repeat or shuffle the input data, get smaller samples of the data, and to cache disk-based data in RAM for faster access.

2. Large datasets in different files can allow multiple processes to read different files in parallel, with each process/thread acting on one subset of the input data for map-reduce like semantics.

 Also, it allows the data to be read by multiple machines.
 
 Finally, it might be the only way to read data, if the dataset is larger than RAM: to break it into RAM-sized chunks, and read from each file, act upon the data, evict the memory, and then move to the next.

3. You can instrument the time spent in reading the data versus computing the models. In many naive cases, folks spend a lot of time reading data from the cloud (like [this example](https://towardsdatascience.com/music-genre-classification-with-tensorflow-3de38f0d4dbb)) rather than model training.  This is also the case if the model is built and preprocessing of the incoming streaming data is taking longer than inference.

4. Only serialized protobufs can be saved in TFRecord files.

(..is my guess). Let's check the book. The book is silent on the details, but it is basically a protobuf written to disk.

5. If we had our own protobuf format, we'd have to write reader and writer methods. It is definitely possible, but much easier to use the TF format, as they are written for most common data and dataset operations. This makes it much easier to read the serialized formats directly.

6. I'd use compression when:
  * Using or transporting large files
  * Sending data over the wire is prohibitive
  * When the underlying data is relatively sparse (large empty files)
  * When CPU is cheap but disk or network is expensive
  
 I wouldn't use compression when:
  * Data is small
  * Data is meant to be stored on disk, and not sent over networks.
  * Data can be generated trivially
  * Data is information-rich, like zip files, or mp3 or jpg files which compress poorly.
  * CPU is at a premium and disk is cheap.
    
I wouldn't use compression all the time. For small files, compression doesn't add much help and might even increase the size of the data, or the CPU time used (or both)


7. Writing data files:
Pros:
  * When data is generated outside of tensorflow (server producing log files)
  * Data is ancient and exists, and needs to be maintained as-is.
  * The ML work is speculative, and might not yield benefit.
  * Persistent
  
Cons:
  * Difficult to work with. You need to know what the data format is.
  * Can be inefficient to store on disk.
  * Has OS level file-system quirks. Files might be case sensitive (Linux), not be case sensitive (Windows) or case sensitive, but in a way that nobody understands (Mac)
  * Can get mangled without anyone noticing. The end of the file might get truncated during transfer, for example.
  * Rely on endian-ness, or character encoding (Unicode, UTF-8), or precision (32-bit FP?, 64-bit FP? 38-bit?)
  * Outside of normal TF world: training, fitting, etc.
    
TF-data pipeline:
Pros:
  * A standard mechanism.
  * Works great across platforms.
  

Cons:
  * Memory only
  
Processing layer within model.
Pros:
  * Integrated within the normal training.
  * Reliable, can be used for both training and inference
  
Cons:
  * Not persistent.
  
TF Transform:
Pros:
  * Can be deterministically and correctly applied both during training and inference.
  * Can be used outside of TensorFlow for larger datasets in a parallel environment.
  
  
Cons:
  * Just for processing, not for creating/storing data.


8. Encoding categorical features. I'd either use:
  * String enums where the values have some meaning, and then export them as one-hot vectors. For example the species label of an observation.
  * One-hot vectors when they don't have a simple English meaning. For example, form number: 1040-R versus 1040-EZ.
  * Integers, when they can be encoded: ZIP code, which don't have ordinal information (94303 is not *more* than 94043)


Text can be encoded as:
  * Sparse vectors of occurrance for a bag-of-words model: words in a news-article, for example.
  * One-hot vector when you only get a single value: "single-family home" versus "apartment" versus "empty-lot" for housing, for example.
  * integers/floats when the words can be organized in some ordinal sense (colors of the rainbow can become a frequency.
  * Just text when they are unique identifiers: first-name, last-name of people,
  

9. Load Fashion mnist and split and save each dataset to multiple tfrecord files.

In [1]:
# Common imports

import matplotlib.cm as cm
from matplotlib.image import imread
import matplotlib.pyplot as plt

import numpy as np

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

from sklearn.metrics import accuracy_score
from sklearn.metrics import silhouette_samples
from sklearn.metrics import silhouette_score

from sklearn.datasets import fetch_california_housing
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
from tensorflow import keras

import sys

print("TF version ", tf.__version__)
print("Keras version ", keras.__version__)




TF version  2.3.0
Keras version  2.4.0


In [10]:
# Loaded Chapter_10.py. So nice to have actual methods rather than the mess that is jupyter's code layout.
sys.path.insert(0, '/home/viki/ml/MachineLearning/Hands-On2')

# Change these to Chapter_10 hereafter once the conversion is complete and the sys.exit(0) is gone.
# import Chapter_10
import c10 as Chapter_10


In [11]:
X_train, X_valid, X_test, y_train, y_valid, y_test, class_names = Chapter_10.load_fashion_mnist()