# Loading and Preprocessing Data with TensorFlow

# Exercises

## 1.
Dealing with large datasets and preprocessing them efficiently can be challenging. The Data API is a tool that makes this fairly simple with the many features it offers.

## 2.
Splitting a large dataset into many files has many benefits:
- It makes it possible to shuffle the dataset at a coarse level before shuffling it at a finer level.
- It simpler to manipulate many smaller files rather than a huge file
- If the data is split across multiple servers across multiple devices, it is possible to download several files from different servers simultaneously.

## 3.
If the GPU is not being fully utilized, it is possible that the input pipeline is the bottleneck. This may be fixed by reading and preprocessing the data in multiple threads in parallel, and prefetching a few batches. A properly optimized preprocessing function can also help a lot. Saving the dataset into multiple TFRecord files and performing some of the preprocessing ahead of time might be a good idea as well.

## 4.
Any binary data can be stored in a TFRecord file. In practice, most TFRecord files contain sequences of serialized protocol buffers, which allows them to be easily read across multiple platforms.

## 5.
TensorFlow provides some handy operations to parse the `Example` protobuf format, which is flexible enough to represent instances in most datasets. If this is not the case, a custom protocol buffer can be defined for a specific application, but doing so requires deploying the descriptor along with the model.

## 6.
Using compression is great to save space and bandwidth, at the cost of wasting CPU to decompress it. It really depends on the application and which resources are most valuable.

## 7.
- Preprocessing the data when creating the data files will speed up the training process since no preprocessing on the fly will be required. If the data contains a lot of noise that is going to be filtered out, some disk space will be saved as well. However, this approach will limit the flexibility of experimenting with various preprocessing pipelines. Moreover, the trained model will expect preprocessed data, which will add a layer of complexity to the deployed application (code to preprocess the data before feeding it to the model).

- Using the `tf.data` pipeline to preprocess data will make it much easier to tweak the preprocessing logic, and it makes it easy to create highly efficient preprocessing pipelines. However, this approach will slow down training, and each training instance will be preprocessed once per epoch rather than just once when preparing the data beforehand (as with the previous approach). Lastly, the trained model will still expect preprocessed data.

- Adding preprocessing layers to the model has the advantage that preprocessing code has to be written only once for both training and inference. The downside is that it will also slow down training, and each instance will be preprocessed once per epoch. Moreover, by default, these operations will be run on the GPU, but the upcoming Keras preprocessing layers should be able to benefit from multithreaded execution on the CPU.

- Using TF Transform gives many of the benefits from the previous options: each instance is preprocessed just once (which speeds up training), and preprocessing layers get automatically generates so the preprocessing code is only written once. The main drawback is the learning curve required to use this tool.

## 8.
If the categorical feature has a natural order, a simple option is to use ordinal encoding. If it does not have such a natural order, one-hot encoding can be used, or embeddings if there are many categories.

One option to encode text is the bag-of-words representation. However, it can favor words that are usually not very important, so TF-IDF is a popular option to reduce their weight. Also, instead of counting the only words, it is common to count _n_-grams. Using word embeddings to encode the text is also a viable option. 

## 9.

In [1]:
from tensorflow import keras
import tensorflow as tf

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train, y_train = X_train_full[5000:], y_train_full[5000:]
X_val, y_val = X_train_full[:5000], y_train_full[:5000]

In [2]:
train_set = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(len(X_train))
val_set = tf.data.Dataset.from_tensor_slices((X_val, y_val))
test_set = tf.data.Dataset.from_tensor_slices((X_test, y_test))

In [3]:
from tensorflow.train import BytesList, Int64List
from tensorflow.train import Feature, Features, Example

def create_example(image, label):
    image_data = tf.io.serialize_tensor(image)
    
    return Example(
        features=Features(
            feature={
                'image': Feature(bytes_list=BytesList(value=[image_data.numpy()])),
                'label': Feature(int64_list=Int64List(value=[label]))
            }))

In [4]:
from contextlib import ExitStack

def save_tfrecords(name, data, n_records=20):
    filepaths = [f"{name}_{idx:02d}.tfrecord" for idx in range(n_records)]
    
    with ExitStack() as stack:
        writers = [stack.enter_context(tf.io.TFRecordWriter(path)) for path in filepaths]
        
        for idx, (image, label) in data.enumerate():
            file_idx = idx % n_records
            writers[file_idx].write(create_example(image, label).SerializeToString())
            
    return filepaths

In [5]:
train_filepaths = save_tfrecords('fashion_mnist-train', train_set)
val_filepaths = save_tfrecords('fashion_mnist-val', val_set)
test_filepaths = save_tfrecords('fashion_mnist-test', test_set)

In [6]:
import os

def preprocess(tfrecord):
    feature_descriptions = {
        'image': tf.io.FixedLenFeature([], tf.string, default_value=""),
        'label': tf.io.FixedLenFeature([], tf.int64, default_value=-1)
    }
    
    example = tf.io.parse_single_example(tfrecord, feature_descriptions)
    image = tf.io.parse_tensor(example['image'], out_type=tf.uint8)
    
    return tf.reshape(image, shape=[28, 28]), example['label']

def load_dataset(filepaths, batch_size=32, shuffle_buffer_size=None, cache=True):
    dataset = tf.data.TFRecordDataset(filepaths, num_parallel_reads=os.cpu_count())
    
    if cache:
        dataset = dataset.cache()
    if shuffle_buffer_size:
        dataset = dataset.shuffle(shuffle_buffer_size)
    
    dataset = dataset.map(preprocess, num_parallel_calls=os.cpu_count())
    dataset = dataset.batch(batch_size)
    
    return dataset.prefetch(1)

In [7]:
train_set = load_dataset(train_filepaths, shuffle_buffer_size=60000)
val_set = load_dataset(val_filepaths)
test_set = load_dataset(test_filepaths)

In [8]:
import numpy as np

keras.backend.clear_session()

class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)
        
    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())
    
standardization_layer = Standardization(input_shape=[28, 28])

sample_image_batches = train_set.take(100).map(lambda image, label: image)
sample_images = np.concatenate(list(sample_image_batches.as_numpy_iterator()), axis=0).astype(np.float32)

standardization_layer.adapt(sample_images)

model = keras.models.Sequential([
    standardization_layer,
    keras.layers.Flatten(),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='nadam', metrics=['accuracy'])

In [9]:
history = model.fit(train_set, epochs=5, validation_data=val_set)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## 10.

### a.

In [10]:
from pathlib import Path

filename = 'aclImdb_v1.tar.gz'
url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
download_path = keras.utils.get_file(filename, url, extract=True, cache_dir='/tmp/.keras/')
path = Path(download_path).parent / 'aclImdb'
path

PosixPath('/tmp/.keras/datasets/aclImdb')

In [11]:
def data_filepaths(dirpath):
    return np.array([str(path) for path in dirpath.glob('*.txt')])

train_pos = data_filepaths(path / 'train' / 'pos') 
train_neg = data_filepaths(path / 'train' / 'neg')
test_val_pos = data_filepaths(path / 'test' / 'pos')
test_val_neg = data_filepaths(path / 'test' / 'neg')

### b.

In [12]:
perm = np.random.RandomState(42).permutation(len(test_val_pos))
test_val_pos = test_val_pos[perm]
test_val_neg = test_val_neg[perm]

test_pos = test_val_pos[:5000]
test_neg = test_val_neg[:5000]
val_pos = test_val_pos[5000:]
val_neg = test_val_neg[5000:]

### c.

In [13]:
def create_dataset(filepaths_pos, filepaths_neg):
    reviews = []
    labels = []
    for filepaths, label in ((filepaths_pos, 1), (filepaths_neg, 0)):
        for filepath in filepaths:
            with open(filepath) as review:
                reviews.append(review.read())
            labels.append(label)
    return tf.data.Dataset.from_tensor_slices((tf.constant(reviews), tf.constant(labels)))

In [14]:
batch_size = 32

train_set = create_dataset(train_pos, train_neg).shuffle(len(train_pos) + len(train_neg)).batch(batch_size).prefetch(1)
val_set = create_dataset(val_pos, val_neg).batch(batch_size).prefetch(1)
test_set = create_dataset(test_pos, test_neg).batch(batch_size).prefetch(1)

### d.

In [15]:
def preprocess_text(X_batch, n_words=50):
    shape = tf.shape(X_batch) * tf.constant([1, 0]) + tf.constant([0, 50]) 
    Z = tf.strings.substr(X_batch, 0, 300)
    Z = tf.strings.lower(Z)
    Z = tf.strings.regex_replace(Z, b'<br\\s*/?>', b' ')
    Z = tf.strings.regex_replace(Z, b'[^a-z]', b' ')
    Z = tf.strings.split(Z)
    return Z.to_tensor(shape=shape, default_value=b'<pad>')

In [16]:
X_example = tf.constant(["It's a great, great movie! I loved it.", "It was terrible, run away!!!"])
tf.shape(X_example) * tf.constant([1, 0]) + tf.constant([0, 50])

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([ 2, 50], dtype=int32)>

In [17]:
from collections import Counter

def get_vocabulary(data_sample, max_size=1000):
    preprocessed_reviews = preprocess_text(data_sample).numpy()
    counter = Counter()
    for words in preprocessed_reviews:
        for word in words:
            if word != b'<pad>':
                counter[word] += 1
    return [b'<pad>'] + [word for word, count in counter.most_common(max_size)]

In [18]:
class TextVectorization(keras.layers.Layer):
    def __init__(self, max_vocabulary_size=1000, n_oov_buckets=100, dtype=tf.string, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        self.max_vocabulary_size = max_vocabulary_size
        self.n_oov_buckets = n_oov_buckets
        
    def adapt(self, data_sample):
        self.vocab = get_vocabulary(data_sample, self.max_vocabulary_size)
        words = tf.constant(self.vocab)
        word_ids = tf.range(len(self.vocab), dtype=tf.int64)
        vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
        self.table = tf.lookup.StaticVocabularyTable(vocab_init, self.n_oov_buckets)
        
    def call(self, inputs):
        preprocessed_inputs = preprocess_text(inputs)
        return self.table.lookup(preprocessed_inputs)

In [19]:


text_vectorization = TextVectorization()

text_vectorization.adapt(X_example)
text_vectorization(X_example)



<tf.Tensor: shape=(2, 50), dtype=int64, numpy=
array([[ 1,  3,  4,  2,  2,  5,  6,  7,  1,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0],
       [ 1,  8,  9, 10, 11,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0]])>

In [20]:
max_vocabulary_size = 1000
n_oov_buckets = 100

sample_review_batches = train_set.map(lambda review, label: review)
sample_reviews = np.concatenate(list(sample_review_batches.as_numpy_iterator()), axis=0)

text_vectorization = TextVectorization(max_vocabulary_size, n_oov_buckets, input_shape=[])

text_vectorization.adapt(sample_reviews)

In [21]:
class BagOfWords(keras.layers.Layer):
    def __init__(self, n_tokens, dtype=tf.int32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        self.n_tokens = n_tokens
        
    def call(self, inputs):
        one_hot = tf.one_hot(inputs, self.n_tokens)
        return tf.reduce_sum(one_hot, axis=1)[:, 1:]

In [22]:
n_tokens = max_vocabulary_size + n_oov_buckets + 1 # add 1 token for <pad>
bag_of_words = BagOfWords(n_tokens)

In [23]:
model = keras.models.Sequential([
    text_vectorization,
    bag_of_words,
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])

In [24]:
history = model.fit(train_set, epochs=5, validation_data=val_set)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### e.

In [25]:
def mean_embedding(inputs):
    not_pad = tf.math.count_nonzero(inputs, axis=-1)
    n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)
    sqrt_n_words = tf.math.sqrt(tf.cast(n_words, tf.float32))
    return tf.reduce_mean(inputs, axis=1) * sqrt_n_words

In [26]:
embedding_size = 20

model = keras.models.Sequential([
    text_vectorization,
    keras.layers.Embedding(input_dim=n_tokens, output_dim=embedding_size, mask_zero=True),
    keras.layers.Lambda(mean_embedding),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

### f.

In [27]:
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])
history = model.fit(train_set, epochs=5, validation_data=val_set)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### g.

In [28]:
import tensorflow_datasets as tfds

datasets = tfds.load(name='imdb_reviews')
train_set, test_set = datasets['train'], datasets['test']



In [29]:
for example in train_set.take(1):
    print(example['text'])
    print(example['label'])

tf.Tensor(b"Oh yeah! Jenna Jameson did it again! Yeah Baby! This movie rocks. It was one of the 1st movies i saw of her. And i have to say i feel in love with her, she was great in this move.<br /><br />Her performance was outstanding and what i liked the most was the scenery and the wardrobe it was amazing you can tell that they put a lot into the movie the girls cloth were amazing.<br /><br />I hope this comment helps and u can buy the movie, the storyline is awesome is very unique and i'm sure u are going to like it. Jenna amazed us once more and no wonder the movie won so many awards. Her make-up and wardrobe is very very sexy and the girls on girls scene is amazing. specially the one where she looks like an angel. It's a must see and i hope u share my interests", shape=(), dtype=string)
tf.Tensor(1, shape=(), dtype=int64)
