* Here we shall see how to convert large `NumPy` arrays to `.tfrecords` and train a model.
* We will also compare the speed of training between loading a large `NumPy` array directly into the model vs streaming the data into the model using `.tfrecord`. 

* [Source1](https://datamadness.github.io/tensorflow_estimator_large_dataset_feed)
* [Source2](https://www.tensorflow.org/tutorials/load_data/tfrecord#walkthrough_reading_and_writing_image_data)
* [Source3](https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564)
* [Source4](https://medium.com/@prasad.pai/how-to-use-dataset-and-iterators-in-tensorflow-with-code-samples-3bb98b6b74ab)

# Imports

In [1]:
import tensorflow as tf
#tf.compat.v1.enable_eager_execution()
from tensorflow.python.client import device_lib
import numpy as np
import sys, pickle, os
import h5py, time, inspect
import IPython.display as display

In [2]:
config = tf.ConfigProto()
config.gpu_options.allow_growth=True

# Load a large `NumPy` array

In [3]:
filename = 'data/emnist_train_x.h5'
with h5py.File(filename, 'r') as hf:
    train_x = list(hf['pool1_spike_features'][:])

filename = 'data/emnist_test_x.h5'
with h5py.File(filename, 'r') as hf:
    test_x = list(hf['pool1_spike_features'][:])
    

In [4]:
filename = 'data/emnist_train_y.pkl'
filehandle = open(filename, 'rb')
train_y = pickle.load(filehandle).tolist()

filename = 'data/emnist_test_y.pkl'
filehandle = open(filename, 'rb')
test_y = pickle.load(filehandle).tolist()

# Using `.tfrecord` in `Tensorflow`

## Helper function that return list of floats/bytes/ints

In [5]:
def _byte_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def array_bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=value.tobytes()))

def _float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def array_floats_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=value.flatten()))

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


In [6]:
def get_example_object(observation_id=0, data_x=None, data_y=None):
    #print(train_y[observation_id])
    feature = {
        'label': _int64_feature(data_y[observation_id]),
        'image_raw': array_floats_feature(data_x[observation_id]),
  }
    example_obj = tf.train.Example(features=tf.train.Features(feature=feature)) 
    return example_obj

## Serializing a single Image
* `get_example_object` function returns a proto buffer kind of object and we then serialize it

In [7]:
example_obj = get_example_object(0)
print(example_obj)
serialized_example_obj = example_obj.SerializeToString()
print()
print()
serialized_example_obj




TypeError: 'NoneType' object has no attribute '__getitem__'

In [None]:
parsed_example_obj=tf.train.Example.FromString(serialized_example_obj)
parsed_example_obj

In [None]:
len(train_x)

* `tf.io.TFRecordWriter()` takes same `compression_type` argument as `tf.data.TFRecordDataset`.
* [Source](https://github.com/tensorflow/tensorflow/issues/32075#issuecomment-528117457)
* [tf.io.TFRecordWriter](https://www.tensorflow.org/api_docs/python/tf/io/TFRecordWriter)
* [tf.data.TFRecordDataset](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset)
* If you don't use compression your .tfrecords files might end up much larger than original dataset. If you are storing .tfrecords for images then simply read the image file with inbuilt compression `image_string = open(filename, 'rb').read()` and store it, [see](https://github.com/tensorflow/tensorflow/issues/9675#issuecomment-302745553)

In [None]:
tf.io.TFRecordOptions.compression_type_map

## Make `.tfrecord` for the training data

In [None]:
t1 = time.time()
num_partitions = 10
partitions = np.linspace(0,len(train_x), num_partitions+1)
partitions[-1] -= 1
partitions = partitions.astype(np.int32).tolist()
print(partitions)
for partition_id in range(num_partitions):
    with tf.io.TFRecordWriter('tfrecords/EMNIST_train_strings_' + str(partition_id) + '.tfrecord', 'GZIP') as tfwriter:
        for observation_id in range(partitions[partition_id],partitions[partition_id+1]):
            example_obj = get_example_object(observation_id, data_x=train_x, data_y=train_y)
            tfwriter.write(example_obj.SerializeToString())
print('Time taken:{}'.format(time.time()-t1))  
            

## Make `.tfrecord` for the testing data

In [None]:
t1 = time.time()
num_partitions = 2
partitions = np.linspace(0,len(test_x), num_partitions+1)
partitions[-1] -= 1
partitions = partitions.astype(np.int32).tolist()
print(partitions)
for partition_id in range(num_partitions):
    with tf.io.TFRecordWriter('tfrecords/EMNIST_test_strings_' + str(partition_id) + '.tfrecord', 'GZIP') as tfwriter:
        for observation_id in range(partitions[partition_id],partitions[partition_id+1]):
            example_obj = get_example_object(observation_id, data_x=test_x, data_y=test_y)
            tfwriter.write(example_obj.SerializeToString())
print('Time taken:{}'.format(time.time()-t1))  
            

## Read the raw dataset .tfrecord file
* `raw_data` is Tensor with some metadata
* `example` is a loaded string that has the protocol buffer data

In [None]:
filenames = ['tfrecords/EMNIST_train_strings_0.tfrecord']
raw_dataset = tf.data.TFRecordDataset(filenames, compression_type='GZIP',num_parallel_reads=1)
raw_dataset

* Only works in eager mode (enable it from imports section)

In [None]:
for raw_data in raw_dataset.take(1):
    print(repr(raw_data))
    print()
    print()
    example = tf.train.Example.FromString(raw_data.numpy())
    print(example)

In [None]:
print(example.features.feature['image_raw'].float_list.value[0])
print(example.features.feature['label'].int64_list.value[0])

## Parsing using `tf.FixedLenFeature`

In [None]:
image_feature_description = {
    'label': tf.io.FixedLenFeature([], tf.int64),
    'image_raw': tf.io.FixedLenFeature([3630], tf.float32),
}

* Original image in the dataset

In [None]:
train_x[0].sum()

* Parsed Image

In [None]:
print(tf.io.parse_single_example(raw_data, image_feature_description)['image_raw'].numpy().sum())
print(tf.io.parse_single_example(raw_data, image_feature_description)['label']).numpy()

## Parsing using `tf.VarLenFeature()`
* Notice that `tf.io.VarLenFeature` returned a sparse tensor that needs to be converted to a dense one

In [None]:
image_feature_description = {
    'label': tf.io.FixedLenFeature([], tf.int64),
    'image_raw': tf.io.VarLenFeature(tf.float32),
}

* Parsed Image

In [None]:
sparse_tensor = tf.io.parse_single_example(raw_data, image_feature_description)['image_raw']
dense_tensor = tf.sparse.to_dense(sparse_tensor)
print(dense_tensor.numpy().sum())
print(tf.io.parse_single_example(raw_data, image_feature_description)['label']).numpy()

## Parsing all the `.tfrecords` and make an iterable dataset 

In [14]:
def parser(record):
    image_feature_description = {
    'label': tf.io.FixedLenFeature([], tf.int64),
    'image_raw': tf.io.VarLenFeature(tf.float32),
    }
    parsed = tf.io.parse_single_example(record, image_feature_description)
    
    image = parsed['image_raw']
    image = tf.sparse.to_dense(image,default_value = 0)
    label = tf.cast(parsed["label"], tf.int32)
    
    #return {"image_data": image}, label
    return image, label
    

# Train using `.tfrecords` and `tf.keras.model.fit`
* Load `.tfrecords` for the training data
* See more about [shuffle_buffer_size](https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle)

In [15]:
def load_dataset(path='', shuffle_buffer_size=None, batch_size=None, compression='GZIP'):
    filenames = [file for file in os.listdir(os.path.join(os.getcwd(), path)) if file.endswith('.tfrecord')]
    filenames = [os.path.join(path, file) for file in filenames]
    print('.tfrecord files:{}'.format(filenames))
    raw_image_dataset = tf.data.TFRecordDataset(filenames, compression_type=compression)
    print('\n raw data:{}'.format(raw_image_dataset))
    parsed_image_dataset = raw_image_dataset.map(parser)
    print('\n parsed data:{}'.format(parsed_image_dataset))
    final_dataset = parsed_image_dataset.shuffle(shuffle_buffer_size).batch(batch_size)
    print('\n final dataset:{}'.format(final_dataset))

    return final_dataset

* Take 10 images and inspect but only works in eager mode (enable it from imports section)

`for item in parsed_image_dataset.take(10):
    print
    print('Image Data')
    print(item[0])
    print(item[0].numpy().sum())
    print
    print('Label Data')
    print(item[1])
    print(item[1].numpy())`

## Train a simple FCN with the loaded and parsed `.tfrecord` files
* [Source](https://www.tensorflow.org/tutorials/load_data/numpy)

* Setup the dataset
* Try multiplying the `SHUFFLE_BUFFER_SIZE` with `10 or 100` and see the effects on training time and GPU usage.

In [22]:
BATCH_SIZE = 32
SHUFFLE_BUFFER_SIZE = BATCH_SIZE**2
train_dataset = load_dataset('tfrecords/train', SHUFFLE_BUFFER_SIZE, BATCH_SIZE)

.tfrecord files:['tfrecords/train/EMNIST_train_strings_4.tfrecord', 'tfrecords/train/EMNIST_train_strings_9.tfrecord', 'tfrecords/train/EMNIST_train_strings_6.tfrecord', 'tfrecords/train/EMNIST_train_strings_3.tfrecord', 'tfrecords/train/EMNIST_train_strings_5.tfrecord', 'tfrecords/train/EMNIST_train_strings_2.tfrecord', 'tfrecords/train/EMNIST_train_strings_0.tfrecord', 'tfrecords/train/EMNIST_train_strings_8.tfrecord', 'tfrecords/train/EMNIST_train_strings_7.tfrecord', 'tfrecords/train/EMNIST_train_strings_1.tfrecord']

 raw data:<TFRecordDatasetV1 shapes: (), types: tf.string>

 parsed data:<DatasetV1Adapter shapes: ((?,), ()), types: (tf.float32, tf.int32)>

 final dataset:<DatasetV1Adapter shapes: ((?, ?), (?,)), types: (tf.float32, tf.int32)>


In [23]:
def smol_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(1500, input_dim=3630, activation='relu'),
        tf.keras.layers.Dense(47)
    ])

    model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.005),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['sparse_categorical_accuracy'])
    return model
    

In [24]:
model = smol_model()
model.summary()
history = model.fit(train_dataset, epochs=3)

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_10 (Dense)             (None, 1500)              5446500   
_________________________________________________________________
dense_11 (Dense)             (None, 47)                70547     
Total params: 5,517,047
Trainable params: 5,517,047
Non-trainable params: 0
_________________________________________________________________
Train on None steps
Epoch 1/3
Epoch 2/3
Epoch 3/3


## Test the model

In [19]:
test_dataset = load_dataset('tfrecords/test', SHUFFLE_BUFFER_SIZE, BATCH_SIZE)

.tfrecord files:['tfrecords/test/EMNIST_test_strings_0.tfrecord', 'tfrecords/test/EMNIST_test_strings_1.tfrecord']

 raw data:<TFRecordDatasetV1 shapes: (), types: tf.string>

 parsed data:<DatasetV1Adapter shapes: ((?,), ()), types: (tf.float32, tf.int32)>

 final dataset:<DatasetV1Adapter shapes: ((?, ?), (?,)), types: (tf.float32, tf.int32)>


In [20]:
model.evaluate(test_dataset)

    588/Unknown - 4s 7ms/step - loss: 1.5163 - sparse_categorical_accuracy: 0.7154

[1.516273353122124, 0.7154104]

# Train directly using the large `NumPy` array

In [9]:
np.array(train_y).shape

(112799,)

In [8]:
np.array(train_x).shape

(112799, 3630)

In [11]:
BATCH_SIZE = 32
model = smol_model()
model.summary()
history = model.fit(np.array(train_x),np.array(train_y), epochs=3, batch_size=BATCH_SIZE)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 1500)              5446500   
_________________________________________________________________
dense_5 (Dense)              (None, 47)                70547     
Total params: 5,517,047
Trainable params: 5,517,047
Non-trainable params: 0
_________________________________________________________________
Train on 112799 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


## Test the model

In [13]:
model.evaluate(np.array(test_x), np.array(test_y), batch_size=len(test_x))



[1.5333495140075684, 0.69712764]

# Directly storing `2D NumPy` arrays
* `Tensorflow` TFRecords documentation file suggests storing images in serial strings with some metadata (Height, width, channels etc as extra features). What if you have some large 2D `NumPy` arrays that you want to store them as it is without flattening like we did above. 
* [Source](https://stackoverflow.com/questions/47861084/how-to-store-numpy-arrays-as-tfrecord)