`Course Instructor`: **John Chiasson**

`Author (TA)`: **Ruthvik Vaila**
# Note
* In this notebook we shall see how to use `.tfrecord` files and their effects on RAM usage and time to taken to train.
* Typically we are used to loading all the training data at once into our system's RAM but there might be large datasets that won't fit in the RAM. In such cases we need a mechanism that will stream the training data and labels to our model. `tf.io.TFRecordWriter` allows us to create relatively smaller shards of a larger dataset that can be loaded sequentially into our system RAM and eventually into our neural network input layer.
* Here we shall see how to convert large `NumPy` arrays to `.tfrecords` and train a model. This can also be done for `.png` or `.jpg` images or even text files.
* We will also compare the speed of training between loading a large `NumPy` array directly into the model vs streaming the data into the model using `.tfrecord`. 

* [Source1](https://datamadness.github.io/tensorflow_estimator_large_dataset_feed)
* [Source2](https://www.tensorflow.org/tutorials/load_data/tfrecord#walkthrough_reading_and_writing_image_data)
* [Source3](https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564)
* [Source4](https://medium.com/@prasad.pai/how-to-use-dataset-and-iterators-in-tensorflow-with-code-samples-3bb98b6b74ab)
* Tested on `Python 3.7.5` with `Tensorflow 1.15.0` and `Keras 2.2.4`. 
* Tested on `Python 2.7.17` with `Tensorflow 1.15.3` and `Keras 2.2.4`. 

# Imports

In [1]:
import sys, os, shutil
sys.version

'2.7.17 (default, Jul 20 2020, 15:37:01) \n[GCC 7.5.0]'

In [2]:
import tensorflow as tf
#tf.compat.v1.enable_eager_execution()
from tensorflow.python.client import device_lib
import numpy as np
import sys, pickle, os, IPython
import h5py, time, inspect
import keras, warnings
warnings.filterwarnings(action='once')
import IPython.display as display
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
# this make sure thaat if using a gpu total gpu memory is not gobbled
# up by tensorflow and allows growth as required

Using TensorFlow backend.


In [3]:
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 6658834863401485198, name: "/device:XLA_CPU:0"
 device_type: "XLA_CPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 1509187708878976432
 physical_device_desc: "device: XLA_CPU device", name: "/device:XLA_GPU:0"
 device_type: "XLA_GPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 13930074261618042996
 physical_device_desc: "device: XLA_GPU device", name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 7419989197
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 14477355034444360244
 physical_device_desc: "device: 0, name: GeForce RTX 2080 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 7.5"]

In [4]:
tf.__version__

'1.15.3'

In [5]:
keras.__version__

'2.2.4'

# Load a large `NumPy` array

In [6]:
filename = 'data/emnist_train_x.h5'
with h5py.File(filename, 'r') as hf:
    train_x = hf['pool1_spike_features'][:]

filename = 'data/emnist_test_x.h5'
with h5py.File(filename, 'r') as hf:
    test_x = hf['pool1_spike_features'][:]

print('Train data shape:{}'.format(train_x.shape))
print('Test data shape:{}'.format(test_x.shape))

train_x = list(train_x) #convert 2D numpy array to a list of 1D numpy arrays 
test_x = list(test_x)

Train data shape:(112799, 3630)
Test data shape:(18800, 3630)


In [7]:
filename = 'data/emnist_train_y.pkl'
filehandle = open(filename, 'rb')
train_y = pickle.load(filehandle)
filehandle.close()

filename = 'data/emnist_test_y.pkl'
filehandle = open(filename, 'rb')
test_y = pickle.load(filehandle)
filehandle.close()

print('Train labels shape:{}'.format(train_y.shape))
print('Test labels shape:{}'.format(test_y.shape))

train_y = train_y.tolist() #convert 2D numpy array to a list of 1D numpy arrays 
test_y = test_y.tolist()

Train labels shape:(112799,)
Test labels shape:(18800,)


# Using `.tfrecord` in `Tensorflow`

## Helper function that return list of floats/bytes/ints

In [8]:
def _byte_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def array_bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=value.tobytes()))

def _float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def array_floats_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=value.flatten()))

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


In [9]:
def get_example_object(observation_id=0, data_x=None, data_y=None):
    #print(train_y[observation_id])
    feature = {
        'label': _int64_feature(data_y[observation_id]),
        'image_raw': array_floats_feature(data_x[observation_id]),
  }
    example_obj = tf.train.Example(features=tf.train.Features(feature=feature)) 
    return example_obj

## Serializing a single Image
* `get_example_object` function returns a [proto buffer](https://developers.google.com/protocol-buffers) kind of object and we then serialize it

In [10]:
example_obj = get_example_object(0, data_x=train_x, data_y=train_y)
print(example_obj)

features {
  feature {
    key: "image_raw"
    value {
      float_list {
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 11.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 11.0
        value: 11.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 11.0
        value: 12.0
        value: 0.0
        value: 0.0
        value: 12.0
        value: 12.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 11.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 11.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
    

* Converting to strings

In [11]:
serialized_example_obj = example_obj.SerializeToString()
serialized_example_obj

'\n\xdfq\n\xccq\n\timage_raw\x12\xbeq\x12\xbbq\n\xb8q\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x000A\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x000A\x00\x000A\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x000A\x00\x00@A\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@A\x00\x00@A\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x000A\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x000A\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x000

* Converting a serialized string back to protobuffer.

In [12]:
parsed_example_obj=tf.train.Example.FromString(serialized_example_obj)
parsed_example_obj

features {
  feature {
    key: "image_raw"
    value {
      float_list {
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 11.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 11.0
        value: 11.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 11.0
        value: 12.0
        value: 0.0
        value: 0.0
        value: 12.0
        value: 12.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 11.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 11.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
        value: 0.0
    

In [13]:
example_obj==parsed_example_obj

True

* `tf.io.TFRecordWriter()` takes same `compression_type` argument as `tf.data.TFRecordDataset`.
* [Source](https://github.com/tensorflow/tensorflow/issues/32075#issuecomment-528117457)
* [tf.io.TFRecordWriter](https://www.tensorflow.org/api_docs/python/tf/io/TFRecordWriter)
* [tf.data.TFRecordDataset](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset)
* If you don't use compression your .tfrecords files might end up much larger than original dataset. If you are storing .tfrecords for images then simply read the image file with inbuilt compression `image_string = open(filename, 'rb').read()` and store it, [see](https://github.com/tensorflow/tensorflow/issues/9675#issuecomment-302745553)

In [14]:
tf.io.TFRecordOptions.compression_type_map

{0: '', 1: 'ZLIB', 2: 'GZIP'}

## Make `.tfrecord` for the training data

In [21]:
t1 = time.time()
num_partitions = 40  ##write the numpy array to disk as 10 shards of .tfrecord files.
folder_name = str(num_partitions)+'_tfrecords'
if os.path.exists(folder_name):
    shutil.rmtree(folder_name)

os.mkdir(folder_name)
train_folder_name = folder_name + '/train' 
os.mkdir(train_folder_name)
partitions = np.linspace(0,len(train_x), num_partitions+1)
partitions[-1] -= 1
partitions = partitions.astype(np.int32).tolist()
print(partitions)
for partition_id in range(num_partitions):
    with tf.io.TFRecordWriter(train_folder_name+'/EMNIST_train_strings_' + str(partition_id) + '.tfrecord', 'GZIP') as tfwriter:
        for observation_id in range(partitions[partition_id],partitions[partition_id+1]):
            example_obj = get_example_object(observation_id, data_x=train_x, data_y=train_y)
            tfwriter.write(example_obj.SerializeToString())
print('Time taken:{}'.format(time.time()-t1))  
            

[0, 2819, 5639, 8459, 11279, 14099, 16919, 19739, 22559, 25379, 28199, 31019, 33839, 36659, 39479, 42299, 45119, 47939, 50759, 53579, 56399, 59219, 62039, 64859, 67679, 70499, 73319, 76139, 78959, 81779, 84599, 87419, 90239, 93059, 95879, 98699, 101519, 104339, 107159, 109979, 112798]
Time taken:73.6771140099


## Make `.tfrecord` for the testing data

In [22]:
t1 = time.time()
num_partitions = 2
test_folder_name = folder_name + '/test'
if os.path.exists(test_folder_name):
    shutil.rmtree(test_folder_name)
os.mkdir(test_folder_name)
partitions = np.linspace(0,len(test_x), num_partitions+1)
partitions[-1] -= 1
partitions = partitions.astype(np.int32).tolist()
print(partitions)
for partition_id in range(num_partitions):
    with tf.io.TFRecordWriter(test_folder_name+'/EMNIST_test_strings_' + str(partition_id) + '.tfrecord', 'GZIP') as tfwriter:
        for observation_id in range(partitions[partition_id],partitions[partition_id+1]):
            example_obj = get_example_object(observation_id, data_x=test_x, data_y=test_y)
            tfwriter.write(example_obj.SerializeToString())
print('Time taken:{}'.format(time.time()-t1))  
            

[0, 9400, 18799]
Time taken:12.3969511986


# Restart the notebook to free up the `GPU` and `RAM`.

In [23]:
IPython.Application.instance().kernel.do_shutdown(True) #automatically restarts kernel

{'restart': True, 'status': 'ok'}

# Directly storing `2D NumPy` arrays
* `Tensorflow` TFRecords documentation file suggests storing images in serial strings with some metadata (Height, width, channels etc as extra features). What if you have some large 2D `NumPy` arrays that you want to store them as it is without flattening like we did above. 
* [Source](https://stackoverflow.com/questions/47861084/how-to-store-numpy-arrays-as-tfrecord)