# Reading Data in TensorFlow

There are three main methods of getting data into a TensorFlow program:

* **Feeding**: Python code provides the data when running each step.
* **Preloaded data**: a constant or variable in the TensorFlow graph holds all the data (for small data sets).
* **Reading from files**: an input pipeline reads the data from files at the beginning of a TensorFlow graph.

## Feeding

TensorFlow's feed mechanism:
* Inject data into any Tensor in a computation graph
* Temporarily replace the output of an operation with a tensor value

While you can replace any Tensor with feed data, including variables and constants, the best practice is to use a [placeholder op](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#placeholder) node. A placeholder exists solely to serve as the target of feeds. 

Feeding is great for small examples and easily interacting with data

#### Feeding Example I

In [1]:
import tensorflow as tf
input1 = tf.placeholder(tf.float32)
input2 = tf.placeholder(tf.float32)
output = tf.mul(input1, input2)
with tf.Session() as sess:
     print(sess.run([output], feed_dict={input1:[7.], input2:[2.]}))

[array([ 14.], dtype=float32)]


#### Feeding Example II - MNIST 
Full code is available at: [fully_connected_feed.py](https://github.com/tensorflow/tensorflow/blob/r0.12/tensorflow/examples/tutorials/mnist/fully_connected_feed.py)

In [None]:
def placeholder_inputs(batch_size):
    images_placeholder = tf.placeholder(tf.float32, shape=(batch_size, mnist.IMAGE_PIXELS))
    labels_placeholder = tf.placeholder(tf.int32, shape=(batch_size))
    return images_placeholder, labels_placeholder
    
def fill_feed_dict(data_set, images_pl, labels_pl):
    images_feed, labels_feed = data_set.next_batch(FLAGS.batch_size, FLAGS.fake_data)
    feed_dict = {
        images_pl: images_feed,
        labels_pl: labels_feed,
    }
    return feed_dict

In [None]:
def run_training():
    data_sets = input_data.read_data_sets(FLAGS.input_data_dir, FLAGS.fake_data)
    with tf.Graph().as_default():
        images_placeholder, labels_placeholder = placeholder_inputs(FLAGS.batch_size)
        #...
        init = tf.global_variables_initializer()
        sess = tf.Session()
        sess.run(init)
        for step in xrange(FLAGS.max_steps):
            feed_dict = fill_feed_dict(data_sets.train, images_placeholder, labels_placeholder)
            _, loss_value = sess.run([train_op, loss], feed_dict=feed_dict)

## Preloaded Data
This is only used for small data sets that can be loaded entirely in memory. There are two approaches:

* Store the data in a constant.
* Store the data in a variable, that you initialize and then never change.

#### How to Load images in python
##### Using TensorFlow

In [None]:
import tensorflow as tf
import numpy as np
from PIL import Image
import glob
data_dir = '/home/chentao/Pictures/'
filename_list = glob.glob('%s*.jpg' % (data_dir))
filename_queue = tf.train.string_input_producer(filename_list) 
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
img = tf.image.decode_jpeg(value) # use png or jpg decoder based on your files.

init_op = tf.global_variables_initializer()
sess = tf.InteractiveSession()

sess.run(init_op)

coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord, sess=sess)

for i in range(len(filename_list)):  # length of your filename list
    image = img.eval()  # here is your image Tensor
    Image.fromarray(np.asarray(image)).show()

coord.request_stop()
coord.join(threads)

##### Using OpenCV

In [None]:
import cv2
import glob
data_dir = '/home/chentao/Pictures/'
for filename in glob.glob('%s*.jpg' % (data_dir)):
    image = cv2.imread(filename)
    # Only for display
    cv2.imshow(filename, image)
    cv2.waitKey(1000)

##### Using PIL

In [None]:
import numpy as np
from PIL import Image
import glob
data_dir = '/home/chentao/Pictures/'
for filename in glob.glob('%s*.jpg' % (data_dir)):
    image = np.asarray(Image.open(filename))
    Image.fromarray(image).show()

##### Using scikit-image

In [None]:
import skimage.io as ski_io
from PIL import Image
import glob
data_dir = '/home/chentao/Pictures/'
for filename in glob.glob('%s*.jpg' % (data_dir)):
    image = ski_io.imread(filename)
    Image.fromarray(image).show()

##### Using scipy

In [None]:
from scipy import misc
from scipy import ndimage
from PIL import Image
import glob
data_dir = '/home/chentao/Pictures/'
for filename in glob.glob('%s*.jpg' % (data_dir)):
    # image = misc.imread(filename) # either one
    image = ndimage.imread(filename)
    Image.fromarray(image).show()

#### Load Images into NHWC format

In [None]:
import tensorflow as tf
from PIL import Image
import glob
import numpy as np

data_dir = '/home/chentao/Pictures/'
images = []
for filename in glob.glob('%s*.jpg'%(data_dir)):
    image = np.asarray(Image.open(filename))
    images.append(image)
## Method I
images = tf.pack(images, axis=0)
# show the first image
sess = tf.InteractiveSession()
Image.fromarray(images[0].eval()).show()
## Method II
images = np.array(images)
Image.fromarray(images[0]).show()

#### Concatenate multiple images on the channel dimension

In [None]:
import tensorflow as tf
from PIL import Image
import glob
import numpy as np

data_dir = '/home/chentao/Pictures/'
images = []
for filename in glob.glob('%s*.jpg'%(data_dir)):
    image = np.asarray(Image.open(filename))
    images.append(image)
## Method I
images = tf.concat(2, images)
sess = tf.InteractiveSession()
Image.fromarray(images[:,:,:3].eval()).show()
## Method II
images = np.concatenate(images, axis=2)
Image.fromarray(images[:,:,:3]).show()

#### Preloaded Data - Using Constants

In [None]:
training_images = #...
training_labels = #...
with tf.Session():
    input_images = tf.constant(training_images)
    input_labels = tf.constant(training_labels)
    #...

#### Preloaded Data - Using Variables

In [None]:
training_images = #...
training_labels = #...
with tf.Session() as sess:
    images_initializer = tf.placeholder(dtype=training_images.dtype,
                                        shape=training_images.shape)
    label_initializer = tf.placeholder(dtype=training_labels.dtype,
                                       shape=training_labels.shape)
    input_images = tf.Variable(images_initializer, trainable=False,
                               collections=[])
    input_labels = tf.Variable(label_initializer, trainable=False,
                               collections=[])
    #...
    sess.run(input_images.initializer, feed_dict={images_initializer: training_images})
    sess.run(input_labels.initializer, feed_dict={label_initializer: training_labels})

#### Preloaded Data - Generate Batch
Full code is available at [fully_connected_preloaded.py](https://www.tensorflow.org/code/tensorflow/examples/how_tos/reading_data/fully_connected_preloaded.py) and [fully_connected_preloaded_var.py](https://www.tensorflow.org/code/tensorflow/examples/how_tos/reading_data/fully_connected_preloaded_var.py)

In [None]:
image, label = tf.train.slice_input_producer([input_images, input_labels],    
                                             num_epochs=FLAGS.num_epochs,
                                             shuffle=True)
label = tf.cast(label, tf.int32)
images, labels = tf.train.batch([image, label], 
                                batch_size=FLAGS.batch_size)

## Reading from Files
A typical pipeline for reading records from files has the following stages:

* The list of filenames
* Optional filename shuffling
* Optional epoch limit
* Filename queue
* A Reader for the file format
* decoder for a record read by the reader
* Optional preprocessing
* Example queue


To acquire the list of filenames, use either a constant string Tensor (like `["file0", "file1"] or [("file%d" % i) for i in range(2)]`) or use either [glob.glob()](https://docs.python.org/2/library/glob.html) or [tf.train.match_filenames_once()](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#match_filenames_once).

Pass the list of filenames to the [tf.train.string_input_producer](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#string_input_producer), string_input_producer creates a FIFO queue for holding the filenames until the reader needs them.

Select the [reader](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#readers) that matches your input file format and pass the filename queue to the reader's read method. The read method outputs a key identifying the file and record (useful for debugging if you have some weird records), and a scalar string value. Use one (or more) of the decoder and conversion ops to decode this string into the tensors that make up an [example](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#example-protocol-buffer).

#### CSV Files
Two key functions:
* [tf.TextLineReader()](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#TextLineReader)
* [tf.decode_csv()](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#decode_csv)

Remember to call [tf.train.start_queue_runners()](https://www.tensorflow.org/versions/r0.12/api_docs/python/train.html#start_queue_runners) to populate the queue before you call run or eval to execute the read. Otherwise read will block while it waits for filenames from the queue.

In [None]:
import tensorflow as tf
filename_queue = tf.train.string_input_producer(['csv_data/file0.csv',             
                                                 'csv_data/file1.csv'])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
                                  value, record_defaults=record_defaults)
features = tf.pack([col1, col2, col3, col4])

with tf.Session() as sess:
    # Start populating the filename queue.
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    for i in range(2000):
        key_in,value_in,example,label = sess.run([key, value, features, col5])
        print('key:',key_in, ' value:',value_in, '  example:',list(example), ' label:',label)

    coord.request_stop()
    coord.join(threads)

#### Fixed Length Records
To read binary files in which each record is a fixed number of bytes, use [tf.FixedLengthRecordReader](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#FixedLengthRecordReader) with the [tf.decode_raw](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#decode_raw) operation.

Full code is available at:  [cifar10_input.py](https://www.tensorflow.org/code/tensorflow/models/image/cifar10/cifar10_input.py)

In [None]:
filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i) for i in xrange(1, 6)]
filename_queue = tf.train.string_input_producer(filenames)
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
key, value = reader.read(filename_queue)
record_bytes = tf.decode_raw(value, tf.uint8)

# Suppose the first bytes represent the label
label = tf.cast(tf.slice(record_bytes, [0], [label_bytes]), tf.int32)

# Suppose the remaining bytes after the label represent the image, 
# which we reshape from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]), [depth, height, width])

# Convert from [depth, height, width] to [height, width, depth].
image = tf.transpose(depth_major, [1, 2, 0])

#### Standard TensorFlow format
Convert whatever data into a supported format (**[TFRecords File](https://www.tensorflow.org/versions/r0.10/api_docs/python/python_io.html#tfrecords-format-details)**)

A TFRecords file contains [tf.train.Example protocol buffers](https://github.com/tensorflow/tensorflow/blob/r0.12/tensorflow/core/example/example.proto), which contain [Features](https://github.com/tensorflow/tensorflow/blob/r0.12/tensorflow/core/example/feature.proto) as a field. Basically, an **Example** always contains **Features**. **Features** contains a **map** of strings to **Feature**. And finally, a **Feature** contains one of a **FloatList**, a **ByteList** or a **Int64List**.

To convert your data into TFRecord file, you first write a little program that gets your data, then stuffs it in an **Example** protocol buffer, **serializes** the protocol buffer to a string, and then **writes** the string to a TFRecords file using the [tf.python_io.TFRecordWriter](https://www.tensorflow.org/versions/r0.12/api_docs/python/python_io.html#TFRecordWriter) class. 

##### What an example looks like

In [None]:
# construct the Example proto boject
example = tf.train.Example(
    # Example contains a Features proto object
    features=tf.train.Features(
        # Features contains a map of string to Feature proto objects
        feature={
        # A Feature contains one of either a int64_list,
        # float_list, or bytes_list
        'label': tf.train.Feature(int64_list = tf.train.Int64List(value = [label])),
        'image': tf.train.Feature(int64_list = tf.train.Int64List(value = features.astype('int64'))),
}))

##### Convert to TFRecord
Full code is available at [convert_to_records.py](https://www.tensorflow.org/code/tensorflow/examples/how_tos/reading_data/convert_to_records.py)

In [None]:
def _int64_feature(value):
    return tf.train.Feature(int64_list = tf.train.Int64List(value=[value]))

def _bytes_feature(value):
    return tf.train.Feature(bytes_list = tf.train.BytesList(value=[value]))

filename = 'data.tfrecords'
writer = tf.python_io.TFRecordWriter(filename)
for index in range(dataset_size):
    image_raw = images[index].tostring()
    example = tf.train.Example(features=tf.train.Features(feature={'height': _int64_feature(rows),
                                                                   'width': _int64_feature(cols),
                                                                   'depth': _int64_feature(depth),
                                                                   'label': _int64_feature(int(labels[index])),
                                                                   'image_raw': _bytes_feature(image_raw)}))
    writer.write(example.SerializeToString())
writer.close()

##### Read TFRecord Method I

In [None]:
import tensorflow as tf

filename = "data.tfrecords"
for serialized_example in tf.python_io.tf_record_iterator(filename):
    example = tf.train.Example()
    example.ParseFromString(serialized_example)

    # traverse the Example format to get data
    image = example.features.feature['image_raw'].int64_list.value
    label = example.features.feature['label'].int64_list.value[0]

##### Read TFRecord Method II
Three key functions:
* [tf.TFRecordReader()](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#TFRecordReader)
* [tf.parse_single_example()](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#parse_single_example)
* [tf.decode_raw()](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#decode_raw)

It is important to remember that TensorFlow’s graphs contain **state**. It is this state that allows the **TFRecordReader** to remember the location of the **tfrecord** it’s reading and always return the next one. This is why for almost all TensorFlow work we need to initialize the graph. We can use the helper function **tf.global_variables_initializer()**, which constructs an op that initializes the state on the graph when you run it.

Full code is available at [fully_connected_reader.py](https://www.tensorflow.org/code/tensorflow/examples/how_tos/reading_data/fully_connected_reader.py)


In [None]:
filename = "data.tfrecords"
filename_queue = tf.train.string_input_producer([filename], num_epochs=num_epochs)

reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(serialized_example,
                                   features={'image_raw': tf.FixedLenFeature([], tf.string),
                                             'label': tf.FixedLenFeature([], tf.int64),
                                   })
image = tf.decode_raw(features['image_raw'], tf.uint8)
label = tf.cast(features['label'], tf.int32)

##### Preprocessing
You can then do any preprocessing of these examples you want. This would be any processing that doesn't depend on trainable parameters. Examples include normalization of your data, picking a random slice, adding noise or distortions, etc. 

Full code is available at [cifar10_input.py](https://www.tensorflow.org/code/tensorflow/models/image/cifar10/cifar10_input.py)

##### Batching
At the end of the pipeline we use another queue to batch together examples for training, evaluation, or inference.

Note: 
     * tf.train.shuffle_batch(tensors,enqueue_many=False): tensors is a single example
     * tf.train.shuffle_batch(tensors,enqueue_many=True): tensors is a batch of examples

In [None]:
def read_my_file_format(filename_queue):
    reader = tf.SomeReader()
    key, record_string = reader.read(filename_queue)
    example, label = tf.some_decoder(record_string)
    processed_example = some_processing(example)
    return processed_example, label
# Method 1
# filenames is a list of image files like ['data/1.jpg', 'data/2.jpg',...]
# or a list of tfrecord files, csv files, binary files, etc.

def input_pipeline(filenames, batch_size, num_epochs=None):
    filename_queue = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True)
    example, label = read_my_file_format(filename_queue)
    min_after_dequeue = 10000
    capacity = min_after_dequeue + 3 * batch_size
    example_batch, label_batch = tf.train.shuffle_batch([example, label], 
                                                        batch_size=batch_size, 
                                                        capacity=capacity,
                                                        min_after_dequeue=min_after_dequeue)
    return example_batch, label_batch

In [None]:
# Method 2
# filenames is a list of filenames like ['data/1.jpg', 'data/2.jpg',...]
def input_pipeline(filenames, batch_size, num_epochs=None):
    image_files = tf.convert_to_tensor(all_filenames, dtype=dtypes.string)
    labels = tf.convert_to_tensor(all_labels, dtype=dtypes.int32)
    
    train_input_queue = tf.train.slice_input_producer([image_files, labels],
                                                      shuffle=True)
    file_content = tf.read_file(train_input_queue[0])
    train_image = tf.image.decode_jpeg(file_content, channels=NUM_CHANNELS)
    train_label = train_input_queue[1]
    train_image.set_shape([IMAGE_HEIGHT, IMAGE_WIDTH, NUM_CHANNELS])
    
    train_image_batch, train_label_batch = tf.train.batch([train_image, train_label],
                                                          batch_size=BATCH_SIZE)
    return train_image_batch, train_label_batch

##### Prefetch by QueueRunner
many of the **tf.train** functions used above add [QueueRunner](https://www.tensorflow.org/versions/r0.12/api_docs/python/train.html#QueueRunner) objects to your graph. These require that you call [tf.train.start_queue_runners](https://www.tensorflow.org/versions/r0.12/api_docs/python/train.html#start_queue_runners) before running any training or inference steps, or it will hang forever. This will start threads that run the input pipeline, filling the example queue so that the dequeue to get the examples will succeed. This is best combined with a [tf.train.Coordinator](https://www.tensorflow.org/versions/r0.12/api_docs/python/train.html#Coordinator) to cleanly shut down these threads when there are errors. If you set a limit on the number of epochs, that will use an epoch counter that will need to be initialized. 

In [None]:
init_op = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init_op)

# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

try:
    while not coord.should_stop():
        # Run training steps or whatever
        sess.run(train_op)

except tf.errors.OutOfRangeError:
    print('Done training -- epoch limit reached')
finally:
    # When done, ask the threads to stop.
    coord.request_stop()

# Wait for threads to finish.
coord.join(threads)
sess.close()

##### Sparse input data
SparseTensors don't play well with queues. If you use SparseTensors you have to decode the string records using [tf.parse_example](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#parse_example) after batching (instead of using [tf.parse_single_example](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops.html#parse_single_example) before batching).



##### In conclusion

First we create the graph. It will have a few pipeline stages that are connected by queues. 

The first stage will generate filenames to read and enqueue them in the filename queue. 

The second stage consumes filenames (using a Reader), produces examples, and enqueues them in an example queue. Depending on how you have set things up, you may actually have a few independent copies of the second stage, so that you can read from multiple files in parallel. 

At the end of these stages is an enqueue operation, which enqueues into a queue that the next stage dequeues from. We want to start threads running these enqueuing operations, so that our training loop can dequeue examples from the example queue.

## References
* [TensorFlow Data Input (Part 1): Placeholders, Protobufs & Queues](https://indico.io/blog/tensorflow-data-inputs-part1-placeholders-protobufs-queues/)
* [TensorFlow Reading Data](https://www.tensorflow.org/versions/r0.12/how_tos/reading_data/index.html)