# Reading data
텐서플로우로 데이터를 로드하는 방법으로는 다음과 같이 3가지가 있다.
- <font color='red'>Feeding</font> : 파이썬 코드에서 training step에 데이터를 feed하는 방법
- <font color='red'>Reading from files</font> : 텐서플로우 그래프 앞단에서 파일을 읽어오는 input pipline을 구현하는 방법
- <font color='red'>Preloaded data</font> : 모든 데이터를 텐서플로우 그래프의 constant나 variable에 저장하는 방법(작은 데이터셋의 경우)

## Feeding
- 기본적인 데이터 로드 방법
- 텐서플로우 그래프에 tf.placeholder 를 선언하고 sess.run()이나 eval()에서 python 데이터(list, np, ..)를 feed함

In [1]:
import tensorflow as tf

In [2]:
with tf.Session() as sess:
    input = tf.placeholder(tf.float32)
    multiplier = input * 10
    print(multiplier.eval(feed_dict={input: 5}))

50.0


## Reading from files
파일을 읽어오는 input pipeline을 구현하는 단계는 다음과 같다
1. The list of filenames
2. Optional filename shuffling
3. Optional epoch limit
4. Filename queue
5. A reader for the file format
6. A decoder for a record read by the reader
7. Optional preprocessing
8. Example queue

#### Filenames, shuffling, and epoch limits
- filenames : 
    + 파일 이름 리스트, 
    + ["file0", "file2"], [("file%d" % i) for i in range(2)]와 같이 표현될 수 있음
    + tf.train.match_filenames_once로 불러올 수 있음
- tf.train.string_input_producer :
    + 파일 이름을 갖는 FIFO queue 생성
    + option으로 shuffling, maximum number of epochs 등을 설정할 수 있음
    + queue runner는 filenames에 포함된 모든 element를 epoch마다 queue에 추가함

#### File formats
- input file format에 맞추어 Reader생성

#### CSV files
- tf.TextLineReader, tf.decode_csv : CSV(comma-separated value) format 파일 읽기
- tf.TextLineReader : test file 읽기
- tf.decode_csv : csv파일 디코드
    + record_defaults : determines the type of the resulting tensors and sets the default value to use if a value is missing in the input string.

- run(), eval() method 전에 반드시 tf.train.start_queue_runners 실행

In [5]:
filename_queue = tf.train.string_input_producer(["../Species/data/train_labels.csv"])

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1]]
col1, col2 = tf.decode_csv(
    value, record_defaults=record_defaults)
features = tf.stack([col1])

with tf.Session() as sess:
    # Start populating the filename queue.
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    for i in range(1200):
        # Retrieve a single instance:
        example, label = sess.run([features, col2])

    coord.request_stop()
    coord.join(threads)

#### Fixed length records
- tf.FixedLengthRecodReader, tf.decode_raw : binary 파일의 fixed number of bytes를 읽을때 사용(uint8 tensor로 변환)
- 예를들어, CIFAR-10 dataset에서는 시작 1 byte는 label이고 이후 3072 bytes는 image data임

#### Standard Tensorflow format
<p>Another approach is to convert whatever data you have into a supported format.
This approach makes it easier to mix and match data sets and network
architectures. The recommended format for TensorFlow is a
<a href="https://www.tensorflow.org/api_guides/python/python_io#tfrecords_format_details">TFRecords file</a>
containing
<a href="https://www.github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/core/example/example.proto"><code>tf.train.Example</code> protocol buffers</a>
(which contain
<a href="https://www.github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/core/example/feature.proto"><code>Features</code></a>
as a field).  You write a little program that gets your data, stuffs it in an
<code>Example</code> protocol buffer, serializes the protocol buffer to a string, and then
writes the string to a TFRecords file using the
<a href="https://www.tensorflow.org/api_docs/python/tf/python_io/TFRecordWriter"><code>tf.python_io.TFRecordWriter</code></a>.
For example,
<a href="https://www.github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/how_tos/reading_data/convert_to_records.py"><code>tensorflow/examples/how_tos/reading_data/convert_to_records.py</code></a>
converts MNIST data to this format.</p>
<p>To read a file of TFRecords, use
<a href="https://www.tensorflow.org/api_docs/python/tf/TFRecordReader"><code>tf.TFRecordReader</code></a> with
the <a href="https://www.tensorflow.org/api_docs/python/tf/parse_single_example"><code>tf.parse_single_example</code></a>
decoder. The <code>parse_single_example</code> op decodes the example protocol buffers into
tensors. An MNIST example using the data produced by <code>convert_to_records</code> can be
found in
<a href="https://www.github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/how_tos/reading_data/fully_connected_reader.py"><code>tensorflow/examples/how_tos/reading_data/fully_connected_reader.py</code></a>,
which you can compare with the <code>fully_connected_feed</code> version.</p>

#### Preprocessing
<p>You can then do any preprocessing of these examples you want. This would be any
processing that doesn't depend on trainable parameters. Examples include
normalization of your data, picking a random slice, adding noise or distortions,
etc.  See
<a href="https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10/cifar10_input.py"><code>tensorflow_models/tutorials/image/cifar10/cifar10_input.py</code></a>
for an example.</p>

#### Batching
data batch를 위해 training, evaluation, or inference에서 사용할 다른 queue를 생성
- tf.train.shuffle_batch : queue that randomizes the order of examples

In [6]:
def read_my_file_format(filename_queue):
    reader = tf.SomeReader()
    key, record_string = reader.read(filename_queue)
    example, label = tf.some_decoder(record_string)
    processed_example = some_processing(example)
    return processed_example, label

def input_pipeline(filenames, batch_size, num_epochs=None):
    filename_queue = tf.train.string_input_producer(
        filenames, num_epochs=num_epochs, shuffle=True)
    example, label = read_my_file_format(filename_queue)
    # min_after_dequeue defines how big a buffer we will randomly sample
    #   from -- bigger means better shuffling but slower start up and more
    #   memory used.
    # capacity must be larger than min_after_dequeue and the amount larger
    #   determines the maximum we will prefetch.  Recommendation:
    #   min_after_dequeue + (num_threads + a small safety margin) * batch_size
    min_after_dequeue = 10000
    capacity = min_after_dequeue + 3 * batch_size
    example_batch, label_batch = tf.train.shuffle_batch(
        [example, label], batch_size=batch_size, capacity=capacity,
        min_after_dequeue=min_after_dequeue)
    return example_batch, label_batch

<p>If you need more parallelism or shuffling of examples between files, use
multiple reader instances using the
<a href="https://www.tensorflow.org/api_docs/python/tf/train/shuffle_batch_join"><code>tf.train.shuffle_batch_join</code></a>.

In [7]:
def input_pipeline(filenames, batch_size, read_threads, num_epochs=None):
    filename_queue = tf.train.string_input_producer(
        filenames, num_epochs=num_epochs, shuffle=True)
    example_list = [read_my_file_format(filename_queue)
                    for _ in range(read_threads)]
    min_after_dequeue = 10000
    capacity = min_after_dequeue + 3 * batch_size
    example_batch, label_batch = tf.train.shuffle_batch_join(
        example_list, batch_size=batch_size, capacity=capacity,
        min_after_dequeue=min_after_dequeue)
    return example_batch, label_batch

<p>You still only use a single filename queue that is shared by all the readers.
That way we ensure that the different readers use different files from the same
epoch until all the files from the epoch have been started.  (It is also usually
sufficient to have a single thread filling the filename queue.)</p>
<p>An alternative is to use a single reader via the
<a href="https://www.tensorflow.org/api_docs/python/tf/train/shuffle_batch"><code>tf.train.shuffle_batch</code></a>
with <code>num_threads</code> bigger than 1.  This will make it read from a single file at
the same time (but faster than with 1 thread), instead of N files at once.
This can be important:</p>
<ul>
<li>If you have more reading threads than input files, to avoid the risk that
    you will have two threads reading the same example from the same file near
    each other.</li>
<li>Or if reading N files in parallel causes too many disk seeks.</li>
</ul>
<p>How many threads do you need? the <code>tf.train.shuffle_batch*</code> functions add a
summary to the graph that indicates how full the example queue is. If you have
enough reading threads, that summary will stay above zero.  You can
<a href="https://www.tensorflow.org/get_started/summaries_and_tensorboard">view your summaries as training progresses using TensorBoard</a>.</p>

#### Create threads to prefetch using QueueRunner objects
앞서 소개된 대부분의 tf.train 함수들은 tf.train.QueueRunner objects를 사용함
- 따라서 training이나 inference들을 수행하기 전에 tf.train.start_queue_runners를 수행해야 함
- 쓰래드 관리를 위해서 tf.train.Coordinator를 사용함

In [None]:
# example
w = tf.Variable([5], dtype=tf.float32)
b = tf.constant([5], dtype=tf.float32)
train_op = w + b

# Create the graph, etc.
init_op = tf.global_variables_initializer()

# Create a session for running operations in the Graph.
sess = tf.Session()

# Initialize the variables (like the epoch counter).
sess.run(init_op)

# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

try:
    #while not coord.should_stop():
    for i in range(10)
        # Run training steps or whatever
        sess.run(train_op)

except tf.errors.OutOfRangeError:
    print('Done training -- epoch limit reached')
finally:
    # When done, ask the threads to stop.
    coord.request_stop()

# Wait for threads to finish.
coord.join(threads)
sess.close()

### Aside: What is happening here?
그래프를 생성하면, queue들로 연결된 pipeline stages가 생성됨
1. read를 위한 filenames을 생성하고, filename queue에 enqueue operation 수행
2. Reader를 통해 filename queue에서 filenames를 가져오고, examples를 생성하고, example queue에 enqueue operation 수행
3. training 단계에서 dequeue oepration이 수행되면, enqueue operation이 수행됨
    - threads가 enqueuing operation을 수행

<div style="width:70%; margin-left:12%; margin-bottom:10px; margin-top:20px;">
<img style="width:100%" src="https://www.tensorflow.org/images/AnimatedFileQueues.gif">
</div>

<p>The helpers in <code>tf.train</code> that create these queues and enqueuing operations add
a <a href="https://www.tensorflow.org/api_docs/python/tf/train/QueueRunner"><code>tf.train.QueueRunner</code></a> to the
graph using the
<a href="https://www.tensorflow.org/api_docs/python/tf/train/add_queue_runner"><code>tf.train.add_queue_runner</code></a>
function. Each <code>QueueRunner</code> is responsible for one stage, and holds the list of
enqueue operations that need to be run in threads. Once the graph is
constructed, the
<a href="https://www.tensorflow.org/api_docs/python/tf/train/start_queue_runners"><code>tf.train.start_queue_runners</code></a>
function asks each QueueRunner in the graph to start its threads running the
enqueuing operations.</p>
<p>If all goes well, you can now run your training steps and the queues will be
filled by the background threads. If you have set an epoch limit, at some point
an attempt to dequeue examples will get an
<a href="https://www.tensorflow.org/api_docs/python/tf/errors/OutOfRangeError"><code>tf.errors.OutOfRangeError</code></a>. This
is the TensorFlow equivalent of "end of file" (EOF) -- this means the epoch
limit has been reached and no more examples are available.</p>
<p>The last ingredient is the
<a href="https://www.tensorflow.org/api_docs/python/tf/train/Coordinator"><code>tf.train.Coordinator</code></a>. This is responsible
for letting all the threads know if anything has signalled a shut down. Most
commonly this would be because an exception was raised, for example one of the
threads got an error when running some operation (or an ordinary Python
exception).</p>
<p>For more about threading, queues, QueueRunners, and Coordinators
<a href="https://www.tensorflow.org/programmers_guide/threading_and_queues">see here</a>.</p>

#### Aside: How clean shut-down when limiting epochs works
<p>Imagine you have a model that has set a limit on the number of epochs to train
on.  That means that the thread generating filenames will only run that many
times before generating an <code>OutOfRange</code> error. The QueueRunner will catch that
error, close the filename queue, and exit the thread. Closing the queue does two
things:</p>
<ul>
<li>Any future attempt to enqueue in the filename queue will generate an error.
    At this point there shouldn't be any threads trying to do that, but this
    is helpful when queues are closed due to other errors.</li>
<li>Any current or future dequeue will either succeed (if there are enough
    elements left) or fail (with an <code>OutOfRange</code> error) immediately.  They won't
    block waiting for more elements to be enqueued, since by the previous point
    that can't happen.</li>
</ul>
<p>The point is that when the filename queue is closed, there will likely still be
many filenames in that queue, so the next stage of the pipeline (with the reader
and other preprocessing) may continue running for some time.  Once the filename
queue is exhausted, though, the next attempt to dequeue a filename (e.g. from a
reader that has finished with the file it was working on) will trigger an
<code>OutOfRange</code> error.  In this case, though, you might have multiple threads
associated with a single QueueRunner.  If this isn't the last thread in the
QueueRunner, the <code>OutOfRange</code> error just causes the one thread to exit.  This
allows the other threads, which are still finishing up their last file, to
proceed until they finish as well.  (Assuming you are using a
<a href="https://www.tensorflow.org/api_docs/python/tf/train/Coordinator"><code>tf.train.Coordinator</code></a>,
other types of errors will cause all the threads to stop.)  Once all the reader
threads hit the <code>OutOfRange</code> error, only then does the next queue, the example
queue, gets closed.</p>
<p>Again, the example queue will have some elements queued, so training will
continue until those are exhausted.  If the example queue is a
<a href="https://www.tensorflow.org/api_docs/python/tf/RandomShuffleQueue"><code>tf.RandomShuffleQueue</code></a>, say
because you are using <code>shuffle_batch</code> or <code>shuffle_batch_join</code>, it normally will
avoid ever having fewer than its <code>min_after_dequeue</code> attr elements buffered.
However, once the queue is closed that restriction will be lifted and the queue
will eventually empty.  At that point the actual training threads, when they
try and dequeue from example queue, will start getting <code>OutOfRange</code> errors and
exiting.  Once all the training threads are done,
<a href="https://www.tensorflow.org/api_docs/python/tf/train/Coordinator#join"><code>tf.train.Coordinator.join</code></a>
will return and you can exit cleanly.</p>

#### Filtering records or producing multiple examples per recode
<p>Instead of examples with shapes <code>[x, y, z]</code>, you will produce a batch of
examples with shape <code>[batch, x, y, z]</code>.  The batch size can be 0 if you want to
filter this record out (maybe it is in a hold-out set?), or bigger than 1 if you
are producing multiple examples per record.  Then simply set <code>enqueue_many=True</code>
when calling one of the batching functions (such as <code>shuffle_batch</code> or
<code>shuffle_batch_join</code>).</p>

#### Sparse input data
<p>SparseTensors don't play well with queues. If you use SparseTensors you have
to decode the string records using
<a href="https://www.tensorflow.org/api_docs/python/tf/parse_example"><code>tf.parse_example</code></a> <strong>after</strong>
batching (instead of using <code>tf.parse_single_example</code> before batching).</p>

## Preloaded data
모든 데이터를 memory에 로드하기 때문에 작은 데이터셋에 한정에서 사용함
- Preloaded data에는 다음과 같이 2가지 approach가 있음
    + constant로 저장
    + variable로 저장한 후, never change
    
constant를 사용하는 것은 simple하지만 memory를 많이 사용함

In [None]:
training_data = ...
training_labels = ...
with tf.Session():
    input_data = tf.constant(training_data)
    input_labels = tf.constant(training_labels)
    ...

Variable을 사용하는 경우에는 initialize 필요

In [None]:
training_data = ...
training_labels = ...
with tf.Session() as sess:
    data_initializer = tf.placeholder(dtype=training_data.dtype,
                                    shape=training_data.shape)
    label_initializer = tf.placeholder(dtype=training_labels.dtype,
                                     shape=training_labels.shape)
    input_data = tf.Variable(data_initializer, trainable=False, collections=[])
    input_labels = tf.Variable(label_initializer, trainable=False, collections=[])
    ...
    sess.run(input_data.initializer,
           feed_dict={data_initializer: training_data})
    sess.run(input_labels.initializer,
           feed_dict={label_initializer: training_labels})

<p>Setting <code>trainable=False</code> keeps the variable out of the
<code>GraphKeys.TRAINABLE_VARIABLES</code> collection in the graph, so we won't try and
update it when training.  Setting <code>collections=[]</code> keeps the variable out of the
<code>GraphKeys.GLOBAL_VARIABLES</code> collection used for saving and restoring checkpoints.</p>
<p>Either way,
<a href="https://www.tensorflow.org/api_docs/python/tf/train/slice_input_producer"><code>tf.train.slice_input_producer</code></a>
can be used to produce a slice at a time.  This shuffles the examples across an
entire epoch, so further shuffling when batching is undesirable.  So instead of
using the <code>shuffle_batch</code> functions, we use the plain
<a href="https://www.tensorflow.org/api_docs/python/tf/train/batch"><code>tf.train.batch</code></a> function.  To use
multiple preprocessing threads, set the <code>num_threads</code> parameter to a number
bigger than 1.</p>
<p>An MNIST example that preloads the data using constants can be found in
<a href="https://www.github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/how_tos/reading_data/fully_connected_preloaded.py"><code>tensorflow/examples/how_tos/reading_data/fully_connected_preloaded.py</code></a>, and one that preloads the data using variables can be found in
<a href="https://www.github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/how_tos/reading_data/fully_connected_preloaded_var.py"><code>tensorflow/examples/how_tos/reading_data/fully_connected_preloaded_var.py</code></a>,
You can compare these with the <code>fully_connected_feed</code> and
<code>fully_connected_reader</code> versions above.</p>

## Multiple input pipeline
<p>Commonly you will want to train on one dataset and evaluate (or "eval") on
another.  One way to do this is to actually have two separate processes:</p>
<ul>
<li>The training process reads training input data and periodically writes
  checkpoint files with all the trained variables.</li>
<li>The evaluation process restores the checkpoint files into an inference
  model that reads validation input data.</li>
</ul>
<p>This is what is done in
<a href="https://www.tensorflow.org/tutorials/deep_cnn#save_and_restore_checkpoints">the example CIFAR-10 model</a>.  This has a couple of benefits:</p>
<ul>
<li>The eval is performed on a single snapshot of the trained variables.</li>
<li>You can perform the eval even after training has completed and exited.</li>
</ul>
<p>You can have the train and eval in the same graph in the same process, and share
their trained variables.  See
<a href="https://www.tensorflow.org/programmers_guide/variable_scope">the shared variables tutorial</a>.</p>

## References
- Tensorflow Reading data : https://www.tensorflow.org/programmers_guide/reading_data