In this notebook, we would look at how to handle text using `tf.data` APIs in tensorflow.

More specifically we will look at how to create dataset iterators, which can:
* Return one line per call from a text file
* Return a list of tokens per line
* Return a list of token **integers** by using a lookup table
* Append length of sentence (number of tokens)
* Return a **batch** of token integers (Think multiple sentences)

### Download data
Download the english sentences `train.en` and vocab `vocab.en` from [Ted Talks dataset IWSLT'15 English-Vietnamese data.](https://nlp.stanford.edu/projects/nmt/)
  ```bash
  wget https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/train.en
  wget https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/vocab.en
  
  head -100 train.en > sample.en
  ```
  
  You can go ahead and work with the entire training data `train.en`. We are using a sample for simplicity!

In [1]:
# Well, we are using tensorflow, so..
import tensorflow as tf

# This is to get path for text files
import os

### Step1: Data iterator for text

We will create a data iterator which returns a single sentence per call.

Replace `DATA_DIR` with the actual path of directory where you downloaded the data.

In [3]:
DATA_DIR = '.'
sentences_file = os.path.join(DATA_DIR, 'sample.en')
vocab_file = os.path.join(DATA_DIR, 'vocab.en')

In [12]:
#Creates a dataset which returns a single sentence
dataset = tf.data.TextLineDataset(sentences_file)

# Dataset iterator. Needs to be initialized
iterator = dataset.make_initializable_iterator()

All we did above was create a dataset which returns one sentence per call. We further created a iterator.

Note, we cannot execute this as is. This is an operator that returns tensor. We need to run it inside a tensorflow session. Further, we need to **initialize** out iterator

In [5]:
with tf.Session() as sess:
    sess.run(iterator.initializer)
    
    sentence = sess.run(iterator.get_next())
    print(sentence)

b'Rachel Pike : The science behind a climate headline'


So far so good! But, what happens when we run out of data items? In our case past sentence 100. Does it return to begnning of file again? Well, lets check it out...

In [7]:
num_sentences = 0
with tf.Session() as sess:
    sess.run(iterator.initializer)
    while True:
        sentence = sess.run(iterator.get_next())
        num_sentences += 1
        
        #Print progress every 10 sentences
        if num_sentences % 10 == 0:
            print('S:%d :%s'%(num_sentences, sentence))

S:10 :b'It &apos;s a big community . It &apos;s such a big community , in fact , that our annual gathering is the largest scientific meeting in the world .'
S:20 :b'Because it &apos;s important to the atmospheric system , we go to all lengths to study this thing .'
S:30 :b'I recently joined a field campaign in Malaysia . There are others .'
S:40 :b'We have to get special flight clearance .'
S:50 :b'So you can imagine the scale of the effort .'
S:60 :b'I &apos;m going to tell you about that technology .'
S:70 :b'We don &apos;t have to inject anything . We don &apos;t need radiation .'
S:80 :b'We all know that as we form thoughts , they form deep channels in our minds and in our brains .'
S:90 :b'As they do it , in the upper left is a display that &apos;s yoked to their brain activation of their own pain being controlled .'
S:100 :b'Where will we take it ?'


OutOfRangeError: End of sequence
	 [[Node: IteratorGetNext_202 = IteratorGetNext[output_shapes=[[]], output_types=[DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]

Caused by op 'IteratorGetNext_202', defined at:
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2728, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2850, in run_ast_nodes
    if self.run_code(code, result):
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-83edb5bcc7a4>", line 5, in <module>
    sentence = sess.run(iterator.get_next())
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 268, in get_next
    name=name)), self._output_types)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 777, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3076, in create_op
    op_def=op_def)
  File "/Users/vineet/anaconda/envs/tfn/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1561, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

OutOfRangeError (see above for traceback): End of sequence
	 [[Node: IteratorGetNext_202 = IteratorGetNext[output_shapes=[[]], output_types=[DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]


We get a `tf.errors.OutOfRangeError`. This indicates we have run out of things to process. It is **not really an error**. Okay, then, why does this throw an error?

Think of `tf.errors.OutOfRangeError` as a signal that you have finished an epoch. You can re-initialize the iterator here. This can be useful, if you want to say do an evaluation at the end of an epoch. The code below stops after two epochs. Notice, how we need to call `sess.run(iterator.initializer)`.

In [9]:
num_sentences = 0
epoch_num = 0
with tf.Session() as sess:
    sess.run(iterator.initializer)
    while True:
        try:
            sentence = sess.run(iterator.get_next())
            num_sentences += 1
            if num_sentences % 10 == 0:
                print('S:%d :%s'%(num_sentences, sentence))
        except tf.errors.OutOfRangeError:
            print('End of epoch: %d '%epoch_num)
            epoch_num += 1
            
            #Stop after 2 epochs
            if epoch_num == 2:
                break
            sess.run(iterator.initializer)

S:10 :b'It &apos;s a big community . It &apos;s such a big community , in fact , that our annual gathering is the largest scientific meeting in the world .'
S:20 :b'Because it &apos;s important to the atmospheric system , we go to all lengths to study this thing .'
S:30 :b'I recently joined a field campaign in Malaysia . There are others .'
S:40 :b'We have to get special flight clearance .'
S:50 :b'So you can imagine the scale of the effort .'
S:60 :b'I &apos;m going to tell you about that technology .'
S:70 :b'We don &apos;t have to inject anything . We don &apos;t need radiation .'
S:80 :b'We all know that as we form thoughts , they form deep channels in our minds and in our brains .'
S:90 :b'As they do it , in the upper left is a display that &apos;s yoked to their brain activation of their own pain being controlled .'
S:100 :b'Where will we take it ?'
End of epoch: 0 
S:110 :b'It &apos;s a big community . It &apos;s such a big community , in fact , that our annual gathering is the 

### Step 2: Return a list of tokens

In [13]:
#We saw this before, just returns one line from text file
dataset = tf.data.TextLineDataset(sentences_file)

#This would convert a sentence to a list of tokens.
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)

#Good, ol iterator!
iterator = dataset.make_initializable_iterator()

In [14]:
with tf.Session() as sess:
    sess.run(iterator.initializer)
    
    sentence = sess.run(iterator.get_next())
    print(sentence)

[b'Rachel' b'Pike' b':' b'The' b'science' b'behind' b'a' b'climate'
 b'headline']


Note, how this returns a list of words

### Step 3: Lookup Table, and list of token *integers*
OK, so we have made some progress. We know how to create an iterator, and how to return a list of tokens.

To convert tokens (strings) to integers, we need a lookup table. Think of lookup table as a dictionary, which returns a unique index for each token (And a default index for out of vocabulary words)

In [16]:
from tensorflow.python.ops import lookup_ops

In [17]:
# Returns a lookup op, which will convert a string to integer, as per vocab_file. 
# By default this returns 0 (Look at what is first line of vocab_file)
vocab_table = lookup_ops.index_table_from_file(vocab_file, default_value=0)

In [21]:
#Oh comeon, we have seen these two lines enough!
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)

# This does a lookup. Returns corresponding integers
dataset = dataset.map(lambda words: vocab_table.lookup(words))

#Ah, I know this!
iterator = dataset.make_initializable_iterator()

In [20]:
with tf.Session() as sess:
    sess.run(iterator.initializer)
    sess.run(tf.tables_initializer())
    
    sentence = sess.run(iterator.get_next())
    print(sentence)

[ 3  0  4  5  6  7  8  9 10]


### Step 4: Append length of sentence (aka number of tokens)

Why would we need to append number of tokens. Well, text is interesting. Not all sentences are of same length. Thus if we are considering batching them together, we need to know where sentence ends

In [24]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset = dataset.map(lambda words: vocab_table.lookup(words))

#We can compute length at runtime using tf.size
dataset = dataset.map(lambda words: (words, tf.size(words)))

iterator = dataset.make_initializable_iterator()

In [25]:
with tf.Session() as sess:
    sess.run(iterator.initializer)
    sess.run(tf.tables_initializer())
    
    sentence = sess.run(iterator.get_next())
    print(sentence)

(array([ 3,  0,  4,  5,  6,  7,  8,  9, 10]), 9)


### Final Step: Batching

This step involves grouping sentences together. This is where `tf.data` comes really handy, as it takes care of batching for you. However, text comes with its own set of challenges. 

Remember, each sentence is not of same length. Solution is to used `padded_batch`

In [30]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset = dataset.map(lambda words: vocab_table.lookup(words))


dataset = dataset.map(lambda words: (words, tf.size(words)))

#First argument tells us what is size of batch. [None] says this is vectors of unknown size
dataset = dataset.padded_batch(32, padded_shapes=(tf.TensorShape([None]), tf.TensorShape([])))
iterator = dataset.make_initializable_iterator()

In [34]:
with tf.Session() as sess:
    sess.run(iterator.initializer)
    sess.run(tf.tables_initializer())
    
    #This is now a tuple. Where [0] is sentence [1] is length
    sentences = sess.run(iterator.get_next())
    print(len(sentences[0]), len(sentences[1]))

32 32
