In this notebook, we explore how to process text data using `tf.data`

### Download data
Download the english sentences `train.en` and vocab `vocab.en` from [Ted Talks dataset IWSLT'15 English-Vietnamese data.](https://nlp.stanford.edu/projects/nmt/)
  ```bash
  wget https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/train.en
  wget https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/vocab.en
  
  head -100 train.en > sample.en
  ```
  
### Install and setup tensorflow
https://www.tensorflow.org/install/

This notebook requires Python 3 and Tensorflow version >= 1.4

In [None]:
import tensorflow as tf
import os
from tensorflow.python.ops import lookup_ops

Replace `DATA_DIR` with folder where you downloaded the data. You can replace `sample.en` with `train.en`

In [None]:
DATA_DIR = '.'
sentences_file = os.path.join(DATA_DIR, 'sample.en')
vocab_file = os.path.join(DATA_DIR, 'vocab.en')

This creates dataset, and further creates an iterator. Note this can only be **executed** inside a session

In [None]:
#lookup table, converts a token to integer. By default returns token at first line of `vocab.en`
#Requires to be initialized using tf.tables_initializer inside a session.
vocab_table = lookup_ops.index_table_from_file(vocab_file, default_value=0)

#Creates a dataset which retruns a single sentence
dataset = tf.data.TextLineDataset(sentences_file)

#Converts each sentence to a list of tokens
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)

#Converts list of tokens to list of token integers
dataset = dataset.map(lambda words: vocab_table.lookup(words))

#Adds length of sentence (number of tokens)
dataset = dataset.map(lambda words: (words, tf.size(words)))

#Convert to a batch of size 32. Padded batch appends 0 for shorter sentences.
dataset = dataset.padded_batch(batch_size=32, padded_shapes=(tf.TensorShape([None]), tf.TensorShape([])))

# Dataset iterator. Needs to be initialized
iterator = dataset.make_initializable_iterator()

In [None]:
with tf.Session() as sess:
    #Iterator needs to be initialized
    sess.run(iterator.initializer)
    
    #This is required for our lookup table
    sess.run(tf.tables_initializer())
    
    sentences = sess.run(iterator.get_next())
    print ('Num Sentences: %d Shape[txt]: %s Shape[len]: %s'%(len(sentences[0]), sentences[0].shape, sentences[1].shape))
    print ('S[%d]:%s Length:%d\n'%(1, sentences[0][1], sentences[1][1]))
    print ('S[%d]:%s Length:%d'%(14, sentences[0][14], sentences[1][14]))

Notice that shape of tokens part is **32x50** (`batch_size` x maximum tokens in the batch). Shape of length is **32** (one per element of batch). Also sentence 14 has length 30. The remaining portion is padded with 0.

What is the maximum sentence length for the next batch?

In [None]:
with tf.Session() as sess:
    sess.run(iterator.initializer)
    sess.run(tf.tables_initializer())
    
    sentences = sess.run(iterator.get_next())
    sentences = sess.run(iterator.get_next())
    print ('Num Sentences: %d Shape[txt]: %s Shape[len]: %s'%(len(sentences[0]), sentences[0].shape, sentences[1].shape))
    print ('S[%d]:%s Length:%d\n'%(1, sentences[0][1], sentences[1][1]))
    print ('S[%d]:%s Length:%d'%(14, sentences[0][14], sentences[1][14]))

Notice that shape of tokens part is **32x40**. Thus, maximum length for the second batch is 40.