In this notebook, we will build a dataset iterator.

Our dataset iterator would return these fields:
* init: Iterator initializer. We will see its use when we write the main model
* context: context shape: `batch_size x max_len` (will not exceed 160, as we restricted our context length)
* len_context: `batch_size x 1` Length of context for each element in batch
* response: `batch_size x max_len`
* len_response: `batch_size x 1`
* flag: `batch_size x 1`




### Credits
I have learnt how to use dataset APIs, dynamic RNN and namedtuple from the excellent [tutorial](https://github.com/tensorflow/nmt) and [code](https://github.com/tensorflow/nmt/blob/master/nmt/utils/iterator_utils.py) on NMT.

Let us first define all relevant file paths. 

This should not seem new. We created these files in [data_prep notebook](https://github.com/vineetm/tensorflow-notes/blob/master/siamese/data_prep.ipynb)

In [1]:
import os
import tensorflow as tf
from tensorflow.python.ops import lookup_ops

#Set this to actual path of the folder where your training, validation and vocab files reside
DATA_DIR = 'ubuntu-data'

#Training files: context, response and flag
train_context_file = os.path.join(DATA_DIR, 'train.context')
train_response_file = os.path.join(DATA_DIR, 'train.response')
train_flag_file = os.path.join(DATA_DIR, 'train.flag')

#Validation files: context, response and flag
valid_context_file = os.path.join(DATA_DIR, 'valid.context')
valid_response_file = os.path.join(DATA_DIR, 'valid.response')
valid_flag_file = os.path.join(DATA_DIR, 'valid.flag')

#Vocab file
vocab_file = os.path.join(DATA_DIR, 'vocab.txt')

Okay, now lets us quickly test this API. We will create an iterator that returns words indexes for a sentence in `train.context` using vocab file. 

If this seems hurried, I would encourage you to refer my [blog post](https://vineetm.github.io/tensorflow-textdata/) and [notebook](https://github.com/vineetm/tensorflow-notes/blob/master/dataset/notebooks/tf-text-data.ipynb) which explains the dataset API in more detail.

In [2]:
#Create a vocab table. word -> index. Tell it if word is not found, use index 0 `UNK`
vocab_table = lookup_ops.index_table_from_file(vocab_file, default_value=0)

#Context dataset, one line per sentence
context_dataset = tf.data.TextLineDataset(train_context_file)

#Split sentence to words
context_dataset = context_dataset.map(lambda sentence: tf.string_split([sentence]).values)

#Convert words to indexes
context_dataset = context_dataset.map(lambda words: vocab_table.lookup(words))

#Create iterator
iterator = context_dataset.make_initializable_iterator()

Let us now test our iterator. We would need to do that inside a *session*. Also, we would need to
* initialize the iterator
* initialize vocab tables `tf.tables_initializer()`

In [3]:
with tf.Session() as sess:
    sess.run(tf.tables_initializer())
    sess.run(iterator.initializer)
    
    data = sess.run(iterator.get_next())
    print(data)

[    3    96   266   105  2089     4   307  2708   338  1505     5    24
    48    35   266    61     6   141   338  1153     8     3    96     9
    12   727   197  4965     4  1390    19   597   431    13   197  2089
  2118   115    13    35     1     2     9    81    38   230   416     6
   414    10  3615  2923    20     0     1   444   134     1     2    93
     1   567   266    21  3458    56  5523     0     1     2   152     7
     1     2   108     1     2    27   344     1     9    12  5480   119
    49 18947     7     1     2    80     1    24     9    30   159    27
     4   109   142   438     1    28    10   861  2615  9046    15   363
  1642     1     2    11    67    38  1506   155    49  2615  1116     7
     1  1775    10   402  1065   102  2952    23     6    16    32     4
 16527    45    27    16    23    11  1411     1     2    31  4935    12
     0     7     1     2    27  3455  4935     5    24    10  1814 12791
     1     2     3  1513    14   266    79   656   

It seems we would need to do the same operation for `context` and `response`. Let us write a function, that would take a text file and vocab_table and return a dataset with words converted to indexes

In [4]:
def text_to_word_indexes(text_file, vocab_table):
    dataset = tf.data.TextLineDataset(text_file)
    
    #Split sentence to words
    dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)

    #Convert words to indexes
    dataset = dataset.map(lambda words: vocab_table.lookup(words))
    
    return dataset


Ok, what do we do about flag? Flag needs to be converted to a float, use `tf.string_to_number`

Finally, we need to return context, response and flag in a single iterator call. Enter, [`tf.data.Dataset.zip`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#zip)

Let us see how this all works!

In [5]:
#Create a vocab table. word -> index. Tell it if word is not found, use index 0 `UNK`
vocab_table = lookup_ops.index_table_from_file(vocab_file, default_value=0)

#Create context dataset, sentence -> word indexes
context_dataset = text_to_word_indexes(train_context_file, vocab_table)

#Create response dataset, sentence -> word indexes
response_dataset = text_to_word_indexes(train_response_file, vocab_table)

flag_dataset = tf.data.TextLineDataset(train_flag_file)
# Convert string to a float..
flag_dataset = flag_dataset.map(lambda sentence: tf.string_to_number(sentence))

#Join datasets together, using zip
dataset = tf.data.Dataset.zip((context_dataset, response_dataset, flag_dataset))

iterator = dataset.make_initializable_iterator()

Okay, let us test this out!

In [6]:
with tf.Session() as sess:
    sess.run(tf.tables_initializer())
    sess.run(iterator.initializer)
    
    data = sess.run(iterator.get_next())
    print('Len:{} data:{}'.format(len(data), data))

Len:3 data:(array([    3,    96,   266,   105,  2089,     4,   307,  2708,   338,
        1505,     5,    24,    48,    35,   266,    61,     6,   141,
         338,  1153,     8,     3,    96,     9,    12,   727,   197,
        4965,     4,  1390,    19,   597,   431,    13,   197,  2089,
        2118,   115,    13,    35,     1,     2,     9,    81,    38,
         230,   416,     6,   414,    10,  3615,  2923,    20,     0,
           1,   444,   134,     1,     2,    93,     1,   567,   266,
          21,  3458,    56,  5523,     0,     1,     2,   152,     7,
           1,     2,   108,     1,     2,    27,   344,     1,     9,
          12,  5480,   119,    49, 18947,     7,     1,     2,    80,
           1,    24,     9,    30,   159,    27,     4,   109,   142,
         438,     1,    28,    10,   861,  2615,  9046,    15,   363,
        1642,     1,     2,    11,    67,    38,  1506,   155,    49,
        2615,  1116,     7,     1,  1775,    10,   402,  1065,   102,
        

Now, you see that data is a tuple of length 3. 

* `data[0]` is `context`
* `data[1]` is `response`
* `data[2]` is `flag`

As you might have noticed, context is sometimes really long! We will take a cue from the [original paper](http://www.cs.toronto.edu/~lcharlin/papers/ubuntu_dialogue_dd17.pdf), and restrict context to 160 tokens. 

Further, we will restrict it to only take last 160 tokens. Why? Because the most relevant bits to predict next response would be in the end!

We now need to worry about two other things: 
* Add length of context and response. We will use `tf.size` on the words
* Batch data

Let us see how this can be accomplished!

In [7]:
#Create a vocab table. word -> index. Tell it if word is not found, use index 0 `UNK`
vocab_table = lookup_ops.index_table_from_file(vocab_file, default_value=0)

#Create context dataset, sentence -> word indexes
context_dataset = text_to_word_indexes(train_context_file, vocab_table)

#Restrict context to Last 160 tokens
context_dataset = context_dataset.map(lambda words: words[-160:])


#Create response dataset, sentence -> word indexes
response_dataset = text_to_word_indexes(train_response_file, vocab_table)

flag_dataset = tf.data.TextLineDataset(train_flag_file)
# Convert string to a float..
flag_dataset = flag_dataset.map(lambda sentence: tf.string_to_number(sentence))

#Join datasets together, using zip
dataset = tf.data.Dataset.zip((context_dataset, response_dataset, flag_dataset))

#Add length of context and response
dataset = dataset.map(lambda context, response, flag: (context, tf.size(context), response, tf.size(response), flag))

iterator = dataset.make_initializable_iterator()

In [8]:
with tf.Session() as sess:
    sess.run(tf.tables_initializer())
    sess.run(iterator.initializer)
    
    data = sess.run(iterator.get_next())
    print('Len:{} data:{}'.format(len(data), data))

Len:5 data:(array([    2,    80,     1,    24,     9,    30,   159,    27,     4,
         109,   142,   438,     1,    28,    10,   861,  2615,  9046,
          15,   363,  1642,     1,     2,    11,    67,    38,  1506,
         155,    49,  2615,  1116,     7,     1,  1775,    10,   402,
        1065,   102,  2952,    23,     6,    16,    32,     4, 16527,
          45,    27,    16,    23,    11,  1411,     1,     2,    31,
        4935,    12,     0,     7,     1,     2,    27,  3455,  4935,
           5,    24,    10,  1814, 12791,     1,     2,     3,  1513,
          14,   266,    79,   656,     4,     0,     1,     2,    93,
           1,     2,    81,    11,  1733,    10,   124,  2274,    10,
        1142,     7,     1,     2,  2834,     7,     1,     2,  9047,
           0,    95,    20,     0,     8,    65,  1891,    18,     4,
        1247,    20,     4,  1468,    13,    14,    30,     9,     1,
           2,  9047,     4,   300,   124,    11,   140,     7,     1,
        

Okay, the tuple length is now 5 (we have added len_context and len_response). But this seems to be getting a bit out of control.

We should not be expected to remember which field in tuple means what! Well, we can name these fields using [`namedtuple`](https://docs.python.org/2/library/collections.html)

In [9]:
from collections import namedtuple

#Notice a new (sixth) field init. This is initializer for iterator
class DataIterator(namedtuple('DataIterator', 'init context len_context response len_response flag')):
    pass

Now, we do return a new `DataIterator` instead of iterator. This would allow us access via named fields!

Let us see how! We will put our code from before inside a function.

In [10]:
def create_dataset_iterator(vocab_table, context_file, response_file, flag_file):
    #Create a vocab table. word -> index. Tell it if word is not found, use index 0 `UNK`
    vocab_table = lookup_ops.index_table_from_file(vocab_file, default_value=0)

    #Create context dataset, sentence -> word indexes
    context_dataset = text_to_word_indexes(context_file, vocab_table)

    #Restrict context to Last 160 tokens
    context_dataset = context_dataset.map(lambda words: words[-160:])


    #Create response dataset, sentence -> word indexes
    response_dataset = text_to_word_indexes(response_file, vocab_table)

    flag_dataset = tf.data.TextLineDataset(flag_file)
    # Convert string to a float..
    flag_dataset = flag_dataset.map(lambda sentence: tf.string_to_number(sentence))

    #Join datasets together, using zip
    dataset = tf.data.Dataset.zip((context_dataset, response_dataset, flag_dataset))

    #Add length of context and response
    dataset = dataset.map(lambda context, response, flag: (context, tf.size(context), response, tf.size(response), flag))

    iterator = dataset.make_initializable_iterator()

    context, len_context, response, len_response, flag = iterator.get_next()

    return DataIterator(iterator.initializer, context, len_context, response, len_response, flag)

In [11]:
iterator = create_dataset_iterator(vocab_table, train_context_file, train_response_file, train_flag_file)

In [12]:
with tf.Session() as sess:
    sess.run(tf.tables_initializer())
    sess.run(iterator.init)
    
    datum = sess.run(iterator)
    print ('Context:{} Len:{}'.format(datum.context, datum.len_context))
    print ('Context:{} Len:{}'.format(datum.response, datum.len_response))
    print ('Flag: {}'.format(datum.flag))

Context:[    2    80     1    24     9    30   159    27     4   109   142   438
     1    28    10   861  2615  9046    15   363  1642     1     2    11
    67    38  1506   155    49  2615  1116     7     1  1775    10   402
  1065   102  2952    23     6    16    32     4 16527    45    27    16
    23    11  1411     1     2    31  4935    12     0     7     1     2
    27  3455  4935     5    24    10  1814 12791     1     2     3  1513
    14   266    79   656     4     0     1     2    93     1     2    81
    11  1733    10   124  2274    10  1142     7     1     2  2834     7
     1     2  9047     0    95    20     0     8    65  1891    18     4
  1247    20     4  1468    13    14    30     9     1     2  9047     4
   300   124    11   140     7     1     2    80     8   137   934   206
     8     1     2     3    81  1102     6  1624     9    75    42  2546
     8    42    12   114    14    97   935   155     0  2209     5     3
   947     7     1     2] Len:160
Context:[

Okay, now to the final part. Batching

Batching with text is a bit tricky, as the length of each sentence! Tensorflow supports concept of [`padded_batch`]
(https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset#padded_batch). Let us see how to use it!

```python
dataset = dataset.padded_batch(batch_size, padded_shapes=(tf.TensorShape([None]), tf.TensorShape([]), tf.TensorShape([None]), tf.TensorShape([]), tf.TensorShape([])))
```

The first argument `batch_size` is exactly what it means: How many lines do you want to pool together?
Remaining items specify how padding should be done. For all practical purposes, yo

In [13]:
def create_dataset_iterator(vocab_table, context_file, response_file, flag_file, batch_size):
    #Create a vocab table. word -> index. Tell it if word is not found, use index 0 `UNK`
    vocab_table = lookup_ops.index_table_from_file(vocab_file, default_value=0)

    #Create context dataset, sentence -> word indexes
    context_dataset = text_to_word_indexes(context_file, vocab_table)

    #Restrict context to Last 160 tokens
    context_dataset = context_dataset.map(lambda words: words[-160:])


    #Create response dataset, sentence -> word indexes
    response_dataset = text_to_word_indexes(response_file, vocab_table)

    flag_dataset = tf.data.TextLineDataset(flag_file)
    # Convert string to a float..
    flag_dataset = flag_dataset.map(lambda sentence: tf.string_to_number(sentence))

    #Join datasets together, using zip
    dataset = tf.data.Dataset.zip((context_dataset, response_dataset, flag_dataset))

    #Add length of context and response
    dataset = dataset.map(lambda context, response, flag: (context, tf.size(context), response, tf.size(response), flag))

    dataset = dataset.padded_batch(batch_size, padded_shapes=(tf.TensorShape([None]), tf.TensorShape([]), tf.TensorShape([None]), tf.TensorShape([]), tf.TensorShape([])))
    iterator = dataset.make_initializable_iterator()

    context, len_context, response, len_response, flag = iterator.get_next()

    return DataIterator(iterator.initializer, context, len_context, response, len_response, flag)

In [14]:
iterator = create_dataset_iterator(vocab_table, train_context_file, train_response_file, train_flag_file, batch_size=16)

In [15]:
with tf.Session() as sess:
    sess.run(tf.tables_initializer())
    sess.run(iterator.init)
    
    datum = sess.run(iterator)
    print('Len datum: {}'.format(len(datum)))
    print ('Context Shape:{}'.format(datum.context.shape))
    print ('Len_Context Shape:{}'.format(datum.len_context.shape))
        
    print ('Response Shape:{}'.format(datum.response.shape))
    print ('Len_Response Shape:{}'.format(datum.len_response.shape))
    
    print ('Flags Shape:{}'.format(datum.flag.shape))

Len datum: 6
Context Shape:(16, 160)
Len_Context Shape:(16,)
Response Shape:(16, 27)
Len_Response Shape:(16,)
Flags Shape:(16,)


That is all! We now have a data iterator with named fields, that can return batched data.

Well, we are all set to use this data iterator for our Siamase Network for text correlation!