In this notebook, we will learn how to work with **Text Data** in tensorflow using [tf.data](https://www.tensorflow.org/api_docs/python/tf/data) in [eager](https://www.tensorflow.org/guide/eager) mode

Let us start by exploring [tf.data.TextLineDataset](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset)

In [1]:
import tensorflow as tf
tf.enable_eager_execution()

### P1: TextLineDataset

Let us start with our favorite file `wiki.1M.txt.tokenized`, which we tokenized in the last notebook. 

This file contains 1M sentences where each token is separated by a space...

A file which has one line per sentence can be processed using TextLineDataset. Let us see how this works...

In [2]:
sentences_file = 'wiki.100K.txt'

In [3]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset_iter = iter(dataset)

In [4]:
print(dataset)

<TextLineDataset shapes: (), types: tf.string>


In [5]:
print(dataset_iter)

<tensorflow.python.data.ops.iterator_ops.EagerIterator object at 0x1235a2f60>


In [6]:
print(next(dataset_iter))

tf.Tensor(b'anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions .', shape=(), dtype=string)


### P2: StringSplit
A useful operator in tf, is [tf.string_split](https://www.tensorflow.org/api_docs/python/tf/string_split). This method splits a string *tensor* into its constituents

* We start by defining a dataset from a text file
* We now define what operation should be performed on each member of the dataset using **map**

In [7]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset_iter = iter(dataset)

In [8]:
print(next(dataset_iter))

tf.Tensor(
[b'anarchism' b'is' b'a' b'political' b'philosophy' b'that' b'advocates'
 b'self-governed' b'societies' b'based' b'on' b'voluntary' b'institutions'
 b'.'], shape=(14,), dtype=string)


### P3: Convert Words to integers

TF provides lookup_ops which can convert a key to an integer. All you need to construct this is provide a vocab file..

In [9]:
from tensorflow.python.ops import lookup_ops

In [10]:
vocab_table = lookup_ops.index_table_from_file('vocab.txt', default_value=0)

In [11]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset = dataset.map(lambda words: vocab_table.lookup(words))
dataset_iter = iter(dataset)

In [12]:
print(next(dataset_iter))

tf.Tensor([7627   11    8  245 1004   17 6284    0 3460  188   20 6174 1370    3], shape=(14,), dtype=int64)


### P4: Batching

So, let us recap what we have achieved so far:

1. Read a text file and convert each sentence to a String Tensor
2. Build a lookup table using a vocabulary file, which assigns an integer to a word
3. For each string Tensor, assign word indexes using the lookup table in 2

Now, let us look at how we will create mini-batches. This is a necessary step in almost any model you will work with. You will never feed a single example, but a batch (say 32 sentences)

Dataset provides a **batch** operation. Why does this fail?

In [13]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset = dataset.map(lambda words: vocab_table.lookup(words))
dataset = dataset.batch(batch_size=32)
dataset_iter = iter(dataset)

In [14]:
print(next(dataset_iter))

InvalidArgumentError: Cannot batch tensors with different shapes in component 0. First element had shape [14] and element 1 had shape [25]. [Op:IteratorGetNextSync]

Well, as you can see this fails as you trying to batch sentences of different lengths together...

This is a key challenge in NLP. We would see sentences(aka list of words) which are not of the same size. So how can you batch together sentences which are of different lengths?

The solution is simple, you create a batch with the *longest* sentence, and pad the rest...
[padded_batch](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset#padded_batch)

* Batch_size, Number of examples in a batch
* padded_shapes: 
* padding_values: This is optional, By default it would pad with value 0.

In [15]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset = dataset.map(lambda words: vocab_table.lookup(words))
dataset = dataset.padded_batch(batch_size=32, padded_shapes=[None])
dataset_iter = iter(dataset)

In [16]:
print(next(dataset_iter))

tf.Tensor(
[[ 7627    11     8 ...     0     0     0]
 [   63    23   113 ...     0     0     0]
 [ 7627  2040     1 ...     0     0     0]
 ...
 [   12   120     4 ...     0     0     0]
 [  187     7  1232 ...     0     0     0]
 [  601 26681     2 ...     0     0     0]], shape=(32, 52), dtype=int64)


What if we want to pad the data with a specific value say 345?

* Supply that as a tensor in padding_values. However, this needs to be of type tf.int64

In [17]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset = dataset.map(lambda words: vocab_table.lookup(words))
dataset = dataset.padded_batch(batch_size=32, padded_shapes=[None], padding_values=tf.convert_to_tensor(345, dtype=tf.int64))
dataset_iter = iter(dataset)

In [18]:
print(next(dataset_iter))

tf.Tensor(
[[ 7627    11     8 ...   345   345   345]
 [   63    23   113 ...   345   345   345]
 [ 7627  2040     1 ...   345   345   345]
 ...
 [   12   120     4 ...   345   345   345]
 [  187     7  1232 ...   345   345   345]
 [  601 26681     2 ...   345   345   345]], shape=(32, 52), dtype=int64)
