In this notebook, we will learn how to work with **Text Data** in tensorflow.

More concretely, we will learn how to build data pipeline using Tensorflow

Some excellent resources (highly recommended) a more detailed overview:

1. [Stanford CS230 Blog post](https://cs230-stanford.github.io/tensorflow-input-data.html) on data pipelines using Tensorflow
2. [Talk on tf.data by architect of tf.data](https://www.youtube.com/watch?v=uIcqeP7MFH0) Derek Murray
3. Official [tf.data guide](https://www.tensorflow.org/guide/datasets)

In [1]:
import tensorflow as tf
tf.enable_eager_execution()

In [2]:
sentences_file = '../data/wiki.1M.txt.tokenized'
vocab_file = '../data/vocab.txt'

### P1: TextLineDataset

Let us start with our favorite file `wiki.1M.txt.tokenized`, which we tokenized in the last notebook. 

This file contains 1M sentences where each token is separated by a space delimiter:
```
anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions .
```

A file which has one line per sentence can be processed using TextLineDataset. Let us see how this works...

In [3]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset_iter = iter(dataset)

In [4]:
print(dataset)

<TextLineDataset shapes: (), types: tf.string>


In [5]:
print(dataset_iter)

<tensorflow.python.data.ops.iterator_ops.EagerIterator object at 0x11da4da58>


In [6]:
print(next(dataset_iter))

tf.Tensor(b'anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions .', shape=(), dtype=string)


### P2: StringSplit and map
A useful operator in tf, is [tf.string_split](https://www.tensorflow.org/api_docs/python/tf/string_split). This method splits a string *tensor* into a **list** of string tensors

* We start by defining a dataset from a text file
* **Transform**: A powerful function in datasets is **map**. This does as name suggests applies an operation to each element of the dataset...

In [7]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset_iter = iter(dataset)

In [8]:
print(next(dataset_iter))

tf.Tensor(
[b'anarchism' b'is' b'a' b'political' b'philosophy' b'that' b'advocates'
 b'self-governed' b'societies' b'based' b'on' b'voluntary' b'institutions'
 b'.'], shape=(14,), dtype=string)


### P3: Convert Words to integers

TF provides lookup_ops which can convert a key to an integer. All you need to construct this is provide a vocab file..

In [9]:
from tensorflow.python.ops import lookup_ops

In [11]:
vocab_table = lookup_ops.index_table_from_file(vocab_file, default_value=0)

In [12]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset = dataset.map(lambda words: vocab_table.lookup(words))
dataset_iter = iter(dataset)

In [13]:
print(next(dataset_iter))

tf.Tensor([7629   11    8  245 1004   17 6285    0 3460  188   20 6173 1370    3], shape=(14,), dtype=int64)


### P4: Batching

So, let us recap what we have achieved so far:

1. Read a text file and convert each sentence to a String Tensor
2. Build a lookup table using a vocabulary file, which assigns an integer to a word
3. For each string Tensor, assign word indexes using the lookup table in 2

Now, let us look at how we will create mini-batches. This is a necessary step in almost any model you will work with. You will never feed a single example, but a batch (say 32 sentences)

Dataset provides a **batch** operation. Why does this fail?

In [16]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset = dataset.map(lambda words: vocab_table.lookup(words))
dataset = dataset.batch(batch_size=32)
dataset_iter = iter(dataset)

In [17]:
next(dataset_iter)

InvalidArgumentError: Cannot batch tensors with different shapes in component 0. First element had shape [14] and element 1 had shape [25]. [Op:IteratorGetNextSync]

This is a central challenge in NLP. We would see sentences(aka list of words) which are not of the same size. So how can you batch together sentences which are of different lengths?

The solution is simple, you create a batch with the *longest* sentence, and pad the rest...
[padded_batch](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset#padded_batch)

* Batch_size, Number of examples in a batch
* padded_shapes: We need to specify shape of the tensor you ar creating. 
    We know it is a list. But we don't know its shape. Thus we add `[None]`
    
* padding_values: This is optional, By default it would pad with value 0.

In [20]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset = dataset.map(lambda words: vocab_table.lookup(words))
dataset = dataset.padded_batch(batch_size=32, padded_shapes=[None])
dataset_iter = iter(dataset)

In [23]:
words_batch = next(dataset_iter)

In [26]:
words_batch

<tf.Tensor: id=135, shape=(32, 79), dtype=int64, numpy=
array([[   0,   38, 1859, ...,    0,    0,    0],
       [   6, 6428,    2, ...,    0,    0,    0],
       [  47, 2243,  636, ...,    0,    0,    0],
       ...,
       [   1,  349,  114, ...,    0,    0,    0],
       [ 184,    4,    1, ...,    0,    0,    0],
       [   1, 3030,  114, ...,    0,    0,    0]])>

### Exercise 1: Add number of words per sentence in the words_batch. 
* Hint: Use `tf.size` to compute size of a tensor...

### Exercise 2: For each source sentence, add a target sentence, which is shifted right by one word

Thus, if source is
`[2, 3, 4, 40]`, we want to create `[2, 3, 4, 40]` and `[3, 4, 40]`

### Exercise 3: For each sentence add characters for the first word
* Hint use tf.string_split with delimiter as ''
* tf.string_split returns sparse tensor, Use tf.sparse_tensor_to_dense to convert to a dense tensor, use default_value of UNK
* You would need to convert the char tensor to 1D. use tf.squeeze with axis=0

In [None]:
def get_chars_from_first_word(words):
    return words

In [1]:
dataset = tf.data.TextLineDataset(sentences_file)
dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
dataset = dataset.map(get_chars_from_first_word)

#Fill in the rest

dataset_iter = iter(dataset)
words_batch = next(dataset_iter)

NameError: name 'tf' is not defined

In [None]:
words_batch