Example of how to use ```tf.data.TextLineDataset``` to load examples from text files. ```TextLineDataset``` is designed to create a dataset from text file, in which each example is a line of text from the original file. This is potentially useful for any text data that is primaryly line-based such as poetry or error logs.

Thhis example will use 3 different English translations of the same work, Homer's Illiad, and train a model to identify the translator given a single line of text.

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
import tensorflow_datasets as tfds
import os

In [2]:
DIRECTORY_URL = "https://storage.googleapis.com/download.tensorflow.org/data/illiad/"

FILE_NAMES = [
   "cowper.txt", 
   "derby.txt", 
   "butler.txt"
]

for name in FILE_NAMES:
   text_dir = tf.keras.utils.get_file(
      name, origin = DIRECTORY_URL + name
   )
  
parent_dir = os.path.dirname(text_dir)
print(parent_dir)

C:\Users\Ridzuan\.keras\datasets


### Load text into datasets

Iterate through the files, loading each one into its own dataset. Each example needs to be individually labeled, so use ```tf.data.Dataset.map``` to apply a labeler function to each one. This will iterate over every example in the dataset, returning ```(example, label)``` pairs.

In [3]:
def labeler(example, index):
   return example, tf.cast(index, tf.int64)

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
   
   lines_dataset = tf.data.TextLineDataset(
      os.path.join(parent_dir, file_name)
   )
   
   labeled_dataset = lines_dataset.map(
      lambda ex: labeler(ex, i)
   )
   
   labeled_data_sets.append(labeled_dataset)

In [4]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

In [5]:
all_labeled_data = labeled_data_sets[0]

# combine these labeled datasets into single dataset and shuffle
for labeled_dataset in labeled_data_sets[1:]:
   all_labeled_data = all_labeled_data.concatenate(labeled_dataset)
   
all_labeled_data = all_labeled_data.shuffle(
   BUFFER_SIZE,
   reshuffle_each_iteration = False
)

In [6]:
for example in all_labeled_data.take(5):
   print(example)

(<tf.Tensor: shape=(), dtype=string, numpy=b'Ajax--Idomeneus--abstain ye both'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b"Trusting to heav'nly signs, and fav'ring Jove,">, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Idomeneus captain of the Cretans was first to make out the running, for'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'finding himself in ambush, but is all the time longing to go into'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: shape=(), dtype=string, numpy=b"He spake and sat, when Thestor's son arose">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)


### Encode text line as number

Machine learning models work on numbers, not words, so the string values need to be converted into lists of numbers. To do that, map each unique word to a unique integer.

#### Vocabulary building

Build vocabulary by tokenizing the text into a collection of individual unique words.
- Iterate over each example's ```numpy``` value.
- Use ```tfds.features.text.Tokenizer``` to split it into tokens.
- Collect these token into a set to remove duplicates.
- Get size of vocabulary for later use.

In [7]:
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()

for text_tensor, _ in all_labeled_data:
   some_tokens = tokenizer.tokenize(text_tensor.numpy())
   vocabulary_set.update(some_tokens)
   
vocab_size = len(vocabulary_set)
print("Vocabulary size: ", vocab_size)

Vocabulary size:  17178


### Encoder

Create an encoder by passing the ```vocabulary_set``` to ```tfds.features.text.TokenTextEncoder```. The encoder's ```encode``` method takes the string of text and return list of integers.

In [8]:
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

In [9]:
example_text = next(iter(all_labeled_data))[0].numpy()
print(example_text)

b'Ajax--Idomeneus--abstain ye both'


In [10]:
encoded_example_text = encoder.encode(example_text)
print(encoded_example_text)

[4687, 6673, 9499, 15439, 6050]


In [11]:
def encode(text_tensor, label):
   encoded_text = encoder.encode(text_tensor.numpy())
   return encoded_text, label

Use ```Dataset.map``` to apply the ```encode``` function to each element in the dataset. ```Dataset.map``` runs in graph mode.
- Graph tensor do not have a value
- In graph mode, only Tensorflow Ops and functions can be used.

Therefore ```Dataset.map``` method can't be used directly. Wrap it in a ```tf.py_function```. ```tf.py_function``` will pass regular tensors (with a value and a ```numpy``` method to access it), to the wrapped function.

In [12]:
def encode_map_fn(text, label):
   # py_func doesn't set the shape of the returned tensor
   encoded_text, label = tf.py_function(
      encode,
      inp = [text, label],
      Tout = (tf.int64, tf.int64)
   )
   
   # daaset work best if all components have a shape set
   # set the shape manually
   
   encoded_text.set_shape([None])
   label.set_shape([])
   
   return encoded_text, label

all_encoded_data = all_labeled_data.map(encode_map_fn)

In [13]:
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(
   BATCH_SIZE, 
   padded_shapes = ([None], [])
)

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(
   BATCH_SIZE, 
   padded_shapes = ([None], [])
)

In [14]:
sample_text, sample_labels = next(iter(test_data))
print("Sample text: ", sample_text[0])
print("Sample label: ", sample_labels[0])

Sample text:  tf.Tensor(
[ 4687  6673  9499 15439  6050     0     0     0     0     0     0     0
     0     0     0     0], shape=(16,), dtype=int64)
Sample label:  tf.Tensor(0, shape=(), dtype=int64)


Since new token encoding (zero used for padding) has been introduced, increase the vocabulary size by 1.

In [15]:
vocab_size = vocab_size + 1

### Build the model

In [16]:
model = tf.keras.Sequential()

First layer converts integer representations to dense vecto embeddings.

In [17]:
model.add(
   tf.keras.layers.Embedding(vocab_size, 64)
)

Long-Short-Term Memory layer lets the model understand words in their context with other words. A bi-directional wrapper on LSTM layer helps it to learn about the datapoints in relationship to the datapoints in relationship to the datapoints that came before it and after it.

In [18]:
model.add(
   tf.keras.layers.Bidirectional(
      tf.keras.layers.LSTM(64)
   )
)

At Dense layer and output layer. Output layer produces probability for all the labels. The one with the highest probability is the models prediction of an example model.

In [19]:
# One or more dense layers.
# Edit the list in the `for` line to experiment with layer sizes.
for units in [64, 64]:
   model.add(tf.keras.layers.Dense(units, activation='relu'))

# Output layer. The first argument is the number of labels.
model.add(tf.keras.layers.Dense(3))

In [20]:
model.summary(0)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          1099456   
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               66048     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 195       
Total params: 1,178,115
Trainable params: 1,178,115
Non-trainable params: 0
_________________________________________________________________


In [21]:
model.compile(
   optimizer='adam',
   loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
   metrics=['accuracy'])

In [22]:
model.fit(train_data, epochs=3, validation_data=test_data)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x20c2ea29e80>