<a href="https://colab.research.google.com/github/weiyunna/Deep-Learning-with-Tensorflow/blob/master/Load_text_with_tf_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load text with tf.data

This tutorial provides an example of how to use` tf.data.TextLineDataset` to load examples from text files. `TextLineDataset` is designed to create a dataset from a text file, in which each example is a line of text from the original file. This is potentially useful for any text data that is primarily line-based (for example, poetry or error logs).

In this tutorial, we'll use three different English translations of the same work, Homer's Illiad, and train a model to identify the translator given a single line of text.



## Setup

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

!pip install -q tensorflow==2.0.0-alpha0
import tensorflow as tf

import tensorflow_datasets as tfds
import os

[K    100% |████████████████████████████████| 79.9MB 476kB/s 
[K    100% |████████████████████████████████| 419kB 16.1MB/s 
[K    100% |████████████████████████████████| 61kB 21.4MB/s 
[K    100% |████████████████████████████████| 3.0MB 11.8MB/s 
[?25h

The texts of the three translations are by:

* William Cowper — text

* Edward, Earl of Derby — text

* Samuel Butler — text

The text files used in this tutorial have undergone some typical preprocessing tasks, mostly removing stuff — document header and footer, line numbers, chapter titles. Download these lightly munged files locally.

In [2]:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

#Downloads a file from a URL if it not already in the cache.
#By default the file at the url origin is downloaded to the cache_dir ~/.keras, 
#placed in the cache_subdir datasets, and given the filename fname. 
#The final location of a file example.txt would therefore be ~/.keras/datasets/example.txt.
for name in FILE_NAMES:
  text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)
  
parent_dir = os.path.dirname(text_dir)

parent_dir

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt


'/root/.keras/datasets'

In [3]:
text_dir

'/root/.keras/datasets/butler.txt'

## Load text into datasets

Iterate through the files, loading each one into its own dataset.

Each example needs to be labeled individually labeled, so use `tf.data.Dataset.map ` to apply a labeler function to each one. This will iterate over every example in the dataset, returning (example, label) pairs.

In [0]:
def labeler(example, index):
  return example, tf.cast(index, tf.int64)  

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

In [14]:
for x in labeled_dataset.take(1):
  print(x)

(<tf.Tensor: id=148918, shape=(), dtype=string, numpy=b'\xef\xbb\xbfSing, O goddess, the anger of Achilles son of Peleus, that brought'>, <tf.Tensor: id=148919, shape=(), dtype=int64, numpy=2>)


In [16]:
for x in labeled_data_sets[0]:
  print(x)

(<tf.Tensor: id=148926, shape=(), dtype=string, numpy=b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;">, <tf.Tensor: id=148927, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=148930, shape=(), dtype=string, numpy=b'His wrath pernicious, who ten thousand woes'>, <tf.Tensor: id=148931, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=148934, shape=(), dtype=string, numpy=b"Caused to Achaia's host, sent many a soul">, <tf.Tensor: id=148935, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=148938, shape=(), dtype=string, numpy=b'Illustrious into Ades premature,'>, <tf.Tensor: id=148939, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=148942, shape=(), dtype=string, numpy=b'And Heroes gave (so stood the will of Jove)'>, <tf.Tensor: id=148943, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=148946, shape=(), dtype=string, numpy=b'To dogs and to all ravening fowls a prey,'>, <tf.Tensor: id=148947, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=148950, shape=(), dtype=string, numpy=

Combine these labeled datasets into a single dataset, and shuffle it.

In [0]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

In [0]:
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

In [0]:
all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

You can use `tf.data.Dataset.take` and `print ` to see what the (example, label) pairs look like. The numpy property shows each Tensor's value.

In [8]:
for ex in all_labeled_data.take(5):
  print(ex)

(<tf.Tensor: id=49, shape=(), dtype=string, numpy=b'Equipp\'d with all things needed for the way?"'>, <tf.Tensor: id=50, shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: id=53, shape=(), dtype=string, numpy=b"Sands heaping o'er him and around him sands">, <tf.Tensor: id=54, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=57, shape=(), dtype=string, numpy=b'going down of the sun the line of beacon-fires blazes forth, flaring'>, <tf.Tensor: id=58, shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: id=61, shape=(), dtype=string, numpy=b"So saying, she tumult raised in Helen's mind.">, <tf.Tensor: id=62, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=65, shape=(), dtype=string, numpy=b'Shall now be mine, for I despair to escape'>, <tf.Tensor: id=66, shape=(), dtype=int64, numpy=0>)


In [9]:
all_labeled_data

<ShuffleDataset shapes: ((), ()), types: (tf.string, tf.int64)>

In [12]:
for text_tensor, _ in all_labeled_data.take(1):
  print(text_tensor.numpy())

b'Equipp\'d with all things needed for the way?"'


## Encode text lines as numbers

Machine learning models work on numbers, not words, so the string values need to be converted into lists of numbers. To do that, map each unique word to a unique integer.

### Build vocabulory

First, build a vocabulary by tokenizing the text into a collection of individual unique words. There are a few ways to do this in both TensorFlow and Python. For this tutorial:

* Iterate over each example's numpy value.
* Use `tfds.features.text.Tokenizer` to split it into tokens.
* Collect these tokens into `a Python set`, to remove duplicates.
* Get the size of the vocabulary for later use.

In [17]:
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
  some_tokens = tokenizer.tokenize(text_tensor.numpy())
  vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
vocab_size

17178

### Encode Examples

Create an encoder by passing the `vocabulary_set` to `tfds.features.text`.`TokenTextEncoder`. The encoder's encode method takes in a string of text and returns a list of integers.

In [0]:
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

In [0]:
# Interate through a tuble for every tensor
a =next(iter(all_labeled_data))

In [23]:
a[1]

<tf.Tensor: id=374331, shape=(), dtype=int64, numpy=1>

You can try this on a single line to see what the output looks like.

In [24]:
example_text = next(iter(all_labeled_data))[0].numpy()
print(example_text)

b'Equipp\'d with all things needed for the way?"'


In [25]:
encoded_example = encoder.encode(example_text)
print(encoded_example)

[1164, 991, 14985, 3329, 3616, 5999, 10549, 13184, 11240]


Now run the encoder on the dataset by wrapping it in `tf.py_function` and passing that to the dataset's `map` method.

In [0]:
def encode(text_tensor, label):
  encoded_text = encoder.encode(text_tensor.numpy())
  return encoded_text, label

#tf.py_function Wraps a python function into a TensorFlow op that executes it eagerly.
#func: A Python function which accepts a list of Tensor objects having element types that match the corresponding 
#tf.Tensor objects in inp and returns a list of Tensor objects (or a single Tensor, or None) 
#having element types that match the corresponding values in Tout.
#inp: A list of Tensor objects.
#Tout:A list or tuple of tensorflow data types or a single tensorflow data type if there is only one, indicating what func returns; an empty list if no value is returned (i.e., if the return value is None).

def encode_map_fn(text, label):
  return tf.py_function(encode, inp=[text, label], Tout=(tf.int64, tf.int64))

all_encoded_data = all_labeled_data.map(encode_map_fn)

