# Welcome to the [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor) Dataset Colab!

Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and [accelerate ML research](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html).

**This colab shows you how to add your own dataset to T2T so that you can train one of the several preexisting models on your newly added dataset!**

For a tutorial that covers all the broader aspects of T2T using existing datasets and models, please see this [IPython notebook](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb).

In [0]:
#@title
# Copyright 2018 Google LLC.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

>[Welcome to the Tensor2Tensor Dataset Colab!](#scrollTo=Wd48fv-zDMe6)

>>[Installation & Setup](#scrollTo=Urn4QmNfI3hw)

>>[Define the Problem](#scrollTo=LUoP57gOjlk9)

>>>[Run t2t_datagen](#scrollTo=Q1xBmlrFLSPX)

>>[Viewing the generated data.](#scrollTo=MCqJhdnYgiG-)

>>>[tf.python_io.tf_record_iterator](#scrollTo=uNpohcPXKsLN)

>>>[Using tf.data.Dataset](#scrollTo=6o_1BHGQC5w5)

>>[Terminology](#scrollTo=xRtfC0sHBlSo)

>>>[Problem](#scrollTo=xRtfC0sHBlSo)

>>>[Modalities](#scrollTo=xRtfC0sHBlSo)



## Installation & Setup


We'll install T2T and TensorFlow.

We also need to setup the directories where T2T will:

*   Generate the dataset and write the TFRecords file representing the training and the eval set, vocabulary files etc `DATA_DIR`
*   Run the training, keep the graph and the checkpoint files `OUTPUT_DIR` and
*   Use as a scratch directory to download your dataset from a URL, unzip it, etc. `TMP_DIR`

In [0]:
#@title Run for installation.

! pip install -q -U tensor2tensor
! pip install -q tensorflow

In [0]:
#@title Run this only once - Sets up TF Eager execution.

import tensorflow as tf

# Enable Eager execution - useful for seeing the generated data.
tf.enable_eager_execution()

In [0]:
#@title Setting a random seed.

from tensor2tensor.utils import trainer_lib

# Set a seed so that we have deterministic outputs.
RANDOM_SEED = 301
trainer_lib.set_random_seed(RANDOM_SEED)

In [0]:
#@title Run for setting up directories.

import os

# Setup and create directories.
DATA_DIR = os.path.expanduser("/tmp/t2t/data")
OUTPUT_DIR = os.path.expanduser("/tmp/t2t/output")
TMP_DIR = os.path.expanduser("/tmp/t2t/tmp")

# Create them.
tf.gfile.MakeDirs(DATA_DIR)
tf.gfile.MakeDirs(OUTPUT_DIR)
tf.gfile.MakeDirs(TMP_DIR)

## Define the `Problem`

To simplify our setting our input text sampled randomly from [a, z] - each sentence has between [3, 20] words with each word being [1, 8] characters in length.

Example input: "olrkpi z cldv xqcxisg cutzllf doteq" -- this will be generated by `sample_sentence()`

Our output will be the input words sorted according to length.

Example output: "z cldv doteq olrkpi xqcxisg cutzllf" -- this will be processed by `target_sentence()`

Let's dive right into our first problem -- we'll explain as we go on.

Take some time to read each line along with its comments -- or skip them and come back later to clarify your understanding.

In [0]:
#@title Define `sample_sentence()` and `target_sentence(input_sentence)`
import random
import string

def sample_sentence():
    # Our sentence has between 3 and 20 words
    num_words = random.randint(3, 20)
    words = []
    for i in range(num_words):
        # Our words have between 1 and 8 characters.
        num_chars = random.randint(1, 8)
        chars = []
        for j in range(num_chars):
            chars.append(random.choice(string.ascii_lowercase))
        words.append("".join(chars))
    return " ".join(words)

def target_sentence(input_sentence):
    words = input_sentence.split(" ")
    return " ".join(sorted(words, key=lambda x: len(x)))

In [0]:
# `Problem` is the base class for any dataset that we want to add to T2T -- it
# unifies the specification of the problem for generating training data,
# training, evaluation and inference.
#
# All its methods (except `generate_data`) have reasonable default
# implementations.
#
# A sub-class must implement `generate_data(data_dir, tmp_dir)` -- this method
# is called by t2t-trainer or t2t-datagen to actually generate TFRecord dataset
# files on disk.
from tensor2tensor.data_generators import problem

# Certain categories of problems are very common, like where either the input or
# output is text, for such problems we define an (abstract) sub-class of
# `Problem` called `Text2TextProblem` -- this implements `generate_data` in
# terms of another function `generate_samples`. Sub-classes must override
# `generate_samples` and `is_generate_per_split`.
from tensor2tensor.data_generators import text_problems

# Every non-abstract problem sub-class (as well as models and hyperparameter
# sets) must be registered with T2T so that T2T knows about it and can look it
# up when you specify your problem on the commandline to t2t-trainer or
# t2t-datagen.
#
# One uses:
# `register_problem` for a new Problem sub-class.
# `register_model` for a new T2TModel sub-class.
# `register_hparams` for a new hyperparameter set. All hyperparameter sets
# typically extend `common_hparams.basic_params1` (directly or indirectly).
from tensor2tensor.utils import registry


# By default, when you register a problem (or model or hyperparameter set) the
# name with which it gets registered is the 'snake case' version -- so here
# the Problem class `SortWordsAccordingToLengthRandom` will be registered with
# the name `sort_words_according_to_length_random`.
#
# One can override this default by actually assigning a name as follows:
# `@registry.register_problem("my_awesome_problem")`
#
# The registered name is specified to the t2t-trainer or t2t-datagen using the
# commandline flag `--problem`.
@registry.register_problem

# We inherit from `Text2TextProblem` which takes care of a lot of details
# regarding reading and writing the data to disk, what vocabulary type one
# should use, its size etc -- so that we need not worry about them, one can,
# of course, override those.
class SortWordsAccordingToLengthRandom(text_problems.Text2TextProblem):
  """Sort words on length in randomly generated text."""

  # START: Methods we should override.

  # The methods that need to be overriden from `Text2TextProblem` are:
  # `is_generate_per_split` and
  # `generate_samples`.

  @property
  def is_generate_per_split(self):
    # If we have pre-existing data splits for (train, eval, test) then we set
    # this to True, which will have generate_samples be called for each of the
    # dataset_splits.
    #
    # If we do not have pre-existing data splits, we set this to False, which
    # will have generate_samples be called just once and the Problem will
    # automatically partition the data into dataset_splits.
    return False

  def generate_samples(self, data_dir, tmp_dir, dataset_split):
    # Here we are generating the data in-situ using the `sample_sentence`
    # function, otherwise we would have downloaded the data and put it in
    # `tmp_dir` -- and read it from that location.
    del tmp_dir

    # Unused here, is used in `Text2TextProblem.generate_data`.
    del data_dir

    # This would have been useful if `self.is_generate_per_split()` was True.
    # In that case we would have checked if we were generating a training,
    # evaluation or test sample. This is of type `problem.DatasetSplit`.
    del dataset_split

    # Just an arbitrary limit to our number of examples, this can be set higher.
    MAX_EXAMPLES = 10

    for i in range(MAX_EXAMPLES):
      sentence_input = sample_sentence()
      sentence_target = target_sentence(sentence_input)
      yield {
          "inputs"  : sentence_input,
          "targets" : sentence_target,
      }

  # END: Methods we should override.

  # START: Overridable methods.

  @property
  def vocab_type(self):
    # We can use different types of vocabularies, `VocabType.CHARACTER`,
    # `VocabType.SUBWORD` and `VocabType.TOKEN`.
    #
    # SUBWORD and CHARACTER are fully invertible -- but SUBWORD provides a good
    # tradeoff between CHARACTER and TOKEN.
    return text_problems.VocabType.SUBWORD

  @property
  def approx_vocab_size(self):
    # Approximate vocab size to generate. Only for VocabType.SUBWORD.
    return 2**13  # ~8k

  @property
  def dataset_splits(self):
    # Since we are responsible for generating the dataset splits, we override
    # `Text2TextProblem.dataset_splits` to specify that we intend to keep
    # 80% data for training and 10% for evaluation and testing each.
    return [{
        "split": problem.DatasetSplit.TRAIN,
        "shards": 8,
    }, {
        "split": problem.DatasetSplit.EVAL,
        "shards": 1,
    }, {
        "split": problem.DatasetSplit.TEST,
        "shards": 1,
    }]

 # END: Overridable methods.

That's it!

To use this with `t2t-trainer` or `t2t-datagen`, save it to a directory, add an `__init__.py` that imports it, and then specify that directory with `--t2t_usr_dir`.

i.e. as follows:

```
$ t2t-datagen \
  --problem=sort_words_according_to_length_random \
  --data_dir=/tmp/t2t/data \
  --tmp_dir=/tmp/t2t/tmp \
  --t2t_usr_dir=/tmp/t2t/usr

```

However, we'll generate the data from the colab itself as well -- this is what `t2t-datagen` essentially does.

## Generate the data.

We will now generate the data by calling `Problem.generate_data()` and inspect it.

In [0]:
sort_len_problem = SortWordsAccordingToLengthRandom()

sort_len_problem.generate_data(DATA_DIR, TMP_DIR)

## Viewing the generated data.

`tf.data.Dataset` is the recommended API for inputting data into a TensorFlow graph and the `Problem.dataset()` method returns a `tf.data.Dataset` object.


In [0]:
tfe = tf.contrib.eager

Modes = tf.estimator.ModeKeys

# We can iterate over our examples by making an iterator and calling next on it.
eager_iterator = tfe.Iterator(sort_len_problem.dataset(Modes.EVAL, DATA_DIR))
example = eager_iterator.next()

input_tensor = example["inputs"]
target_tensor = example["targets"]

# The tensors are actually encoded using the generated vocabulary file -- you
# can inspect the actual vocab file in DATA_DIR.
print("Tensor Input: " + str(input_tensor))
print("Tensor Target: " + str(target_tensor))

In [0]:

# We use the encoders to decode the tensors to the actual input text.
input_encoder = sort_len_problem.get_feature_encoders(
    data_dir=DATA_DIR)["inputs"]
target_encoder = sort_len_problem.get_feature_encoders(
    data_dir=DATA_DIR)["targets"]

input_decoded = input_encoder.decode(input_tensor.numpy())
target_decoded = target_encoder.decode(target_tensor.numpy())

print("Decoded Input: " + input_decoded)
print("Decoded Target: " + target_decoded)

## To be continued ...

Stay tuned for additions to this notebook for adding problems with non-text modalities like Images, Audio and Video!