##### Copyright 2020 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dataset Collections

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/datasets/determinism"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/dataset_collections.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/datasets/blob/master/docs/dataset_collections.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/datasets/docs/dataset_collections.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

In [None]:
!pip install -q tfds-nightly tensorflow

/usr/bin/sh: line 1: pip: command not found


In [None]:
import tensorflow as tf

import tensorflow_datasets as tfds

## Overview

Dataset collections provide a simple way to group together an arbitrary number
of existing TFDS datasets, and to perform simple operations over them.

They can be useful, for example, to group together different datasets related to the same task, or for easy [benchmarking](https://ruder.io/nlp-benchmarking/) of models over a fixed number of different tasks.

## Find available dataset collections

All dataset collection builders are subclass of
`tfds.core.dataset_collection_builder.DatasetCollection`.

To get the list of available builders, use `tfds.list_dataset_collections()`.


In [None]:
tfds.list_dataset_collections()

['longt5', 'xtreme']

## Load and inspect a dataset collection

The easiest way of loading a dataset collection is to instantiate a `DatasetCollectionLoader` object using the [`tfds.dataset_collection`](https://www.tensorflow.org/datasets/api_docs/python/tfds/dataset_collection) command.


In [None]:
collection_loader = tfds.dataset_collection('longt5')

Specific dataset collection versions can be loaded following the same syntax as with TFDS datasets:

In [None]:
collection_loader = tfds.dataset_collection('longt5:1.0.0')

A dataset collection loader can display information about the collection:

In [None]:
collection_loader.print_info()

Dataset collection: longt5
Version: 1.0.0
Description: # Long T5 benchmark

This dataset collection comprises the evaluation benchmark used in the paper:
_LongT5: Efficient Text-To-Text Transformer for Long Sequences_

LongT5 is an extension of the T5 model that handles long sequence inputs more
efficiently. LongT5 achieves state-of-the-art performance on several
summarization benchmarks that required longer context or multi-document
understanding.



And it can also display information about the datasets contained in the collection:

In [None]:
collection_loader.print_datasets()

The dataset collection longt5 (version: 1.0.0) contains the datasets:
 - natural_questions: DatasetReference(dataset_name='natural_questions', version=Version('0.1.0'), split_mapping=None, config='longt5')
 - media_sum: DatasetReference(dataset_name='media_sum', version=Version('1.0.0'), split_mapping=None, config=None)



### Loading datasets from a dataset collection

The easiest way to load one dataset from a collection is to use `DatasetCollectionLoader`'s `load_dataset` method, which loads the required dataset by calling [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load).

It will return a dictionary of split names and the corresponding `tf.data.Dataset`s:

In [None]:
collection_loader.load_dataset("natural_questions")

{'train': <PrefetchDataset element_spec={'all_answers': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'answer': TensorSpec(shape=(), dtype=tf.string, name=None), 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
 'validation': <PrefetchDataset element_spec={'all_answers': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'answer': TensorSpec(shape=(), dtype=tf.string, name=None), 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>}

`load_dataset` accepts the following optional parameters:

* `split`: which split(s) to load. It accepts a single split (`split="test"`) or a list of splits: (`split=["train", "test"]`). If not specified, it will load all splits for the given dataset.
* `loader_kwargs`: keyword arguments to be passed to the `tfds.load` function. Refer to the [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) documentation for a comperehensive overview of the different loading options.

### Loading multiple datasets from a dataset collection

The easiest way to load multiple datasets from a collection is to use `DatasetCollectionLoader`'s `load_datasets` method, which loads the required dataset by calling [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load).

It will return a dictionary of dataset names, each one of which is associated with a dictionary of split names and the corresponding `tf.data.Dataset`s, as in the following example:

In [None]:
collection_loader.load_datasets(['natural_questions', 'media_sum'])

{'media_sum': {'test': <PrefetchDataset element_spec={'date': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'program': TensorSpec(shape=(), dtype=tf.string, name=None), 'speaker': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'summary': TensorSpec(shape=(), dtype=tf.string, name=None), 'url': TensorSpec(shape=(), dtype=tf.string, name=None), 'utt': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
  'train': <PrefetchDataset element_spec={'date': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'program': TensorSpec(shape=(), dtype=tf.string, name=None), 'speaker': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'summary': TensorSpec(shape=(), dtype=tf.string, name=None), 'url': TensorSpec(shape=(), dtype=tf.string, name=None), 'utt': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
  'val': <PrefetchDataset element_spec={'date': TensorSpec

`load_all_datasets` will load *all* available datasets for a given collection:

In [None]:
collection_loader.load_all_datasets()

{'media_sum': {'test': <PrefetchDataset element_spec={'date': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'program': TensorSpec(shape=(), dtype=tf.string, name=None), 'speaker': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'summary': TensorSpec(shape=(), dtype=tf.string, name=None), 'url': TensorSpec(shape=(), dtype=tf.string, name=None), 'utt': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
  'train': <PrefetchDataset element_spec={'date': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'program': TensorSpec(shape=(), dtype=tf.string, name=None), 'speaker': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'summary': TensorSpec(shape=(), dtype=tf.string, name=None), 'url': TensorSpec(shape=(), dtype=tf.string, name=None), 'utt': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
  'val': <PrefetchDataset element_spec={'date': TensorSpec

`load_datasets` accepts the following optional parameters:

* `split`: which split(s) to load. It accepts a single split `(split="test")` or a list of splits: `(split=["train", "test"])`. If not specified, it will load all splits for the given dataset.
* `loader_kwargs`: keyword arguments to be passed to the `tfds.load` function. Refer to the [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) documentation for a comperehensive overview of the different loading options.

### Specifying `loader_kwargs`

`loader_kwargs` are optional keyword arguments to be passed to the [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) function.
They can be specified in three ways:

1. When initializing the `DatasetCollectionLoader` class:


In [None]:
collection_loader = tfds.dataset_collection('longt5', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))

2. Using `DatasetCollectioLoader`'s `set_loader_kwargs` method:

In [None]:
collection_loader.set_loader_kwargs(dict(split='train', batch_size=10, try_gcs=False))

3. As optional paramenters to the `load_dataset`, `load_datasets` and `load_all_datasets` methods.

In [None]:
collection_loader.load_dataset('natural_questions', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))

{'train': <PrefetchDataset element_spec={'all_answers': TensorSpec(shape=(None, None), dtype=tf.string, name=None), 'answer': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'context': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'id': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'question': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'title': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>}

## Add a new dataset collection

All dataset collections are implemented subclasses of `tfds.core.dataset_collection_builder.DatasetCollection`.

Here is a minimal example of a dataset collection builder, defined in the file `my_dataset_collection.py`:

In [None]:
import collections
from typing import Mapping
from tensorflow_datasets.core import dataset_collection_builder

class MyDatasetCollection(dataset_collection_builder.DatasetCollection):
  """Dataset collection builder my_dataset_collection."""

  @property
  def info(self) -> dataset_collection_builder.DatasetCollectionInfo:
    return dataset_collection_builder.DatasetCollectionInfo.from_cls(
        dataset_collection_class=self.__class__,
        description="my_dataset_collection description.",
        release_notes={
            "1.0.0": "Initial release",
        },
    )

  @property
  def datasets(
      self,
  ) -> Mapping[str, Mapping[str, dataset_collection_builder.DatasetReference]]:
    return collections.OrderedDict({
        "1.0.0":
            dataset_collection_builder.references_for({
                "dataset_1": "natural_questions/default:0.0.2",
                "dataset_2": "media_sum:1.0.0",
            }),
        "1.1.0":
            dataset_collection_builder.references_for({
                "dataset_1": "natural_questions/longt5:0.1.0",
                "dataset_2": "media_sum:1.0.0",
                "dataset_3": "squad:3.0.0"
            })
    })

Let's see in detail the 2 abstract methods to overwrite.

### `info`: dataset collection metadata

`info` returns the `dataset_collection_builder.DatasetCollectionInfo` containing the collection's metadata.

The dataset collection info contains four fields:

- `name`: the name of the dataset collection.
- `description`: a markdown-formatted description of the dataset collection. There are two ways to define a dataset collection's description:
  - As a (multi-line) string directly in the colletion's `my_dataset_collection.py` file - similarly as it is already done for TFDS datasets;
  - In a `description.md` file, which must be placed in the dataset collection folder.
- `release_notes`: a mapping from the dataset collection's version to the corresponding release notes.
- `citation`: An optional (list of) `BibTeX` citation(s) for the dataset collection. There are two ways to define a dataset collection's citation:
  - As a (multi-line) string directly in the colletion's `my_dataset_collection.py` file - similarly as it is already done for TFDS datasets;
  - In a `citations.bib` file, which must be placed in the dataset collection folder.

### `dataset`: define the datasets in the collection

`dataset` returns the TFDS datasets included as part of the collection.

`dataset` is defined as a dictionary of versions, which describe the evolution of the dataset collection.

For each version, the included TFDS datasets are stored as a dictionary from dataset names to `dataset_collection_builder.DatasetReference`. For example: 

In [None]:
  @property
  def datasets(self):
    return {
        "1.0.0": {
            "yes_no":
                dataset_collection_builder.DatasetReference(
                    dataset_name="yes_no", version="1.0.0"),
            "sst2":
                dataset_collection_builder.DatasetReference(
                    dataset_name="glue", config="sst2", version="2.0.0"),
            "assin2":
                dataset_collection_builder.DatasetReference(
                    dataset_name="assin2", version="1.0.0"),
        },
        # ...
    }

The `dataset_collection_builder.references_for` method provides a more compact way to express the same as above:

In [None]:
@property
def datasets(self):
  return {
      "1.0.0":
          dataset_collection_builder.references_for({
              "yes_no": "yes_no:1.0.0",
              "sst2": "glue/sst:2.0.0",
              "assin2": "assin2:1.0.0",
          }),
      # ...
  }

### Send us feedback

We are continuously trying to improve the dataset creation workflow, but can
only do so if we are aware of the issues. Which issues, errors did you
encountered while creating the dataset collection? Was there a part which was confusing,
boilerplate or wasn't working the first time? Please share your feedback on
[GitHub](https://github.com/tensorflow/datasets/issues).