##### Copyright 2022 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dataset Collections

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/datasets/dataset_collections"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/dataset_collections.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/datasets/blob/master/docs/dataset_collections.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/datasets/docs/dataset_collections.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

## Overview

Dataset collections provide a simple way to group together an arbitrary number
of existing TFDS datasets, and to perform simple operations over them.

They can be useful, for example, to group together different datasets related to the same task, or for easy [benchmarking](https://ruder.io/nlp-benchmarking/) of models over a fixed number of different tasks.

## Setup

To get started, install a few packages:

In [None]:
# Use tfds-nightly to ensure access to the latest features.
!pip install -q tfds-nightly tensorflow
!pip install -U conllu

Import TensorFlow and the Tensorflow Datasets package into your development environment:

In [None]:
import pprint

import tensorflow as tf
import tensorflow_datasets as tfds

Dataset collections provide a simple way to group together an arbitrary number
of existing datasets from Tensorflow Datasets (TFDS), and to perform simple operations over them.

They can be useful, for example, to group together different datasets related to the same task, or for easy [benchmarking](https://ruder.io/nlp-benchmarking/) of models over a fixed number of different tasks.

## Find available dataset collections

All dataset collection builders are a subclass of
`tfds.core.dataset_collection_builder.DatasetCollection`.

To get the list of available builders, use `tfds.list_dataset_collections()`.


In [None]:
tfds.list_dataset_collections()

## Load and inspect a dataset collection

The easiest way of loading a dataset collection is to instantiate a `DatasetCollectionLoader` object using the [`tfds.dataset_collection`](https://www.tensorflow.org/datasets/api_docs/python/tfds/dataset_collection) command.


In [None]:
collection_loader = tfds.dataset_collection('xtreme')

Specific dataset collection versions can be loaded following the same syntax as with TFDS datasets:

In [None]:
collection_loader = tfds.dataset_collection('xtreme:1.0.0')

A dataset collection loader can display information about the collection:

In [None]:
collection_loader.print_info()

The dataset loader can also display information about the datasets contained in the collection:

In [None]:
collection_loader.print_datasets()

### Loading datasets from a dataset collection

The easiest way to load one dataset from a collection is to use a `DatasetCollectionLoader` object's `load_dataset` method, which loads the required dataset by calling [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load).

This call returns a dictionary of split names and the corresponding `tf.data.Dataset`s:

In [None]:
splits = collection_loader.load_dataset("ner")

pprint.pprint(splits)

`load_dataset` accepts the following optional parameters:

* `split`: which split(s) to load. It accepts a single split (`split="test"`) or a list of splits: (`split=["train", "test"]`). If not specified, it will load all splits for the given dataset.
* `loader_kwargs`: keyword arguments to be passed to the `tfds.load` function. Refer to the [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) documentation for a comprehensive overview of the different loading options.

### Loading multiple datasets from a dataset collection

The easiest way to load multiple datasets from a collection is to use the `DatasetCollectionLoader` object's `load_datasets` method, which loads the required datasets by calling [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load).

It returns a dictionary of dataset names, each one of which is associated with a dictionary of split names and the corresponding `tf.data.Dataset`s, as in the following example:

In [None]:
datasets = collection_loader.load_datasets(['xnli', 'bucc'])

pprint.pprint(datasets)

The `load_all_datasets` method loads *all* available datasets for a given collection:

In [None]:
all_datasets = collection_loader.load_all_datasets()

pprint.pprint(all_datasets)

The `load_datasets` method accepts the following optional parameters:

* `split`: which split(s) to load. It accepts a single split `(split="test")` or a list of splits: `(split=["train", "test"])`. If not specified, it will load all splits for the given dataset.
* `loader_kwargs`: keyword arguments to be passed to the `tfds.load` function. Refer to the [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) documentation for a comprehensive overview of the different loading options.

### Specifying `loader_kwargs`

The `loader_kwargs` are optional keyword arguments to be passed to the [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) function.
They can be specified in three ways:

1.  When initializing the `DatasetCollectionLoader` class:

In [None]:
collection_loader = tfds.dataset_collection('xtreme', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))

2.  Using `DatasetCollectioLoader`'s `set_loader_kwargs` method:

In [None]:
collection_loader.set_loader_kwargs(dict(split='train', batch_size=10, try_gcs=False))

3.  As optional parameters to the `load_dataset`, `load_datasets` and `load_all_datasets` methods.

In [None]:
dataset = collection_loader.load_dataset('ner', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))

### Feedback

We are continuously trying to improve the dataset creation workflow, but can
only do so if we are aware of the issues. Which issues, errors did you
encountered while creating the dataset collection? Was there a part which was confusing,
boilerplate or wasn't working the first time? Please share your feedback on
[GitHub](https://github.com/tensorflow/datasets/issues).