Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/tensorflow/datasets into …
Browse files Browse the repository at this point in the history
…todo-free-disk-size
  • Loading branch information
us committed Mar 26, 2019
2 parents d3531af + c803e7f commit 554a2b3
Show file tree
Hide file tree
Showing 126 changed files with 2,702 additions and 797 deletions.
9 changes: 6 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,14 @@ Thanks for thinking about contributing to our library !


## Before you start

* Please accept the [Contributor License Agreement](https://cla.developers.google.com) (see below)
* [Ask here](https://github.com/tensorflow/datasets/issues/142) to be added to
the list of collaborators so that issues can be assigned to you.
* Comment on the issue that you plan to work on so we can assign it to you and
there isn't unnecessary duplication of work.
there isn't unnecessary duplication of work. If this is your first time
contributing, we'll send you an invitation on GitHub to be a contributor;
you must accept this invitation
[here](https://github.com/tensorflow/datasets/settings/collaboration)
before we can assign you the issue.
* When you plan to work on something larger (for example, adding new
`FeatureConnectors`), please respond on the issue (or create one if there
isn't one) to explain your plan and give others a chance to discuss.
Expand Down
23 changes: 23 additions & 0 deletions docs/add_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,11 @@ Note that most datasets will find the [current set of
`tfds.features.FeatureConnector`s](api_docs/python/tfds/features.md)
sufficient, but sometimes a new one may need to be defined.

Note: If you need a new `FeatureConnector` not present in the default set and
are planning to submit it to `tensorflow/datasets`, please open a
[new issue](https://github.com/tensorflow/datasets/issues/new?assignees=&labels=enhancement&template=feature_request.md&title=)
on GitHub with your proposal.

[`tfds.features.FeatureConnector`s](api_docs/python/tfds/features/FeatureConnector.md)
in `DatasetInfo` correspond to the elements returned in the
`tf.data.Dataset` object. For instance, with:
Expand Down Expand Up @@ -445,14 +450,27 @@ import to its subdirectory's `__init__.py`

### 2. Run `download_and_prepare` locally.

If you're contributing the dataset to `tensorflow/datasets`, add a checksums
file for the dataset. On first download, the `DownloadManager` will
automatically add the sizes and checksums for all downloaded URLs to that file.
This ensures that on subsequent data generation, the downloaded files are
as expected.

```sh
touch tensorflow_datasets/url_checksums/my_new_dataset.txt
```

Run `download_and_prepare` locally to ensure that data generation works:

```
# default data_dir is ~/tensorflow_datasets
python -m tensorflow_datasets.scripts.download_and_prepare \
--register_checksums \
--datasets=my_new_dataset
```

Note that the `--register_checksums` flag must only be used while in development.

Copy in the contents of the `dataset_info.json` file(s) to a [GitHub gist](https://gist.github.com/) and link to it in your pull request.


Expand Down Expand Up @@ -483,6 +501,11 @@ Most datasets in TFDS should have a unit test and your reviewer may ask you
to add one if you haven't already. See the
[testing section](#testing-mydataset) below.

### 5. Send for review!

Send the pull request for review.


## Large datasets and distributed generation

Some datasets are so large as to require multiple machines to download and
Expand Down
4 changes: 2 additions & 2 deletions docs/api_docs/python/tfds/Split.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ stages of training and evaluation.
model architecture, etc.).
* `TEST`: the testing data. This is the data to report metrics on. Typically
you do not want to use this during model iteration as you may overfit to it.
* `ALL`: Special value corresponding to all existing splits of a dataset
merged together
* `ALL`: Special value, never defined by a dataset, but corresponding to all
defined splits of a dataset merged together.

Note: All splits, including compositions inherit from <a href="../tfds/core/SplitBase.md"><code>tfds.core.SplitBase</code></a>

Expand Down
20 changes: 16 additions & 4 deletions docs/api_docs/python/tfds/_api_cache.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"current_doc_full_name": "tfds.core.GeneratorBasedBuilder.__getattribute__",
"current_doc_full_name": "tfds.core.Version.__sizeof__",
"duplicate_of": {
"tfds.GenerateMode": "tfds.download.GenerateMode",
"tfds.GenerateMode.FORCE_REDOWNLOAD": "tfds.download.GenerateMode.FORCE_REDOWNLOAD",
Expand Down Expand Up @@ -50,6 +50,7 @@
"tfds.core.GeneratorBasedBuilder.__str__": "tfds.core.BuilderConfig.__str__",
"tfds.core.GeneratorBasedBuilder.__weakref__": "tfds.core.DatasetBuilder.__weakref__",
"tfds.core.GeneratorBasedBuilder.builder_config": "tfds.core.DatasetBuilder.builder_config",
"tfds.core.GeneratorBasedBuilder.data_dir": "tfds.core.DatasetBuilder.data_dir",
"tfds.core.GeneratorBasedBuilder.info": "tfds.core.DatasetBuilder.info",
"tfds.core.NamedSplit.__delattr__": "tfds.core.BuilderConfig.__delattr__",
"tfds.core.NamedSplit.__format__": "tfds.core.BuilderConfig.__format__",
Expand Down Expand Up @@ -497,6 +498,7 @@
"tfds.testing.DummyDatasetSharedGenerator.__str__": "tfds.core.BuilderConfig.__str__",
"tfds.testing.DummyDatasetSharedGenerator.__weakref__": "tfds.core.DatasetBuilder.__weakref__",
"tfds.testing.DummyDatasetSharedGenerator.builder_config": "tfds.core.DatasetBuilder.builder_config",
"tfds.testing.DummyDatasetSharedGenerator.data_dir": "tfds.core.DatasetBuilder.data_dir",
"tfds.testing.DummyDatasetSharedGenerator.info": "tfds.core.DatasetBuilder.info",
"tfds.testing.DummyMnist.BUILDER_CONFIGS": "tfds.core.DatasetBuilder.BUILDER_CONFIGS",
"tfds.testing.DummyMnist.__abstractmethods__": "tfds.core.NamedSplit.__abstractmethods__",
Expand All @@ -513,6 +515,7 @@
"tfds.testing.DummyMnist.__str__": "tfds.core.BuilderConfig.__str__",
"tfds.testing.DummyMnist.__weakref__": "tfds.core.DatasetBuilder.__weakref__",
"tfds.testing.DummyMnist.builder_config": "tfds.core.DatasetBuilder.builder_config",
"tfds.testing.DummyMnist.data_dir": "tfds.core.DatasetBuilder.data_dir",
"tfds.testing.DummyMnist.info": "tfds.core.DatasetBuilder.info",
"tfds.testing.FeatureExpectationItem.__delattr__": "tfds.core.BuilderConfig.__delattr__",
"tfds.testing.FeatureExpectationItem.__format__": "tfds.core.BuilderConfig.__format__",
Expand Down Expand Up @@ -641,6 +644,7 @@
"tfds.core.BuilderConfig.version": true,
"tfds.core.DatasetBuilder": false,
"tfds.core.DatasetBuilder.BUILDER_CONFIGS": true,
"tfds.core.DatasetBuilder.GOOGLE_DISABLED": true,
"tfds.core.DatasetBuilder.IN_DEVELOPMENT": true,
"tfds.core.DatasetBuilder.VERSION": true,
"tfds.core.DatasetBuilder.__abstractmethods__": true,
Expand All @@ -664,6 +668,7 @@
"tfds.core.DatasetBuilder.as_dataset": true,
"tfds.core.DatasetBuilder.builder_config": true,
"tfds.core.DatasetBuilder.builder_configs": true,
"tfds.core.DatasetBuilder.data_dir": true,
"tfds.core.DatasetBuilder.download_and_prepare": true,
"tfds.core.DatasetBuilder.info": true,
"tfds.core.DatasetBuilder.name": true,
Expand All @@ -690,13 +695,13 @@
"tfds.core.DatasetInfo.citation": true,
"tfds.core.DatasetInfo.compute_dynamic_properties": true,
"tfds.core.DatasetInfo.description": true,
"tfds.core.DatasetInfo.download_checksums": true,
"tfds.core.DatasetInfo.features": true,
"tfds.core.DatasetInfo.full_name": true,
"tfds.core.DatasetInfo.initialize_from_bucket": true,
"tfds.core.DatasetInfo.initialized": true,
"tfds.core.DatasetInfo.name": true,
"tfds.core.DatasetInfo.read_from_directory": true,
"tfds.core.DatasetInfo.redistribution_info": true,
"tfds.core.DatasetInfo.size_in_bytes": true,
"tfds.core.DatasetInfo.splits": true,
"tfds.core.DatasetInfo.supervised_keys": true,
Expand All @@ -706,6 +711,7 @@
"tfds.core.DatasetInfo.write_to_directory": true,
"tfds.core.GeneratorBasedBuilder": false,
"tfds.core.GeneratorBasedBuilder.BUILDER_CONFIGS": true,
"tfds.core.GeneratorBasedBuilder.GOOGLE_DISABLED": true,
"tfds.core.GeneratorBasedBuilder.IN_DEVELOPMENT": true,
"tfds.core.GeneratorBasedBuilder.VERSION": true,
"tfds.core.GeneratorBasedBuilder.__abstractmethods__": true,
Expand All @@ -729,6 +735,7 @@
"tfds.core.GeneratorBasedBuilder.as_dataset": true,
"tfds.core.GeneratorBasedBuilder.builder_config": true,
"tfds.core.GeneratorBasedBuilder.builder_configs": true,
"tfds.core.GeneratorBasedBuilder.data_dir": true,
"tfds.core.GeneratorBasedBuilder.download_and_prepare": true,
"tfds.core.GeneratorBasedBuilder.info": true,
"tfds.core.GeneratorBasedBuilder.name": true,
Expand Down Expand Up @@ -960,12 +967,12 @@
"tfds.download.DownloadManager.download": true,
"tfds.download.DownloadManager.download_and_extract": true,
"tfds.download.DownloadManager.download_kaggle_data": true,
"tfds.download.DownloadManager.download_sizes": true,
"tfds.download.DownloadManager.downloaded_size": true,
"tfds.download.DownloadManager.extract": true,
"tfds.download.DownloadManager.iter_archive": true,
"tfds.download.DownloadManager.manual_dir": true,
"tfds.download.DownloadManager.recorded_download_checksums": true,
"tfds.download.ExtractMethod": false,
"tfds.download.ExtractMethod.BZIP2": true,
"tfds.download.ExtractMethod.GZIP": true,
"tfds.download.ExtractMethod.NO_EXTRACT": true,
"tfds.download.ExtractMethod.TAR": true,
Expand Down Expand Up @@ -1659,6 +1666,7 @@
"tfds.testing.DatasetBuilderTestCase.BUILDER_CONFIG_NAMES_TO_TEST": true,
"tfds.testing.DatasetBuilderTestCase.DATASET_CLASS": true,
"tfds.testing.DatasetBuilderTestCase.DL_EXTRACT_RESULT": true,
"tfds.testing.DatasetBuilderTestCase.EXAMPLE_DIR": true,
"tfds.testing.DatasetBuilderTestCase.INTERNAL_DATASET": true,
"tfds.testing.DatasetBuilderTestCase.MOCK_MONARCH": true,
"tfds.testing.DatasetBuilderTestCase.MOCK_OUT_FORBIDDEN_OS_FUNCTIONS": true,
Expand Down Expand Up @@ -1836,6 +1844,7 @@
"tfds.testing.DatasetBuilderTestCase.test_session": true,
"tfds.testing.DummyDatasetSharedGenerator": false,
"tfds.testing.DummyDatasetSharedGenerator.BUILDER_CONFIGS": true,
"tfds.testing.DummyDatasetSharedGenerator.GOOGLE_DISABLED": true,
"tfds.testing.DummyDatasetSharedGenerator.IN_DEVELOPMENT": true,
"tfds.testing.DummyDatasetSharedGenerator.VERSION": true,
"tfds.testing.DummyDatasetSharedGenerator.__abstractmethods__": true,
Expand All @@ -1859,11 +1868,13 @@
"tfds.testing.DummyDatasetSharedGenerator.as_dataset": true,
"tfds.testing.DummyDatasetSharedGenerator.builder_config": true,
"tfds.testing.DummyDatasetSharedGenerator.builder_configs": true,
"tfds.testing.DummyDatasetSharedGenerator.data_dir": true,
"tfds.testing.DummyDatasetSharedGenerator.download_and_prepare": true,
"tfds.testing.DummyDatasetSharedGenerator.info": true,
"tfds.testing.DummyDatasetSharedGenerator.name": true,
"tfds.testing.DummyMnist": false,
"tfds.testing.DummyMnist.BUILDER_CONFIGS": true,
"tfds.testing.DummyMnist.GOOGLE_DISABLED": true,
"tfds.testing.DummyMnist.IN_DEVELOPMENT": true,
"tfds.testing.DummyMnist.VERSION": true,
"tfds.testing.DummyMnist.__abstractmethods__": true,
Expand All @@ -1887,6 +1898,7 @@
"tfds.testing.DummyMnist.as_dataset": true,
"tfds.testing.DummyMnist.builder_config": true,
"tfds.testing.DummyMnist.builder_configs": true,
"tfds.testing.DummyMnist.data_dir": true,
"tfds.testing.DummyMnist.download_and_prepare": true,
"tfds.testing.DummyMnist.info": true,
"tfds.testing.DummyMnist.name": true,
Expand Down
8 changes: 8 additions & 0 deletions docs/api_docs/python/tfds/core/DatasetBuilder.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@
<meta itemprop="name" content="tfds.core.DatasetBuilder" />
<meta itemprop="path" content="Stable" />
<meta itemprop="property" content="builder_config"/>
<meta itemprop="property" content="data_dir"/>
<meta itemprop="property" content="info"/>
<meta itemprop="property" content="__init__"/>
<meta itemprop="property" content="as_dataset"/>
<meta itemprop="property" content="download_and_prepare"/>
<meta itemprop="property" content="BUILDER_CONFIGS"/>
<meta itemprop="property" content="GOOGLE_DISABLED"/>
<meta itemprop="property" content="IN_DEVELOPMENT"/>
<meta itemprop="property" content="VERSION"/>
<meta itemprop="property" content="builder_configs"/>
Expand Down Expand Up @@ -86,6 +88,10 @@ Callers must pass arguments as keyword arguments.

<a href="../../tfds/core/BuilderConfig.md"><code>tfds.core.BuilderConfig</code></a> for this builder.

<h3 id="data_dir"><code>data_dir</code></h3>



<h3 id="info"><code>info</code></h3>

<a href="../../tfds/core/DatasetInfo.md"><code>tfds.core.DatasetInfo</code></a> for this builder.
Expand Down Expand Up @@ -161,6 +167,8 @@ Downloads and prepares dataset for reading.

<h3 id="BUILDER_CONFIGS"><code>BUILDER_CONFIGS</code></h3>

<h3 id="GOOGLE_DISABLED"><code>GOOGLE_DISABLED</code></h3>

<h3 id="IN_DEVELOPMENT"><code>IN_DEVELOPMENT</code></h3>

<h3 id="VERSION"><code>VERSION</code></h3>
Expand Down
17 changes: 11 additions & 6 deletions docs/api_docs/python/tfds/core/DatasetInfo.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
<meta itemprop="property" content="as_proto"/>
<meta itemprop="property" content="citation"/>
<meta itemprop="property" content="description"/>
<meta itemprop="property" content="download_checksums"/>
<meta itemprop="property" content="features"/>
<meta itemprop="property" content="full_name"/>
<meta itemprop="property" content="initialized"/>
<meta itemprop="property" content="name"/>
<meta itemprop="property" content="redistribution_info"/>
<meta itemprop="property" content="size_in_bytes"/>
<meta itemprop="property" content="splits"/>
<meta itemprop="property" content="supervised_keys"/>
Expand Down Expand Up @@ -52,7 +52,8 @@ __init__(
features=None,
supervised_keys=None,
urls=None,
citation=None
citation=None,
redistribution_info=None
)
```

Expand All @@ -69,6 +70,10 @@ Constructs DatasetInfo.
supervised learning, if applicable for the dataset.
* <b>`urls`</b>: `list(str)`, optional, the homepage(s) for this dataset.
* <b>`citation`</b>: `str`, optional, the citation to use for this dataset.
* <b>`redistribution_info`</b>: `dict`, optional, information needed for
redistribution, as specified in `dataset_info_pb2.RedistributionInfo`.
The content of the `license` subfield will automatically be written to a
LICENSE file stored with the dataset.



Expand All @@ -90,10 +95,6 @@ Constructs DatasetInfo.



<h3 id="download_checksums"><code>download_checksums</code></h3>



<h3 id="features"><code>features</code></h3>


Expand All @@ -110,6 +111,10 @@ Whether DatasetInfo has been fully initialized.



<h3 id="redistribution_info"><code>redistribution_info</code></h3>



<h3 id="size_in_bytes"><code>size_in_bytes</code></h3>


Expand Down
8 changes: 8 additions & 0 deletions docs/api_docs/python/tfds/core/GeneratorBasedBuilder.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@
<meta itemprop="name" content="tfds.core.GeneratorBasedBuilder" />
<meta itemprop="path" content="Stable" />
<meta itemprop="property" content="builder_config"/>
<meta itemprop="property" content="data_dir"/>
<meta itemprop="property" content="info"/>
<meta itemprop="property" content="__init__"/>
<meta itemprop="property" content="as_dataset"/>
<meta itemprop="property" content="download_and_prepare"/>
<meta itemprop="property" content="BUILDER_CONFIGS"/>
<meta itemprop="property" content="GOOGLE_DISABLED"/>
<meta itemprop="property" content="IN_DEVELOPMENT"/>
<meta itemprop="property" content="VERSION"/>
<meta itemprop="property" content="builder_configs"/>
Expand Down Expand Up @@ -58,6 +60,10 @@ Builder constructor.

<a href="../../tfds/core/BuilderConfig.md"><code>tfds.core.BuilderConfig</code></a> for this builder.

<h3 id="data_dir"><code>data_dir</code></h3>



<h3 id="info"><code>info</code></h3>

<a href="../../tfds/core/DatasetInfo.md"><code>tfds.core.DatasetInfo</code></a> for this builder.
Expand Down Expand Up @@ -133,6 +139,8 @@ Downloads and prepares dataset for reading.

<h3 id="BUILDER_CONFIGS"><code>BUILDER_CONFIGS</code></h3>

<h3 id="GOOGLE_DISABLED"><code>GOOGLE_DISABLED</code></h3>

<h3 id="IN_DEVELOPMENT"><code>IN_DEVELOPMENT</code></h3>

<h3 id="VERSION"><code>VERSION</code></h3>
Expand Down
2 changes: 1 addition & 1 deletion docs/api_docs/python/tfds/core/SplitBase.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ See the
for more information.

There are three parts to the composition:
1) The splits are composed (defined, merged, splitted,...) together before
1) The splits are composed (defined, merged, split,...) together before
calling the `.as_dataset()` function. This is done with the `__add__`,
`__getitem__`, which return a tree of `SplitBase` (whose leaf
are the `NamedSplit` objects)
Expand Down
2 changes: 1 addition & 1 deletion docs/api_docs/python/tfds/core/lazy_imports.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,6 @@ Defined in [`core/lazy_imports.py`](https://github.com/tensorflow/datasets/tree/
Lazy importer for heavy dependencies.

Some datasets require heavy dependencies for data generation. To allow for
the default installation to remain lean, those heavy depdencies are
the default installation to remain lean, those heavy dependencies are
lazily imported here.

5 changes: 4 additions & 1 deletion docs/api_docs/python/tfds/download/DownloadConfig.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ __init__(
manual_dir=None,
download_mode=None,
compute_stats=None,
max_examples_per_split=None
max_examples_per_split=None,
register_checksums=False
)
```

Expand All @@ -44,6 +45,8 @@ Constructs a `DownloadConfig`.
statistics over the generated data. Defaults to `AUTO`.
* <b>`max_examples_per_split`</b>: `int`, optional max number of examples to write
into each split.
* <b>`register_checksums`</b>: `bool`, defaults to False. If True, checksum of
downloaded files are recorded.



Loading

0 comments on commit 554a2b3

Please sign in to comment.