Skip to content

Added cityscapes dataset#225

Merged
tfds-copybara merged 15 commits intotensorflow:masterfrom
f1recracker:master
Jan 3, 2020
Merged

Added cityscapes dataset#225
tfds-copybara merged 15 commits intotensorflow:masterfrom
f1recracker:master

Conversation

@f1recracker
Copy link
Contributor

Added Cityscapes dataset with segmentation labels. Fixes #214.

@googlebot googlebot added the cla: yes Author has signed CLA label Mar 11, 2019
@f1recracker
Copy link
Contributor Author

f1recracker commented Mar 11, 2019

Hello! This is my first contribution to tensorflow-datasets! I actually had a few quick questions / design items to discuss before merging this.

  1. How are the unit tests run? (is it via oss_scripts/oss_tests?) I haven't run any of the cityscapes test but I checked the size of using the actual dataset in my local repository.

  2. What is the recommended size of fake datasets used for testing? I have tried to ensure each split has a different size but I still end up with a 38MB fake dataset (cityscapes has big images - 1024 * 2048).

  3. The current PR does not have instance level segmentation labels, only semantic segmentation labels. I plan on fixing this before the merge but this requires parsing a JSON. The official cityscapes repository has a parser already and is also packaged as an installer. I came up with three apporaches to solve this:

    • Re-implement the parser in this repository
    • Clone the repository with the dataset, and add it to the include path (seems kinda hacky)
    • Require a user installation of cityScapesScripts if the cityscapes dataset is loaded

Copy link
Contributor

@us us left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome :)
Maybe if you can same fake_data folder, it'll help to decreasing mb.(assumed that the images were the same)

@f1recracker
Copy link
Contributor Author

f1recracker commented Mar 11, 2019

I've used pngquant to compress the fake images!

@us
Copy link
Contributor

us commented Mar 12, 2019

@f1recracker some dataset's test generate own images and label like testing/cifar.py. Maybe it can reduce size even more.

@cyfra
Copy link
Contributor

cyfra commented Mar 12, 2019

@f1recracker - for running the tests, you can also invoke them from command line directly (assuming you're in the top directory):

python3 -m tensorflow_datasets.image.celeba_test

@cyfra cyfra self-assigned this Mar 12, 2019
@f1recracker
Copy link
Contributor Author

f1recracker commented Mar 12, 2019

I've fixed most of the changes but I have observed that dl_manager.extract does not work as intended in unit tests (which actually returns DL_EXTRACT_RESULT or the example_path from what I can see here) but dl_manager.iter_archive works. Is mocking the extract method intentional?

If mocking extract is intentional, then I have a few work arounds to solve this issue:

  • Rewrite my code to use dl_manager.iter_archive but the problem is that I need to zip and synchronize generators returned from two archives (images and labels), which would require loading all file objects in memory at once and then generate match across both archives using ids.
  • Put the extracted data in fake_data directly and leave empty zip file, which seems much cleaner but would create a few additional files in the fake_data instead of a single archive.

@cyfra
Copy link
Contributor

cyfra commented Mar 13, 2019

@f1recracker - yes, you're right, this is an inconsistency in our testing framework that we introduced when adding DL_EXTRACT_RESULT.

@pierrot0 - Pierre, can you comment on the extract issue and what is the recommended way @f1recracker should follow?

@f1recracker
Copy link
Contributor Author

Has there been any updates on this? Sorry I got a little caught up with some work during the past few weeks!

@Conchylicultor Conchylicultor added the dataset request Request for a new dataset to be added label Apr 25, 2019
@f1recracker
Copy link
Contributor Author

It seems that the latest version fixed the inconsistency and also another error. I've also refactored my code to now use BuilderConfigs instead of having separate classes for Cityscapes - fine and Cityscapes - coarse.

@lgeiger
Copy link
Collaborator

lgeiger commented Aug 2, 2019

@f1recracker Thanks for the PR! Would it be also possible to add the depth data of the Cityscapes dataset as an additional feature?

@f1recracker
Copy link
Contributor Author

I wasn't aware of the depth maps - let me have a look at this.

@lgeiger
Copy link
Collaborator

lgeiger commented Aug 2, 2019

Thanks for taking a look!

@lgeiger
Copy link
Collaborator

lgeiger commented Aug 6, 2019

@f1recracker The stereo pair right images would also be a great addition to this PR.

@f1recracker
Copy link
Contributor Author

f1recracker commented Aug 12, 2019

@lgeiger Added support for right images and disparity maps too. Cityscapes now has 4 configurations:

  • 'semantic_segmentation' - left + segmentations
  • 'semantic_segmentation_extra' - left + segmentations + coarse labels + train extra split
  • 'stereo_disparity' - left + right + disparity maps
  • 'stereo_disparity_extra' - left + right + disparity maps + train extra split

Let me know your thoughts on this configuration.

Could this be reviewed again? There's significant changes since the last review (BuilderConfigs so no abcs, and the recent change with disparity maps)? Thanks!

@netw0rkf10w
Copy link

@f1recracker Thanks for working on this.
@cyfra What would need to be changed for this to be merged please?

@f1recracker f1recracker requested a review from cyfra October 22, 2019 04:49
@cyfra cyfra added the kokoro:run Run Kokoro tests label Oct 24, 2019
@kokoro-team kokoro-team removed the kokoro:run Run Kokoro tests label Oct 24, 2019
@cyfra
Copy link
Contributor

cyfra commented Oct 24, 2019

Hey @netw0rkf10w
Unfortunately we have grown quite a backlog of PRs - sorry for that.
We were quite busy with making sure that current ones will work with Python3.

as I see that there is high interest with this dataset, I'll try to merge it soon.

@netw0rkf10w
Copy link

@cyfra Thank you for your reply, and for your hard work! This is indeed a very popular dataset in computer vision, so the merge would benefit a lot of users. Looking forward to having this merged.


BUILDER_CONFIGS = [
CityscapesConfig(
name='semantic_segmentation',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using configs, couldn't you merge all configs into a single one, and the user could afterward select only the feature he wants.
Or reduce the number of config to only two ?

Currently the code is quite difficult to read with a lot of if/else condition which make it difficult to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a deliberate choice - Disparity maps need additional permissions from the authors and aren't readily available. 'Extra' splits - whether for segmentation or for the disparity requires additional files (50GB). I chose this route so that each configuration so that it would work for all cases.

Copy link
Contributor Author

@f1recracker f1recracker Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this some more thought today to see if I could refactor the code to use fewer if/else statements. The problem arises because Cityscapes has too many features that every user may not use.

  1. There's two 'datasets' for different tasks - semantic segmentation and disparity inference. Disparity tasks can also require the right images - another additional download.
  2. Every cityscapes 'dataset' (semantic seg / stereo depth) has an additional 'extra' split. This split is much larger and any labels provided are coarse grain instead.

I think we still need 4 (2 (task) x 2 (extra split)) configurations.

One feature that might reduce some if/else might be to use a constant feature dict and use a 'default' image tensor (since cityscape labels are images) if this is acceptable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any tutorials on how to use this class please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just have to do something along the lines of:

import tensorflow_datasets as tfds
dataset = tfds.load(name="cityscapes/semantic_segmentation_extra", split="train")
# Can also specify config from above.

I'd give a working Colab but unfortunately this is a pretty big dataset and requires manual download :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@f1recracker thank you for the reply!
I downloaded the dataset manually and will attempt to access it with the syntax you suggested
Thanks again.

@netw0rkf10w
Copy link

Thanks a lot, @f1recracker, for continuing working on this! Unfortunately I am not yet familiar enough with the code to be able to help you.

tfds-copybara pushed a commit that referenced this pull request Jan 3, 2020
PiperOrigin-RevId: 287961867
@tfds-copybara tfds-copybara merged commit b25c53b into tensorflow:master Jan 3, 2020
@cyfra
Copy link
Contributor

cyfra commented Jan 3, 2020

Ok - dataset is submitted.

Thanks a lot @f1recracker for your contribution and patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Author has signed CLA dataset request Request for a new dataset to be added

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[data request] <Cityscapes>