Dataset Zoo and MVP SDK interface #19

brimoor · 2020-04-30T09:17:50Z

This PR is pretty big, but it accomplishes the following two goals:

establishes a FiftyOne Dataset Zoo
implements a functioning prototype interface that users can use to engage with their datasets

Run the example code in examples/interface.py and let me know what you think.

FiftyOne Dataset Zoo

A Dataset Zoo is established in FiftyOne that enables users to download any of a collection of datasets using the syntax:

import fiftyone.core.data as fod

fod.load_zoo_dataset("cifar10")

Behind the scenes, it grabs datasets using either the TF Zoo or the PyTorch Zoo, depending on which package is installed on the user's machine. All datasets are stored on disk in eta.core.datasets.LabeledImageDataset format.

Prototype Dataset interface

My goal here was to have a working example of performing the following tasks:

loading a dataset with ground truth annotations into FiftyOne
allowing the user to iterate over (image, ImageLabels) pairs in the dataset, where the labels are either the ground truth labels or a set of predicted labels
allowing the user to select a specific frame attribute and iterate over the (img, label) pairs in the dataset, as would be fed to a classifier during training
allowing the user to register a new model against the dataset
allowing the user to iterate over (image, sample ID) pairs so they can add new predictions to the dataset

To accomplish this, I implemented prototype versions of the following classes:

fiftyone.experimental.data.Dataset: the underlying class that stores all information about the dataset, including paths to the raw images on disk, ground truth annotations, one or more sets of predicted labels, and additional metadata accompanying all of these items, such as sample hardness, annotation correctness, etc

fiftyone.experimental.contexts.DatasetContext: classes that represent specific contexts into the dataset. Examples include:
- ImageContext: pulls only the images from the dataset. Iterating over this yields imgs
- LabeledImageContext: pulls images and a particular set of ImageLabels (e.g., ground truth) from the dataset. Iterating over contexts of this kind yield (img, ImageLabels) pairs
- ImageClassificationContext: pulls images and a particular frame attribute from a LabeledImageContext from the dataset. Iterating over contexts of this kind yield (img, label) pairs
- ModelContext: pulls a specific model from the dataset, so that additional predictions can be added to samples. Iterating over context of this kind yield (img, sample ID) pairs

fiftyone.experimental.views.DatasetView: classes that allow read-only operations to be performed on a DatasetContext. These allow things like sorting by X, shuffling, removing specific samples (from the view, not the underlying dataset), iterating over the samples in the view, and exporting the current state of the view as a dataset on disk

To achieve a functional prototype of the above, my fiftyone.experimental.data.Dataset class has an actual implementation. This, of course, will be thrown out and grafted onto the MongoDB interface.

Again, see examples/interface.py to see this functionality in action.

I'm not sure that there needs to be a difference between DatasetContexts and DatasetViews. But, there is the interesting distinction to be made that one may want a "read-only" playground that DatasetViews offer where any slicing and dicing of the dataset is not permanent.

Key Design Principles

all images are stored on disk
image paths and labels are stored in-memory (anticipating MongoDB being grafted onto this work in the backend)
don't import tensorflow or torch unless the user specifically requests functionality that requires them

…mplementation

jasoncorso

Thanks for pulling this work together. I cannot review it all right now, but I have two high-level comments both come from an "ease of adoption" standpoint; as a scientist, if you will:

(Overall sense of complexity of tool) Thinking about numpy, torch, tensorflow and scipy; three of these libraries, which are significant, are essentially imported with import thelibrary as foo. The design choices in this code rather take a approach similar to the fourth, and require significant lines of important. While we should have made an effort to see what user's want, I think this conveys as sense of heavyness and complexity when I'd rather directly use import fiftyone as fo or something like it, and perhaps import fiftyone.zoo as foz as a separate one I need to choose. In sum, the overall structure conveys complexity and violates lightweightness.
(science users) As a scientist using the tool, I am likely to want to keep the data under my control as long as I can. Simply take the necessary step of preprocessing data for things like equalization, sizing, etc. The interface provided here suggests I need to give this pause and just adopt fiftyone for all of my dataset work, which is not necessarily the case. Let's say I have some code like this:

(train, val) = load_dataset_from_whereever("dataset_name")
# train, val are lists of (numpy.ndarray images, target label)
prep = Preprocessor(initialize_multiple_steps...)
prep.in_place_process(train)
prep.in_place_process(val)

# now I am ready to begin training
# In PyTorch, for example, I'd create a dataloader for batches, etc.
model = train_model(train, val)

# I have a model, let's get some information from fiftyone on the data
# (not concrete suggestion for interface)
fdata = fiftyone.dataset(train, "a name")   #assuming I will have a way to map whatever is in fdata back to list-indices in train.
fdata.add_model(model, "an identifier")  
with fdata.get_context as context:
  for i, sample in enumerate(train):
    prediction_dict = model.run(sample)   # noting I'd probably even want to do this on "batches" of samples at once, via my own dataloader...
    context.add_predictions(i, prediction_dict)

  hardness_ranking = context.get_hardness_rank()

This puts "scientist me" first and fiftyone second. (I have real examples that do essentially the above, without fiftyone at this point.)
I guess my comments center around what I, as the data scientist, need to do in order to adopt fiftyone; as we agree this should be as light as possible. To have fiftyone support dataset loading, etc, seems fine, but my guess is that it would then also need to support things like preprocessing across the underlying backends; but then it starts to look a lot like keras. Do we want that? Working on the user's terms is probably the right design goal, etc.

In summary, these are comments from a user perspective relating to sense of complexity and value from using fiftyone.

jasoncorso · 2020-04-30T14:55:39Z

(In reading this, I have presumed the two Dataset class implementations will be merged. If this is not the case, then it'll generate more discussion.)

brimoor · 2020-04-30T15:00:39Z

I moved load_zoo_dataset() to the fiftyone.zoonamespace, and I imported the other main dataset ingestion methods in the fiftyone namespace. So, users will only import fiftyone or import fiftyone.zoo unless they are doing something very detailed.

examples/interface.py is updated to demonstrate this.

brimoor · 2020-04-30T15:01:28Z

(In reading this, I have presumed the two Dataset class implementations will be merged. If this is not the case, then it'll generate more discussion.)

Yes they will need to be merged. My thinking is the public interface of this one will be grafted onto the backend implementation of the other one.

jasoncorso · 2020-04-30T15:05:59Z

Keeping the discussion here:

@brimoor , relating to your PR, are you envisioning the calculation of some scalar-measure (eg hardness) would be performed on a ModelContext
One user-aspect is a need to be able to refer to contexts by id/name/descriptor? It seems to me that I made exit a context, but want to be able to refer to it in some way. Is the vision that this is possible?

brimoor · 2020-04-30T17:22:05Z

(fiftyone) tylerganter@tgmbp:~/source/fiftyone/examples$ python interface.py 
Available datasets: ['mnist', 'cifar10', 'imagenet', 'imagenet-2012', 'coco', 'coco-2017']
Using default split 'test'
Using default backing directory '/Users/tylerganter/.fiftyone/cifar10'
Using default dataset directory '/Users/tylerganter/fiftyone/cifar10/test'
Uncaught exception
Traceback (most recent call last):
  File "interface.py", line 67, in <module>
    dataset1 = foz.load_zoo_dataset("cifar10")
  File "/Users/tylerganter/source/fiftyone/fiftyone/zoo/base.py", line 108, in load_zoo_dataset
    return fod.Dataset.from_ground_truth_labeled_dataset(labeled_dataset)
TypeError: from_ground_truth_labeled_dataset() missing 2 required positional arguments: 'backing_dir' and 'labeled_dataset'

fixed

examples/interface.py

brimoor · 2020-04-30T17:28:32Z

That implementation does not really support the natural (lightweight) workflow in:

# I have a model, let's get some information from fiftyone on the data
# (not concrete suggestion for interface)
fdata = fiftyone.dataset(train, "a name")   #assuming I will have a way to map whatever is in fdata back to list-indices in train.
fdata.add_model(model, "an identifier")  
with fdata.get_context as context:
  for i, sample in enumerate(train):
    prediction_dict = model.run(sample)
    context.add_predictions(i, prediction_dict)
  hardness_ranking = context.get_hardness_rank()

We could introduce an class ImageClassifierContext(ModelContext, ImageClassificationContext), then hardness would be available directly in that context

tylerganter · 2020-04-30T19:32:54Z

fiftyone/core/data.py

+
+
+def load_image_classification_dataset(
+    dataset, labels_map=None, sample_parser=None, name=None, backing_dir=None,


maybe rename dataset to iterable or sample_iterator. I think this was one of the things that sounded heavy duty to Jason, and it definitely threw me off guard when I first read it.

jasoncorso · 2020-05-01T11:19:26Z

Did our conversation yesterday lead to any decision on the use of the word load_ in many of the load function calls? This is mainly a semantic user consideration. And, load_ may be fine.

brimoor · 2020-05-01T13:28:00Z

There are a variety of considerations:

We want to convey to the user when a given operation is lightweight or heavyweight. I define "lightweight" to mean the raw images/videos are not read from disk or copied, and "heavyweight" is when they are
At the same time, we need to convey to the user whether their work is expected to be _persistent _ or not. We haven't fully fleshed this out yet, but, in fact, it seems to me that we do need to persist operations that the user does on their datasets, since a core value prop is tracking models over time on a dataset, and requesting "heavy" backend computation to add information to one's samples

Looking to other libraries for examples, the syntax for creating "lightweight" datasets is:

#
# TensorFlow
#

# Dataset creation syntax. This is "lightweight"
dataset = tf.data.Dataset.from_tensor_slices(image_paths, labels).map(
    parse_sample,  # a function that loads an image and parses the label, if necessary
    num_parallel_calls=tf.data.experimental.AUTOTUNE,
)

# Example use
for img, label in dataset.as_numpy_iterator():
    pass

#
# PyTorch
#

class MyDataset(torch.utils.data.Dataset):

    def __init__(self, image_paths, labels):
        self.image_paths = image_paths
        self.labels = labels
    
    def __getitem__(idx):
        return load_image(self.image_paths[idx]), self.labels[idx]

# Dataset creation syntax. This is "lightweight"
dataset = MyDataset(image_paths, labels)

# Example use
for img, label in torch.utils.data.DataLoader(dataset):
    ...

So, we could adopt a syntax like:

dataset = fiftyone.Dataset.from_image_classification_samples(samples)

But, our datasets are in fact persistent. When you "load/ingest" a dataset, its labels will be completely ingested and saved to a persistent database. Another consideration is that, if we're not creating copies of the user's data, then they need to know that they can't move the samples and expect to still use their dataset in the future.

As discussed yesterday, we may want to optionally provide the ability to cede control of one's raw samples to us as well, to be stored internally along with the labels and other metadata, in which case we're switching over to "heavyweight" mode.

tylerganter · 2020-05-01T15:52:54Z

@brimoor in response to your latest comment, a few thoughts:

I think the best start for us, and the first interaction the user may have with the system, is to drop the database when the session is finished. It's simpler to mange on our end and understand on the user's end, and follows "lightweight" principle.

Keeping track of backend operations could be as simple as tracking MongoDB logs. If not, I would recommend we connect to a service like Pachyderm that already supports this.

There is some fancy stuff we could do for making samples lightweight but persistent. We could use hardlinks to the data, which has a lot of benefits:

lightweight ingestion
users can move their data wherever they want
if they "delete" the data, it isn't deleted from disk until they clear the dataset from fiftyone as well. DVC uses reflinks by default, and we could potentially hookup with them as a backend: file-link-types-for-the-dvc-cache

brimoor · 2020-05-01T16:20:46Z

I agree that FiftyOne not persisting anything to disk is simpler and easier to understand, I just wonder if it will be essential in order to track changes to models over time.

If FiftyOne persists nothing, then we'll have to return everything we compute to the user. If they request sample hardness, we compute it and return it to them. We also store it internally in the current session so that they can visualize it, say. But, if the user wants to visualize the same information at a later date, for example, they'll be responsible for providing that metadata back to FiftyOne. Maybe that's the way it should be, it's just not clear to me yet.

In any case, I am fine with starting under the assumption that nothing is persisted.

brimoor · 2020-05-01T16:23:50Z

I made a few changes to this PR:

renamed the dataset "loading" methods in fiftyone.core.data to from_*_samples and from_*_images to emphasize that the user is creating a handle to something that they have ultimate control over
moved my implementations of Dataset, DatasetContext and DatasetView to a fiftyone.experimental package. The core dataset loading methods remain in fiftyone.core.data, although they return fiftyone.experimental.data.Dataset instances for the time being

I couldn't see any clear way to contribute this PR without including my experimental definitions of Dataset functionality, since otherwise one can't do anything with the zoo/dataset loading code!

tylerganter · 2020-05-01T16:25:23Z

I agree that FiftyOne not persisting anything to disk is simpler and easier to understand, I just wonder if it will be essential in order to track changes to models over time.

If FiftyOne persists nothing, then we'll have to return everything we compute to the user. If they request sample hardness, we compute it and return it to them. We also store it internally in the current session so that they can visualize it, say. But, if the user wants to visualize the same information at a later date, for example, they'll be responsible for providing that metadata back to FiftyOne. Maybe that's the way it should be, it's just not clear to me yet.

In any case, I am fine with starting under the assumption that nothing is persisted.

Yeah there are obvious limitations if nothing is persisted. You can compute hardness and get it back. Or you can quickly browse your samples. That's about it. But it's a simple start that gets'em comin' back!

jasoncorso · 2020-05-01T16:49:46Z

In planning discussions, we agreed nothing persists for MVP. In the agile mindset, we should develop strictly for that right now. Get immediate value. Replan. Redevelop and move on.

brimoor

@tylerganter some thoughts for you

brimoor · 2020-05-01T16:51:47Z

fiftyone/core/datautils.py

+            sample: the sample
+
+        Returns:
+            an ``eta.core.image.ImageLabels`` instance


@tylerganter a way to merge this work with your work would be to change this to output a fiftyone.label.ImageLabels instead

brimoor · 2020-05-01T16:52:15Z

fiftyone/core/datautils.py

+            sample: the sample
+
+        Returns:
+            an ``eta.core.image.ImageLabels`` instance


@tylerganter and then, critically, this would just emit a fiftyone.label.ClassificationLabel

brimoor · 2020-05-01T16:52:39Z

fiftyone/core/datautils.py

+            sample: the sample
+
+        Returns:
+            an :class:`eta.core.image.ImageLabels` instance


@tylerganter and this would be a fiftyone.label.DetectionLabel

tylerganter

I'm approving noting that we are in a very early stage and Brian and I just had a productive conversation about what we're going to change

0.1.2

Adding support for reading/writing datasets from cloud storage

brimoor added 19 commits April 30, 2020 03:23

adding a FiftyOne config to store settings

fd3abff

adding support for config printing to CLI

140db54

adding base infrastructure for DatasetZoo

52c532b

adding torchvision models to the zoo

87efff3

adding TFDS models to the zoo

78512f3

adding package init

3e2bd8e

adding utilities for ingesting datasets into LabeledDataset format

38971f1

adding utils

26398f2

adding utils for exporting data in Torch format

42749ae

adding utils for importing/exporting data in TF format

120c097

adding core methods for loading datasets, and a placeholder Dataset i…

328ddc4

…mplementation

adding initial implementation of dataset contexts

27ee6aa

adding initial implementation of dataset views

facd256

adding example of using the FiftyOne interface

918d6d5

Merge branch 'develop' into interface

cc2d878

moving config to core package

b5a4f51

fixing typo

bcdb922

avoiding name clashes

52d7e66

adding imports back

ec9e6c3

brimoor requested a review from a team April 30, 2020 09:17

brimoor self-assigned this Apr 30, 2020

brimoor added the feature Work on a feature request label Apr 30, 2020

jasoncorso reviewed Apr 30, 2020

View reviewed changes

brimoor added 3 commits April 30, 2020 10:56

moving load_zoo_dataset() to fiftyone.zoo package

f8cdaea

importing core data methods in fiftyone namespace

561eb96

using new method homes

5b2a361

brimoor added 3 commits April 30, 2020 13:12

formatting docstrings for sphinx

0580a58

making Sphinx happy

0bd4d0c

making sphinx happy

5de0137

tylerganter reviewed Apr 30, 2020

View reviewed changes

examples/interface.py Show resolved Hide resolved

tylerganter reviewed Apr 30, 2020

View reviewed changes

brimoor added 5 commits May 1, 2020 11:01

clarifying whether class docstring or __init__ will be documented

bf73e4f

clarifying interface

557f9f0

improving clarity of implementation

9d78f0c

removing backing_dir and updating method call syntax

31a939f

updating method names in example

ba04b71

brimoor added 3 commits May 1, 2020 11:54

improving docstrings, cleaning up

884eee8

moving experimental Dataset class to experimenal/ subpackage

9e06256

using existing method

f7ca969

brimoor requested review from tylerganter and jasoncorso May 1, 2020 16:24

brimoor commented May 1, 2020

View reviewed changes

tylerganter approved these changes May 1, 2020

View reviewed changes

brimoor merged commit 1a18d2a into develop May 1, 2020

brimoor deleted the interface branch May 1, 2020 21:22

benjaminpkane pushed a commit that referenced this pull request Jan 20, 2022

Merge pull request #19 from voxel51/rel-0.2.0

2a26a59

0.1.2

kaixi-wang pushed a commit that referenced this pull request May 11, 2023

Merge pull request #19 from voxel51/cloud-io

51a0a20

Adding support for reading/writing datasets from cloud storage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Zoo and MVP SDK interface #19

Dataset Zoo and MVP SDK interface #19

brimoor commented Apr 30, 2020 •

edited

jasoncorso left a comment •

edited

jasoncorso commented Apr 30, 2020

brimoor commented Apr 30, 2020 •

edited

brimoor commented Apr 30, 2020

jasoncorso commented Apr 30, 2020

brimoor commented Apr 30, 2020

brimoor commented Apr 30, 2020

tylerganter Apr 30, 2020

jasoncorso commented May 1, 2020

brimoor commented May 1, 2020 •

edited

tylerganter commented May 1, 2020 •

edited

brimoor commented May 1, 2020

brimoor commented May 1, 2020

tylerganter commented May 1, 2020

jasoncorso commented May 1, 2020

brimoor left a comment

brimoor May 1, 2020

brimoor May 1, 2020

brimoor May 1, 2020

tylerganter left a comment



		def load_image_classification_dataset(
		dataset, labels_map=None, sample_parser=None, name=None, backing_dir=None,

Dataset Zoo and MVP SDK interface #19

Dataset Zoo and MVP SDK interface #19

Conversation

brimoor commented Apr 30, 2020 • edited

jasoncorso left a comment • edited

Choose a reason for hiding this comment

jasoncorso commented Apr 30, 2020

brimoor commented Apr 30, 2020 • edited

brimoor commented Apr 30, 2020

jasoncorso commented Apr 30, 2020

brimoor commented Apr 30, 2020

brimoor commented Apr 30, 2020

tylerganter Apr 30, 2020

Choose a reason for hiding this comment

jasoncorso commented May 1, 2020

brimoor commented May 1, 2020 • edited

tylerganter commented May 1, 2020 • edited

brimoor commented May 1, 2020

brimoor commented May 1, 2020

tylerganter commented May 1, 2020

jasoncorso commented May 1, 2020

brimoor left a comment

Choose a reason for hiding this comment

brimoor May 1, 2020

Choose a reason for hiding this comment

brimoor May 1, 2020

Choose a reason for hiding this comment

brimoor May 1, 2020

Choose a reason for hiding this comment

tylerganter left a comment

Choose a reason for hiding this comment

brimoor commented Apr 30, 2020 •

edited

jasoncorso left a comment •

edited

brimoor commented Apr 30, 2020 •

edited

brimoor commented May 1, 2020 •

edited

tylerganter commented May 1, 2020 •

edited