Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Zoo and MVP SDK interface #19

Merged
merged 35 commits into from May 1, 2020
Merged

Dataset Zoo and MVP SDK interface #19

merged 35 commits into from May 1, 2020

Conversation

brimoor
Copy link
Contributor

@brimoor brimoor commented Apr 30, 2020

This PR is pretty big, but it accomplishes the following two goals:

  • establishes a FiftyOne Dataset Zoo
  • implements a functioning prototype interface that users can use to engage with their datasets

Run the example code in examples/interface.py and let me know what you think.

FiftyOne Dataset Zoo

A Dataset Zoo is established in FiftyOne that enables users to download any of a collection of datasets using the syntax:

import fiftyone.core.data as fod

fod.load_zoo_dataset("cifar10")

Behind the scenes, it grabs datasets using either the TF Zoo or the PyTorch Zoo, depending on which package is installed on the user's machine. All datasets are stored on disk in eta.core.datasets.LabeledImageDataset format.

Prototype Dataset interface

My goal here was to have a working example of performing the following tasks:

  • loading a dataset with ground truth annotations into FiftyOne
  • allowing the user to iterate over (image, ImageLabels) pairs in the dataset, where the labels are either the ground truth labels or a set of predicted labels
  • allowing the user to select a specific frame attribute and iterate over the (img, label) pairs in the dataset, as would be fed to a classifier during training
  • allowing the user to register a new model against the dataset
  • allowing the user to iterate over (image, sample ID) pairs so they can add new predictions to the dataset

To accomplish this, I implemented prototype versions of the following classes:

  • fiftyone.experimental.data.Dataset: the underlying class that stores all information about the dataset, including paths to the raw images on disk, ground truth annotations, one or more sets of predicted labels, and additional metadata accompanying all of these items, such as sample hardness, annotation correctness, etc

fiftyone.experimental.contexts.DatasetContext: classes that represent specific contexts into the dataset. Examples include:
- ImageContext: pulls only the images from the dataset. Iterating over this yields imgs
- LabeledImageContext: pulls images and a particular set of ImageLabels (e.g., ground truth) from the dataset. Iterating over contexts of this kind yield (img, ImageLabels) pairs
- ImageClassificationContext: pulls images and a particular frame attribute from a LabeledImageContext from the dataset. Iterating over contexts of this kind yield (img, label) pairs
- ModelContext: pulls a specific model from the dataset, so that additional predictions can be added to samples. Iterating over context of this kind yield (img, sample ID) pairs

fiftyone.experimental.views.DatasetView: classes that allow read-only operations to be performed on a DatasetContext. These allow things like sorting by X, shuffling, removing specific samples (from the view, not the underlying dataset), iterating over the samples in the view, and exporting the current state of the view as a dataset on disk

To achieve a functional prototype of the above, my fiftyone.experimental.data.Dataset class has an actual implementation. This, of course, will be thrown out and grafted onto the MongoDB interface.

Again, see examples/interface.py to see this functionality in action.

I'm not sure that there needs to be a difference between DatasetContexts and DatasetViews. But, there is the interesting distinction to be made that one may want a "read-only" playground that DatasetViews offer where any slicing and dicing of the dataset is not permanent.

Key Design Principles

  • all images are stored on disk
  • image paths and labels are stored in-memory (anticipating MongoDB being grafted onto this work in the backend)
  • don't import tensorflow or torch unless the user specifically requests functionality that requires them

@brimoor brimoor requested a review from a team April 30, 2020 09:17
@brimoor brimoor self-assigned this Apr 30, 2020
@brimoor brimoor added the feature Work on a feature request label Apr 30, 2020
Copy link
Contributor

@jasoncorso jasoncorso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pulling this work together. I cannot review it all right now, but I have two high-level comments both come from an "ease of adoption" standpoint; as a scientist, if you will:

  • (Overall sense of complexity of tool) Thinking about numpy, torch, tensorflow and scipy; three of these libraries, which are significant, are essentially imported with import thelibrary as foo. The design choices in this code rather take a approach similar to the fourth, and require significant lines of important. While we should have made an effort to see what user's want, I think this conveys as sense of heavyness and complexity when I'd rather directly use import fiftyone as fo or something like it, and perhaps import fiftyone.zoo as foz as a separate one I need to choose. In sum, the overall structure conveys complexity and violates lightweightness.

  • (science users) As a scientist using the tool, I am likely to want to keep the data under my control as long as I can. Simply take the necessary step of preprocessing data for things like equalization, sizing, etc. The interface provided here suggests I need to give this pause and just adopt fiftyone for all of my dataset work, which is not necessarily the case. Let's say I have some code like this:

(train, val) = load_dataset_from_whereever("dataset_name")
# train, val are lists of (numpy.ndarray images, target label)
prep = Preprocessor(initialize_multiple_steps...)
prep.in_place_process(train)
prep.in_place_process(val)

# now I am ready to begin training
# In PyTorch, for example, I'd create a dataloader for batches, etc.
model = train_model(train, val)

# I have a model, let's get some information from fiftyone on the data
# (not concrete suggestion for interface)
fdata = fiftyone.dataset(train, "a name")   #assuming I will have a way to map whatever is in fdata back to list-indices in train.
fdata.add_model(model, "an identifier")  
with fdata.get_context as context:
  for i, sample in enumerate(train):
    prediction_dict = model.run(sample)   # noting I'd probably even want to do this on "batches" of samples at once, via my own dataloader...
    context.add_predictions(i, prediction_dict)

  hardness_ranking = context.get_hardness_rank()

This puts "scientist me" first and fiftyone second. (I have real examples that do essentially the above, without fiftyone at this point.)
I guess my comments center around what I, as the data scientist, need to do in order to adopt fiftyone; as we agree this should be as light as possible. To have fiftyone support dataset loading, etc, seems fine, but my guess is that it would then also need to support things like preprocessing across the underlying backends; but then it starts to look a lot like keras. Do we want that? Working on the user's terms is probably the right design goal, etc.

In summary, these are comments from a user perspective relating to sense of complexity and value from using fiftyone.

@jasoncorso
Copy link
Contributor

(In reading this, I have presumed the two Dataset class implementations will be merged. If this is not the case, then it'll generate more discussion.)

@brimoor
Copy link
Contributor Author

brimoor commented Apr 30, 2020

I moved load_zoo_dataset() to the fiftyone.zoonamespace, and I imported the other main dataset ingestion methods in the fiftyone namespace. So, users will only import fiftyone or import fiftyone.zoo unless they are doing something very detailed.

examples/interface.py is updated to demonstrate this.

@brimoor
Copy link
Contributor Author

brimoor commented Apr 30, 2020

(In reading this, I have presumed the two Dataset class implementations will be merged. If this is not the case, then it'll generate more discussion.)

Yes they will need to be merged. My thinking is the public interface of this one will be grafted onto the backend implementation of the other one.

@jasoncorso
Copy link
Contributor

Keeping the discussion here:

  • @brimoor , relating to your PR, are you envisioning the calculation of some scalar-measure (eg hardness) would be performed on a ModelContext
  • One user-aspect is a need to be able to refer to contexts by id/name/descriptor? It seems to me that I made exit a context, but want to be able to refer to it in some way. Is the vision that this is possible?

@brimoor
Copy link
Contributor Author

brimoor commented Apr 30, 2020

(fiftyone) tylerganter@tgmbp:~/source/fiftyone/examples$ python interface.py 
Available datasets: ['mnist', 'cifar10', 'imagenet', 'imagenet-2012', 'coco', 'coco-2017']
Using default split 'test'
Using default backing directory '/Users/tylerganter/.fiftyone/cifar10'
Using default dataset directory '/Users/tylerganter/fiftyone/cifar10/test'
Uncaught exception
Traceback (most recent call last):
  File "interface.py", line 67, in <module>
    dataset1 = foz.load_zoo_dataset("cifar10")
  File "/Users/tylerganter/source/fiftyone/fiftyone/zoo/base.py", line 108, in load_zoo_dataset
    return fod.Dataset.from_ground_truth_labeled_dataset(labeled_dataset)
TypeError: from_ground_truth_labeled_dataset() missing 2 required positional arguments: 'backing_dir' and 'labeled_dataset'

fixed

@brimoor
Copy link
Contributor Author

brimoor commented Apr 30, 2020

That implementation does not really support the natural (lightweight) workflow in:

# I have a model, let's get some information from fiftyone on the data
# (not concrete suggestion for interface)
fdata = fiftyone.dataset(train, "a name")   #assuming I will have a way to map whatever is in fdata back to list-indices in train.
fdata.add_model(model, "an identifier")  
with fdata.get_context as context:
  for i, sample in enumerate(train):
    prediction_dict = model.run(sample)
    context.add_predictions(i, prediction_dict)
  hardness_ranking = context.get_hardness_rank()

We could introduce an class ImageClassifierContext(ModelContext, ImageClassificationContext), then hardness would be available directly in that context



def load_image_classification_dataset(
dataset, labels_map=None, sample_parser=None, name=None, backing_dir=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename dataset to iterable or sample_iterator. I think this was one of the things that sounded heavy duty to Jason, and it definitely threw me off guard when I first read it.

@jasoncorso
Copy link
Contributor

Did our conversation yesterday lead to any decision on the use of the word load_ in many of the load function calls? This is mainly a semantic user consideration. And, load_ may be fine.

@brimoor
Copy link
Contributor Author

brimoor commented May 1, 2020

There are a variety of considerations:

  • We want to convey to the user when a given operation is lightweight or heavyweight. I define "lightweight" to mean the raw images/videos are not read from disk or copied, and "heavyweight" is when they are

  • At the same time, we need to convey to the user whether their work is expected to be _persistent _ or not. We haven't fully fleshed this out yet, but, in fact, it seems to me that we do need to persist operations that the user does on their datasets, since a core value prop is tracking models over time on a dataset, and requesting "heavy" backend computation to add information to one's samples

Looking to other libraries for examples, the syntax for creating "lightweight" datasets is:

#
# TensorFlow
#

# Dataset creation syntax. This is "lightweight"
dataset = tf.data.Dataset.from_tensor_slices(image_paths, labels).map(
    parse_sample,  # a function that loads an image and parses the label, if necessary
    num_parallel_calls=tf.data.experimental.AUTOTUNE,
)

# Example use
for img, label in dataset.as_numpy_iterator():
    pass

#
# PyTorch
#

class MyDataset(torch.utils.data.Dataset):

    def __init__(self, image_paths, labels):
        self.image_paths = image_paths
        self.labels = labels
    
    def __getitem__(idx):
        return load_image(self.image_paths[idx]), self.labels[idx]

# Dataset creation syntax. This is "lightweight"
dataset = MyDataset(image_paths, labels)

# Example use
for img, label in torch.utils.data.DataLoader(dataset):
    ...

So, we could adopt a syntax like:

dataset = fiftyone.Dataset.from_image_classification_samples(samples)

But, our datasets are in fact persistent. When you "load/ingest" a dataset, its labels will be completely ingested and saved to a persistent database. Another consideration is that, if we're not creating copies of the user's data, then they need to know that they can't move the samples and expect to still use their dataset in the future.

As discussed yesterday, we may want to optionally provide the ability to cede control of one's raw samples to us as well, to be stored internally along with the labels and other metadata, in which case we're switching over to "heavyweight" mode.

@tylerganter
Copy link
Contributor

tylerganter commented May 1, 2020

@brimoor in response to your latest comment, a few thoughts:

I think the best start for us, and the first interaction the user may have with the system, is to drop the database when the session is finished. It's simpler to mange on our end and understand on the user's end, and follows "lightweight" principle.

Keeping track of backend operations could be as simple as tracking MongoDB logs. If not, I would recommend we connect to a service like Pachyderm that already supports this.

There is some fancy stuff we could do for making samples lightweight but persistent. We could use hardlinks to the data, which has a lot of benefits:

  1. lightweight ingestion
  2. users can move their data wherever they want
  3. if they "delete" the data, it isn't deleted from disk until they clear the dataset from fiftyone as well. DVC uses reflinks by default, and we could potentially hookup with them as a backend: file-link-types-for-the-dvc-cache

@brimoor
Copy link
Contributor Author

brimoor commented May 1, 2020

I agree that FiftyOne not persisting anything to disk is simpler and easier to understand, I just wonder if it will be essential in order to track changes to models over time.

If FiftyOne persists nothing, then we'll have to return everything we compute to the user. If they request sample hardness, we compute it and return it to them. We also store it internally in the current session so that they can visualize it, say. But, if the user wants to visualize the same information at a later date, for example, they'll be responsible for providing that metadata back to FiftyOne. Maybe that's the way it should be, it's just not clear to me yet.

In any case, I am fine with starting under the assumption that nothing is persisted.

@brimoor
Copy link
Contributor Author

brimoor commented May 1, 2020

I made a few changes to this PR:

  • renamed the dataset "loading" methods in fiftyone.core.data to from_*_samples and from_*_images to emphasize that the user is creating a handle to something that they have ultimate control over

  • moved my implementations of Dataset, DatasetContext and DatasetView to a fiftyone.experimental package. The core dataset loading methods remain in fiftyone.core.data, although they return fiftyone.experimental.data.Dataset instances for the time being

I couldn't see any clear way to contribute this PR without including my experimental definitions of Dataset functionality, since otherwise one can't do anything with the zoo/dataset loading code!

@tylerganter
Copy link
Contributor

I agree that FiftyOne not persisting anything to disk is simpler and easier to understand, I just wonder if it will be essential in order to track changes to models over time.

If FiftyOne persists nothing, then we'll have to return everything we compute to the user. If they request sample hardness, we compute it and return it to them. We also store it internally in the current session so that they can visualize it, say. But, if the user wants to visualize the same information at a later date, for example, they'll be responsible for providing that metadata back to FiftyOne. Maybe that's the way it should be, it's just not clear to me yet.

In any case, I am fine with starting under the assumption that nothing is persisted.

Yeah there are obvious limitations if nothing is persisted. You can compute hardness and get it back. Or you can quickly browse your samples. That's about it. But it's a simple start that gets'em comin' back!

@jasoncorso
Copy link
Contributor

In planning discussions, we agreed nothing persists for MVP. In the agile mindset, we should develop strictly for that right now. Get immediate value. Replan. Redevelop and move on.

Copy link
Contributor Author

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylerganter some thoughts for you

sample: the sample

Returns:
an ``eta.core.image.ImageLabels`` instance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylerganter a way to merge this work with your work would be to change this to output a fiftyone.label.ImageLabels instead

sample: the sample

Returns:
an ``eta.core.image.ImageLabels`` instance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylerganter and then, critically, this would just emit a fiftyone.label.ClassificationLabel

sample: the sample

Returns:
an :class:`eta.core.image.ImageLabels` instance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylerganter and this would be a fiftyone.label.DetectionLabel

Copy link
Contributor

@tylerganter tylerganter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving noting that we are in a very early stage and Brian and I just had a productive conversation about what we're going to change

@brimoor brimoor merged commit 1a18d2a into develop May 1, 2020
@brimoor brimoor deleted the interface branch May 1, 2020 21:22
benjaminpkane pushed a commit that referenced this pull request Jan 20, 2022
kaixi-wang pushed a commit that referenced this pull request May 11, 2023
Adding support for reading/writing datasets from cloud storage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Work on a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants