New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Zoo and MVP SDK interface #19
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pulling this work together. I cannot review it all right now, but I have two high-level comments both come from an "ease of adoption" standpoint; as a scientist, if you will:
-
(Overall sense of complexity of tool) Thinking about numpy, torch, tensorflow and scipy; three of these libraries, which are significant, are essentially imported with
import thelibrary as foo
. The design choices in this code rather take a approach similar to the fourth, and require significant lines of important. While we should have made an effort to see what user's want, I think this conveys as sense of heavyness and complexity when I'd rather directly useimport fiftyone as fo
or something like it, and perhapsimport fiftyone.zoo as foz
as a separate one I need to choose. In sum, the overall structure conveys complexity and violates lightweightness. -
(science users) As a scientist using the tool, I am likely to want to keep the data under my control as long as I can. Simply take the necessary step of preprocessing data for things like equalization, sizing, etc. The interface provided here suggests I need to give this pause and just adopt fiftyone for all of my dataset work, which is not necessarily the case. Let's say I have some code like this:
(train, val) = load_dataset_from_whereever("dataset_name")
# train, val are lists of (numpy.ndarray images, target label)
prep = Preprocessor(initialize_multiple_steps...)
prep.in_place_process(train)
prep.in_place_process(val)
# now I am ready to begin training
# In PyTorch, for example, I'd create a dataloader for batches, etc.
model = train_model(train, val)
# I have a model, let's get some information from fiftyone on the data
# (not concrete suggestion for interface)
fdata = fiftyone.dataset(train, "a name") #assuming I will have a way to map whatever is in fdata back to list-indices in train.
fdata.add_model(model, "an identifier")
with fdata.get_context as context:
for i, sample in enumerate(train):
prediction_dict = model.run(sample) # noting I'd probably even want to do this on "batches" of samples at once, via my own dataloader...
context.add_predictions(i, prediction_dict)
hardness_ranking = context.get_hardness_rank()
This puts "scientist me" first and fiftyone second. (I have real examples that do essentially the above, without fiftyone at this point.)
I guess my comments center around what I, as the data scientist, need to do in order to adopt fiftyone; as we agree this should be as light as possible. To have fiftyone support dataset loading, etc, seems fine, but my guess is that it would then also need to support things like preprocessing across the underlying backends; but then it starts to look a lot like keras. Do we want that? Working on the user's terms is probably the right design goal, etc.
In summary, these are comments from a user perspective relating to sense of complexity and value from using fiftyone.
(In reading this, I have presumed the two Dataset class implementations will be merged. If this is not the case, then it'll generate more discussion.) |
I moved
|
Yes they will need to be merged. My thinking is the public interface of this one will be grafted onto the backend implementation of the other one. |
Keeping the discussion here:
|
fixed |
We could introduce an |
fiftyone/core/data.py
Outdated
|
||
|
||
def load_image_classification_dataset( | ||
dataset, labels_map=None, sample_parser=None, name=None, backing_dir=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename dataset
to iterable
or sample_iterator
. I think this was one of the things that sounded heavy duty to Jason, and it definitely threw me off guard when I first read it.
Did our conversation yesterday lead to any decision on the use of the word |
There are a variety of considerations:
Looking to other libraries for examples, the syntax for creating "lightweight" datasets is: #
# TensorFlow
#
# Dataset creation syntax. This is "lightweight"
dataset = tf.data.Dataset.from_tensor_slices(image_paths, labels).map(
parse_sample, # a function that loads an image and parses the label, if necessary
num_parallel_calls=tf.data.experimental.AUTOTUNE,
)
# Example use
for img, label in dataset.as_numpy_iterator():
pass
#
# PyTorch
#
class MyDataset(torch.utils.data.Dataset):
def __init__(self, image_paths, labels):
self.image_paths = image_paths
self.labels = labels
def __getitem__(idx):
return load_image(self.image_paths[idx]), self.labels[idx]
# Dataset creation syntax. This is "lightweight"
dataset = MyDataset(image_paths, labels)
# Example use
for img, label in torch.utils.data.DataLoader(dataset):
... So, we could adopt a syntax like: dataset = fiftyone.Dataset.from_image_classification_samples(samples) But, our datasets are in fact persistent. When you "load/ingest" a dataset, its labels will be completely ingested and saved to a persistent database. Another consideration is that, if we're not creating copies of the user's data, then they need to know that they can't move the samples and expect to still use their dataset in the future. As discussed yesterday, we may want to optionally provide the ability to cede control of one's raw samples to us as well, to be stored internally along with the labels and other metadata, in which case we're switching over to "heavyweight" mode. |
@brimoor in response to your latest comment, a few thoughts: I think the best start for us, and the first interaction the user may have with the system, is to drop the database when the session is finished. It's simpler to mange on our end and understand on the user's end, and follows "lightweight" principle. Keeping track of backend operations could be as simple as tracking MongoDB logs. If not, I would recommend we connect to a service like Pachyderm that already supports this. There is some fancy stuff we could do for making samples lightweight but persistent. We could use hardlinks to the data, which has a lot of benefits:
|
I agree that FiftyOne not persisting anything to disk is simpler and easier to understand, I just wonder if it will be essential in order to track changes to models over time. If FiftyOne persists nothing, then we'll have to return everything we compute to the user. If they request sample hardness, we compute it and return it to them. We also store it internally in the current session so that they can visualize it, say. But, if the user wants to visualize the same information at a later date, for example, they'll be responsible for providing that metadata back to FiftyOne. Maybe that's the way it should be, it's just not clear to me yet. In any case, I am fine with starting under the assumption that nothing is persisted. |
I made a few changes to this PR:
I couldn't see any clear way to contribute this PR without including my experimental definitions of Dataset functionality, since otherwise one can't do anything with the zoo/dataset loading code! |
Yeah there are obvious limitations if nothing is persisted. You can compute hardness and get it back. Or you can quickly browse your samples. That's about it. But it's a simple start that gets'em comin' back! |
In planning discussions, we agreed nothing persists for MVP. In the agile mindset, we should develop strictly for that right now. Get immediate value. Replan. Redevelop and move on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tylerganter some thoughts for you
sample: the sample | ||
|
||
Returns: | ||
an ``eta.core.image.ImageLabels`` instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tylerganter a way to merge this work with your work would be to change this to output a fiftyone.label.ImageLabels
instead
sample: the sample | ||
|
||
Returns: | ||
an ``eta.core.image.ImageLabels`` instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tylerganter and then, critically, this would just emit a fiftyone.label.ClassificationLabel
sample: the sample | ||
|
||
Returns: | ||
an :class:`eta.core.image.ImageLabels` instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tylerganter and this would be a fiftyone.label.DetectionLabel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving noting that we are in a very early stage and Brian and I just had a productive conversation about what we're going to change
Adding support for reading/writing datasets from cloud storage
This PR is pretty big, but it accomplishes the following two goals:
Run the example code in
examples/interface.py
and let me know what you think.FiftyOne Dataset Zoo
A Dataset Zoo is established in FiftyOne that enables users to download any of a collection of datasets using the syntax:
Behind the scenes, it grabs datasets using either the TF Zoo or the PyTorch Zoo, depending on which package is installed on the user's machine. All datasets are stored on disk in
eta.core.datasets.LabeledImageDataset
format.Prototype Dataset interface
My goal here was to have a working example of performing the following tasks:
(image, ImageLabels)
pairs in the dataset, where the labels are either the ground truth labels or a set of predicted labels(img, label)
pairs in the dataset, as would be fed to a classifier during training(image, sample ID)
pairs so they can add new predictions to the datasetTo accomplish this, I implemented prototype versions of the following classes:
fiftyone.experimental.data.Dataset
: the underlying class that stores all information about the dataset, including paths to the raw images on disk, ground truth annotations, one or more sets of predicted labels, and additional metadata accompanying all of these items, such as sample hardness, annotation correctness, etcfiftyone.experimental.contexts.DatasetContext
: classes that represent specific contexts into the dataset. Examples include:-
ImageContext
: pulls only the images from the dataset. Iterating over this yieldsimg
s-
LabeledImageContext
: pulls images and a particular set of ImageLabels (e.g., ground truth) from the dataset. Iterating over contexts of this kind yield(img, ImageLabels)
pairs-
ImageClassificationContext
: pulls images and a particular frame attribute from aLabeledImageContext
from the dataset. Iterating over contexts of this kind yield(img, label)
pairs-
ModelContext
: pulls a specific model from the dataset, so that additional predictions can be added to samples. Iterating over context of this kind yield(img, sample ID)
pairsfiftyone.experimental.views.DatasetView
: classes that allow read-only operations to be performed on aDatasetContext
. These allow things like sorting byX
, shuffling, removing specific samples (from the view, not the underlying dataset), iterating over the samples in the view, and exporting the current state of the view as a dataset on diskTo achieve a functional prototype of the above, my
fiftyone.experimental.data.Dataset
class has an actual implementation. This, of course, will be thrown out and grafted onto the MongoDB interface.Again, see
examples/interface.py
to see this functionality in action.I'm not sure that there needs to be a difference between
DatasetContext
s andDatasetView
s. But, there is the interesting distinction to be made that one may want a "read-only" playground thatDatasetView
s offer where any slicing and dicing of the dataset is not permanent.Key Design Principles
tensorflow
ortorch
unless the user specifically requests functionality that requires them