# Fifteen Minutes to FiftyOne

### Two sentence summary

`Dataset`s are composed of `Sample`s which contain `Field`s, all of which can
be dynamically created, modified and deleted.

`DatasetView`s allow one to easily view and manipulate subsets of `Dataset`s.

In [1]:
import fiftyone as fo

## Getting Started

`Dataset`s are the core of any project using `fiftyone`. Unlike some other common libraries, it is easy
to use and manipulate subsets of `Dataset`s, so a `fiftyone` dataset can typically correspond to all
data for a particular project, including train and test splits, labeled and unlabeled data, etc.

In [2]:
dataset = fo.Dataset(name="simple_dataset")
print("dataset.name: %s" % dataset.name)
print("len(dataset): %s" % len(dataset))

dataset.name: simple_dataset
len(dataset): 0


When instantiated, a `Sample` does not have an associated `Dataset`.

In [3]:
sample = fo.Sample(filepath="path/to/image.png")
print("sample.in_dataset: %s" % sample.in_dataset)
print("sample.dataset_name: %s" % sample.dataset_name)
print("sample.id: %s" % sample.id)
print("sample.ingest_time: %s" % sample.ingest_time)

sample.in_dataset: False
sample.dataset_name: None
sample.id: None
sample.ingest_time: None


But when it is added to a `Dataset`, the related attributes are automatically populated.

In [4]:
dataset.add_sample(sample)
print("len(dataset): %s" % len(dataset))
print("sample.in_dataset: %s" % sample.in_dataset)
print("sample.dataset_name: %s" % sample.dataset_name)
print("sample.id: %s" % sample.id)
print("sample.ingest_time: %s" % sample.ingest_time)

len(dataset): 1
sample.in_dataset: True
sample.dataset_name: simple_dataset
sample.id: 5ecd2bd5779f8c708d88b8f2
sample.ingest_time: 2020-05-26 14:46:45+00:00


## `Dataset` Basics

Let's create a dataset with a couple samples in it.

In [5]:
dataset = fo.Dataset("fityone_in_fifteen")
dataset.add_samples(
    [
        fo.Sample(filepath="/path/to/img1.jpg"),
        fo.Sample(filepath="/path/to/img2.jpg"),
        fo.Sample(filepath="/path/to/img3.jpg"),
    ]
);

`Dataset`s are iterable.

In [6]:
for sample in dataset:
    print(type(sample))

<class 'fiftyone.core.sample.Sample'>
<class 'fiftyone.core.sample.Sample'>
<class 'fiftyone.core.sample.Sample'>


`Sample`s can be key accessed in their `Dataset`.

The returned sample will be the same instance. Wow cool!

In [7]:
print("Sample ID: %s" % sample.id)
same_sample = dataset[sample.id]
print("same_sample is sample: %s" % (same_sample is sample))

Sample ID: 5ecd2bd5779f8c708d88b8f6
same_sample is sample: True


## Modifying datasets

### Adding or deleting a sample

Use `Dataset.add_sample` to add a single sample to a dataset:

In [8]:
sample = fo.Sample(filepath="new1.jpg")
sample_id = dataset.add_sample(sample)
sample_id

'5ecd2bd5779f8c708d88b8f7'

Remove a sample from a dataset via its ID:

In [9]:
del dataset[sample_id]

try:
    print("Attempting to access sample '%s'" % sample_id)
    sample = dataset[sample_id]
except KeyError as e:
    print(e)

Attempting to access sample '5ecd2bd5779f8c708d88b8f7'
"No sample found with ID '5ecd2bd5779f8c708d88b8f7'"


`Sample`s can equivalently be removed from their `Dataset` via `Dataset.remove_sample`.

If a `Sample` persists in memory, the behavior will be consistent with a `Sample` that
has never been added to a `Dataset`.

In [10]:
sample = next(iter(dataset))
print("Before removing:")
print("  in_dataset: %s" % sample.in_dataset)
print("  dataset_name: %s" % sample.dataset_name)
print("  id: %s" % sample.id)

dataset.remove_sample(sample)

print("After removing:")
print("  in_dataset: %s" % sample.in_dataset)
print("  dataset_name: %s" % sample.dataset_name)
print("  id: %s" % sample.id)

Before removing:
  in_dataset: True
  dataset_name: fityone_in_fifteen
  id: 5ecd2bd5779f8c708d88b8f4
After removing:
  in_dataset: False
  dataset_name: None
  id: None


### Batch addition/deletion of samples

Use `Dataset.add_samples` to add a batch of samples to a dataset:

In [11]:
sample_ids = dataset.add_samples(
    [
        fo.Sample(filepath="new_batch1.jpg"),
        fo.Sample(filepath="new_batch2.jpg"),
        fo.Sample(filepath="new_batch3.jpg"),
        fo.Sample(filepath="new_batch4.jpg"),
    ]
)
sample_ids

['5ecd2bd5779f8c708d88b8f8',
 '5ecd2bd5779f8c708d88b8f9',
 '5ecd2bd5779f8c708d88b8fa',
 '5ecd2bd5779f8c708d88b8fb']

Batch remove samples by passing an iterable over samples of IDs to `Dataset.remove_samples`

In [12]:
print("Number of samples before: %d" % len(dataset))
dataset.remove_samples(sample_ids)
print("Number of samples after: %d" % len(dataset))

Number of samples before: 6
Number of samples after: 2


## `Field`s

`Field`s are special attributes of `Sample`s shared across all `Sample`s in a
`Dataset`.

> If `Dataset`s were tables, and `Sample`s were rows, `Field`s would be the columns.

`Sample.filepath` is an example of a `StringField` that is default accessible on `Sample`s.

In [13]:
sample = fo.Sample(filepath="path/to/img.png")

sample.filepath

'path/to/img.png'

The field "schema" describes what fields are accessible on the sample.

`Sample.get_field_schema()` returns a dictionary of `(field_name, field)` pairs.

In [14]:
for field_name, field in sample.get_field_schema().items():
    print("Field name: %s" % field_name)
    print("Field type: %s" % type(field))
    print()

Field name: filepath
Field type: <class 'fiftyone.core.fields.StringField'>

Field name: tags
Field type: <class 'fiftyone.core.fields.ListField'>

Field name: metadata
Field type: <class 'fiftyone.core.fields.EmbeddedDocumentField'>



Print the sample itself to quickly see all fields present and their values:

In [15]:
print(sample)

{
    "filepath": "path/to/img.png",
    "tags": [],
    "metadata": null
}


### Adding Fields to Samples

New fields can be added to a `Sample` via key item setting. Once set, a field
may be accessed by key or attribute access.

In [16]:
sample["integer_field"] = 51

print(sample["integer_field"])
print(sample.integer_field)

51
51


Adding new fields automatically updates the field schema.

In [17]:
fields = sample.get_field_schema()

print(list(fields.keys()))

type(fields["integer_field"])

['filepath', 'tags', 'metadata', 'integer_field']


fiftyone.core.fields.IntField

`Sample` fields can be any primitive type: `bool`, `int`, `float`, `str`, `list`, `dict`

or more complex data structures such as `Label`s.

In [18]:
sample["ground_truth"] = fo.Classification(label="alligator")

print(type(sample.ground_truth))
print(sample)

<class 'fiftyone.core.labels.Classification'>
{
    "filepath": "path/to/img.png",
    "tags": [],
    "metadata": null,
    "integer_field": 51,
    "ground_truth": {
        "_cls": "Classification",
        "label": "alligator",
        "confidence": null,
        "logits": null
    }
}


## `Field`s and `Dataset`s

The real power of `Field`s is revealed when working with a `Dataset`.

Any `Sample` in a `Dataset` is guaranteed to have the same `Field`s
present.

In [19]:
dataset = fo.Dataset("dataset_with_dynamic_fields")

sample1 = fo.Sample(filepath="/path/to/img1.jpg")
sample2 = fo.Sample(filepath="/path/to/img2.jpg")

dataset.add_samples([sample1, sample2]);

The field schema is accessible on a dataset same as on a sample.


In [20]:
for field_name, field in dataset.get_field_schema().items():
    print("Field name: %s" % field_name)
    print("Field type: %s" % type(field))
    print()

Field name: filepath
Field type: <class 'fiftyone.core.fields.StringField'>

Field name: tags
Field type: <class 'fiftyone.core.fields.ListField'>

Field name: metadata
Field type: <class 'fiftyone.core.fields.EmbeddedDocumentField'>



To get a quick overview of the contents of a dataset, including
the field schema, use `Dataset.summary()`.

In [21]:
print(dataset.summary())

Name:           dataset_with_dynamic_fields
Num samples:    2
Tags:           []
Sample fields:
    filepath: fiftyone.core.fields.StringField
    tags:     fiftyone.core.fields.ListField(field=fiftyone.core.fields.StringField)
    metadata: fiftyone.core.fields.EmbeddedDocumentField(document_type=fiftyone.core.metadata.Metadata)


Similar to how we demonstrated before, we can add a new field to `Sample`, however, because this
sample is in a `Dataset`, operation affects the sample's dataset and all other
samples in the dataset.

A minor detail we have to mention is that now that the sample is part of a dataset, any modification
of the fields requires saving the sample to see these changes propagate. This is accomplished with
the `Sample.save()` method.

In [22]:
sample1["integer_field"] = 51
sample1.save()

sample1.integer_field

51

In [23]:
print(dataset.summary())

Name:           dataset_with_dynamic_fields
Num samples:    2
Tags:           []
Sample fields:
    filepath:      fiftyone.core.fields.StringField
    tags:          fiftyone.core.fields.ListField(field=fiftyone.core.fields.StringField)
    metadata:      fiftyone.core.fields.EmbeddedDocumentField(document_type=fiftyone.core.metadata.Metadata)
    integer_field: fiftyone.core.fields.IntField


If a `Field` is not set on a `Sample`, the default is `None`.

In [24]:
print(sample2.integer_field)

None


Setting a field to an inappropriate type raises a `ValidationError`.

In [25]:
try:
    sample2.integer_field = "a string"
except Exception as e:
    print(e)

a string could not be converted to int


Fields can be entirely deleted from datasets at any time via the `Dataset.delete_sample_field()` method:

In [26]:
dataset.delete_sample_field("integer_field")

try:
    sample2["integer_field"]
except KeyError as e:
    print(e)
    
print(dataset.summary())

"Sample has no field 'integer_field'"
Name:           dataset_with_dynamic_fields
Num samples:    2
Tags:           []
Sample fields:
    filepath: fiftyone.core.fields.StringField
    tags:     fiftyone.core.fields.ListField(field=fiftyone.core.fields.StringField)
    metadata: fiftyone.core.fields.EmbeddedDocumentField(document_type=fiftyone.core.metadata.Metadata)


After a field has been deleted, it can be set again to a different field type:

In [27]:
sample1["integer_field"] = "a string instead"
sample1.save()

print(sample1.integer_field)
print()

print(dataset.summary())

a string instead

Name:           dataset_with_dynamic_fields
Num samples:    2
Tags:           []
Sample fields:
    filepath:      fiftyone.core.fields.StringField
    tags:          fiftyone.core.fields.ListField(field=fiftyone.core.fields.StringField)
    metadata:      fiftyone.core.fields.EmbeddedDocumentField(document_type=fiftyone.core.metadata.Metadata)
    integer_field: fiftyone.core.fields.StringField


### Tags

`Sample.tags` is a default `Field` which is simply a set of strings. Tags may refer to dataset
splits, however they are not constrained to be disjoint, and a `Sample` may have more than
one tag.

In [28]:
sample = fo.Sample(filepath="path/to/image.png", tags=["train"])
sample.tags

['train']

A set of all unique tags present on samples in a dataset can be accessed
by `Dataset.get_tags()`.

In [29]:
dataset = fo.Dataset("dataset_with_tags")

dataset.add_samples(
    [
        fo.Sample(filepath="path/to/image1.png", tags=["train"]),
        fo.Sample(filepath="path/to/image2.png", tags=["test"]),
    ]
)

print(dataset.get_tags())

{'test', 'train'}


Modify the tags of a sample same as any `list`, then call `sample.save()` if the sample is in a dataset.

In [30]:
sample = next(iter(dataset))

sample.tags += ["my_tag"]
sample.save()

print(sample)
print(dataset.get_tags())

{
    "_id": {
        "$oid": "5ecd2bd6779f8c708d88b900"
    },
    "filepath": "path/to/image1.png",
    "tags": [
        "train",
        "my_tag"
    ],
    "metadata": null
}
{'test', 'train', 'my_tag'}


##  `DatasetView` basics

Depending on your character, you may be either frustrated or delighted to hear
that we have not been using the ideal approach for accessing samples from a
dataset.

The easiest way to access samples on a dataset is through a `DatasetView`.

In [31]:
dataset = fo.Dataset("interesting_dataset")
samples = [
    fo.Sample(filepath="/path/to/img1.jpg", tags=["train"]),
    fo.Sample(
        filepath="/path/to/img2.jpg",
        tags=["train"],
        metadata=fo.Metadata(size_bytes=1024, mime_type=".jpg"),
    ),
    fo.Sample(filepath="/path/to/img3.jpg", tags=["test"])
]
dataset.add_samples(samples);

The default view on a dataset is easily accessible via:

In [32]:
type(dataset.view())

fiftyone.core.view.DatasetView

Basic exploratory commands are also available on views

In [33]:
len(dataset.view())

3

In [34]:
print(dataset.view().summary())

Dataset:        interesting_dataset
Num samples:    3
Tags:           ['test', 'train']
Sample fields:
    filepath: fiftyone.core.fields.StringField
    tags:     fiftyone.core.fields.ListField(field=fiftyone.core.fields.StringField)
    metadata: fiftyone.core.fields.EmbeddedDocumentField(document_type=fiftyone.core.metadata.Metadata)
Pipeline stages:
    ---


Note there is an extra section: "Pipeline stages", which refers to the pipeline of
filter, sort, and other operations of the view.

Use `first()` to get a single sample from the view.

In [35]:
sample = dataset.view().first()
print(sample)

{
    "_id": {
        "$oid": "5ecd2bd6779f8c708d88b903"
    },
    "filepath": "/path/to/img1.jpg",
    "tags": [
        "train"
    ],
    "metadata": null
}


Use `take()` to randomly sample the view.

In [36]:
for sample in dataset.view().take(2):
    print(sample)

{
    "_id": {
        "$oid": "5ecd2bd6779f8c708d88b904"
    },
    "filepath": "/path/to/img2.jpg",
    "tags": [
        "train"
    ],
    "metadata": {
        "_cls": "Metadata",
        "size_bytes": 1024,
        "mime_type": ".jpg"
    }
}
{
    "_id": {
        "$oid": "5ecd2bd6779f8c708d88b905"
    },
    "filepath": "/path/to/img3.jpg",
    "tags": [
        "test"
    ],
    "metadata": null
}


### Sorting

Samples can be sorted by any field or subfield:

In [37]:
print("\nReverse sorting by filepath:")
for sample in dataset.view().sort_by("filepath", reverse=True):
    print(sample.filepath)

print("\nReverse sort by image size (samples with no metadata omitted):")
for sample in dataset.view().sort_by("metadata.size_bytes", reverse=True):
    if sample.metadata:
        print(sample.metadata.size_bytes)


Reverse sorting by filepath:
/path/to/img3.jpg
/path/to/img2.jpg
/path/to/img1.jpg

Reverse sort by image size (samples with no metadata omitted):
1024


### Selection

Ranges of samples can be accessed using `skip()` and `limit()`

In [38]:
view = dataset.view()
len(view.skip(1).limit(2))

2

or, equivalently, using array slicing

In [39]:
len(view[1:3])

2

Slicing only works if a `:` is provided.

In [40]:
try:
    view[0]
except KeyError as e:
    print(e)

'Accessing samples by numeric index is not supported. Use sample IDs or slices'


Access a sample in a view by its ID:

In [41]:
print("Loading sample '%s' from the view:" % sample.id)
print(view[sample.id])

Loading sample '5ecd2bd6779f8c708d88b905' from the view:
{
    "_id": {
        "$oid": "5ecd2bd6779f8c708d88b905"
    },
    "filepath": "/path/to/img3.jpg",
    "tags": [
        "test"
    ],
    "metadata": null
}


### Querying

The core query function is `match()`, which uses [MongoDB query syntax](https://docs.mongodb.com/manual/tutorial/query-documents).

In [42]:
for sample in dataset.view().match({"tags": "train"}):
    print(sample.tags)

['train']
['train']


Convenience functions are provided for common queries.

We can `select()` or `exclude()` only the samples matching a list of IDs.

In [43]:
sample_ids = [str(sample.id)]
print("\nsample_ids: %s" % sample_ids)

print("\nselect:")
for sample in dataset.view().select(sample_ids):
    print(" - ", sample.id)

print("\nexclude:")
for sample in dataset.view().exclude(sample_ids):
    print(" - ", sample.id)


sample_ids: ['5ecd2bd6779f8c708d88b904']

select:
 -  5ecd2bd6779f8c708d88b904

exclude:
 -  5ecd2bd6779f8c708d88b903
 -  5ecd2bd6779f8c708d88b905


Or check that a field exists and is not `None` with `exists()`

In [44]:
for sample in dataset.view().exists("metadata"):
    print(sample.metadata)

{
    "_cls": "Metadata",
    "size_bytes": 1024,
    "mime_type": ".jpg"
}


### Chaining `DatasetView` operations

The above operations on views return `DatasetView` instances. These operations can be chained in any arbitrary order.

In [45]:
very_complex_view = (
    dataset.view()
    .match({"tags": "train"})
    .exists("file_hash")
    .sort_by("filepath")[10:20]
    .take(5)
)

print(very_complex_view.summary())

Dataset:        interesting_dataset
Num samples:    0
Tags:           []
Sample fields:
    filepath: fiftyone.core.fields.StringField
    tags:     fiftyone.core.fields.ListField(field=fiftyone.core.fields.StringField)
    metadata: fiftyone.core.fields.EmbeddedDocumentField(document_type=fiftyone.core.metadata.Metadata)
Pipeline stages:
    1. {'$match': {'tags': 'train'}}
    2. {'$match': {'file_hash': {'$exists': True, '$ne': None}}}
    3. {'$sort': {'filepath': 1}}
    4. {'$skip': 10}
    5. {'$limit': 10}
    6. {'$sample': {'size': 5}}


Deleting the samples in a view from a dataset is straightforward:

In [46]:
view = dataset.view().select(sample_ids)

print("Length before: %d" % len(dataset))
dataset.remove_samples(view)
print("Length after: %d" % len(dataset))

Length before: 3
Length after: 2


## Aggregation pipelines

Powerful custom aggregations are available on `Dataset`s and `DatasetView`s
via the [MongoDB aggregation API](https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/).

In [47]:
pipeline = [
    # deconstruct the `tags` array field of the samples to output a sample for each tag
    {"$unwind": "$tags"},
    # group by `tags` and count the number of instances for each
    {"$group": {"_id": "$tags", "count": {"$sum": 1}}},
]

aggregation = dataset.view().aggregate(pipeline)
aggregation

<pymongo.command_cursor.CommandCursor at 0x1201ec860>

In [48]:
for d in aggregation:
    # d is a dictionary whos structure depends on the aggregation pipeline
    print(d)

{'_id': 'test', 'count': 1}
{'_id': 'train', 'count': 1}
