# Fifteen minutes to FiftyOne

### Two sentence summary

`Dataset`s are composed of samples which contain fields, all of which can
be dynamically created, modified and deleted. `DatasetView`s allow one to
easily view and manipulate subsets of `Dataset`s

### Core concepts

The fundamental FiftyOne object that a user interacts with is the
`DatasetView`. Users are constantly creating and chaining commands on views.
Any method on a `DatasetView` is also available on a `Dataset`. If appropriate,
the dataset creates a default view and calls the method on that view.

`Sample`s are the building blocks that `Dataset`s are composed of. `Sample`s
can have dynamically added fields on them. Fields can be of special types,
like the `fiftyone.core.labels.Labels` class, or they can be primitive serializable types
like dicts, lists, strings, scalars, etc.

## Creating a `Dataset`

Don't worry about the details of this code now. We are just creating a dataset so that we can explore it.
We will discuss the details of creating datasets soon.

In [1]:
import fiftyone as fo

dataset = fo.Dataset("fiftyone_in_fifteen")

samples = [
    fo.Sample(filepath="/path/to/img1.jpg", tags=["train"]),
    fo.Sample(
        filepath="/path/to/img2.jpg",
        tags=["train"],
        metadata=fo.Metadata(size_bytes=1024, mime_type=".jpg"),
    ),
    fo.Sample(filepath="/path/to/img3.jpg", tags=["test"])
]
dataset.add_samples(samples)

['5ec61397afa548973759e0a8',
 '5ec61397afa548973759e0a9',
 '5ec61397afa548973759e0aa']

## `Dataset` basics

To start, there are some basic things we can inquire on any dataset.

In [2]:
len(dataset)

3

Every dataset has "sample fields", which are the accessible fields on any sample of the dataset. There are some fields that come default on every dataset.

In [3]:
dataset.get_sample_fields()

OrderedDict([('filepath', <mongoengine.fields.StringField at 0x115f23fd0>),
             ('tags', <mongoengine.fields.ListField at 0x116217860>),
             ('metadata',
              <mongoengine.fields.EmbeddedDocumentField at 0x116217eb8>),
             ('id', <mongoengine.base.fields.ObjectIdField at 0x11e235278>)])

Fields are available for python primitives (`bool`, `int`, `str`, `list`, `dict`) as well as more semantically meaningful structures for samples such as metadata and labels.

We can filter by a particular `field_type` to see all fields of that type.

In [4]:
dataset.get_sample_fields(ftype=fo.Metadata)

OrderedDict([('metadata',
              <mongoengine.fields.EmbeddedDocumentField at 0x116217eb8>)])

A useful helper is `summary()` which succinctly summarizes basic information about a dataset.

In [5]:
print(dataset.summary())

Name:           fiftyone_in_fifteen
Num samples:    3
Tags:           ['test', 'train']
Sample fields:
    filepath: mongoengine.fields.StringField
    tags:     mongoengine.fields.ListField(field=mongoengine.fields.StringField)
    metadata: mongoengine.fields.EmbeddedDocumentField(document_type=fiftyone.core.metadata.Metadata)
    id:       mongoengine.base.fields.ObjectIdField


##  `DatasetView` basics

The easiest way to access samples on a dataset is through a `DatasetView`

The default view on a dataset is easily accessible via:

In [6]:
view = dataset.view()
view

<fiftyone.core.view.DatasetView at 0x10a210630>

Basic exploratory commands are also available on views

In [7]:
len(view)

3

In [8]:
print(view.summary())

Dataset:        fiftyone_in_fifteen
Num samples:    3
Tags:           ['test', 'train']
Sample fields:
    filepath: mongoengine.fields.StringField
    tags:     mongoengine.fields.ListField(field=mongoengine.fields.StringField)
    metadata: mongoengine.fields.EmbeddedDocumentField(document_type=fiftyone.core.metadata.Metadata)
    id:       mongoengine.base.fields.ObjectIdField
Pipeline stages:
    


Use `first()` to get a single sample from the view.

In [9]:
sample = view.first()
print(sample)

{
    "_id": {
        "$oid": "5ec61397afa548973759e0a8"
    },
    "filepath": "/path/to/img1.jpg",
    "tags": [
        "train"
    ],
    "metadata": null
}


Use `take()` to randomly sample the view.

In [10]:
for sample in view.take(2):
    print(sample)

{
    "_id": {
        "$oid": "5ec61397afa548973759e0a9"
    },
    "filepath": "/path/to/img2.jpg",
    "tags": [
        "train"
    ],
    "metadata": {
        "_cls": "Metadata",
        "size_bytes": 1024,
        "mime_type": ".jpg"
    }
}
{
    "_id": {
        "$oid": "5ec61397afa548973759e0aa"
    },
    "filepath": "/path/to/img3.jpg",
    "tags": [
        "test"
    ],
    "metadata": null
}


### Sorting

Samples can be sorted by any field or subfield:

In [11]:
print("\nReverse sorting by filepath:")
for sample in view.sort_by("filepath", reverse=True):
    print(sample.filepath)

print("\nReverse sort by image size (samples with no metadata omitted):")
for sample in view.sort_by("metadata.size_bytes", reverse=True):
    if sample.metadata:
        print(sample.metadata.size_bytes)

print("\nSort by first tag:")
for sample in view.sort_by("tags[0]"):
    print(sample.tags)


Reverse sorting by filepath:
/path/to/img3.jpg
/path/to/img2.jpg
/path/to/img1.jpg

Reverse sort by image size (samples with no metadata omitted):
1024

Sort by first tag:
['train']
['train']
['test']


### Selection

Ranges of samples can be accessed using `skip()` and `limit()`

In [12]:
len(view.skip(1).limit(2))

2

or using array slicing

In [13]:
len(view[1:3])

2

Slicing only works if a `:` is provided.

In [14]:
try:
    view[0]
except KeyError as e:
    print(e)

'Accessing samples by numeric index is not supported. Use sample IDs or slices'


Access a sample in a view by its ID:

In [15]:
print("Loading sample '%s' from the view:" % sample.id)
print(view[sample.id])

Loading sample '5ec61397afa548973759e0aa' from the view:
{
    "_id": {
        "$oid": "5ec61397afa548973759e0aa"
    },
    "filepath": "/path/to/img3.jpg",
    "tags": [
        "test"
    ],
    "metadata": null
}


### Querying

The core query function is `match()`, which uses [MongoDB query syntax](https://docs.mongodb.com/manual/tutorial/query-documents/#read-operations-query-argument)

In [16]:
for sample in view.match({"tags": "train"}):
    print(sample.tags)

['train']
['train']


Convenience functions are provided for common queries.

We can `select()` or `exclude()` only the samples matching a list of IDs.

In [17]:
sample_ids = [str(sample.id)]
print("\nsample_ids: %s" % sample_ids)

print("\nselect:")
for sample in view.select(sample_ids):
    print(" - ", sample.id)

print("\nexclude:")
for sample in view.exclude(sample_ids):
    print(" - ", sample.id)


sample_ids: ['5ec61397afa548973759e0a9']

select:
 -  5ec61397afa548973759e0a9

exclude:
 -  5ec61397afa548973759e0a8
 -  5ec61397afa548973759e0aa


Or check that a field and is not `None` with `exists()`

In [18]:
for sample in view.exists("metadata"):
    print(sample.metadata)

{
    "_cls": "Metadata",
    "size_bytes": 1024,
    "mime_type": ".jpg"
}


### Chaining `DatasetView` operations

The above operations on views return `DatasetView` instances. These operations can be chained in any arbitrary order.

In [22]:
very_complex_view = (
    dataset.view()
    .match({"tags": "train"})
    .exists("file_hash")
    .sort_by("filepath")[10:20]
    .take(5)
)

print(very_complex_view.summary())

Dataset:        fiftyone_in_fifteen
Num samples:    0
Tags:           []
Sample fields:
    filepath: mongoengine.fields.StringField
    tags:     mongoengine.fields.ListField(field=mongoengine.fields.StringField)
    metadata: mongoengine.fields.EmbeddedDocumentField(document_type=fiftyone.core.metadata.Metadata)
    id:       mongoengine.base.fields.ObjectIdField
Pipeline stages:
    1. {'$match': {'tags': 'train'}}
    2. {'$match': {'file_hash': {'$exists': True, '$ne': None}}}
    3. {'$sort': {'filepath': 1}}
    4. {'$skip': 10}
    5. {'$limit': 10}
    6. {'$sample': {'size': 5}}


## Modifying datasets

### Adding and deleting single samples

Use `Dataset.add_sample` to add a single sample to a dataset:

In [23]:
sample = fo.Sample(filepath="new1.jpg")
sample_id = dataset.add_sample(sample)
sample_id

'5ec613c7afa548973759e0ab'

Delete a sample from a dataset via its ID:

In [24]:
del dataset[sample_id]

try:
    print("Attempting to access sample '%s'" % sample_id)
    sample = dataset[sample_id]
except Exception as e:
    print(e)

Attempting to access sample '5ec613c7afa548973759e0ab'
"No sample found with ID '5ec613c7afa548973759e0ab'"


Samples can **NOT** be added to a view (views are read-only)

In [25]:
try:
    view.add_sample(sample)
except Exception as e:
    print(e)

'DatasetView' object has no attribute 'add_sample'


### Batch addition/deletion of samples

Use `Dataset.add_samples` to add a batch of samples to a dataset:

In [26]:
sample_ids = dataset.add_samples(
    [
        fo.Sample(filepath="new_batch1.jpg"),
        fo.Sample(filepath="new_batch2.jpg"),
        fo.Sample(filepath="new_batch3.jpg"),
        fo.Sample(filepath="new_batch4.jpg"),
    ]
)
sample_ids

['5ec613cdafa548973759e0ac',
 '5ec613cdafa548973759e0ad',
 '5ec613cdafa548973759e0ae',
 '5ec613cdafa548973759e0af']

Deleting the samples in a view from a dataset is easy:

In [27]:
view = dataset.view().select(sample_ids)

print("Length before: %d" % len(dataset))
dataset.delete_samples(view)
print("Length after: %d" % len(dataset))

Length before: 7
Length after: 3


## Aggregation pipelines

Powerful custom aggregations are available via the [MongoDB aggregation API](https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/).


In [28]:
pipeline = [
    # deconstruct the `tags` array field of the samples to output a sample for each tag
    {"$unwind": "$tags"},
    # group by `tags` and count the number of instances for each
    {"$group": {"_id": "$tags", "count": {"$sum": 1}}},
]

aggregation = dataset.view().aggregate(pipeline)
aggregation

<pymongo.command_cursor.CommandCursor at 0x11e2789e8>

In [29]:
for d in aggregation:
    # d is a dictionary whos structure depends on the aggregation pipeline
    print(d)

{'_id': 'test', 'count': 1}
{'_id': 'train', 'count': 2}


## Creating samples

### Default fields

Some fields are automatically available on all samples:

In [30]:
sample.id

'5ec613c7afa548973759e0ab'

In [31]:
sample.filepath

'new1.jpg'

In [32]:
sample.tags

[]

In [33]:
sample.metadata

### Dynamic fields

Fields can also be dynamically added to dataset samples.

We can check what fields exist on a dataset at any time via `dataset.get_sample_fields()`:

In [34]:
dataset.get_sample_fields()

OrderedDict([('filepath', <mongoengine.fields.StringField at 0x115f23fd0>),
             ('tags', <mongoengine.fields.ListField at 0x116217860>),
             ('metadata',
              <mongoengine.fields.EmbeddedDocumentField at 0x116217eb8>),
             ('id', <mongoengine.base.fields.ObjectIdField at 0x11e235278>)])

New fields **MUST** be added via item assignment syntax `sample[field] = value`:

In [35]:
sample["my_boolean"] = True

Fields can be accessed either via item or attribute getter syntax:

In [36]:
# equivalent
print(sample["my_boolean"])
print(sample.my_boolean)

True
True


All the usual builtin types are supported:

In [37]:
sample["my_int"] = 51
sample.my_int

51

In [38]:
sample["my_string"] = "fiftyone"
sample.my_string

'fiftyone'

In [39]:
sample["my_list"] = ["fifty", "one"]
sample.my_list

['fifty', 'one']

In [40]:
sample["my_dict"] = {"fifty": 50, "one": "uno"}
sample.my_dict

{'fifty': 50, 'one': 'uno'}

In [41]:
sample["my_label"] = fo.Classification(label="cow", confidence=0.98)
sample.my_label

<Classification: {
    "_cls": "Classification",
    "label": "cow",
    "confidence": 0.98,
    "logits": null
}>

Use the `Dataset.get_sample_fields()` method to retrieve the schema of the samples in the dataset:

In [42]:
dataset.get_sample_fields()

OrderedDict([('filepath', <mongoengine.fields.StringField at 0x115f23fd0>),
             ('tags', <mongoengine.fields.ListField at 0x116217860>),
             ('metadata',
              <mongoengine.fields.EmbeddedDocumentField at 0x116217eb8>),
             ('id', <mongoengine.base.fields.ObjectIdField at 0x11e235278>),
             ('my_boolean', <mongoengine.fields.BooleanField at 0x11e2784e0>),
             ('my_int', <mongoengine.fields.IntField at 0x11e2780f0>),
             ('my_string', <mongoengine.fields.StringField at 0x11e278080>),
             ('my_list', <mongoengine.fields.ListField at 0x11e278550>),
             ('my_dict', <mongoengine.fields.DictField at 0x11e278a90>),
             ('my_label',
              <mongoengine.fields.EmbeddedDocumentField at 0x11e278d68>)])

Setting a field to an inappropriate type raises a `ValidationError`.

In [43]:
try:
    sample.my_list = 15
except Exception as e:
    print(e)

Only lists and tuples may be used in a list field


Fields can be entirely deleted from datasets at any time via the `Dataset.delete_sample_field()` method:

In [44]:
dataset.delete_sample_field("my_list")

try:
    sample["my_list"]
except Exception as e:
    print(e)
    
dataset.get_sample_fields()

"Sample has no field 'my_list'"


OrderedDict([('filepath', <mongoengine.fields.StringField at 0x115f23fd0>),
             ('tags', <mongoengine.fields.ListField at 0x116217860>),
             ('metadata',
              <mongoengine.fields.EmbeddedDocumentField at 0x116217eb8>),
             ('id', <mongoengine.base.fields.ObjectIdField at 0x11e235278>),
             ('my_boolean', <mongoengine.fields.BooleanField at 0x11e2784e0>),
             ('my_int', <mongoengine.fields.IntField at 0x11e2780f0>),
             ('my_string', <mongoengine.fields.StringField at 0x11e278080>),
             ('my_dict', <mongoengine.fields.DictField at 0x11e278a90>),
             ('my_label',
              <mongoengine.fields.EmbeddedDocumentField at 0x11e278d68>)])

After a field has been deleted, it can be set again to a different field type:

In [45]:
# Add `my_list` back as an int
sample["my_list"] = 1

dataset.get_sample_fields()

OrderedDict([('filepath', <mongoengine.fields.StringField at 0x115f23fd0>),
             ('tags', <mongoengine.fields.ListField at 0x116217860>),
             ('metadata',
              <mongoengine.fields.EmbeddedDocumentField at 0x116217eb8>),
             ('id', <mongoengine.base.fields.ObjectIdField at 0x11e235278>),
             ('my_boolean', <mongoengine.fields.BooleanField at 0x11e2784e0>),
             ('my_int', <mongoengine.fields.IntField at 0x11e2780f0>),
             ('my_string', <mongoengine.fields.StringField at 0x11e278080>),
             ('my_dict', <mongoengine.fields.DictField at 0x11e278a90>),
             ('my_label',
              <mongoengine.fields.EmbeddedDocumentField at 0x11e278d68>),
             ('my_list', <mongoengine.fields.IntField at 0x11e278748>)])