# Fifteen minutes to `fiftyone`

### Two sentence summary:
> "`Dataset`s are composed of samples which contain fields, all of which can
be dynamically created, modified and deleted. `DatasetView`s allow one to
easily view and manipulate subsets of `Dataset`s."

### Core Concepts

The fundamental `fiftyone` object that a user interacts with is the
`DatasetView`. Users are constantly creating and chaining commands on views.
Any method on a `DatasetView` is also available on a `Dataset`. If appropriate,
the dataset creates a default view and calls the method on that view.

Samples are the building blocks that `Dataset`s are composed of. Samples
can have dynamically added fields on them. Fields can be of special types,
like the `fiftyone.core.labels.Labels` class, or they can be primitive serializable types
like dicts, lists, strings, scalars, etc.

## Setup

Don't worry about the details of this code now. We are just creating a dataset so that we can explore it.
We will discuss the details of creating datasets soon.

In [1]:
import fiftyone as fo

dataset = fo.Dataset("fiftyone_in_fifteen")
sample_id = dataset.add_sample(filepath="/path/to/img1.jpg", tags=["train"])
dataset.add_sample(
    filepath="/path/to/img2.jpg",
    tags=["train"],
    metadata=fo.Metadata(size_bytes=1024, mime_type=".jpg"),
)
dataset.add_sample(filepath="/path/to/img3.jpg", tags=["test"])
sample = dataset[sample_id]

## Exploring a Dataset

To start, there are some basic things we can inquire on any dataset.

In [2]:
len(dataset)

3

Every dataset has "sample fields", which are the accessible fields on any sample of the dataset. There are some fields that come default on every dataset.

In [3]:
dataset.get_sample_fields()

OrderedDict([('id', <mongoengine.base.fields.ObjectIdField at 0x10e84ffd0>),
             ('filepath', <mongoengine.fields.StringField at 0x11a8e64e0>),
             ('tags', <mongoengine.fields.ListField at 0x11a8e6588>),
             ('metadata',
              <mongoengine.fields.EmbeddedDocumentField at 0x11a8e6780>)])

Fields are available for python primitives (`bool`, `int`, `str`, `list`, `dict`) as well as more semantically meaningful structures for samples such as metadata and labels.

We can filter by a particular `field_type` to see all fields of that type.

In [4]:
dataset.get_sample_fields(field_type=fo.Metadata)

OrderedDict([('metadata',
              <mongoengine.fields.EmbeddedDocumentField at 0x11a8e6780>)])

A useful helper is `summary()` which succinctly summarizes basic information about a dataset.

In [5]:
print(dataset.summary())

Name:           fiftyone_in_fifteen
Num samples:    3
Tags:           ['test', 'train']
Sample Fields:
	id      : <class 'mongoengine.base.fields.ObjectIdField'>
	filepath: <class 'mongoengine.fields.StringField'>
	tags    : <class 'mongoengine.fields.ListField'>
	metadata: <class 'mongoengine.fields.EmbeddedDocumentField'>


## Basics with `DatasetView`s

The easiest way to access samples on a dataset is through a `DatasetView`

The default view on a dataset is easily accessible via:

In [6]:
view = dataset.view()
view

<fiftyone.core.view.DatasetView at 0x122910e48>

Basic exploratory commands are also available on views

In [7]:
len(view)

3

In [8]:
print(view.summary())

Dataset:        fiftyone_in_fifteen
Num samples:    3
Tags:           ['train', 'test']
Sample Fields:
	id      : <class 'mongoengine.base.fields.ObjectIdField'>
	filepath: <class 'mongoengine.fields.StringField'>
	tags    : <class 'mongoengine.fields.ListField'>
	metadata: <class 'mongoengine.fields.EmbeddedDocumentField'>
Pipeline stages:
	


Use `first()` to get a single sample from the view.

In [9]:
sample = view.first()
sample

<fiftyone_in_fifteen: {
    "_id": {
        "$oid": "5ebdb6ccf02eeb2e38a465f2"
    },
    "filepath": "/path/to/img1.jpg",
    "tags": [
        "train"
    ],
    "metadata": null
}>

Use `take()` to randomly sample the view.

In [10]:
for sample in view.take(2):
    print(sample)

{
    "_id": {
        "$oid": "5ebdb6ccf02eeb2e38a465f2"
    },
    "filepath": "/path/to/img1.jpg",
    "tags": [
        "train"
    ],
    "metadata": null
}
{
    "_id": {
        "$oid": "5ebdb6ccf02eeb2e38a465f4"
    },
    "filepath": "/path/to/img3.jpg",
    "tags": [
        "test"
    ],
    "metadata": null
}


### Sorting

Samples can be sorted by any field or subfield

In [11]:
for sample in view.sort_by("filepath", reverse=True):
    print(sample.filepath)
print()

for sample in view.sort_by("metadata.size_bytes", reverse=True):
    if sample.metadata:
        print(sample.metadata.size_bytes)
print()
    
for sample in view.sort_by("tags[0]"):
    print(sample.tags)

/path/to/img3.jpg
/path/to/img2.jpg
/path/to/img1.jpg

1024

['train']
['train']
['test']


### Selection

#### Slicing

Ranges of samples can be accessed using `skip()` and `limit()`

In [12]:
len(view.skip(1).limit(2))

2

or equivalently using array slicing

In [13]:
len(view[1:3])

2

#### Key Indexing

Views are keyed same as datasets: by sample ID.

In [14]:
sample_id = sample.id
print("sample_id: '%s'" % sample_id)

view[sample_id]

sample_id: '5ebdb6ccf02eeb2e38a465f4'


<fiftyone_in_fifteen: {
    "_id": {
        "$oid": "5ebdb6ccf02eeb2e38a465f4"
    },
    "filepath": "/path/to/img3.jpg",
    "tags": [
        "test"
    ],
    "metadata": null
}>

Slicing only works if a `:` is provided.

In [15]:
try:
    view[0]
except Exception as e:
    print("%s: %s" % (type(e), e))

<class 'KeyError'>: 'Accessing samples by numeric index is not supported. Use sample IDs or slices'


### Querying

The core query function is `match()`, which uses [MongoDB query syntax](https://docs.mongodb.com/manual/tutorial/query-documents/#read-operations-query-argument)

In [16]:
for sample in view.match({"tags": "train"}):
    print(sample.tags)

['train']
['train']


Convenience functions are provided for common queries.

We can `select()` or `exclude()` only the samples matching a list of IDs.

In [17]:
sample_ids = [str(sample.id)]
print("sample_ids: %s" % sample_ids)
print()

print("select:")
for sample in view.select(sample_ids):
    print(" - ", sample.id)
print()

print("exclude:")
for sample in view.exclude(sample_ids):
    print(" - ", sample.id)

sample_ids: ['5ebdb6ccf02eeb2e38a465f3']

select:
 -  5ebdb6ccf02eeb2e38a465f3

exclude:
 -  5ebdb6ccf02eeb2e38a465f2
 -  5ebdb6ccf02eeb2e38a465f4


Or check that a field and is not `None` with `exists()`

In [18]:
for sample in view.exists("metadata"):
    print(sample.metadata)

{
    "_cls": "Metadata",
    "size_bytes": 1024,
    "mime_type": ".jpg"
}


### Chaining `DatasetView` Operations

The above operations on views return `DatasetView` instances. These operations can be chained in any arbitrary order.

In [19]:
very_complex_view = (
    dataset.view()
    .match({"tags": "train"})
    .exists("file_hash")
    .sort_by("filepath")[10:20]
    .take(5)
)
very_complex_view

<fiftyone.core.view.DatasetView at 0x122945828>

In [20]:
print(very_complex_view.summary())

Dataset:        fiftyone_in_fifteen
Num samples:    0
Tags:           []
Sample Fields:
	id      : <class 'mongoengine.base.fields.ObjectIdField'>
	filepath: <class 'mongoengine.fields.StringField'>
	tags    : <class 'mongoengine.fields.ListField'>
	metadata: <class 'mongoengine.fields.EmbeddedDocumentField'>
Pipeline stages:
	1. {'$match': {'tags': 'train'}}
	2. {'$match': {'file_hash': {'$exists': True, '$ne': None}}}
	3. {'$sort': {'filepath': 1}}
	4. {'$skip': 10}
	5. {'$limit': 10}
	6. {'$sample': {'size': 5}}


## Modifying `Dataset`s (Inserting & Deleting Samples)

At the moment the `kwargs` used to instantiate a sample must be passed to the dataset, which internally instantiates the sample. This is weird, I know I know. It's a TODO to make it less weird.

### Single add / delete

In [21]:
sample_id = dataset.add_sample(filepath="new1.jpg")
sample_id

'5ebdb6cdf02eeb2e38a465f5'

A single sample can be deleted using the sample's ID

In [22]:
del dataset[sample_id]

try:
    print("Attempting to access sample: '%s'" % sample_id)
    sample = dataset[sample_id]
except Exception as e:
    print("%s: %s" % (type(e), e))

Attempting to access sample: '5ebdb6cdf02eeb2e38a465f5'
<class 'ValueError'>: No sample found with ID '5ebdb6cdf02eeb2e38a465f5'


Samples can **NOT** be added to a view

In [23]:
try:
    view.add_sample(filepath="new1.jpg")
except Exception as e:
    print("%s: %s" % (type(e), e))

<class 'AttributeError'>: 'DatasetView' object has no attribute 'add_sample'


### Batch add / delete

To add a batch of samples, pass a list of `kwargs` `dict`s for each sample to add.

In [24]:
sample_ids = dataset.add_samples(
    [
        {"filepath": "new_batch1.jpg"},
        {"filepath": "new_batch2.jpg"},
        {"filepath": "new_batch3.jpg"},
        {"filepath": "new_batch4.jpg"},
    ]
)
sample_ids

['5ebdb6cdf02eeb2e38a465f6',
 '5ebdb6cdf02eeb2e38a465f7',
 '5ebdb6cdf02eeb2e38a465f8',
 '5ebdb6cdf02eeb2e38a465f9']

All samples in a view can be deleted from the dataset

In [25]:
view = dataset.view().select(sample_ids)

print("Length before: %d" % len(dataset))
# @todo(Tyler) merge Brian's work
# dataset.delete_samples(view)
print("Length after: %d" % len(dataset))

Length before: 4
Length after: 4


## Operations (Aggregations)

Powerful custom aggregations are available via the [MongoDB aggregation API](https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/).


In [26]:
pipeline = [
    # deconstruct the `tags` array field of the samples to output a sample for each tag
    {"$unwind": "$tags"},
    # group by `tags` and count the number of instances for each
    {"$group": {"_id": "$tags", "count": {"$sum": 1}}},
]

aggregation = dataset.view().aggregate(pipeline)
aggregation

<pymongo.command_cursor.CommandCursor at 0x122960ef0>

In [27]:
for d in aggregation:
    # d is a dictionary whos structure depends on the aggregation pipeline
    print(d)

## Fields of Samples

### Default Fields

Some fields are automatically available on all samples.

In [28]:
sample.id

ObjectId('5ebdb6ccf02eeb2e38a465f3')

In [29]:
sample.filepath

'/path/to/img2.jpg'

In [30]:
sample.tags

['train']

In [31]:
sample.metadata

<Metadata: {
    "_cls": "Metadata",
    "size_bytes": 1024,
    "mime_type": ".jpg"
}>

### Dynamically adding fields

Fields can also be dynamically added to a dataset. We can check what fields exist on a dataset at any time via `dataset.get_sample_fields()`

In [32]:
dataset.get_sample_fields()

OrderedDict([('id', <mongoengine.base.fields.ObjectIdField at 0x10e84ffd0>),
             ('filepath', <mongoengine.fields.StringField at 0x11a8e64e0>),
             ('tags', <mongoengine.fields.ListField at 0x11a8e6588>),
             ('metadata',
              <mongoengine.fields.EmbeddedDocumentField at 0x11a8e6780>)])

Fields MUST be assigned via index assignment, but they may be accessed via indexing or object attributes

In [33]:
sample["my_boolean"] = True

# equivalent:
print(sample["my_boolean"])
print(sample.my_boolean)

True
True


In [34]:
sample["my_int"] = 51
sample.my_int

51

In [35]:
sample["my_string"] = "fiftyone"
sample.my_string

'fiftyone'

In [36]:
sample["my_list"] = ["fifty", "one"]
sample.my_list

['fifty', 'one']

In [37]:
sample["my_dict"] = {"fifty": 50, "one": "uno"}
sample.my_dict

{'fifty': 50, 'one': 'uno'}

In [38]:
sample["my_label"] = fo.Classification(label="cow", confidence=0.98)
sample.my_label

<Classification: {
    "label": "cow",
    "confidence": 0.98,
    "logits": null
}>

The `OrderedDict` returned by `get_sample_fields()` tracks the order in which fields were added to the dataset.

In [39]:
dataset.get_sample_fields()

OrderedDict([('id', <mongoengine.base.fields.ObjectIdField at 0x10e84ffd0>),
             ('filepath', <mongoengine.fields.StringField at 0x11a8e64e0>),
             ('tags', <mongoengine.fields.ListField at 0x11a8e6588>),
             ('metadata',
              <mongoengine.fields.EmbeddedDocumentField at 0x11a8e6780>),
             ('my_boolean', <mongoengine.fields.BooleanField at 0x12295b940>),
             ('my_int', <mongoengine.fields.IntField at 0x12295be48>),
             ('my_string', <mongoengine.fields.StringField at 0x122960208>),
             ('my_list', <mongoengine.fields.ListField at 0x122960240>),
             ('my_dict', <mongoengine.fields.DictField at 0x122960908>),
             ('my_label',
              <mongoengine.fields.EmbeddedDocumentField at 0x122960a20>)])

Setting a field to an inappropriate type raises a `ValidationError`.

In [40]:
try:
    sample.my_list = 15
except Exception as e:
    print("%s: %s" % (type(e), e))

<class 'mongoengine.errors.ValidationError'>: Only lists and tuples may be used in a list field


However, a field can be entirely deleted from a dataset, afterwhich it can be set again to different field type.

In [41]:
# @todo(Tyler) implement `delete_field`!
# dataset.delete_field("my_list")

# sample["my_list"] = 15

dataset.get_sample_fields()

OrderedDict([('id', <mongoengine.base.fields.ObjectIdField at 0x10e84ffd0>),
             ('filepath', <mongoengine.fields.StringField at 0x11a8e64e0>),
             ('tags', <mongoengine.fields.ListField at 0x11a8e6588>),
             ('metadata',
              <mongoengine.fields.EmbeddedDocumentField at 0x11a8e6780>),
             ('my_boolean', <mongoengine.fields.BooleanField at 0x12295b940>),
             ('my_int', <mongoengine.fields.IntField at 0x12295be48>),
             ('my_string', <mongoengine.fields.StringField at 0x122960208>),
             ('my_list', <mongoengine.fields.ListField at 0x122960240>),
             ('my_dict', <mongoengine.fields.DictField at 0x122960908>),
             ('my_label',
              <mongoengine.fields.EmbeddedDocumentField at 0x122960a20>)])