# Fifteen minutes to `fiftyone`

### Two sentence summary:
> "`Dataset`s are composed of samples which contain fields, all of which can
be dynamically created, modified and deleted. `DatasetView`s allow one to
easily view and manipulate subsets of `Dataset`s."

### Core Concepts

The fundamental `fiftyone` object that a user interacts with is the
`DatasetView`. Users are constantly creating and chaining commands on views.
Any method on a `DatasetView` is also available on a `Dataset`. If appropriate,
the dataset creates a default view and calls the method on that view.

Samples are the building blocks that `Dataset`s are composed of. Samples
can have dynamically added fields on them. Fields can be of special types,
like the `fiftyone.core.labels.Labels` class, or they can be primitive serializable types
like dicts, lists, strings, scalars, etc.

## Setup

Don't worry about the details of this code now. We are just creating a dataset so that we can explore it.
We will discuss the details of creating datasets soon.

In [None]:
import fiftyone as fo

dataset = fo.Dataset("fiftyone_in_fifteen")
sample_id = dataset.add_sample(filepath="/path/to/img1.jpg", tags=["train"])
dataset.add_sample(
    filepath="/path/to/img2.jpg",
    tags=["train"],
    metadata=fo.Metadata(size_bytes=1024, mime_type=".jpg"),
)
dataset.add_sample(filepath="/path/to/img3.jpg", tags=["test"])
sample = dataset[sample_id]

## Exploring a Dataset

To start, there are some basic things we can inquire on any dataset.

In [None]:
len(dataset)

Every dataset has "sample fields", which are the accessible fields on any sample of the dataset. There are some fields that come default on every dataset.

In [None]:
dataset.get_sample_fields()

Fields are available for python primitives (`bool`, `int`, `str`, `list`, `dict`) as well as more semantically meaningful structures for samples such as metadata and labels.

We can filter by a particular `field_type` to see all fields of that type.

In [None]:
dataset.get_sample_fields(field_type=fo.Metadata)

A useful helper is `summary()` which succinctly summarizes basic information about a dataset.

In [None]:
print(dataset.summary())

## Basics with `DatasetView`s

The easiest way to access samples on a dataset is through a `DatasetView`

The default view on a dataset is easily accessible via:

In [None]:
view = dataset.view()
view

Basic exploratory commands are also available on views

In [None]:
len(view)

In [None]:
print(view.summary())

Use `first()` to get a single sample from the view.

In [None]:
sample = view.first()
sample

Use `take()` to randomly sample the view.

In [None]:
for sample in view.take(2):
    print(sample)

### Sorting

Samples can be sorted by any field or subfield

In [None]:
for sample in view.sort_by("filepath", reverse=True):
    print(sample.filepath)
print()

for sample in view.sort_by("metadata.size_bytes", reverse=True):
    if sample.metadata:
        print(sample.metadata.size_bytes)
print()
    
for sample in view.sort_by("tags[0]"):
    print(sample.tags)

### Selection

#### Slicing

Ranges of samples can be accessed using `skip()` and `limit()`

In [None]:
len(view.skip(1).limit(2))

or equivalently using array slicing

In [None]:
len(view[1:3])

#### Key Indexing

Views are keyed same as datasets: by sample ID.

In [None]:
sample_id = sample.id
print("sample_id: '%s'" % sample_id)

view[sample_id]

Slicing only works if a `:` is provided.

In [None]:
try:
    view[0]
except Exception as e:
    print("%s: %s" % (type(e), e))

### Querying

The core query function is `match()`, which uses [MongoDB query syntax](https://docs.mongodb.com/manual/tutorial/query-documents/#read-operations-query-argument)

In [None]:
for sample in view.match({"tags": "train"}):
    print(sample.tags)

Convenience functions are provided for common queries.

We can `select()` or `exclude()` only the samples matching a list of IDs.

In [None]:
sample_ids = [str(sample.id)]
print("sample_ids: %s" % sample_ids)
print()

print("select:")
for sample in view.select(sample_ids):
    print(" - ", sample.id)
print()

print("exclude:")
for sample in view.exclude(sample_ids):
    print(" - ", sample.id)

Or check that a field and is not `None` with `exists()`

In [None]:
for sample in view.exists("metadata"):
    print(sample.metadata)

### Chaining `DatasetView` Operations

The above operations on views return `DatasetView` instances. These operations can be chained in any arbitrary order.

In [None]:
very_complex_view = (
    dataset.view()
    .match({"tags": "train"})
    .exists("file_hash")
    .sort_by("filepath")[10:20]
    .take(5)
)
very_complex_view

In [None]:
print(very_complex_view.summary())

## Modifying `Dataset`s (Inserting & Deleting Samples)

At the moment the `kwargs` used to instantiate a sample must be passed to the dataset, which internally instantiates the sample. This is weird, I know I know. It's a TODO to make it less weird.

### Single add / delete

In [None]:
sample_id = dataset.add_sample(filepath="new1.jpg")
sample_id

A single sample can be deleted using the sample's ID

In [None]:
del dataset[sample_id]

try:
    print("Attempting to access sample: '%s'" % sample_id)
    sample = dataset[sample_id]
except Exception as e:
    print("%s: %s" % (type(e), e))

Samples can **NOT** be added to a view

In [None]:
try:
    view.add_sample(filepath="new1.jpg")
except Exception as e:
    print("%s: %s" % (type(e), e))

### Batch add / delete

To add a batch of samples, pass a list of `kwargs` `dict`s for each sample to add.

In [None]:
sample_ids = dataset.add_samples(
    [
        {"filepath": "new_batch1.jpg"},
        {"filepath": "new_batch2.jpg"},
        {"filepath": "new_batch3.jpg"},
        {"filepath": "new_batch4.jpg"},
    ]
)
sample_ids

All samples in a view can be deleted from the dataset

In [None]:
view = dataset.view().select(sample_ids)

print("Length before: %d" % len(dataset))
# @todo(Tyler) merge Brian's work
# dataset.delete_samples(view)
print("Length after: %d" % len(dataset))

## Operations (Aggregations)

Powerful custom aggregations are available via the [MongoDB aggregation API](https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/).


In [None]:
pipeline = [
    # deconstruct the `tags` array field of the samples to output a sample for each tag
    {"$unwind": "$tags"},
    # group by `tags` and count the number of instances for each
    {"$group": {"_id": "$tags", "count": {"$sum": 1}}},
]

aggregation = dataset.view().aggregate(pipeline)
aggregation

In [None]:
for d in aggregation:
    # d is a dictionary whos structure depends on the aggregation pipeline
    print(d)

## Fields of Samples

### Default Fields

Some fields are automatically available on all samples.

In [None]:
sample.id

In [None]:
sample.filepath

In [None]:
sample.tags

In [None]:
sample.metadata

### Dynamically adding fields

Fields can also be dynamically added to a dataset. We can check what fields exist on a dataset at any time via `dataset.get_sample_fields()`

In [None]:
dataset.get_sample_fields()

Fields MUST be assigned via index assignment, but they may be accessed via indexing or object attributes

In [None]:
sample["my_boolean"] = True

# equivalent:
print(sample["my_boolean"])
print(sample.my_boolean)

In [None]:
sample["my_int"] = 51
sample.my_int

In [None]:
sample["my_string"] = "fiftyone"
sample.my_string

In [None]:
sample["my_list"] = ["fifty", "one"]
sample.my_list

In [None]:
sample["my_dict"] = {"fifty": 50, "one": "uno"}
sample.my_dict

In [None]:
sample["my_label"] = fo.Classification(label="cow", confidence=0.98)
sample.my_label

The `OrderedDict` returned by `get_sample_fields()` tracks the order in which fields were added to the dataset.

In [None]:
dataset.get_sample_fields()

Setting a field to an inappropriate type raises a `ValidationError`.

In [None]:
try:
    sample.my_list = 15
except Exception as e:
    print("%s: %s" % (type(e), e))

However, a field can be entirely deleted from a dataset, afterwhich it can be set again to different field type.

In [None]:
# @todo(Tyler) implement `delete_field`!
# dataset.delete_field("my_list")

# sample["my_list"] = 15

dataset.get_sample_fields()