# Image Deduplication with FiftyOne

This walkthrough demonstrates a simple use case of using FiftyOne to detect and
remove duplicate images from your dataset.

## Requirements

This walkthrough requires the `tensorflow` package.

```bash
pip install tensorflow
```

## Download the data

First we download the dataset to disk. The dataset is a 1000 sample subset of
CIFAR-100, a dataset of 32x32 pixel images with one of 100 different
classification labels such as `apple`, `bicycle`, `porcupine`, etc.

In [1]:
from image_deduplication_helpers import download_dataset

download_dataset()

Downloading dataset of 1000 samples to:
	/tmp/fiftyone/cifar100_with_duplicates
and corrupting the data (5% duplicates)
Download successful


The above script uses `tensorflow.keras.datasets` to download the dataset, so
you must have [TensorFlow installed](https://www.tensorflow.org/install).

The dataset is organized on disk as follows:

```
/tmp/fiftyone/
└── cifar100_with_duplicates/
    ├── <classA>/
    │   ├── <image1>.jpg
    │   ├── <image2>.jpg
    │   └── ...
    ├── <classB>/
    │   ├── <image1>.jpg
    │   ├── <image2>.jpg
    │   └── ...
    └── ...
```

As we will soon come to discover, some of these samples are duplicates and we
have no clue which they are!

## Create a dataset

First import the `fiftyone` package.

In [2]:
import fiftyone as fo

Let's use a utililty method provided by FiftyOne to load the image
classification dataset from disk:

In [3]:
import os

import fiftyone.utils.data as foud

dataset_name = "cifar100_with_duplicates"

src_data_dir = os.path.join("/tmp/fiftyone", dataset_name)

samples, classes = foud.parse_image_classification_dir_tree(src_data_dir)
dataset = fo.Dataset.from_image_classification_samples(
    samples, name=dataset_name, classes=classes
)

 100% |███████████████████████████| 1000/1000 [227.0ms elapsed, 0s remaining, 4.4K samples/s]     


## Explore the dataset

We can poke around in the dataset:

In [4]:
# Print summary information about the dataset
print(dataset)

# Print a random sample
print(dataset.view().take(1).first())

Name:           cifar100_with_duplicates
Persistent:     False
Num samples:    1000
Tags:           []
Sample fields:
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
<Sample: {
    'dataset_name': 'cifar100_with_duplicates',
    'id': '5ef390bfe8c7261be7bd2cf0',
    'filepath': '/tmp/fiftyone/cifar100_with_duplicates/caterpillar/178.jpg',
    'tags': BaseList([]),
    'ground_truth': <Classification: {'label': 'caterpillar'}>,
}>


Create a view that contains only samples whose ground truth label is
`mountain`:

In [5]:
view = dataset.view().match({"ground_truth.label": "mountain"})

# Print summary information about the view
print(view)

# Print the first sample in the view
print(view.first())

Dataset:        cifar100_with_duplicates
Num samples:    7
Tags:           []
Sample fields:
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
Pipeline stages:
    1. <fiftyone.core.stages.Match object at 0x7f2d2cb25a20>
<Sample: {
    'dataset_name': 'cifar100_with_duplicates',
    'id': '5ef390bfe8c7261be7bd2e2a',
    'filepath': '/tmp/fiftyone/cifar100_with_duplicates/mountain/21.jpg',
    'tags': BaseList([]),
    'ground_truth': <Classification: {'label': 'mountain'}>,
}>


Create a view with samples sorted by their ground truth labels in reverse
alphabetical order:

In [6]:
view = dataset.view().sort_by("ground_truth.label", reverse=True)

print(view)
print(view.first())

Dataset:        cifar100_with_duplicates
Num samples:    1000
Tags:           []
Sample fields:
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
Pipeline stages:
    1. <fiftyone.core.stages.SortBy object at 0x7f2d2cb254a8>
<Sample: {
    'dataset_name': 'cifar100_with_duplicates',
    'id': '5ef390bfe8c7261be7bd3014',
    'filepath': '/tmp/fiftyone/cifar100_with_duplicates/worm/167.jpg',
    'tags': BaseList([]),
    'ground_truth': <Classification: {'label': 'worm'}>,
}>


## Visualize the dataset

Start browsing the dataset:

In [7]:
session = fo.launch_app(dataset=dataset)

App launched


![dataset](images/dedup_1.png)

Narrow your scope to 10 random samples:

In [8]:
session.view = dataset.view().take(10)

![take](images/dedup_2.png)

Click on some some samples in the GUI to select them and access their IDs from
code!

In [9]:
# Get the IDs of the currently selected samples in the App
sample_ids = session.selected

Create a view that contains your currently selected samples:

In [10]:
selected_view = dataset.view().select(session.selected)

Update the App to only show your selected samples:

In [11]:
session.view = selected_view

![selected](images/dedup_3.png)

## Compute file hashes

Iterate over the samples and compute their file hashes:

In [12]:
import fiftyone.core.utils as fou

for sample in dataset:
    sample["file_hash"] = fou.compute_filehash(sample.filepath)
    sample.save()

print(dataset)

Name:           cifar100_with_duplicates
Persistent:     False
Num samples:    1000
Tags:           []
Sample fields:
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    file_hash:    fiftyone.core.fields.IntField


We have two ways to visualize this new information:

-   From your terminal:

In [13]:
sample = dataset.view().first()
print(sample)

<Sample: {
    'dataset_name': 'cifar100_with_duplicates',
    'id': '5ef390bfe8c7261be7bd2c38',
    'filepath': '/tmp/fiftyone/cifar100_with_duplicates/apple/113.jpg',
    'tags': BaseList([]),
    'ground_truth': <Classification: {'label': 'apple'}>,
    'file_hash': -6400261609252323776,
}>


-   By refreshing the App:

In [14]:
session.dataset = dataset

![dataset2](images/dedup_4.png)

## Check for duplicates

Now let's use a simple Python statement to locate the duplicate files in the
dataset, i.e., those with the same file hashses:

In [15]:
from collections import Counter

filehash_counts = Counter(sample.file_hash for sample in dataset)
dup_filehashes = [k for k, v in filehash_counts.items() if v > 1]

print("Number of duplicate file hashes: %d" % len(dup_filehashes))

Number of duplicate file hashes: 53


Now let's create a view that contains only the samples with these duplicate
file hashes:

In [16]:
dup_view = (
    dataset.view()
    # Extract samples with duplicate file hashes
    .match({"file_hash": {"$in": dup_filehashes}})
    # Sort by file hash so duplicates will be adjacent
    .sort_by("file_hash")
)

print("Number of images that have a duplicate: %d" % len(dup_view))
print("Number of duplicates: %d" % (len(dup_view) - len(dup_filehashes)))

Number of images that have a duplicate: 108
Number of duplicates: 55


Of course, we can always use the App to visualize our work!

In [17]:
session.view = dup_view

![dup-view](images/dedup_5.png)

## Delete duplicates

Now let's delete the duplicate samples from the dataset using our `dup_view` to
restrict our attention to known duplicates:

In [18]:
print("Length of dataset before: %d" % len(dataset))

_dup_filehashes = set()
for sample in dup_view:
    if sample.file_hash not in _dup_filehashes:
        _dup_filehashes.add(sample.file_hash)
        continue

    del dataset[sample.id]

print("Length of dataset after: %d" % len(dataset))

# Verify that the dataset no longer contains any duplicates
print("Number of unique file hashes: %d" % len({s.file_hash for s in dataset}))

Length of dataset before: 1000
Length of dataset after: 943
Number of unique file hashes: 943


## Export the deduplicated dataset

Finally, let's export a fresh copy of our now-duplicate-free dataset:

In [19]:
EXPORT_DIR = "/tmp/fiftyone/export"

dataset.export(label_field="ground_truth", export_dir=EXPORT_DIR)

Writing samples to '/tmp/fiftyone/export' in 'fiftyone.types.dataset_types.ImageClassificationDataset' format...
 100%  943/943 [546.8ms elapsed, 0s remaining, 1.7K s
Writing labels to '/tmp/fiftyone/export/labels.json'
Dataset created


Check out the contents of `/tmp/fiftyone/export` on disk to see how the data is
organized.

You can load the deduplicated dataset that you exported back into FiftyOne at
any time as follows:

In [20]:
no_dups_dataset = fo.Dataset.from_image_classification_dataset(
    EXPORT_DIR, name="no_duplicates"
)

print(no_dups_dataset)

 100%  943/943 [545.7ms elapsed, 0s remaining, 1.7K s
Name:           no_duplicates
Persistent:     False
Num samples:    943
Tags:           []
Sample fields:
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)


## Cleanup

You can cleanup the files generated by this tutorial by running:

```shell
rm -rf /tmp/fiftyone
```