FastDup | A tool for gaining insights from a large image collection

Large Image Datasets Today are a Mess | Blog Post | Video Tutorial

FastDup is a tool for gaining insights from a large image collection. It can find anomalies, duplicate and near duplicate images, clusters of similarity, learn the normal behavior and temporal interactions between images. It can be used for smart subsampling of a higher quality dataset, outlier removal, novelty detection of new information to be sent for tagging. FastDup scales to millions of images running on CPU only.

From the authors of GraphLab and Turi Create.

Duplicates and near duplicates identified in MS-COCO and Imagenet-21K dataset

Wrong labels in the Imagenet-21K dataset Different labels to visaully similar daisy flower images.

Cluster of wrong labels in the Imagenet-21K dataset. No human can tell those red wine flavors from their image.

Cluster of wrong labels in the Imagenet-21K dataset. Different labels to visually similar red-wine images.

Thousands of broken ImageNet images that have confusing labels of real objects.

IMDB-WIKI outliers (data goal is for face detection, gender and age detection)

Outliers in the Google Landmark Recognition 2021 dataset (dataset intention is to capture recognizable landmarks, like the empire state building etc.)

Fun labels in the Imagenet-21K dataset

Results on Key Datasets

We have thoroughly tested fastdup across various famous visual datasets. Ranging from pilar Academic datasets to Kaggle competitions. A key finding we have made using FastDup is that there are ~1.2M (!) duplicate images on the ImageNet-21K dataset, out of which 104K pairs belong both to the train and to the val splits (this amounts to 20% of the validation set). This is a new unknown result! Full results are below. * train/val splits are taken from https://github.com/Alibaba-MIIL/ImageNet21 .

Dataset	Total Images	cost [$]	spot cost [$]	processing [sec]	Identical pairs	Anomalies
imagenet21k-resized	11,582,724	4.98	1.24	11,561	1,194,059	Anomalies Wrong Labels
imdb-wiki	514,883	0.65	0.16	1,509	187,965	View
places365-standard	2,168,460	1.01	0.25	2,349	93,109	View
herbarium-2022-fgvc9	1,050,179	0.69	0.17	1,598	33,115	View
landmark-recognition-2021	1,590,815	0.96	0.24	2,236	2,613	View
visualgenome	108,079	0.05	0.01	124	223	View
iwildcam2021-fgvc9	261,428	0.29	0.07	682	54	View
coco	163,957	0.09	0.02	218	54	View
sku110k	11,743	0.03	0.01	77	7	View

Experiments presented are on a 32 core Google cloud machine, with 128GB RAM (no GPU required).
All experiments could be also reproduced on a 8 core, 32GB machine (excluding Imagenet-21K).
We run on the full ImageNet-21K dataset (11.5M images) to compare all pairs of images in less than 3 hours WITHOUT a GPU (with Google cloud cost of 5$).

Quick Installation

For Python 3.7 and 3.8 (Ubuntu 20.04 or Ubuntu 18.04 or Mac M1)

python3.8 -m pip install fastdup

Running the code

import fastdup
fastdup.run(input_dir="/path/to/your/folder")                            #main running function
fastdup.create_duplicates_gallery('similarity.csv', save_path='.')       #create a visual gallery of found duplicates
fastdup.create_duplicates_gallery('outliers.csv',   save_path='.')       #create a visual gallery of anomalies

Working on the Food-101 dataset. Detecting identical pairs, similar-pairs (search) and outliers (non-food images..)

Getting started examples

Detailed instructions

Support

Join our Slack channel

Technology

We build upon several excellent open source tools. Microsoft's ONNX Runtime, Facebook's Faiss, Open CV, Pillow Resize, Apple's Turi Create, Minio, Amazon's awscli.

About Us

Danny Bickson, Amir Alush

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
examples		examples
gallery		gallery
.gitignore		.gitignore
CLOUD.md		CLOUD.md
Dockerfile		Dockerfile
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
RUN.md		RUN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastDup | A tool for gaining insights from a large image collection

Results on Key Datasets

Quick Installation

Running the code

Getting started examples

Detailed instructions

Support

Technology

About Us

About

Releases

Packages

Languages

License

xinsuinizhuan/fastdup

Folders and files

Latest commit

History

Repository files navigation

FastDup | A tool for gaining insights from a large image collection

Results on Key Datasets

Quick Installation

Running the code

Getting started examples

Detailed instructions

Support

Technology

About Us

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages