## Getting Started with FastDup

This is a walkthrough on how to install and run FastDup to find image-duplicates & \
near-duplicated on the *hotel-id-to-combat-human-trafficking-2022-fgvc9*
cvpr-2022 workshop competition.

Kaggle competition link: https://www.kaggle.com/competitions/hotel-id-to-combat-human-trafficking-2022-fgvc9/data

### 1. Installing FastDup

In [44]:
!pip install fastdup



### 2. Downloading Competition Data
Note: see installation instruction for kaggle-api here: https://github.com/Kaggle/kaggle-api 

In [4]:
import os
import shutil
data_dir = 'hotel-id-to-combat-human-trafficking-2022-fgvc9/data'

if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [17]:
!kaggle competitions download -c hotel-id-to-combat-human-trafficking-2022-fgvc9

Downloading hotel-id-to-combat-human-trafficking-2022-fgvc9.zip to /home/jupyter/tmp/fastdup/examples
100%|██████████████████████████████████████▉| 14.0G/14.0G [00:57<00:00, 335MB/s]
100%|███████████████████████████████████████| 14.0G/14.0G [01:03<00:00, 236MB/s]


In [5]:
!unzip -q hotel-id-to-combat-human-trafficking-2022-fgvc9.zip -d hotel-id-to-combat-human-trafficking-2022-fgvc9/data
shutil.rmtree('hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_masks')

### 3. Running FastDup

In [10]:
import fastdup
print(fastdup.__version__)

JPY_PARENT_PID=1754
0.34


In [31]:
results_dir = 'hotel-id-to-combat-human-trafficking-2022-fgvc9/results'

In [40]:
fastdup.run(input_dir=data_dir, work_dir=results_dir)

Going to loop over dir hotel-id-to-combat-human-trafficking-2022-fgvc9/data
Found total 44704 images to run on
[■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] 100% Estimated: 0 Minutes

### 4. Displaying Similarity Gallery

In [32]:
from IPython.display import HTML

#### 4.1 Displaying top ranking similar images 
Displaying the top-15 ranking similar pairs. They are taken from the *similarity.csv*
output file which containes similar pairs ordered by similarity.

In [42]:
similar_gallery_save_path = os.path.join(results_dir, 'similar-gallery-top')
if not os.path.exists(similar_gallery_save_path):
    os.makedirs(similar_gallery_save_path)
    
similarity_file = os.path.join(results_dir, 'similarity.csv')
fastdup.create_duplicates_gallery(similarity_file, save_path=similar_gallery_save_path, 
                                  num_images=15, descending=True)

gallery_file_name = os.path.join(similar_gallery_save_path, 'similarity.html')
HTML(filename=gallery_file_name)

100%|██████████| 15/15 [00:01<00:00, 11.19it/s]

Stored similarity visual view in  hotel-id-to-combat-human-trafficking-2022-fgvc9/results/similar-gallery-top/similarity.html





Unnamed: 0,image,distance,from,to
0,,1.0,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/73523/000002190.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/73523/000002200.jpg
1,,1.0,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/104342/000032993.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/104342/000032989.jpg
2,,1.0,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/83939/000024695.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/83939/000024691.jpg
3,,1.0,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/22658/000006982.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/22658/000006967.jpg
4,,1.0,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/22658/000006959.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/22658/000006974.jpg
5,,1.0,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/104342/000032990.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/104342/000032994.jpg
6,,1.0,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/22658/000006977.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/22658/000006962.jpg
7,,1.0,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/306244/000027855.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/306244/000027859.jpg
8,,1.0,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/28166/000003785.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/28166/000003793.jpg
9,,1.0,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/28906/000037551.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/28906/000037555.jpg


#### 4.2 Display Outliers
Displaying outliers, which are the images farthest away from all other images in the dataset.
They are taken from the *outliers.csv*

In [38]:
outliers_gallery_save_path = os.path.join(results_dir, 'outliers-gallery')
if not os.path.exists(outliers_gallery_save_path):
    os.makedirs(outliers_gallery_save_path)
    
outliers_file = os.path.join(results_dir, 'outliers.csv')
fastdup.create_duplicates_gallery(outliers_file, save_path=outliers_gallery_save_path, 
                                  num_images=15, descending=False)

gallery_file_name = os.path.join(outliers_gallery_save_path, 'similarity.html')
HTML(filename=gallery_file_name)

100%|██████████| 15/15 [00:02<00:00,  5.96it/s]


Stored similarity visual view in  hotel-id-to-combat-human-trafficking-2022-fgvc9/results/outliers-gallery/similarity.html


Unnamed: 0,image,distance,from,to
26807,,0.608458,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/18800/000004336.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/6231/000043880.jpg
26808,,0.606731,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/38889/000031826.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/199287/000006014.jpg
26809,,0.587809,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/8268/000005739.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/8268/000005719.jpg
26810,,0.582954,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/38889/000031826.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/5723/000015964.jpg
26811,,0.580804,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/8268/000005739.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/81391/000044396.jpg
26812,,0.572511,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/96908/000030872.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/32335/000030761.jpg
26813,,0.571884,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/688869/000043143.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/45168/000044932.jpg
26814,,0.568088,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/36539/000043058.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/36539/000043036.jpg
26815,,0.556116,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/36539/000043058.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/25890/000030408.jpg
26816,,0.551147,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/7630/000026501.jpg,hotel-id-to-combat-human-trafficking-2022-fgvc9/data/train_images/80091/000023672.jpg
