# Preparing a labeled object dataset for training using fastdup V1.0

In [None]:
# download fastdup
!pip install pip -U
!pip install fastdup
!pip install pandas
!pip install plotly

In [2]:
!pip install pip -U
!pip install fastdup-0.211-cp38-cp38-manylinux_2_31_x86_64.whl --force-reinstall

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-23.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./fastdup-0.211-cp38-cp38-manylinux_2_31_x86_64.whl
Collecting numpy
  Downloading numpy-1.24.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m66.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyyaml
  Downloading PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manyli

In [None]:
import fastdup
import pandas as pd

# Get mini-coco dataset

In [1]:
# download mini-coco
!gdown --fuzzy https://drive.google.com/file/d/1t_l9uyBPfxSEzcajTk4a1TaQXzeRm9hw/view\?usp\=sharing
!tar xf coco_minitrain_25k.zip

Access denied with the following error:

 	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

	 https://drive.google.com/uc?id=1t_l9uyBPfxSEzcajTk4a1TaQXzeRm9hw 

tar: coco_minitrain_25k.zip: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now


In [None]:
# a file called coco_minitrain2017.csv should be in coco_minitrain_25k/annotations/
# as part of the zip download, if it doesn't exist, this gdown command will download it
!gdown --fuzzy https://drive.google.com/file/d/1i12p23cXlqp1QrXjAD_vu467r4q67Mq9/view\?usp\=sharing -O coco_minitrain_25k/annotations/

# Load annotations
We will use a simple converter to convert the COCO format JSON annotaion file into the fastdup annotation dataframe. This converter is applicable to any dataset which uses COCO format.

In [None]:
coco_csv = 'coco_minitrain_25k/annotations/coco_minitrain2017.csv'
coco_annotations = pd.read_csv(coco_csv, header=None, names=['img_filename', 'bbox_x', 'bbox_y',
                                                             'bbox_w', 'bbox_h', 'label', 'ext'])

coco_annotations['split'] = 'train'  # Only train files were loaded
coco_annotations = coco_annotations.drop_duplicates()

In [None]:
coco_annotations.head(3)

Unnamed: 0,img_filename,bbox_x,bbox_y,bbox_w,bbox_h,label,ext,split
0,000000131075.jpg,20.23,55.98,313.49,326.5,tv,0,train
1,000000131075.jpg,176.9,381.12,286.2,136.63,laptop,0,train
2,000000131075.jpg,369.96,361.35,72.76,73.91,laptop,0,train


# Run fastdup

In [None]:
# run fastdup with annotations
image_dir = 'coco_minitrain_25k/images/train2017/'
work_dir = 'fastdup_minicoco'

fd = fastdup.create(work_dir=work_dir, input_dir=image_dir)
fd.run(annotations=coco_annotations)

# Get class statistics

In [None]:
fd.annotations()['label'].value_counts()

person           50336
chair             7870
car               7703
bottle            4330
cup               4154
                 ...  
toothbrush         309
bear               296
parking meter      183
toaster             46
hair drier          40
Name: label, Length: 80, dtype: int64

## Class distribution
The dataset contains 25k images and 183k objects, an average of 7.3 objects per image. 

Interestingly, we see a highly unbalanced class distribution, where all 80 coco classes are present here, but there is a strong balance towards the person class, that accounts for over 56k instances (30.6%). Car and Chair classes also contain over 8k instances each, while at the bottom of the list the toaster and hair drier classes contain as few as 40 instances. 

Using `Plotly` we get a useful interactive histogram. 

In [None]:
import plotly.express as px
fig = px.histogram(coco_annotations, x="label")
fig.show()

# Findind duplicates
First we visualize the general lists of duplicates

In [None]:
fd.vis.component_gallery(min_items=2)

Filtered & Sorted Galleries

In [None]:
# sorting by largest objects
fd.vis.component_gallery(sort_by='area', min_items=2, max_width=900)

In [None]:
# clusters with 'person' labels
fd.vis.component_gallery(sort_by='area', min_items=2, slice='person', max_width=900)

# And outliers

In [None]:
# visualize outliers
fd.vis.outliers_gallery()

# Find size and shape issues
Objects come in various shapes and sizes, and some times objects might be incorrectly labeled or too small to be useful. We will now find the smallest, narrowest and widest objects, and asses their usefulness. 

In [None]:
annot = fd.annotations()
annot['area'] = annot['bbox_w'] * annot['bbox_h']
annot['aspect'] = annot['bbox_w'] / annot['bbox_h']

In [None]:
# Smallest 5% of objects:
smallest_objects = annot[annot['area'] < annot['area'].quantile(0.05)].sort_values(by=['area'])

# 5% of extreme aspect ratios
aspect_ratio_objects = annot[(annot['aspect'] < annot['aspect'].quantile(0.05))
                             | (annot['aspect'] > annot['aspect'].quantile(0.95))].sort_values(by=['aspect'])


In [None]:
# let's see the smallest objects
smallest_objects.head(3)

Unnamed: 0_level_0,img_filename,bbox_x,bbox_y,bbox_w,bbox_h,label,ext,split,fastdup_id,error_code,is_valid,area,aspect
fd_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
110320,000000205392.jpg,580.76,232.51,10.0,10.0,person,0,train,110320,VALID,True,100.0,1.0
131657,000000320957.jpg,183.84,300.01,10.02,10.02,apple,0,train,131657,VALID,True,100.4004,1.0
21376,000000407413.jpg,301.32,175.41,10.06,10.06,person,0,train,21376,VALID,True,101.2036,1.0


In [None]:
aspect_ratio_objects.head(3)

Unnamed: 0_level_0,img_filename,bbox_x,bbox_y,bbox_w,bbox_h,label,ext,split,fastdup_id,error_code,is_valid,area,aspect
fd_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
134867,000000093298.jpg,230.69,1.11,16.64,419.24,kite,0,train,134867,VALID,True,6976.1536,0.039691
3642,000000002444.jpg,1.92,136.5,11.51,263.87,person,0,train,3642,VALID,True,3037.1437,0.04362
164216,000000116502.jpg,222.84,63.49,16.19,318.2,spoon,0,train,164216,VALID,True,5151.658,0.05088


In [None]:
aspect_ratio_objects.tail(3)

Unnamed: 0_level_0,img_filename,bbox_x,bbox_y,bbox_w,bbox_h,label,ext,split,fastdup_id,error_code,is_valid,area,aspect
fd_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
4261,000000527098.jpg,33.0,216.0,602.0,18.0,boat,0,train,4261,VALID,True,10836.0,33.444444
77850,000000444692.jpg,1.12,58.6,638.88,15.74,bench,0,train,77850,VALID,True,10055.9712,40.589581
173218,000000516740.jpg,17.27,197.91,601.64,13.43,train,0,train,173218,VALID,True,8080.0252,44.798213


Look at that! The slices reveal many items that are either tiny (10x10 pixels) or have extreme aspect ratios - as extreme at 1:45 - an object 601 pixels wide by only 13 pixels high. 

### Objects that didn't make the cut:
Let's look at objects deemed invalid by fastdup. These are either objects that are too small to be useful in our analysis (smaller than 10px), have bouding boxes with illeagal values (negative or beyond image boundaries), or are part of images that are missing. We can tell which is which by the `error_code` column in our dataframe.

In [None]:
fd.invalid_instances().head(3)

Unnamed: 0,img_filename,bbox_x,bbox_y,bbox_w,bbox_h,label,ext,split,fastdup_id,error_code,is_valid
0,000000262162.jpg,437.17,244.79,19.52,9.93,mouse,0,train,16,ERROR_BAD_BOUNDING_BOX,False
1,000000524325.jpg,137.84,332.22,8.92,11.5,person,0,train,60,ERROR_BAD_BOUNDING_BOX,False
2,000000524325.jpg,177.35,294.13,5.32,11.92,person,0,train,65,ERROR_BAD_BOUNDING_BOX,False


#### Distribution of error codes:
A simple `value_counts` will tell us the distribution of the errors. We have found 18,592 (!) bounding boxes that are either too small or go beyond image boundaries. This is 10% of the data! Filtering them would both save us grusome debugging of training errors and failures and help up provide the model with useful size objects. 

In [None]:
fd.invalid_instances()['error_code'].value_counts()

ERROR_BAD_BOUNDING_BOX    18592
Name: error_code, dtype: int64

# Find possible mis-labels
The fastdup similarity search and gallery is a strong tool for finding objects that are possibly mislabeled. By finding each object's nearest neighbors and their classes, we can find objects with classes contradicting their neighbors' - a strong sign for mislabels.

In [None]:
fd.vis.similarity_gallery(label_col='label', num_images=25)