# Quick Dataset Analysis
This notebook shows how to quickly analyze an image dataset for potential issues using fastdup.

## Installation & Setting Up

In [None]:
!pip install pip -U
!pip install fastdup
!pip install pandas
!pip install matplotlib
!pip install wurlitzer
%load_ext wurlitzer

## Download Oxford Pets Dataset

In [None]:
!wget https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz -O images.tar.gz
!tar xf images.tar.gz

## Import and Run fastdup

In [21]:
import fastdup
fastdup.__version__

'0.903'

In [22]:
fd = fastdup.create(work_dir="fastdup_work_dir/", input_dir="images/")
fd.run()



In [23]:
fd.summary()


 ########################################################################################

Dataset Analysis Summary: 

    Dataset contains 7390 images
    Valid images are 99.92% (7,384) of the data, invalid are 0.08% (6) of the data
    For a detailed analysis, use `.invalid_instances()`.

    Similarity:  1.00% (74) belong to 3 similarity clusters (components).
    99.00% (7,316) images do not belong to any similarity cluster.
    Largest cluster has 6 (0.08%) images.
    For a detailed analysis, use `.connected_components()`
(similarity threshold used is 0.9, connected component threshold used is 0.96).

    Outliers: 6.13% (453) of images are possible outliers, and fall in the bottom 5.00% of similarity values.
    For a detailed list of outliers, use `.outliers()`.


['Dataset contains 7390 images',
 'Valid images are 99.92% (7,384) of the data, invalid are 0.08% (6) of the data',
 'For a detailed analysis, use `.invalid_instances()`.\n',
 'Similarity:  1.00% (74) belong to 3 similarity clusters (components).',
 '99.00% (7,316) images do not belong to any similarity cluster.',
 'Largest cluster has 6 (0.08%) images.',
 'For a detailed analysis, use `.connected_components()`\n(similarity threshold used is 0.9, connected component threshold used is 0.96).\n',
 'Outliers: 6.13% (453) of images are possible outliers, and fall in the bottom 5.00% of similarity values.',
 'For a detailed list of outliers, use `.outliers()`.']

## Invalid Images

In [24]:
fd.invalid_instances()

Unnamed: 0,img_filename,fastdup_id,error_code,is_valid
0,Abyssinian_34.jpg,135,ERROR_ZERO_SIZE_FILE,False
1,Egyptian_Mau_139.jpg,2240,ERROR_ZERO_SIZE_FILE,False
2,Egyptian_Mau_145.jpg,2247,ERROR_ZERO_SIZE_FILE,False
3,Egyptian_Mau_167.jpg,2268,ERROR_ZERO_SIZE_FILE,False
4,Egyptian_Mau_177.jpg,2278,ERROR_ZERO_SIZE_FILE,False
5,Egyptian_Mau_191.jpg,2293,ERROR_ZERO_SIZE_FILE,False


## Duplicate Image Pairs

In [25]:
fd.vis.duplicates_gallery()

100%|████████████████████████████████████████████████| 20/20 [00:00<00:00, 133.19it/s]


Stored similarity visual view in  fastdup_work_dir/galleries/duplicates.html


Info,Unnamed: 1
Distance,1.0
From,Bombay_109.jpg
To,Bombay_206.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_11.jpg
To,Bombay_192.jpg

Info,Unnamed: 1
Distance,1.0
From,Egyptian_Mau_131.jpg
To,Egyptian_Mau_202.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_126.jpg
To,Bombay_220.jpg

Info,Unnamed: 1
Distance,1.0
From,boxer_114.jpg
To,boxer_82.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_201.jpg
To,Bombay_92.jpg

Info,Unnamed: 1
Distance,1.0
From,newfoundland_147.jpg
To,newfoundland_152.jpg

Info,Unnamed: 1
Distance,1.0
From,english_cocker_spaniel_151.jpg
To,english_cocker_spaniel_162.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_193.jpg
To,Bombay_22.jpg

Info,Unnamed: 1
Distance,1.0
From,english_cocker_spaniel_154.jpg
To,english_cocker_spaniel_164.jpg

Info,Unnamed: 1
Distance,1.0
From,Egyptian_Mau_10.jpg
To,Egyptian_Mau_183.jpg

Info,Unnamed: 1
Distance,1.0
From,keeshond_54.jpg
To,keeshond_59.jpg

Info,Unnamed: 1
Distance,1.0
From,Egyptian_Mau_210.jpg
To,Egyptian_Mau_41.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_100.jpg
To,Bombay_192.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_164.jpg
To,Bombay_189.jpg

Info,Unnamed: 1
Distance,1.0
From,english_cocker_spaniel_152.jpg
To,english_cocker_spaniel_163.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_200.jpg
To,Bombay_85.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_102.jpg
To,Bombay_203.jpg

Info,Unnamed: 1
Distance,1.0
From,english_cocker_spaniel_176.jpg
To,english_cocker_spaniel_179.jpg


## Outliers

In [None]:
fd.vis.outliers_gallery() 

100%|███████████████████████████████████████████████| 20/20 [00:00<00:00, 9037.50it/s]


Stored outliers visual view in  fastdup_work_dir/galleries/outliers.html


Info,Unnamed: 1
Distance,0.59692
Path,Bengal_105.jpg

Info,Unnamed: 1
Distance,0.611524
Path,Bengal_131.jpg

Info,Unnamed: 1
Distance,0.617132
Path,staffordshire_bull_terrier_51.jpg

Info,Unnamed: 1
Distance,0.621796
Path,miniature_pinscher_76.jpg

Info,Unnamed: 1
Distance,0.622757
Path,Sphynx_128.jpg

Info,Unnamed: 1
Distance,0.62428
Path,beagle_142.jpg

Info,Unnamed: 1
Distance,0.627605
Path,american_pit_bull_terrier_72.jpg

Info,Unnamed: 1
Distance,0.630928
Path,german_shorthaired_173.jpg

Info,Unnamed: 1
Distance,0.634339
Path,staffordshire_bull_terrier_76.jpg

Info,Unnamed: 1
Distance,0.635179
Path,Bombay_36.jpg

Info,Unnamed: 1
Distance,0.636152
Path,chihuahua_6.jpg

Info,Unnamed: 1
Distance,0.641191
Path,basset_hound_197.jpg

Info,Unnamed: 1
Distance,0.642425
Path,Bombay_204.jpg

Info,Unnamed: 1
Distance,0.642967
Path,Bengal_30.jpg

Info,Unnamed: 1
Distance,0.643354
Path,boxer_149.jpg

Info,Unnamed: 1
Distance,0.643533
Path,beagle_147.jpg

Info,Unnamed: 1
Distance,0.644183
Path,german_shorthaired_121.jpg

Info,Unnamed: 1
Distance,0.64548
Path,Bombay_188.jpg

Info,Unnamed: 1
Distance,0.646996
Path,chihuahua_164.jpg

Info,Unnamed: 1
Distance,0.653168
Path,Abyssinian_226.jpg


## Dark, Bright and Blurry Images

In [None]:
fd.vis.stats_gallery(metric='dark')

In [None]:
fd.vis.stats_gallery(metric='bright')

In [None]:
fd.vis.stats_gallery(metric='blur')

## Image Clusters

In [None]:
fd.vis.component_gallery()