[![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com)

# Analyzing Roboflow Universe Datasets

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-kaggle-datasets.ipynb)
[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-kaggle-datasets.ipynb)

This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze any computer vision datasets from [Roboflow Universe](https://universe.roboflow.com/).

## Install Roboflow Python Package
The Roboflow Python Package is a python wrapper around the core Roboflow web application and REST API.

This package lets us download from over 200,000 datasets from [Roboflow Universe](https://universe.roboflow.com/).

In [3]:
!pip install -Uq roboflow

## Search Datasets

First, [sign-up for an account](https://app.roboflow.com/) - for free.

You can now search for any dataset on the [platform](https://universe.roboflow.com/) using keywords.

![image-3.png](attachment:image-3.png)

Upon finding your dataset of interest, click on the 'Download Dataset' button on the dataset page.

![image.png](attachment:image.png)

Copy the code snippet into your notebook.

![image-2.png](attachment:image-2.png)

## Download into local folder

In this notebook we will downlaod the Pascal VOC 2012 dataset in COCO format into our local folder.

In [None]:
from roboflow import Roboflow
rf = Roboflow(api_key="YOUR_API_KEY")

In [None]:
project = rf.workspace("jacob-solawetz").project("pascal-voc-2012")
dataset = project.version(1).download("coco")

## Install fastdup

Next, install fastdup and verify the installation.

In [6]:
!pip install -Uq fastdup 

Now, test the installation. If there's no error message, we are ready to go.

In [7]:
import fastdup
fastdup.__version__

/usr/bin/dpkg


'1.31'

## Run fastdup

To run fastdup, we only need to point `input_dir` to the folder containing images from the dataset.

In [16]:
fd = fastdup.create(input_dir='Pascal-VOC-2012-1')
fd.run()

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-08-01 15:41:47 [INFO] Going to loop over dir Pascal-VOC-2012-1
2023-08-01 15:41:47 [INFO] Found total 17112 images to run on, 17112 train, 0 test, name list 17112, counter 17112 
2023-08-01 15:42:27 [INFO] Found total 17112 images to run onimated: 0 Minutes
Finished histogram 3.974
Finished bucket sort 4.014
2023-08-01 15:42:28 [INFO] 1026) Finished write_index() NN model
2023-08-01 15:42:28 [INFO] Stored nn model index file work_dir/nnf.index
2023-08-01 15:42:29 [INFO] Total time took 41980 ms
2023-08-01 15:42:29 [INFO] Found a total of 0 fully identical images (d>0.990), which are 0.00 %
2023-08-01 15:42:29 [INFO] Found a total of 2 nearly identical images(d>0.980), which are 0.01 %
2023-08-01 15:42:29 [INFO] Found a total of 261 above threshold images (d>0.900), which are 0.76 %
2023-08-01 15:42:29 [INFO] Found a total of 1711 outlier images         (d<0.050), which are 5.00 %
2023-08-01 15:42:29 [INFO] 

0

## Inspect Issues

From the summary above, we have 1 corrupted image. Let's get some more details with:

There are several other methods we can use to inspect and visualize the issues found:

```python
fd.vis.duplicates_gallery()    # create a visual gallery of duplicates
fd.vis.outliers_gallery()      # create a visual gallery of anomalies
fd.vis.component_gallery()     # create a visualization of connected components
fd.vis.stats_gallery()         # create a visualization of images statistics (e.g. blur)
fd.vis.similarity_gallery()    # create a gallery of similar images
```

In [17]:
fd.vis.duplicates_gallery()

100%|████████████████████████████████████████████████| 20/20 [00:00<00:00, 242.30it/s]


Stored similarity visual view in  work_dir/galleries/duplicates.html
########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################


Info,Unnamed: 1
Distance,0.987515
From,/train/2010_006803_jpg.rf.31aca29dff5114a2b42aef2e3e6fcbc4.jpg
To,/train/2008_008229_jpg.rf.2d23540e328f7956073c61ca6baba9d0.jpg

Info,Unnamed: 1
Distance,0.951582
From,/train/2008_002916_jpg.rf.cf113512549e960850271642cf9740d5.jpg
To,/train/2011_001005_jpg.rf.76c841beee524a4ceb759beff3c5681f.jpg

Info,Unnamed: 1
Distance,0.949979
From,/train/2011_004646_jpg.rf.a0c9277e66be9d1c483fa3e5e2ddb27c.jpg
To,/valid/2011_005394_jpg.rf.62a057e8d92da162e28c619f8edc1e6a.jpg

Info,Unnamed: 1
Distance,0.945792
From,/train/2009_003598_jpg.rf.9845b62b589d3874cee0c7f51e37f1d4.jpg
To,/train/2009_002240_jpg.rf.7f74c7c6dc6b0d092a16e4354dd3ea1d.jpg

Info,Unnamed: 1
Distance,0.941233
From,/train/2012_002992_jpg.rf.3cd83fe5b8fbbd28bed39707da595eec.jpg
To,/train/2011_006408_jpg.rf.a124a4d25da364267db5e83ed19460b4.jpg

Info,Unnamed: 1
Distance,0.940866
From,/train/2009_000156_jpg.rf.5601e7435a30413d0e444226f47312f6.jpg
To,/train/2009_004082_jpg.rf.609686abe085ec0c871b3b961fc7de08.jpg

Info,Unnamed: 1
Distance,0.940558
From,/train/2009_000817_jpg.rf.65f98a6a31826bf2832a7f9f5c35e49e.jpg
To,/train/2009_003066_jpg.rf.144e78fa0da27f9647d65462ff5f6726.jpg

Info,Unnamed: 1
Distance,0.939534
From,/valid/2011_001611_jpg.rf.65323de4c3cd01b833f6dd72b7c8d9a7.jpg
To,/train/2008_004171_jpg.rf.6b251f3dafb054bec3e4be1fc18ddacf.jpg

Info,Unnamed: 1
Distance,0.939317
From,/train/2008_006621_jpg.rf.89523d695fcb422f855cba6fd06391ca.jpg
To,/valid/2008_006619_jpg.rf.24e8059e198ad28307f441b2a0592f7e.jpg

Info,Unnamed: 1
Distance,0.937513
From,/train/2009_004738_jpg.rf.c47e351a4230030e8b266661e40510d6.jpg
To,/train/2009_005247_jpg.rf.bd7b35d9f418518b306e8ce48849d7e5.jpg


0

In [18]:
fd.vis.component_gallery()

100%|██████████████████████████████████████████████████| 1/1 [00:00<00:00, 141.64it/s]


Finished OK. Components are stored as image files work_dir/galleries/components_[index].jpg
Stored components visual view in  work_dir/galleries/components.html
Execution time in seconds 1.1
########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################


Info,Unnamed: 1
component,3850.0
num_images,2.0
mean_distance,0.9875


0

In [19]:
fd.vis.stats_gallery(metric='dark')

100%|████████████████████████████████████████████████| 20/20 [00:00<00:00, 396.04it/s]


Stored mean visual view in  work_dir/galleries/mean.html
########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################


Info,Unnamed: 1
mean,9.8708
filename,Pascal-VOC-2012-1/train/2008_002296_jpg.rf.dd66653f5f3b8d2cf98685996bd8530b.jpg

Info,Unnamed: 1
mean,9.9376
filename,Pascal-VOC-2012-1/train/2011_006677_jpg.rf.74a0f19a0754c0ea92c16843686d8e40.jpg

Info,Unnamed: 1
mean,10.0613
filename,Pascal-VOC-2012-1/train/2012_000753_jpg.rf.0c4e0e21e2c4cd5c3cbbd8b4293f642b.jpg

Info,Unnamed: 1
mean,10.1653
filename,Pascal-VOC-2012-1/train/2011_005762_jpg.rf.a6dbf82a92df15164ff7d2ef645081ce.jpg

Info,Unnamed: 1
mean,10.6959
filename,Pascal-VOC-2012-1/train/2011_006638_jpg.rf.37ad27827a4c5485b47f736d02c99345.jpg

Info,Unnamed: 1
mean,11.0414
filename,Pascal-VOC-2012-1/train/2010_005754_jpg.rf.2e8eaf908b39ef02afeee1e0b92a2863.jpg

Info,Unnamed: 1
mean,11.095
filename,Pascal-VOC-2012-1/valid/2012_000217_jpg.rf.5740c3c8bbddda6dfb499168daa84190.jpg

Info,Unnamed: 1
mean,11.7396
filename,Pascal-VOC-2012-1/train/2008_001241_jpg.rf.9043ba15c0b55fe7f89eb284f79d6d99.jpg

Info,Unnamed: 1
mean,11.9203
filename,Pascal-VOC-2012-1/train/2009_005190_jpg.rf.eadcfe20bb063bbb5874cd5ae8ac6c7b.jpg

Info,Unnamed: 1
mean,12.2668
filename,Pascal-VOC-2012-1/valid/2009_000840_jpg.rf.c463a1f9fe62aa5d6de6ab077a2a0c75.jpg

Info,Unnamed: 1
mean,13.6745
filename,Pascal-VOC-2012-1/valid/2012_000646_jpg.rf.6dc09f772cd15620619b81cafd443227.jpg

Info,Unnamed: 1
mean,14.1303
filename,Pascal-VOC-2012-1/train/2011_005949_jpg.rf.4b7ce7fe1f9896e5dc1a041cc53a66f8.jpg

Info,Unnamed: 1
mean,14.4831
filename,Pascal-VOC-2012-1/train/2009_001617_jpg.rf.8be50380316934251bed518250826e22.jpg

Info,Unnamed: 1
mean,14.7667
filename,Pascal-VOC-2012-1/train/2010_002301_jpg.rf.f8e0ca2ec67d3c3444e562b718b6d33a.jpg

Info,Unnamed: 1
mean,15.2213
filename,Pascal-VOC-2012-1/train/2011_006235_jpg.rf.788e2d16ae3449623f785c00da3a14eb.jpg

Info,Unnamed: 1
mean,15.2462
filename,Pascal-VOC-2012-1/train/2012_000461_jpg.rf.76baa36ce713ba2d39bb61a1f2f90669.jpg

Info,Unnamed: 1
mean,16.3886
filename,Pascal-VOC-2012-1/train/2011_002236_jpg.rf.7195191ef11d8acad1fda2ec9cc4cbd0.jpg

Info,Unnamed: 1
mean,16.5388
filename,Pascal-VOC-2012-1/train/2011_006912_jpg.rf.34490d4d340a40c4fb80345707391eeb.jpg

Info,Unnamed: 1
mean,16.7958
filename,Pascal-VOC-2012-1/train/2010_004855_jpg.rf.f60ec7db6f2f702ebbe245f2f93f29c9.jpg

Info,Unnamed: 1
mean,16.8413
filename,Pascal-VOC-2012-1/valid/2009_001266_jpg.rf.4f23406cb77c954cf903b0f61f2dbb72.jpg


0

In [20]:
fd.vis.stats_gallery(metric='bright')

100%|████████████████████████████████████████████████| 20/20 [00:00<00:00, 359.95it/s]

Stored mean visual view in  work_dir/galleries/mean.html





########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################


Info,Unnamed: 1
mean,246.8335
filename,Pascal-VOC-2012-1/train/2008_001448_jpg.rf.2fde0878a01b34fa959108bffd348166.jpg

Info,Unnamed: 1
mean,243.2605
filename,Pascal-VOC-2012-1/train/2011_002269_jpg.rf.faf54b6d54aa3ae1b03ad09440331d34.jpg

Info,Unnamed: 1
mean,236.4897
filename,Pascal-VOC-2012-1/train/2010_001731_jpg.rf.79e722e6d48ecf5b2bc8b47a907133f5.jpg

Info,Unnamed: 1
mean,236.1122
filename,Pascal-VOC-2012-1/train/2011_000922_jpg.rf.cac8be0bbfc9c15ba9bb791448cba8d6.jpg

Info,Unnamed: 1
mean,231.1935
filename,Pascal-VOC-2012-1/train/2010_000091_jpg.rf.6516d1559dd4b10d6fe7fad76d45f92f.jpg

Info,Unnamed: 1
mean,229.4073
filename,Pascal-VOC-2012-1/train/2008_004433_jpg.rf.a6b081a2f34b68102d9346ce4cc7ae76.jpg

Info,Unnamed: 1
mean,229.24
filename,Pascal-VOC-2012-1/train/2010_004072_jpg.rf.19b78fe2704e0d1c812f7bbeedb6def7.jpg

Info,Unnamed: 1
mean,228.713
filename,Pascal-VOC-2012-1/valid/2008_003781_jpg.rf.06032c3ef628ed9dc75ed342b05021e2.jpg

Info,Unnamed: 1
mean,227.4939
filename,Pascal-VOC-2012-1/train/2011_007026_jpg.rf.815c697a47b67ed0c5aa35ad7ea4a929.jpg

Info,Unnamed: 1
mean,222.9629
filename,Pascal-VOC-2012-1/valid/2011_005791_jpg.rf.cf93e50dd9379d58f2cb1856bc655eca.jpg

Info,Unnamed: 1
mean,222.3015
filename,Pascal-VOC-2012-1/train/2009_003813_jpg.rf.9cbd222f12c23c6d227600449df352c3.jpg

Info,Unnamed: 1
mean,221.2277
filename,Pascal-VOC-2012-1/train/2009_002962_jpg.rf.f3e074926ce91f1fe9ff70346dd495cb.jpg

Info,Unnamed: 1
mean,220.2631
filename,Pascal-VOC-2012-1/train/2009_002429_jpg.rf.628f3c778231575bda2236fec12cfa84.jpg

Info,Unnamed: 1
mean,219.9182
filename,Pascal-VOC-2012-1/train/2010_002321_jpg.rf.76cdcf8479a3c63025d54da80b1d2763.jpg

Info,Unnamed: 1
mean,219.7747
filename,Pascal-VOC-2012-1/valid/2009_004247_jpg.rf.6ace82bc58c602943d058c2f0ad7a3e0.jpg

Info,Unnamed: 1
mean,217.7867
filename,Pascal-VOC-2012-1/train/2009_004319_jpg.rf.3086a8faad5834fd4164a1d12d8c09cb.jpg

Info,Unnamed: 1
mean,216.6105
filename,Pascal-VOC-2012-1/train/2009_002714_jpg.rf.0080cddf084ff4ce23353bbe378be16b.jpg

Info,Unnamed: 1
mean,215.6898
filename,Pascal-VOC-2012-1/train/2011_002177_jpg.rf.a3ad75aae85945ebbeec2c7ab215365d.jpg

Info,Unnamed: 1
mean,215.3194
filename,Pascal-VOC-2012-1/train/2010_001448_jpg.rf.c64d76af145d264577800ffe2179e3fc.jpg

Info,Unnamed: 1
mean,215.2172
filename,Pascal-VOC-2012-1/valid/2011_006205_jpg.rf.76167d7f949820b36d0489597dbf40d6.jpg


0

## Wrap Up

That's a wrap! In this notebook we showed how you load dataset from Kaggle and analyze it using fastdup. You can use similar methods to run on other similar datasets on [Roboflow Universe](https://universe.roboflow.com/).

Try it out and let us know what issues you find.


Next, feel free to check out other tutorials -

+ ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
+ 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
+ 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
+ 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. 


# VL Profiler
If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. 

[Sign up](https://app.visual-layer.com) now, it's free.

[![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com)

As usual, feedback is welcome! 

Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).