Releases: woctezuma/steam-DINOv2
Matches
Image features were matched with feature-matcher
:
matches_*.npy
: the matched indices (asnp.uint32
) in a NumPy file,scores_*.npy
: the similarity scores (asnp.float16
) in a NumPy file.
all_model_names = [ 'dinov2_vits14', 'dinov2_vitb14', 'dinov2_vitl14' ]
model_name = all_model_names[0]
%cd /content
!git clone https://github.com/woctezuma/feature-matcher.git
%cd feature-matcher
%pip install --quiet -r requirements.txt
!python match_fts.py \
--input_dir /content/feature-extractor/features \
--feature_filename fts_{model_name}.npy \
--numpy_matches matches_{model_name}.npy \
--numpy_similarity_scores scores_{model_name}.npy \
--num_neighbors 10
Features @ DINOv2
Image features were extracted with feature-extractor
:
fts_*.pth
: the features (astorch.float32
) in a PyTorch file,fts_*.npy
: the features (asnp.float16
) in a NumPy file,img_list.json
: the list of image paths corresponding to the features.
all_model_names = [ 'dinov2_vits14', 'dinov2_vitb14', 'dinov2_vitl14' ]
model_name = all_model_names[0]
%cd /content
!git clone https://github.com/woctezuma/feature-extractor.git
%cd feature-extractor
%pip install --quiet -r requirements.txt
!python extract_fts.py \
--data_dir /content/images --batch_size 256 \
--resize_size 224 --keep_ratio --crop_size 224 \
--model_repo "facebookresearch/dinov2" --model_name {model_name} \
--torch_features fts_{model_name}.pth \
--numpy_features fts_{model_name}.npy
Images @ 224x336 resolution
On July 20, 2023, images were downloaded and resized from 300x450 to 224x336 with img2dataset
:
!img2dataset --url_list=urls.txt --image_size=224 --resize_mode=keep_ratio
Images are stored in two archives due to Github's maximal file size being set to 2 GB:
images_partA.tar.gz
images_partB.tar.gz
App IDs found in apps.json
were these used to create urls.txt
.
The matching between image indices and app IDs is entirely managed by apps.json
. Keep these app IDs as a reference!
App IDs found in filtered_apps.json
are these for which images were successfully downloaded.
Indices appear in image filenames, so it is more convenient to handle indices when trimming the image dataset.
The indices of filtered app IDs can be found in filtered_indices.json
, and are such that:
for index, app_id in zip(filtered_indices, filtered_apps, strict=True):
assert apps[index] == app_id
Dataset Issues
Issues are detected with cleanvision
.
The file detailed_issues.json
contains a dictionary which maps each issue type to a list of images, where issue types are:
is_odd_size_issue
is_odd_aspect_ratio_issue
is_low_information_issue
is_light_issue
is_grayscale_issue
is_dark_issue
is_blurry_issue
The following issue types are omitted to avoid storing redundant data, because duplicates are separately tackled below:
is_exact_duplicates_issue
is_near_duplicates_issue
The file exact_duplicates.json
contains the list of sets of exact duplicates.
The file near_duplicates.json
contains the list of sets of near duplicates.
The file indices_to_remove.json
suggests a list of images to remove due to detected issues, including issues with duplicates.
In each JSON file, images are represented by their indices in apps.json
. Moreover, lists of indices are sorted in ascending order.
from cleanvision import Imagelab
DATASET_PATH = "images/"
SAVE_PATH = "results"
imagelab = Imagelab.load(SAVE_PATH, DATASET_PATH)
imagelab.report()
Output:
Issue checks completed. 18088 issues found in the dataset.
Issues found in images in order of severity in the dataset
| | issue_type | num_images |
|---:|:-----------------|-------------:|
| 0 | exact_duplicates | 13162 |
| 1 | near_duplicates | 2608 |
| 2 | low_information | 975 |
| 3 | grayscale | 924 |
| 4 | dark | 300 |
| 5 | blurry | 67 |
| 6 | light | 52 |
| 7 | odd_aspect_ratio | 0 |
| 8 | odd_size | 0 |
[exact duplicates] #sets = 6403
[near duplicates] #sets = 1223
#apps to remove: 9608