Releases · woctezuma/steam-DINOv2

30 Jul 20:22

woctezuma

matches

f85f3ea

Matches Latest

Latest

Image features were matched with feature-matcher:

matches_*.npy: the matched indices (as np.uint32) in a NumPy file,
scores_*.npy: the similarity scores (as np.float16) in a NumPy file.

all_model_names = [ 'dinov2_vits14', 'dinov2_vitb14', 'dinov2_vitl14' ]
model_name = all_model_names[0]

%cd /content
!git clone https://github.com/woctezuma/feature-matcher.git
%cd feature-matcher
%pip install --quiet -r requirements.txt

!python match_fts.py \
--input_dir /content/feature-extractor/features \
--feature_filename fts_{model_name}.npy \
--numpy_matches matches_{model_name}.npy \
--numpy_similarity_scores scores_{model_name}.npy \
--num_neighbors 10

Assets 8

30 Jul 17:36

woctezuma

features

f85f3ea

Features @ DINOv2

Image features were extracted with feature-extractor:

fts_*.pth: the features (as torch.float32) in a PyTorch file,
fts_*.npy: the features (as np.float16) in a NumPy file,
img_list.json: the list of image paths corresponding to the features.

all_model_names = [ 'dinov2_vits14', 'dinov2_vitb14', 'dinov2_vitl14' ]
model_name = all_model_names[0]

%cd /content
!git clone https://github.com/woctezuma/feature-extractor.git
%cd feature-extractor
%pip install --quiet -r requirements.txt

!python extract_fts.py \
 --data_dir /content/images --batch_size 256 \
 --resize_size 224 --keep_ratio --crop_size 224 \
 --model_repo "facebookresearch/dinov2" --model_name {model_name} \
 --torch_features fts_{model_name}.pth \
 --numpy_features fts_{model_name}.npy

Assets 9

20 Jul 13:12

woctezuma

input

7466780

Images @ 224x336 resolution

On July 20, 2023, images were downloaded and resized from 300x450 to 224x336 with img2dataset:

!img2dataset --url_list=urls.txt --image_size=224 --resize_mode=keep_ratio

Images are stored in two archives due to Github's maximal file size being set to 2 GB:

images_partA.tar.gz
images_partB.tar.gz

App IDs found in apps.json were these used to create urls.txt.
The matching between image indices and app IDs is entirely managed by apps.json. Keep these app IDs as a reference!

App IDs found in filtered_apps.json are these for which images were successfully downloaded.

Indices appear in image filenames, so it is more convenient to handle indices when trimming the image dataset.
The indices of filtered app IDs can be found in filtered_indices.json, and are such that:

for index, app_id in zip(filtered_indices, filtered_apps, strict=True):
  assert apps[index] == app_id

Assets 7

20 Jul 22:45

woctezuma

cleanvision

fe33da6

Dataset Issues

Issues are detected with cleanvision.

The file detailed_issues.json contains a dictionary which maps each issue type to a list of images, where issue types are:

is_odd_size_issue
is_odd_aspect_ratio_issue
is_low_information_issue
is_light_issue
is_grayscale_issue
is_dark_issue
is_blurry_issue

The following issue types are omitted to avoid storing redundant data, because duplicates are separately tackled below:

is_exact_duplicates_issue
is_near_duplicates_issue

The file exact_duplicates.json contains the list of sets of exact duplicates.

The file near_duplicates.json contains the list of sets of near duplicates.

The file indices_to_remove.json suggests a list of images to remove due to detected issues, including issues with duplicates.

In each JSON file, images are represented by their indices in apps.json. Moreover, lists of indices are sorted in ascending order.

from cleanvision import Imagelab

DATASET_PATH = "images/"
SAVE_PATH = "results"

imagelab = Imagelab.load(SAVE_PATH, DATASET_PATH)
imagelab.report()

Output:

Issue checks completed. 18088 issues found in the dataset.
Issues found in images in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | exact_duplicates |        13162 |
|  1 | near_duplicates  |         2608 |
|  2 | low_information  |          975 |
|  3 | grayscale        |          924 |
|  4 | dark             |          300 |
|  5 | blurry           |           67 |
|  6 | light            |           52 |
|  7 | odd_aspect_ratio |            0 |
|  8 | odd_size         |            0 |

[exact duplicates] #sets = 6403
[near duplicates] #sets = 1223

#apps to remove: 9608

Assets 7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: woctezuma/steam-DINOv2

Matches

Features @ DINOv2

Images @ 224x336 resolution

Dataset Issues