Skip to content

Releases: woctezuma/steam-DINOv2

Matches

30 Jul 20:22
f85f3ea
Compare
Choose a tag to compare

Image features were matched with feature-matcher:

  • matches_*.npy: the matched indices (as np.uint32) in a NumPy file,
  • scores_*.npy: the similarity scores (as np.float16) in a NumPy file.
all_model_names = [ 'dinov2_vits14', 'dinov2_vitb14', 'dinov2_vitl14' ]
model_name = all_model_names[0]
%cd /content
!git clone https://github.com/woctezuma/feature-matcher.git
%cd feature-matcher
%pip install --quiet -r requirements.txt

!python match_fts.py \
--input_dir /content/feature-extractor/features \
--feature_filename fts_{model_name}.npy \
--numpy_matches matches_{model_name}.npy \
--numpy_similarity_scores scores_{model_name}.npy \
--num_neighbors 10

Features @ DINOv2

30 Jul 17:36
f85f3ea
Compare
Choose a tag to compare

Image features were extracted with feature-extractor:

  • fts_*.pth: the features (as torch.float32) in a PyTorch file,
  • fts_*.npy: the features (as np.float16) in a NumPy file,
  • img_list.json: the list of image paths corresponding to the features.
all_model_names = [ 'dinov2_vits14', 'dinov2_vitb14', 'dinov2_vitl14' ]
model_name = all_model_names[0]
%cd /content
!git clone https://github.com/woctezuma/feature-extractor.git
%cd feature-extractor
%pip install --quiet -r requirements.txt

!python extract_fts.py \
 --data_dir /content/images --batch_size 256 \
 --resize_size 224 --keep_ratio --crop_size 224 \
 --model_repo "facebookresearch/dinov2" --model_name {model_name} \
 --torch_features fts_{model_name}.pth \
 --numpy_features fts_{model_name}.npy

Images @ 224x336 resolution

20 Jul 13:12
7466780
Compare
Choose a tag to compare

On July 20, 2023, images were downloaded and resized from 300x450 to 224x336 with img2dataset:

!img2dataset --url_list=urls.txt --image_size=224 --resize_mode=keep_ratio

Images are stored in two archives due to Github's maximal file size being set to 2 GB:

  • images_partA.tar.gz
  • images_partB.tar.gz

App IDs found in apps.json were these used to create urls.txt.
The matching between image indices and app IDs is entirely managed by apps.json. Keep these app IDs as a reference!


App IDs found in filtered_apps.json are these for which images were successfully downloaded.

Indices appear in image filenames, so it is more convenient to handle indices when trimming the image dataset.
The indices of filtered app IDs can be found in filtered_indices.json, and are such that:

for index, app_id in zip(filtered_indices, filtered_apps, strict=True):
  assert apps[index] == app_id

Dataset Issues

20 Jul 22:45
fe33da6
Compare
Choose a tag to compare

Issues are detected with cleanvision.

The file detailed_issues.json contains a dictionary which maps each issue type to a list of images, where issue types are:

  • is_odd_size_issue
  • is_odd_aspect_ratio_issue
  • is_low_information_issue
  • is_light_issue
  • is_grayscale_issue
  • is_dark_issue
  • is_blurry_issue

The following issue types are omitted to avoid storing redundant data, because duplicates are separately tackled below:

  • is_exact_duplicates_issue
  • is_near_duplicates_issue

The file exact_duplicates.json contains the list of sets of exact duplicates.

The file near_duplicates.json contains the list of sets of near duplicates.

The file indices_to_remove.json suggests a list of images to remove due to detected issues, including issues with duplicates.

In each JSON file, images are represented by their indices in apps.json. Moreover, lists of indices are sorted in ascending order.


from cleanvision import Imagelab

DATASET_PATH = "images/"
SAVE_PATH = "results"

imagelab = Imagelab.load(SAVE_PATH, DATASET_PATH)
imagelab.report()

Output:

Issue checks completed. 18088 issues found in the dataset.
Issues found in images in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | exact_duplicates |        13162 |
|  1 | near_duplicates  |         2608 |
|  2 | low_information  |          975 |
|  3 | grayscale        |          924 |
|  4 | dark             |          300 |
|  5 | blurry           |           67 |
|  6 | light            |           52 |
|  7 | odd_aspect_ratio |            0 |
|  8 | odd_size         |            0 | 
[exact duplicates] #sets = 6403
[near duplicates] #sets = 1223
#apps to remove: 9608