![Roboflow Notebooks banner](https://camo.githubusercontent.com/aec53c2b5fb6ed43d202a0ab622b58ba68a89d654fbe3abab0c0cc8bd1ff424e/68747470733a2f2f696b2e696d6167656b69742e696f2f726f626f666c6f772f6e6f7465626f6f6b732f74656d706c6174652f62616e6e657274657374322d322e706e673f696b2d73646b2d76657273696f6e3d6a6176617363726970742d312e342e33267570646174656441743d31363732393332373130313934)

# Image Classification with DINOv2

DINOv2, released by Meta Research in April 2023, implements a self-supervised method of training computer vision models.

DINOv2 was trained using 140 million images without labels. The embeddings generated by DINOv2 can be used for classification, image retrieval, segmentation, and depth estimation. With that said, Meta Research did not release heads for segmentation and depth estimation.
In this guide, we are going to build an image classifier using embeddings from DINOv2. To do so, we will:

1. Load a folder of images
2. Compute embeddings for each image
3. Save all the embeddings in a file and vector store
4. Train an SVM classifier to classify images

By the end of this notebook, we'll have a classifier trained on our dataset.

Without further ado, let's begin!

## Import Packages

First, let's import the packages we will need for this project.

In [1]:
import numpy as np
import torch
import torchvision.transforms as T
from PIL import Image
import os
#import cv2
import json
import glob
from tqdm.notebook import tqdm

In [2]:
import roboflow
#import supervision as sv


In [3]:
cwd = os.getcwd()
cwd

'/workspaces/gc_quant_trading_research'

Load folder containing the trading images

In [4]:
cwd = os.getcwd()

ROOT_DIR = os.path.join(cwd)

labels = {}

for folder in os.listdir(ROOT_DIR):
  try:
    print(folder)
    for file in os.listdir(os.path.join(ROOT_DIR, folder)):
        if file.endswith(".png"):
            full_name = os.path.join(ROOT_DIR, folder, file)
            labels[full_name] = folder
  except:
    pass

files = labels.keys()

all_embeddings.json
bear
range
now
Dinov2_classification_gc.ipynb
bull
README.md
dockerfile
notebooks
.git
data
Now
requirements.txt


In [5]:
list(files)

['/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-14 at 13.47.04.png',
 '/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-14 at 13.59.17.png',
 '/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-14 at 14.03.03.png',
 '/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-14 at 13.46.01.png',
 '/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-14 at 14.06.28.png',
 '/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-14 at 13.45.50.png',
 '/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-14 at 14.03.27.png',
 '/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-13 at 14.37.05.png',
 '/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-14 at 14.03.09.png',
 '/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-14 at 14.13.21.png',
 '/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-14 at 14.03.45.png',
 '/workspaces/gc_quant_trading_research/bear/Screenshot 2024-09-1

In [6]:
# prompt: get data from dictionary files

values = [labels[key] for key in files]

## Load the Model and Compute Embeddings

To train our classifier, we need:

1. The embeddings associated with each image in our dataset, and;
2. The labels associated with each image.

To calculate embeddings, we'll use DINOv2. Below, we load the smallest DINOv2 weights and define functions that will load and compute embeddings for every image in a specified list.

We store all of our vectors in a dictionary that is saved to disk so we can reference them again if needed. Note that in production environments one may opt for using another data structure such as a vector embedding database (i.e. faiss) for storing embeddings.

In [7]:
dinov2_vits14 = torch.hub.load("facebookresearch/dinov2", "dinov2_vits14")

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")

dinov2_vits14.to(device)

transform_image = T.Compose([T.ToTensor(),
                             T.Resize((70, 210)),
                             #T.CenterCrop(224),
                             T.Normalize([0.5], [0.5])])

Using cache found in /home/codespace/.cache/torch/hub/facebookresearch_dinov2_main
xFormers is not available (SwiGLU)
xFormers is not available (Attention)
xFormers is not available (Block)


In [8]:
def load_image(img: str) -> torch.Tensor:
    """
    Load an image and return a tensor that can be used as an input to DINOv2.
    """
    img = Image.open(img)

    transformed_img = transform_image(img)[:3].unsqueeze(0)

    return transformed_img

def compute_embeddings(files: list) -> dict:
    """
    Create an index that contains all of the images in the specified list of files.
    """
    all_embeddings = {}

    with torch.no_grad():
      for i, file in enumerate(files):
        embeddings = dinov2_vits14(load_image(file).to(device))

        all_embeddings[file] = np.array(embeddings[0].cpu().numpy()).reshape(1, -1).tolist()

    with open("all_embeddings.json", "w") as f:
        f.write(json.dumps(all_embeddings))

    return all_embeddings

## Compute Embeddings

The code below computes the embeddings for all the images in our dataset. This step will take a few minutes for the MIT Indoor Scene Recognition dataset. There are over 10,000 images in the training set that we need to pass through DINOv2.

In [9]:
embeddings = compute_embeddings(files)

In [10]:
embedding_list = list(embeddings.values())
embedding_arr = np.array(embedding_list).reshape(-1, 384)

In [11]:
embedding_arr

array([[-4.21441078,  1.71051049, -3.24294972, ..., -1.11730194,
        -1.10640061,  3.24252033],
       [-3.27280474,  2.08943319, -3.88683748, ..., -0.32761729,
         0.81283861,  2.35233927],
       [-3.4268477 ,  0.75939184, -3.20614815, ..., -0.32847378,
        -1.09211266,  3.56648684],
       ...,
       [-2.74673676,  1.05394149, -3.04941797, ..., -0.57670152,
         0.15265657,  4.90965033],
       [-3.81139922,  1.18707597, -4.90517712, ..., -1.76032627,
        -0.96873438,  4.04915285],
       [-2.99168348,  1.48524296, -3.75221324, ..., -0.04493006,
        -1.31210852,  3.28821325]])

In [18]:
# prompt: generate a 2d dimentional reduction of the embeddings with TSNE

from sklearn.manifold import TSNE

# Reduce dimensionality of the embeddings
tsne = TSNE(n_components=2, perplexity=30)
tsne_embeddings = tsne.fit_transform(embedding_arr)


In [19]:
# prompt: install umap and reduce dimension with umap
import numpy as np
import umap

# Initialize UMAP model
umap_model = umap.UMAP(n_components=2, random_state=42)

# Fit and transform the data to generate embeddings
umap_embeddings = umap_model.fit_transform(embedding_arr)



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



In [20]:
# prompt: generate embedding plot with plotly

import plotly.express as px

fig = px.scatter(x=umap_embeddings[:, 0], y=umap_embeddings[:, 1], color=values, hover_name=files)
fig.show()



In [21]:
fig = px.scatter(x=tsne_embeddings[:, 0], y=tsne_embeddings[:, 1], color=values, hover_name=files)
fig.show()

## Train a Classification Model

The embeddings we have computed can be used as an input in a classification model. For this guide, we will be using SVM, a linear classification model.

Below, we make lists of both all of the embeddings we have computed and their associated labels. We then fit our model using those lists.

In [22]:
from sklearn import svm

clf = svm.SVC(gamma='scale')

y = [labels[file] for file in files]

embedding_list = list(embeddings.values())

clf.fit(np.array(embedding_list).reshape(-1, 384), y)

## Classify an Image

We now have a classifier we can use to classify images!

Change the `input_file` value below to the path of a file in the `valid` or `test` directories in the image dataset with which we have been working.

Then, run the cell to classify the image.

In [33]:
import cv2

#any file in the folder title Now
input_file = glob.glob("Now/*.png")[0]
print(input_file)
new_image = load_image(input_file)

%matplotlib inline
#sv.plot_image(image=cv2.imread(input_file), size=(8, 8))

with torch.no_grad():
    embedding = dinov2_vits14(new_image.to(device))

    prediction = clf.predict(np.array(embedding[0].cpu()).reshape(1, -1))

    print()
    print("Predicted class: " + prediction[0])

Now/Screenshot 2024-09-14 at 14.33.20.png

Predicted class: bull
