# Image Comparison Model!

The Vision Transformer (ViT) is a deep learning model architecture introduced by Google Research for computer vision tasks. It represents a significant departure from traditional convolutional neural networks (CNNs) commonly used in image processing tasks. Instead of processing entire images directly, ViT divides input images into fixed-size patches and flattens them into sequences. Each patch is treated as a token and processed by the Transformer encoder. We import this model from Hugging Face here: https://huggingface.co/google/vit-base-patch16-224-in21k.

## Loading the Model ##

Here, we load model from Hugging Face and import a finger print data-base. As a note, this iteration of the model is a generalized model and was trained on (14 million images, 21,843 classes).

In [2]:
!!pip install transformers datasets -q

[]

In [5]:
!pip install torch

Collecting torch
  Downloading torch-2.2.2-cp311-none-macosx_10_9_x86_64.whl.metadata (25 kB)
Collecting networkx (from torch)
  Downloading networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
Downloading torch-2.2.2-cp311-none-macosx_10_9_x86_64.whl (150.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.8/150.8 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading networkx-3.3-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: networkx, torch
Successfully installed networkx-3.3 torch-2.2.2


In [21]:
!pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/nightly/cpu
Collecting torchvision
  Downloading torchvision-0.17.2-cp311-cp311-macosx_10_13_x86_64.whl.metadata (6.6 kB)
Collecting torchaudio
  Downloading torchaudio-2.2.2-cp311-cp311-macosx_10_13_x86_64.whl.metadata (6.4 kB)
Downloading torchvision-0.17.2-cp311-cp311-macosx_10_13_x86_64.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading torchaudio-2.2.2-cp311-cp311-macosx_10_13_x86_64.whl (3.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: torchvision, torchaudio
Successfully installed torchaudio-2.2.2 torchvision-0.17.2


In [None]:
import torch
from transformers import ViTImageProcessor, ViTModel
from PIL import Image
import requests
import os
import numpy as np
import torch.nn.functional as F

processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224-in21k')
model = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device);

In [9]:
data_dir = "/Users/crosas/Downloads/CrossMatch_Sample_DB"
image_files = os.listdir(data_dir)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/crosas/Downloads/CrossMatch_Sample_DB'

## Helper Functions: Process & Compare ##

The following function extracts images from out imported file. We standardize its dimension to work with the model.

In [None]:
def process_image(image_path):
    image = Image.open(image_path)
    if image.mode != 'RGB':
        image = image.convert('RGB')
    inputs = extractor(images=image, return_tensors="pt")
    inputs.to(device)
    return inputs

In [None]:
def compute_similarity(embeddings1, embeddings2):
    embeddings1 = embeddings1.reshape(1, -1)
    embeddings2 = embeddings2.reshape(1, -1)
    similarity = F.cosine_similarity(torch.tensor(embeddings1), torch.tensor(embeddings2), dim=1).item()
    return similarity

This function performs a cosine similarity comparison on our embeddings. Cosine similarity is a measure that quantifies the similarity between two vectors by calculating the cosine of the angle between them, ranging from -1 to 1 (ie. closer to 1, more similar).

## Create Emeddings ##

Below, we loop through the image file to ge the emeddings from the model.

In [None]:
image_embeddings = []

for image_file in image_files:
    if image_file.endswith('.tif') or image_file.endswith('.tiff'):
        image_path = os.path.join(data_dir, image_file)
        inputs = process_image(image_path)
        
        with torch.no_grad():
            outputs = model(**inputs)
            embeddings = outputs.last_hidden_state.mean(dim=1)
        
        image_embeddings.append(embeddings.cpu().numpy())


## Perform Comparisons! ##

In [20]:
image_embeddings = np.vstack(image_embeddings)
similarity = compute_similarity(image_embeddings[0], image_embeddings[99])
print("Similarity between the first and 100th images:", similarity)

Similarity between the first and 100th images: 0.7751970887184143
