This is a quick experiment to test the assumption that embeddings straight out of an untrained Owl-VIT model are indeed good for zero-shot classification.

The assumption is that the Owl-VIT model out of the box produces meaningful embeddings for each object detected, and embeds them in latent space such that there is some 
meaningful distance between different objects. The test for this is as follows:

1. Pick an image containing *one* object on a blank background.
2. Gather embeddings, dim reduce and visiualize - we should see two clearly seperable clusters, one that represents the object embeddings and one that represents the background noise embeddings (there may be more than one "background" cluster since I'm not sure how Owl handles noise embeddings)
3. Use k-means with k=2 to classify each point in an unsupervised manner
4. Overlay bounding boxes on the image for each cluster

What we should see then from each image are bounding boxes around the object of interest for the non-noise cluster's boxes, and bounding boxes scattered about the image chaotically for the background cluster's boxes.

**Result:** As expected. Boxes cluster where you'd expect hinting that the embeddings are useful right out of the box.

In [1]:
import notebook_helper
import sys

sys.path.append("../")
from transformers import OwlViTProcessor
import torch
from PIL import Image
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans

from src.models import OwlViT, PostProcess
from src.util import BoxUtil
from src.main import model_output_to_image

n_kmeans_clusters = 2
impath = "assets/dog-on-white.jpg"
image = Image.open(impath)
w, h = image.size 

image_processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViT(num_classes=0)  # no classes since we're not using the classifier, just the image embedder
model.eval()
post = PostProcess(confidence_threshold=0.0, iou_threshold=1.0)  # keep all boxes

image = image_processor(images=image, return_tensors="pt")["pixel_values"]
with torch.no_grad():
    pred_boxes, embeddings = model(image, return_with_embeddings=True)
    pred_boxes = model_output_to_image(pred_boxes, {"width": w, "height": h})
    embeddings = embeddings.squeeze(0).numpy()

reduced = notebook_helper.get_reduced(embeddings, 3)
kmeans = KMeans(n_clusters=n_kmeans_clusters, random_state=0, n_init="auto").fit(reduced)
labels = torch.tensor(kmeans.labels_).unsqueeze(0)
print(kmeans.labels_)
fig = notebook_helper.make_plot_3d(reduced, colors=kmeans.labels_)
display(fig)

for label in range(n_kmeans_clusters):
    _pred_boxes = pred_boxes[torch.where(labels == label)].unsqueeze(0)
    image_with_boxes = BoxUtil.draw_box_on_image(impath, _pred_boxes)
    plt.imshow(image_with_boxes.squeeze(0).permute(1,2,0).numpy(), interpolation='nearest')
    plt.show()

  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


TypeError: OwlViT.__init__() got an unexpected keyword argument 'num_classes'