OWL-ViT
====

**Simple Open-Vocabulary Object Detection with Vision Transformers**

 * Paper: https://arxiv.org/abs/2205.06230

![OWL-ViT training](../assets/owlvit_training.jpg)



OWL-ViT is an open-vocabulary object detector that performs detection in a **class-agnostic** manner. 
 * Given an input image, it first identifies regions that may contain objects without assuming any predefined categories.
 * Then, using a list of free-text queries, the model scores each region based on how likely it is to match each query.


**Instalation**

```bash
pip install torch torchvision
pip install -q git+https://github.com/huggingface/transformers.git
pip install matplotlib
```

In [None]:
from PIL import Image
import matplotlib.pyplot as plt
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = OwlViTProcessor.from_pretrained(
    "google/owlvit-base-patch32"
)
model = OwlViTForObjectDetection.from_pretrained(
    "google/owlvit-base-patch32"
)

model.eval().to(device);

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
from PIL import Image
import matplotlib.pyplot as plt

image_path = "../samples/plants.jpg"
text_queries = ["a plant", "a flower", "a tree", "a vase"]
image = Image.open(image_path).convert("RGB")

# Process image and text inputs
inputs = processor(
    text=text_queries, images=image,
    return_tensors="pt"
).to(device)

# Print input names and shapes
for key, val in inputs.items():
    print(f"{key}: {val.shape}")


input_ids: torch.Size([4, 16])
attention_mask: torch.Size([4, 16])
pixel_values: torch.Size([1, 3, 768, 768])


In [7]:
# Get predictions
with torch.no_grad():
  outputs = model(**inputs)

for k, val in outputs.items():
    if k not in {"text_model_output", "vision_model_output"}:
        print(f"{k}: shape of {val.shape}")

print("\nText model outputs")
for k, val in outputs.text_model_output.items():
    print(f"{k}: shape of {val.shape}")

print("\nVision model outputs")
for k, val in outputs.vision_model_output.items():
    print(f"{k}: shape of {val.shape}") 

logits: shape of torch.Size([1, 576, 4])
pred_boxes: shape of torch.Size([1, 576, 4])
text_embeds: shape of torch.Size([1, 4, 512])
image_embeds: shape of torch.Size([1, 24, 24, 768])
class_embeds: shape of torch.Size([1, 576, 512])

Text model outputs
last_hidden_state: shape of torch.Size([4, 16, 512])
pooler_output: shape of torch.Size([4, 512])

Vision model outputs
last_hidden_state: shape of torch.Size([1, 577, 768])
pooler_output: shape of torch.Size([1, 768])
