# Deep Learning Assignment 2023: Visual Grounding

The goal of the project is to develop a deep learning framework to perform visual grounding on the RefCOCOg dataset.

Throughout the work multiple approaches have been tested, yielding some very different results and with import differences in terms of performance. For the sake of the providing a complete overview of the work done, the notebook will go through all of the approach tested, presenting any idea that has been discussed and implemented in these months of work, discussing their pros and limitations. The approaches that have been taken into account include:

- a baseline pipeline using YOLO for object proposals and CLIP for the grounding task;
- a pipeline which involves image segmentation, CLIP embedding and bounding box proposal;
- a pipeline using DETR for object proposals and CLIP for the grounding task;
- a pipeline only using MDETR for the grounding task (for the sake of having a SOTA comparison)
- a framework involving Diffusion for bounding box detection and CLIP for the grounding task.
- a framework involving Reinforcement Learning for bounding box regression.

The notebook will cover all aforementioned approaches,

- (1) starting from a brief introduction, covering (i) the problem of visual grounding, (ii) a quick presentation of the dataset and (iii) CLIP
- (2) continuing through the implementation of the different frameworks, discussing pro and cons
- (3) and finally presenting the overall results achieved and providing some final considerations.

The notebook is meant to be run standalone here on Google Colab but, may it be useful, the codebase is also available on GitHub at [synchroazel/visual-grounding](https://github.com/synchroazel/visual-grounding). Please refer to the README for further instructions on how to run it locally.

## 1. Brief introduction

Here is a brief introduction on some of the pillars of the project, namely visual grounding, the RefCOCOg dataset and CLIP.

### 1.1 Visual Grounding

Visual grounding refers to the process of linking visual information with corresponding linguistic symbols, bridging the gap between visual perception and linguistic understanding. The goal is to enable machines to comprehend and interpret visual information in a way that is similar to human understanding.

Concretely, visual grounding can be approached in different ways, depending on the specific task and context. One common approach is to use techniques like object detection, image segmentation, or scene understanding to extract relevant visual features from an image or video. These features are then matched or aligned with corresponding textual descriptions or concepts using a multitude of approaches.

Many of the current approaches to visual grounding can be identified either as a **one-stage** or **two-stage**.

- Two-stage methods formulate visual grounding as a matching problem between language and region. The visual region proposals are extracted by a pre-trained detector in the first stage, which are matched with the given expression in the second stage. However, the performance of these methods is highly dependent on the detector in the first stage. Besides, matching must be performed for every region proposals, which drag a great extent on the speed of the network.

- One-stage methods overcome the reliance on detectors and speed up the inference process by grounding the object in an image by a sentence query directly.

Today, SOTA approaches to visual grounding are relying more and more on one-stage methods, especially using transformers-based architectures. In this project the main focus will be on how to include and repurpose CLIP for the task, yet an implementation of MDETR will still be shown and  rbiefly discussed.

### 1.2 RefCOCOg Dataset

The dataset used for this task is RefCOCOg, consisting of 49822 images with one or more referring expression/s each, for a total of 95010 sentences.

Differently from RefCOCO & RefCOCO+, which were obtained through the 2-players Refer-It game, RefCOCOg was collected on Amazon Mechanical Turk in a non-interactive setting. One set of workers were asked to write natural language referring expressions for objects in MSCOCO images then another set of workers were asked to click on the indicated object given a referring expression. If the click overlapped with the correct object then the referring expression was considered valid and added to the dataset. If not, another referring expression was collected for the object.

...


### 1.3 Contrastive Language-Image Pre-Training (CLIP)



### 1.4 Preliminary steps

Here are some preliminary steps before we start discussing each approach in more detail.

In [None]:
#@title Install all necessary pacakges

!pip install transformers
!pip install ultralytics
!pip install git+https://github.com/openai/CLIP.git
!pip install torchmultimodal-nightly


In [None]:
#@title Import all necessary modules

import copy
import json
import os
import pickle
import re
import time
from datetime import datetime

import clip
import cv2
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torchvision.transforms as T
from PIL import Image
from skimage import io
from skimage.color import rgb2gray
from skimage.filters import sobel
from skimage.measure import regionprops
from skimage.segmentation import slic, watershed
from skimage.util import img_as_float
from sklearn.preprocessing import MinMaxScaler
from torch.utils.data import Dataset
from torchmultimodal.models.mdetr.model import mdetr_for_phrase_grounding
from torchvision import transforms
from torchvision.ops import box_iou
from torchvision.ops.boxes import box_convert
from tqdm import tqdm
from transformers import AutoImageProcessor, DetrForObjectDetection
from transformers import RobertaTokenizerFast
from transformers.utils import logging
from ultralytics import YOLO

%matplotlib inline


In [None]:
#@title Set the best device for to the machine running the notebook

def get_best_device():
    if torch.cuda.is_available():
        device_ = torch.device("cuda")  # for CUDA GPU
        print("[INFO] Using cuda.")
    elif torch.has_mps:
        device_ = torch.device("mps")  # for Apple Silicon GPU
        print("[INFO] Using MPS.")
    else:
        device_ = torch.device("cpu")
        print("[INFO] No GPU found, using CPU instead.")

    return device_


device = get_best_device()


[INFO] No GPU found, using CPU instead.


It is also important to define our dataset and sample objects, which are going to be used throughout the notebook.

- The `RefCOCOgSample` object define a single sample from the RefCOCOg dataset, containing all the information - as attributes - that may be needed for visual grounding tasks.

- The `RefCOCOg` object is a wrapper around the RefCOCOg dataset, providing some useful methods to access the dataset and its samples. Most notably, it implements (1) a `__getitem__` method to access a single sample and (2) a `__len__` method to get the number of samples in the dataset. Also notice that the `__getitem__` method returns a dictionary containing the information that can be used to populate a `RefCOCOgSample` object (simply with `RefCOCOgSample(**output)`). The `__init__` method also takes care of loading the correct split we might need, accepting either `train`, `val` or `test` as arguments and using the annotations from `refs(umd).p` to select the samples of each.

In [None]:
#@title Class definition for a sample from the RefCOCOg dataset

class RefCOCOgSample:
    """
    An annotated image from RefCOCOg dataset.

    """

    def __init__(self,
                 img: Image.Image,
                 shape: tuple[int, int],
                 path: str,
                 img_id: str,
                 split: str,
                 category: str,
                 category_id: int,
                 sentences: list[str],
                 bbox: list[float],
                 segmentation: list[float]):
        self.img = img
        self.shape = shape
        self.path = path
        self.id = img_id
        self.split = split
        self.category = category
        self.category_id = category_id
        self.sentences = sentences
        self.bbox = bbox
        self.segmentation = segmentation

    def __repr__(self):
        return str(vars(self))


In [None]:
#@title Class definition for the RefCOCOg dataset

class RefCOCOg(Dataset):
    """
    Dataset object for RefCOCOg dataset.

    """

    def __init__(self, ds_path: str, split=None, transform=None):
        super(RefCOCOg, self).__init__()

        self.transform = transform

        self.ds_path = ds_path

        with open(f'{ds_path}/annotations/refs(umd).p', 'rb') as f:
            self.refs = pickle.load(f)

        with open(f"{ds_path}/annotations/instances.json", "r") as f:
            self.instances = json.load(f)

        self.categories = {
            item["id"]: {
                "supercategory": item["supercategory"],
                "category": item["name"]
            }
            for item in self.instances['categories']
        }

        self.instances = {inst['id']: inst for inst in self.instances['annotations']}

        if split == 'train':
            self.refs = [ref for ref in self.refs if ref['split'] == 'train']
        elif split == 'val':
            self.refs = [ref for ref in self.refs if ref['split'] == 'val']
        elif split == 'test':
            self.refs = [ref for ref in self.refs if ref['split'] == 'test']

        self.size = len(self.refs)

    def __getitem__(self, idx: int):

        refs_data = self.refs[idx]

        ann_data = self.instances[refs_data['ann_id']]

        image_path = os.path.join(
            self.ds_path,
            "images",
            re.sub(r"_[0-9]+\.jpg", ".jpg", refs_data["file_name"])
        )

        pil_img = Image.open(image_path)

        bbox = torch.tensor(ann_data["bbox"])
        bbox = box_convert(bbox, "xywh", "xyxy").numpy()

        sample = {
            "img": pil_img,
            "shape": transforms.ToTensor()(pil_img).shape,
            "path": image_path,
            "img_id": refs_data["image_id"],
            "split": refs_data["split"],
            "category": self.categories[refs_data["category_id"]]["category"],
            "category_id": refs_data["category_id"],
            "sentences": [sentence["raw"].lower() for sentence in refs_data["sentences"]],
            "bbox": bbox,
            "segmentation": ann_data["segmentation"]
        }

        if self.transform:
            sample = self.transform(sample["img"], dtype=torch.float32)

        return sample

    def __len__(self):
        return self.size  # return the number of annotated images available using `refs(umd).p`


With that taken care of, we can now import the dataset and its split.

In [None]:
#@title Import the RefCOCOg dataset and create train/validation/test splits

data_path = "dataset/refcocog"

dataset = RefCOCOg(ds_path=data_path)

train_ds = RefCOCOg(ds_path=data_path, split='train')
val_ds = RefCOCOg(ds_path=data_path, split='val')
test_ds = RefCOCOg(ds_path=data_path, split='test')

print(f"[INFO] Dataset Size: {len(dataset)}")
print(f"[INFO] train split:  {len(train_ds)}")
print(f"[INFO] val split:    {len(val_ds)}")
print(f"[INFO] test split:   {len(test_ds)}")


## 2. The pipelines

In the following section we are going to present the different visual grounding pipelines that have been implemented and tested.

**About the metrics**

Before diving into each approach, it is worth mentioning the metrics that have been used to evaluate the performance of each pipeline. The metrics used aim to evaluate the performance of the framework in terms of

(1) localization accuracy
(2) grounding accuracy
(3) semantic similarity

Localization accuracy refers to the ability of the model to localize an object in the image, and it is measured using **Intersection over Union (IoU)**, namely the ratio between the area of overlap between the predicted bounding box and the ground-truth bounding box and the area of union between the two.

Semantic similarity refers to the similarity between the predicted bounding boxes and the ground-truth descriptions, and it is measured using distance metrics such as cosine similarity, **euclidean distance** or **dot product** between the CLIP embeddings of the predicted bounding boxes and the ground-truth textual description.

Grounding accuracy refers to the ability of the model to ground the localized object to a language description, and it is measured using **recall**, namely the ratio between the number of correctly grounded objects and the total number of objects. Practically

- we get a CLIP encoding of each available category in a dummy `f"a picture of a {object}"` sentence
- we get a CLIP encoding of the bounding boxed image proposed by the pipeline
- we compare those and get the category that is most similar
- if the category is the same as the one in the ground truth, we have a correct grounding attempt


**About the implementation**

To simplify our codebase, and to provide extra readability, we used a common superclass `VisualGroundingPipeline` for all the visual grounding frameworks approached. This superclass provides a common interface for all the pipelines, such as

- (1) initialization of common attributes
- (2) a pair of methods to encode text and images using CLIP after taking care of tokenization/preprocess
- (3) a method to compute IoU
- (4) a method to compute the grounding accuracy and finally a method used to embed, on pipeline instantiation, all the available categories (useful, as mentioned, for the visual grounding accuracy computation).

All pipelines inherit from this superclass, and implement one or more custom methods, most notably one of them being a `__call__` method which contains the core logic and returns the metrics of interest.

At subclass level each pipeline also implement a set of common attributes to be used in the `__call__` method for displaying and testing purposes, namely

- `show=True` will display the image with the predicted bounding box
- `timeit=True` will print the time taken to run the pipeline (it is recommended to use it without any visualization)

In [None]:
#@title Define utilities function used throughout the notebook

def IoU(true_bbox, predicted_bbox):
    # Determine the (x, y)-coordinates of the intersection rectangle
    xA = max(true_bbox[0], predicted_bbox[0])
    yA = max(true_bbox[1], predicted_bbox[1])
    xB = min(true_bbox[2], predicted_bbox[2])
    yB = min(true_bbox[3], predicted_bbox[3])

    # Compute the area of intersection rectangle
    interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1)

    # Compute the area of both the prediction and ground-truth rectangles
    true_bboxArea = (true_bbox[2] - true_bbox[0] + 1) * (true_bbox[3] - true_bbox[1] + 1)
    predicted_bboxArea = (predicted_bbox[2] - predicted_bbox[0] + 1) * (predicted_bbox[3] - predicted_bbox[1] + 1)

    # Compute the intersection over union by taking the intersection area
    # and dividing it by the sum of prediction + ground-truth areas - the intersection area
    iou = interArea / float(true_bboxArea + predicted_bboxArea - interArea)

    return iou


def get_data(dataset):
    texts, images = list(), list()

    for sample in tqdm(dataset, desc="[INFO] Loading images and captions"):
        sample = RefCOCOgSample(**sample)

        for sentence in sample.sentences:
            images.append(sample.path)
            texts.append(sentence)

    return images, texts


def cosine_similarity(images_z: torch.Tensor, texts_z: torch.Tensor):
    # normalise the image and the text
    images_z /= images_z.norm(dim=-1, keepdim=True)
    texts_z /= texts_z.norm(dim=-1, keepdim=True)

    # evaluate the cosine similarity between the sets of features
    similarity = (texts_z @ images_z.T)

    return similarity


def display_preds(img, prompt, pred_bbox, gt_bbox, model_name):
    fig, ax = plt.subplots()
    ax.imshow(img)

    pred_rect = plt.Rectangle(
        (pred_bbox[0], pred_bbox[1]), pred_bbox[2] - pred_bbox[0], pred_bbox[3] - pred_bbox[1],
        linewidth=1.5, edgecolor=(0, 1, 0), facecolor='none'
    )

    gt_rect = plt.Rectangle(
        (gt_bbox[0], gt_bbox[1]), gt_bbox[2] - gt_bbox[0], gt_bbox[3] - gt_bbox[1],
        linewidth=1.5, edgecolor=(1, 0, 0), facecolor='none'
    )

    ax.add_patch(pred_rect)
    ax.text(pred_bbox[0], pred_bbox[1], "predicted", color=(1, 1, 1),
            bbox={"facecolor": (0, 1, 0), "edgecolor": (0, 1, 0), "pad": 2})

    ax.add_patch(gt_rect)
    ax.text(gt_bbox[0], gt_bbox[3], "true", color=(1, 1, 1),
            bbox={"facecolor": (1, 0, 0), "edgecolor": (1, 0, 0), "pad": 2})

    ax.axis("off")
    plt.title(f"\"{prompt.capitalize()}\"\n")
    plt.text(0.5, -0.075, f"using {model_name}", size=10, ha="center", transform=ax.transAxes)
    plt.show()


In [None]:
#@title Class definition for the VisualGroundingPipeline superclass

class VisualGroundingPipeline:

    def __init__(self,
                 categories,
                 clip_ver="RN50",
                 device="cpu",
                 quiet=True):
        self.categories = copy.deepcopy(categories)
        self.clip_ver = clip_ver
        self.clip_model, self.clip_prep = clip.load(clip_ver, device="cpu")
        self.device = device
        self.quiet = quiet

        # model is loaded to cpu first, and eventually moved to gpu
        # (trick Mac M1 to use f16 tensors)
        if self.device != "cpu":
            self.clip_model = self.clip_model.to(self.device)

        self._embed_categories()

    def _encode_text(self, text):
        text_ = clip.tokenize(text).to(self.device)

        with torch.no_grad():
            return self.clip_model.encode_text(text_)

    def _encode_img(self, image):
        image_ = self.clip_prep(image).unsqueeze(0).to(self.device)

        with torch.no_grad():
            return self.clip_model.encode_image(image_)

    @staticmethod
    def _IoU(pred_bbox, gt_bbox):
        iou = box_iou(
            torch.tensor(pred_bbox).unsqueeze(0),
            torch.tensor(gt_bbox).unsqueeze(0)
        ).item()

        return iou

    def _grounding_accuracy(self, img_sample, pred_image_enc):
        all_c_sims = dict()

        for category_id in self.categories.keys():
            cur_categ = self.categories[category_id]['category']
            cur_categ_enc = self.categories[category_id]['encoding'].float()

            all_c_sims[cur_categ] = cosine_similarity(pred_image_enc, cur_categ_enc)

        pred_category = max(all_c_sims, key=all_c_sims.get)

        # if not self.quiet:
        #     print(f"[INFO] true: {img_sample.category} | predicted: {pred_category}")

        return 1 if pred_category == img_sample.category else 0

    def _embed_categories(self):
        for category_id in self.categories.keys():
            cur_category = self.categories[category_id]['category']
            with torch.no_grad():
                cur_category_enc = self._encode_text(f"a photo of {cur_category}")
            self.categories[category_id].update({"encoding": cur_category_enc})

    def __call__(self, *args, **kwargs):
        return None

Also, before skipping ahead, we pick a random a sample image from the dataset to use as a quick test for the following sections.

In [None]:
#@title Pick a random sample from the dataset

idx = np.random.randint(0, len(dataset))
sample = RefCOCOgSample(**dataset[idx])

plt.imshow(sample.img)
plt.axis("off")
plt.title(f"Sample #{idx}", loc="left")
plt.show()

for i, sentence in enumerate(sample.sentences):
    print(f"[INFO] Sentence #{i}: {sentence}")


### 2.1 YOLO + CLIP

The first approach that has been tested is a baseline pipeline using **YOLO** for object proposals and CLIP for the grounding task. The idea is to use YOLO to extract the bounding boxes of the objects in an image and then use CLIP to find the best match between the bounding boxes and the sentence query. This latter part is simply done by computing CLIP embeddings of the cropped bounding boxes and the sentence query and then computing the cosine similarity between the two embeddings. The bounding box with the highest similarity score is then selected as the best match.

**YOLO**, which funnily stands for "You Only Look Once," is a popular object detection model in computer vision. It revolutionized real-time object detection by proposing a unified framework that simultaneously predicts object bounding boxes and class probabilities in a single pass. YOLO divides the input image into a grid and applies convolutional neural networks to each grid cell to predict bounding boxes and class probabilities. This approach allows YOLO to achieve impressive detection speeds while maintaining competitive accuracy.

This approach has been tested both using different version of YOLO (YOLOv8x and YOLOv5su - which recently replaced YOLOv5s) and different visual backbones for CLIP (ResNet50, ResNet101 and ViT-L/14). The results are shown in the following table.

In [None]:
#@title Class definition for the YOLO+Clip pipeline

class YoloClip(VisualGroundingPipeline):

    def __init__(self,
                 categories,
                 yolo_ver="yolov8x",
                 clip_ver="RN50",
                 device="cpu",
                 quiet=True):

        VisualGroundingPipeline.__init__(self, categories, clip_ver, device, quiet)

        self.yolo_ver = yolo_ver
        self.yolo_model = YOLO(self.yolo_ver + ".pt")

        valid_yolo_versions = ["yolov8x", "yolov5su"]
        if yolo_ver not in valid_yolo_versions:
            raise ValueError(f"Invalid YOLO version '{yolo_ver}'. Must be one of {valid_yolo_versions}.")

        print("[INFO] Initializing YoloClip pipeline")
        print(f"[INFO] YOLO version: {yolo_ver}")
        print("")

    def __call__(self, img_sample, prompt, show=False, show_yolo=False, timeit=False):

        if timeit:
            start = time.time()

        # Get sample image
        img = img_sample.img

        # Use YOLO to propose relevant objects
        yolo_results_ = self.yolo_model(img_sample.path, verbose=False)[0]
        yolo_results = yolo_results_.boxes.xyxy
        if not self.quiet:
            print(f"[INFO] YOLO found {yolo_results.shape[0]} objects")
        if yolo_results.shape[0] == 0:
            print(f"[WARN] YOLO ({self.yolo_ver}) couldn't find any object in {img_sample.path}!")
            return {"IoU": 0, "cosine": np.nan, "euclidean": np.nan, "dotproduct": np.nan, "grounding": np.nan}

        # Use CLIP to encode each relevant object image
        images_encs = list()
        for i in range(yolo_results.shape[0]):
            bbox = yolo_results[i, 0:4].cpu().numpy()
            sub_img = img.crop(bbox)
            with torch.no_grad():
                sub_img_enc = self._encode_img(sub_img)
            images_encs.append(sub_img_enc)
        images_encs = torch.cat(images_encs, dim=0)

        # Use CLIP to encode the text prompt
        prompt_enc = self._encode_text(prompt)

        # Compute the best bbox according to cosine similarity
        c_sims = cosine_similarity(prompt_enc, images_encs).squeeze()
        best_idx = int(c_sims.argmax())

        # Get best bbox
        pred_bbox = yolo_results[best_idx, 0:4].tolist()

        # Crop around the best bbox and encode
        pred_image = img.crop(pred_bbox)
        pred_image_enc = self._encode_img(pred_image)

        # Get ground truth bbox
        gt_bbox = img_sample.bbox

        """ Metrics computation """

        # Compute IoU
        iou = self._IoU(pred_bbox, gt_bbox)

        # Compute grounding accuracy
        grd_correct = self._grounding_accuracy(img_sample, pred_image_enc)

        # Compute distance metrics
        dotproduct = prompt_enc @ pred_image_enc.T  # dot product
        cosine_sim = cosine_similarity(prompt_enc, pred_image_enc)  # cosine similarity
        euclidean_dist = torch.cdist(prompt_enc, pred_image_enc, p=2).squeeze()  # euclidean distance

        """ Display results """

        # Show objects found by YOLO, if requested
        if show_yolo:
            plt.imshow(yolo_results_.plot())
            plt.axis("off")
            plt.title("YOLO findings")

        # Show the final prediction, if requested
        if show:
            display_preds(img, prompt, pred_bbox, gt_bbox, model_name=f"{self.yolo_ver} + CLIP({self.clip_ver})")

        # Print execution time, if requested
        if timeit:
            end = time.time()
            print(f"[INFO] Time elapsed: {end - start:.2f}s")

        return {
            "IoU": float(iou),
            "cosine": float(cosine_sim),
            "euclidean": float(euclidean_dist),
            "dotproduct": float(dotproduct),
            "grounding": float(grd_correct),
        }


With that said, we can instantiate two object for the YOLO+CLIP pipeline, using different YOLO versions, and quickly see them running.

In [None]:
#@title Instantiate YOLO+CLIP pipelines

# YOLOv5 + CLIP
yolo5clip = YoloClip(dataset.categories, yolo_ver="yolov5su", quiet=True, device=device)

# YOLOv8 + CLIP
yolo8clip = YoloClip(dataset.categories, yolo_ver="yolov8x", quiet=True, device=device)


In [None]:
#@title Test the YOLOv5 + CLIP pipeline

#@markdown Please flag `show` to show prediction, or `timeit` to time the process.<br>
#@markdown Also, `show_yolo` will display YOLO detections.
show = True #@param {type:"boolean"}
timeit = False #@param {type:"boolean"}
show_yolo = True #@param {type:"boolean"}

yolo5clip(sample, sample.sentences[0], show_yolo=show_yolo, show=show, timeit=timeit)


In [None]:
#@title Test the YOLOv8 + CLIP pipeline

yolo8clip(sample, sample.sentences[0], show_yolo=True, show=True, timeit=False)


### 2.2 Segmentation + CLIP

This approach is fundamentally based on the idea of segmenting the sample image into a set of regions and interpreting the similarity of each section to the prompt as a heatmap. A heatmap in this context is obtained as follows:

1. The sample image is segmented using a segmentation algorithm (e.g. SLIC, Watershed)
2. Each region is encoded using CLIP
3. The similarity between the prompt and each region is computed
4. Each "pixel" inside each region is assigned the similarity score of that region, producing a score heatmap

That being said, simply computing one single heatmap may not be enough to capture both larger and fine-grained image-text similarites. For such reasons, we compute different heatmaps using a different number of segments each time, and finally pooling all the heatmap together using the mean score of each pixel. This is not all though. At this point:

1. All pixels below a certain threshold are turned off (set to 0)
2. The heatmap is downsampled with a certain factor for lighter computations
3. The heatmap is normalized to the range [-1, 1]
4. The best bounding box is extracted from the heatmap, by considering all possible bounding boxes and selecting the one with the highest pixel sum

Note that:
- the threshold used to filter out pixels is taken as the value of a certain quantile `q` of the pixel values distribution
- the bbox search algorithm searches over all possible boxes, and that is why the heatmap is downsampled first
- the normalization also serves to give turned-off pixels negative values and discourage the algorithm from selecting areas containing many

Although being pretty flexible in terms of experimentation with hyperparameters, this model suffers from a few limitations, one of them being the slow inference time. This could be definitely improved in the future by defining another bounding box search algorithm, such as [...].


**Some implementative details***

The class implements a `__compute_hmap` method, which is responsible for computing the heatmap by passing a method and a number of segments to use it with. The method is used by `__call__` which performs the other steps described above and returns the metrics. The `_find_best_bbox` method is responsible for bbox searching, as decribed.

When calling the pipeline, the user can specify two additional parameters:

- `show_process=True` will show the resulting heatmap alongside its filtered and downsampled versions
- `show_masks=True` will show all *N* masks before pooling, with *N* being the number of segments used

Also, note that the hyperparameters specified in the class instantiation below are those which, experimentally, gave the best results.

In [None]:
#@title Class definition for the Segmentation + CLIP pipeline

class ClipSeg(VisualGroundingPipeline):

    def __init__(self,
                 categories,
                 method,
                 n_segments,
                 clip_ver="ViT-L/14",
                 q=0.95,
                 d=16,
                 device="cpu",
                 quiet=False):

        VisualGroundingPipeline.__init__(self, categories, clip_ver, device, quiet)

        self.method = method
        self.n_segments = n_segments
        self.q = q
        self.d = d

        valid_methods = ["s", "w"]
        if self.method not in valid_methods:
            raise ValueError(f"Method `{method}` not supported. Supported methods are: {valid_methods}.")

        print("[INFO] Initializing ClipSeg pipeline")
        print(f"[INFO] Segmentation method: {method}")
        print(f"[INFO] Number of segments: {n_segments}")
        print(f"[INFO] Threshold q.tile for filtering: {q}")
        print(f"[INFO] Downsampling factor: {d}")
        print("")

    @staticmethod
    def _downsample_map(self, hmap, factor):
        # number of blocks in each dimension
        blocks_h = hmap.shape[0] // factor
        blocks_w = hmap.shape[1] // factor

        # reshape the original matrix into blocks
        blocks = hmap[:blocks_h * factor, :blocks_w * factor].reshape(blocks_h, factor, blocks_w, factor)

        # calculate the average of each block
        averages = blocks.mean(axis=(1, 3))

        return averages

    def _compute_hmap(self, img_sample, np_image, prompt, method, masks):

        # Make sure np_image is an image with shape (h, w, 3)
        if len(np_image.shape) > 3 or (len(np_image.shape) == 3 and np_image.shape[-1] != 3):
            np_image = np_image[:, :, 0]

        if len(np_image.shape) == 2:
            np_image = np.stack((np_image,) * 3, axis=-1)

        hmaps = list()

        prompt_enc = self._encode_text(prompt)

        for i, n in enumerate(masks):

            # Compute regions according to chosen method
            segments = None
            if method == "s":
                # SLIC segmentation algorithm ()
                segments = slic(np_image, n_segments=n, compactness=10, sigma=1)
            elif method == "w":
                # Watershed segmentation algorithm ()
                segments = watershed(sobel(rgb2gray(np_image)), markers=n, compactness=0.001)

            if segments is None:
                raise Exception("Segments are None. Is method different from 's' or 'w'? ")

            regions = regionprops(segments)

            if len(regions) == 1:
                # If the algo returned only 1 region, skip this iteration
                # (may happen, with low-segments masks)
                continue

            # Compute CLIP encodings for each region

            images_encs = list()

            regions = tqdm(regions, desc=f"[INFO] Computing CLIP masks", leave=False) if not self.quiet else regions

            for region in regions:
                rect = region.bbox
                rect = (rect[1], rect[0], rect[3], rect[2])

                sub_image = img_sample.img.crop(rect)
                image_enc = self._encode_img(sub_image)
                images_encs.append(image_enc)

            # Assign a score to each region according to prompt similarity (creating a heatmap)

            images_encs = torch.cat(images_encs, dim=0)
            scores = prompt_enc @ images_encs.T
            scores = scores.squeeze().cpu().numpy()
            heatmap = np.zeros((segments.shape[0], segments.shape[1]))

            for i in range(segments.shape[0]):
                for j in range(segments.shape[1]):
                    heatmap[i, j] = scores[segments[i, j] - 1]

            hmaps.append(heatmap)

        # Finally, return the pooled heatmap and the list of all heatmaps computed

        pmap = np.mean(np.array(hmaps), axis=0)

        return pmap, hmaps

    @staticmethod
    def _find_best_bbox(self, heatmap, lower_bound=-1.0, upper_bound=1.0):
        # Rescale the heatmap
        heatmap = MinMaxScaler(feature_range=(lower_bound, upper_bound)).fit_transform(heatmap)

        # Initialize the best score and best box
        best_score = float('-inf')
        best_box = None

        # Loop over all possible box sizes and positions
        for w in range(1, heatmap.shape[1] + 1):
            for h in range(1, heatmap.shape[0] + 1):
                for i in range(heatmap.shape[1] - w + 1):
                    for j in range(heatmap.shape[0] - h + 1):

                        # Get current sub-region
                        candidate = heatmap[j:j + h, i:i + w]

                        # Compute the score for this box
                        score = candidate.sum()

                        # Update the best score and best box if necessary
                        if score > best_score:
                            best_score = score
                            best_box = (i, j, w, h)

        best_box = [best_box[0], best_box[1], best_box[2] + best_box[0], best_box[3] + best_box[1]]

        return best_box

    def __call__(self, img_sample, prompt, show=False, show_process=False, show_masks=False, timeit=False):

        if timeit:
            start = time.time()

        """ Pipeline core """

        # Get sample image
        img = img_sample.img

        # Convert image to np array
        np_image = img_as_float(io.imread(img_sample.path))

        # Compute a heatmap of CLIP scores
        p_heatmap, heatmaps = self._compute_hmap(img_sample, np_image, prompt, self.method, self.n_segments)

        # Shut down pixels below a certain threshold
        ths = np.quantile(p_heatmap.flatten(), self.q)
        fp_heatmap = p_heatmap.copy()
        fp_heatmap[p_heatmap < ths] = ths

        # Downsample the heatmap by a factor d
        dfp_heatmap = self._downsample_map(fp_heatmap, self.d)

        # Find the best bounding box
        pred_bbox = self._find_best_bbox(dfp_heatmap, lower_bound=-0.75)

        if pred_bbox is None:
            return {"IoU": 0, "cosine": np.nan, "euclidean": np.nan, "dotproduct": np.nan, "grounding": np.nan}

        if self.d > 1:
            pred_bbox = [pred_bbox[0] * self.d + self.d // 2,
                         pred_bbox[1] * self.d + self.d // 2,
                         pred_bbox[2] * self.d - self.d // 2,
                         pred_bbox[3] * self.d - self.d // 2]

        # Use CLIP to encode the text prompt
        prompt_enc = self._encode_text(prompt).float()

        # Crop around the best bbox and encode
        pred_image = img.crop(pred_bbox)
        pred_image_enc = self._encode_img(pred_image)

        # Get ground truth bbox
        gt_bbox = img_sample.bbox

        """ Metrics computation """

        # Compute IoU
        iou = self._IoU(pred_bbox, gt_bbox)

        # Compute grounding accuracy
        grd_correct = self._grounding_accuracy(img_sample, pred_image_enc)

        # Compute distance metrics
        dotproduct = prompt_enc @ pred_image_enc.T  # dot product
        cosine_sim = cosine_similarity(prompt_enc, pred_image_enc)  # cosine similarity
        euclidean_dist = torch.cdist(prompt_enc, pred_image_enc, p=2).squeeze()  # euclidean distance

        """ Display results """

        # Show all masks, if requested
        if show_masks:
            fig, axes = plt.subplots(1, len(heatmaps), figsize=(20, 5))
            for i, heatmap in enumerate(heatmaps):

                for ax in axes.ravel():
                    ax.axis("off")

                axes[i].imshow(np_image, alpha=0.25)
                axes[i].imshow(heatmap, alpha=0.75)
                axes[i].set_title(f"#{i + 1}")

        # Show the mask processing pipeline, if requested
        if show_process:
            fig, axes = plt.subplots(1, 4, figsize=(20, 5))

            for ax in axes.ravel():
                ax.axis("off")

            axes[0].imshow(np_image)
            axes[0].set_title("original image")

            axes[1].imshow(np_image, alpha=0.25)
            axes[1].imshow(p_heatmap, alpha=0.75)
            axes[1].set_title("pooled heatmap")

            axes[2].imshow(np_image, alpha=0.25)
            axes[2].imshow(fp_heatmap, alpha=0.75)
            axes[2].set_title("filtered heatmap")

            axes[3].imshow(np_image, alpha=0.25)
            w, h = np_image.shape[1], np_image.shape[0]
            dfp_heatmap_ = cv2.resize(dfp_heatmap, (w, h), interpolation=cv2.INTER_NEAREST)
            axes[3].imshow(dfp_heatmap_, alpha=0.75)
            axes[3].set_title("dsampled heatmap")

        # Show the final prediction, if requested
        if show:
            methods = {"w": "Watershed", "s": "SLIC"}
            display_preds(img_sample.img, prompt, pred_bbox, img_sample.bbox,
                          f"{methods[self.method]} + CLIP ({self.clip_ver})")

        # Print execution time, if requested
        if timeit:
            end = time.time()
            print(f"[INFO] Time elapsed: {end - start:.2f}s")

        return {
            "IoU": float(iou),
            "cosine": float(cosine_sim),
            "euclidean": float(euclidean_dist),
            "dotproduct": float(dotproduct),
            "grounding": float(grd_correct),
        }


With that said, we can instantiate two object for the Segmentation+CLIP pipeline, which we will conveniently call after the segmentation method used.

In [None]:
#@title Instantiate the segmentation + CLIP pipelines

# Watershed seg. + CLIP | pooling maps with 4, 8, 16, 32 segments | filtering below 0.75 q.tile
wshedclip = ClipSeg(dataset.categories, method="w", n_segments=(4, 8, 16, 32), q=0.75, quiet=False, device=device)

# SLIC seg. + CLIP | pooling maps with 4, 8, 16, 32 segments | filtering below 0.75 q.tile
slicnclip = ClipSeg(dataset.categories, method="s", n_segments=(4, 8, 16, 32), q=0.75, quiet=False, device=device)


In [None]:
#@title Test the Watershed + CLIP pipeline

#@markdown Please flag `show` to show prediction, or `timeit` to time the process.<br>
#@markdown `show_process` will display the processing steps of the heatmap.<br>
#@markdown `show_masks` will display the computed CLIP heatmaps.
show = True #@param {type:"boolean"}
timeit = False #@param {type:"boolean"}
show_process = True #@param {type:"boolean"}
show_masks = True #@param {type:"boolean"}


wshedclip(sample, sample.sentences[0], show_process=True, show_masks=True, show=True, timeit=False)


In [None]:
#@title Test the SLIC + CLIP pipeline

#@markdown Please flag `show` to show prediction, or `timeit` to time the process.<br>
#@markdown `show_process` will display the processing steps of the heatmap.<br>
#@markdown `show_masks` will display the computed CLIP heatmaps.
show = True #@param {type:"boolean"}
timeit = False #@param {type:"boolean"}
show_process = True #@param {type:"boolean"}
show_masks = True #@param {type:"boolean"}

slicnclip(sample, sample.sentences[0], show_process=show_process, show_masks=show_masks, show=show, timeit=timeit)


### 2.3 SSD + CLIP

This approach combines CLIP's encoding power with the Single Shot Detection (SSD) algorithm from NVIDIA.


SSD has two components: a backbone model and SSD head.

- Backbone model usually is a pre-trained image classification network as a feature extractor.
- The **SSD** head is just one or more convolutional layers added to this backbone and the outputs are interpreted as the bounding boxes and classes of objects in the spatial location of the final layers activations.

SSD divides the image using a grid and have each grid cell be responsible for detecting objects in that region of the image. Detection objects simply means predicting the class and location of an object within that region. If no object is present, we consider it as the background class and the location is ignored. For instance, we could use a 4x4 grid in the example below. Each grid cell is able to output the position and shape of the object it contains.

The pipeline uses SSD once again to detect relevant objects in the image, and then relies on CLIP for the actual grounding task, by comaring the detection similarity with the prompt.

In [None]:
#@title Class definition for the SSD + CLIP pipeline

class ClipSSD(VisualGroundingPipeline):

    def __new__(cls, *args, **kwargs):

        # Single Shot Detector (SSD) requires CUDA.
        # Check if the selected device if CUDA before instantiating the class.

        if kwargs["device"] != torch.device("cuda"):
            print("[ERROR] Single Shot Detector requires CUDA. Returning empty object.")
            print("")
            return VisualGroundingPipeline.__new__(VisualGroundingPipeline)
        else:
            return super(ClipSSD, cls).__new__(cls)

    def __init__(self,
                 categories,
                 confidence_t=0.5,
                 clip_ver="ViT-L/14",
                 device="cpu",
                 quiet=True):

        VisualGroundingPipeline.__init__(self, categories, clip_ver, device, quiet)

        self.confidence_t = confidence_t

        self.ssd_model = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd')
        self.utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd_processing_utils')

        self.ssd_model.to(device)
        self.ssd_model.eval()

        print("[INFO] Initializing ClipSSD pipeline")
        print(f"[INFO] Confidence treshold: {confidence_t}")
        print("")

    def _propose(self, image_path, original_size):

        def _resize_bbox(bbox, in_size, out_size):
            """
            Resize bounding boxes according to image resize.

            Args:
                bbox: (np.ndarray) bounding boxes of (y_min, x_min, y_max, x_max)
                in_size: (tuple) the height and the width of the image before resized
                out_size: (tuple) The height and the width of the image after resized
            Returns:
                (np.ndarray) bounding boxes rescaled according to the given image shapes

            """
            bbox = bbox.copy()
            y_scale = float(out_size[0]) / in_size[0]
            x_scale = float(out_size[1]) / in_size[1]
            bbox[:, 0] = y_scale * bbox[:, 0]
            bbox[:, 2] = y_scale * bbox[:, 2]
            bbox[:, 1] = x_scale * bbox[:, 1]
            bbox[:, 3] = x_scale * bbox[:, 3]
            return bbox

        bboxes = []

        inputs = [self.utils.prepare_input(image_path)]
        tensor = self.utils.prepare_tensor(inputs)

        with torch.no_grad():
            detections_batch = self.ssd_model(tensor)

        results_per_input = self.utils.decode_results(detections_batch)
        best_results_per_input = [self.utils.pick_best(results, self.confidence_t) for results in results_per_input]

        bbox, _, _ = best_results_per_input[0]
        bbox *= 300
        bbox = _resize_bbox(bbox, (300, 300), original_size)
        bboxes.append(bbox)

        return np.float32(bboxes[0]).tolist()

    def __call__(self, img_sample, prompt, show=False, timeit=False):

        if timeit:
            start = time.time()

        """ Pipeline core """

        # Get sample image
        image_path = img_sample.path
        img = img_sample.img

        # Use SSD to propose relevant objects
        bboxes = self._propose(image_path, (img_sample.shape[1], img_sample.shape[2]))

        # Handle case where no object is proposed
        if len(bboxes) == 0:
            return {"IoU": 0, "cosine": np.nan, "euclidean": np.nan, "dotproduct": np.nan, "grounding": np.nan}

        # Use CLIP to encode each relevant object detected
        images_encs = list()
        for bbox in bboxes:
            sub_img = img.crop(bbox)
            with torch.no_grad():
                sub_img_enc = self._encode_img(sub_img)
            images_encs.append(sub_img_enc)
        images_encs = torch.cat(images_encs, dim=0)

        # Use CLIP to encode the text prompt
        prompt_enc = self._encode_text(prompt)

        # Find the best object according to cosine similarity
        c_sims = cosine_similarity(prompt_enc, images_encs).squeeze()
        best_idx = int(c_sims.argmax())

        # Get best bbox
        pred_bbox = bboxes[best_idx]

        # Use CLIP to encode the prompt
        prompt_enc = self._encode_text(prompt).float()

        # Crop around the best bbox and encode
        pred_image = img.crop(pred_bbox)
        pred_image_enc = self._encode_img(pred_image)

        # Get ground truth bbox
        gt_bbox = img_sample.bbox

        """ Metrics computation """

        # Compute IoU
        iou = self._IoU(pred_bbox, gt_bbox)

        # Compute grounding accuracy
        grd_correct = self._grounding_accuracy(img_sample, pred_image_enc)

        # Compute distance metrics
        dotproduct = prompt_enc @ pred_image_enc.T  # dot product
        cosine_sim = cosine_similarity(prompt_enc, pred_image_enc)  # cosine similarity
        euclidean_dist = torch.cdist(prompt_enc, pred_image_enc, p=2).squeeze()  # euclidean distance

        """ Display results """

        # Show the final prediction, if requested
        if show:
            display_preds(img, prompt, pred_bbox, gt_bbox, model_name="SSD+CLIP")

        # Print execution time, if requested
        if timeit:
            end = time.time()
            print(f"[INFO] Time elapsed: {end - start:.2f}s")

        return {
            "IoU": float(iou),
            "cosine": float(cosine_sim),
            "euclidean": float(euclidean_dist),
            "dotproduct": float(dotproduct),
            "grounding": float(grd_correct),
        }


With that said, we can instantiate and use an object for the SSD+CLIP pipeline.

In [None]:
#@title Instantiate the SSD + CLIP pipeline

# SSD + CLIP | with 0.01 confidence
ssdnclip = ClipSSD(dataset.categories, confidence_t=0.01, device=device)


In [None]:
#@title Test the SSD + CLIP pipeline

#@markdown Please flag `show` to show prediction, or `timeit` to time the process.<br>
show = True #@param {type:"boolean"}
timeit = False #@param {type:"boolean"}

ssdnclip(sample, sample.sentences[0], show=show, timeit=timeit)


### 2.4 DETR + CLIP

Unlike traditional computer vision techniques, **DETR (which stands for DEtection TRansformer)** approaches object detection as a direct set prediction problem. It consists of a set-based global loss, which forces unique predictions via bipartite matching, and a Transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. Due to this parallel nature, DETR is very fast and efficient.

The DETR architecture basically consists of three blocks: (1) a set of CNN layers are used to extract features from the image, (2) the resulting vector - alongside a positional encoding - is then fed to a **Transformer with an Encoder-Decoder architecture**, and finally the output is forwarded to a **FF neural network**. The last layer consists of 3 nodes, representing the normalized center coordinates of the predicted object and the predicted height and width values of the bounding box detection.

In this approach DETR is combined with CLIP to perform visual grounding. The DETR model is once again used to detect objects in the image, and CLIP is used to encode the text prompt and the objects detected. The object with the highest cosine similarity with the text prompt is selected as the grounding prediction.

In [None]:
#@title Class definition for DETR + CLIP pipeline

class DetrClip(VisualGroundingPipeline):
    def __init__(self,
                 categories,
                 clip_ver="RN50",
                 device="cpu",
                 quiet=True):

        logging.set_verbosity_error()

        VisualGroundingPipeline.__init__(self, categories, clip_ver, device, quiet)

        self.image_prep = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50")
        self.detr = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

        print("[INFO] Initializing DetrClip pipeline")
        print("")

    def __call__(self, img_sample, prompt, show=False, show_detr=False, timeit=False):

        if timeit:
            start = time.time()

        """ Pipeline core """

        # Get sample image
        img = img_sample.img

        # Make sure image has shape (h, w, 3)
        np_image = np.array(img)
        if len(np_image.shape) > 3 or (len(np_image.shape) == 3 and np_image.shape[-1] != 3):
            np_image = np_image[:, :, 0]
        if len(np_image.shape) == 2:
            np_image = np.stack((np_image,) * 3, axis=-1)
        img = Image.fromarray(np_image)

        # Use DETR to find relevant objects
        inputs = self.image_prep(images=img, return_tensors="pt")
        with torch.no_grad():
            outputs = self.detr(**inputs)
        target_sizes = torch.tensor([img_sample.img.size[::-1]])
        results = self.image_prep.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[0]
        detr_results = results['boxes']

        # Use CLIP to encode each relevant object image
        images_encs = list()
        for i in range(detr_results.shape[0]):
            bbox = results['boxes'][i, 0:4].cpu().numpy()
            sub_img = img_sample.img.crop(bbox)
            with torch.no_grad():
                sub_img_enc = self._encode_img(sub_img)
            images_encs.append(sub_img_enc)
        images_encs = torch.cat(images_encs, dim=0)

        # Use CLIP to encode the text prompt
        prompt_enc = self._encode_text(prompt)

        # Find the best object according to cosine similarity
        c_sims = cosine_similarity(prompt_enc, images_encs).squeeze()
        best_idx = int(c_sims.argmax())

        # Get best bbox
        pred_bbox = detr_results[best_idx, 0:4].tolist()

        # Crop around the best bbox and encode
        pred_image = img.crop(pred_bbox)
        pred_image_enc = self._encode_img(pred_image)

        # Get ground truth bbox
        gt_bbox = img_sample.bbox

        """ Metrics computation """

        # Compute IoU
        iou = self._IoU(pred_bbox, gt_bbox)

        # Compute grounding accuracy
        grd_correct = self._grounding_accuracy(img_sample, pred_image_enc)

        # Compute distance metrics
        dotproduct = prompt_enc @ pred_image_enc.T  # dot product
        cosine_sim = cosine_similarity(prompt_enc, pred_image_enc)  # cosine similarity
        euclidean_dist = torch.cdist(prompt_enc, pred_image_enc, p=2).squeeze()  # euclidean distance

        """ Display results """

        # Show objects found by DETR, if requested
        if show_detr:
            fig, ax = plt.subplots()
            ax.imshow(img)
            for i in range(detr_results.shape[0]):
                bbox = results['boxes'][i].cpu().numpy()
                # print(bbox)
                rect = plt.Rectangle(
                    (bbox[0], bbox[1]), bbox[2] - bbox[0], bbox[3] - bbox[1],
                    linewidth=2, edgecolor=(0, 1, 0), facecolor="none"
                )
                ax.add_patch(rect)
            ax.axis("off")
            plt.title("DETR findings")
            plt.show()

        # Show the final prediction, if requested
        if show:
            display_preds(img, prompt, pred_bbox, gt_bbox, model_name=f"DETR + CLIP ({self.clip_ver})")

        # Print execution time, if requested
        if timeit:
            end = time.time()
            print(f"[INFO] Time elapsed: {end - start:.2f}s")

        return {
            "IoU": float(iou),
            "cosine": float(cosine_sim),
            "euclidean": float(euclidean_dist),
            "dotproduct": float(dotproduct),
            "grounding": float(grd_correct),
        }


We can now instantiate the pipeline and test it on a sample image and prompt.

In [None]:
#@title Instantiate the DETR + CLIP pipeline

# DETR + CLIP
detrnclip = DetrClip(dataset.categories, quiet=True, device=device)


In [None]:
#@title Test the DETR + CLIP pipeline

#@markdown Please flag `show` to show prediction, or `timeit` to time the process.<br>
#@markdown Also check `show_detr` to display DETR findings.
show = True #@param {type:"boolean"}
timeit = False #@param {type:"boolean"}
show_detr = True #@param {type:"boolean"}

detrnclip(sample, sample.sentences[0], show_detr=True, show=show, timeit=timeit)


### 2.5 MDETR (SOTA)


MDETR (which stands for Modulated Detection for End-to-End Multi-Modal Understanding) is a state-of-the-art object detection and instance segmentation model that combines image and textual information to perform tasks such as object detection and segmentation in an end-to-end manner.

The MDETR model incorporates a transformer-based architecture, similar to those used in natural language processing tasks, to process both visual and textual data. It takes as input an image along with a textual query or question describing the desired objects to be detected. The model then processes the information in a joint manner, attending to both the visual and textual input, and produces bounding box predictions and segmentations for the specified objects in the image.

It differs from traditional object detection models by eliminating the need for separate modules for region proposal and object classification. It directly attends to the entire image and textual query, enabling a more unified and efficient approach to multi-modal understanding tasks. The modality modulation mechanism in MDETR allows it to effectively handle different combinations of visual and textual inputs.

In particular, MDETR uses the following components:

- a **visual backbone** (RN101 in our code) to process the input image and to extract visual features
- a **text encoder**, (RoBERTa in our code) which converts the textual input into contextualized embeddings
- a **transformer encoder**, which receives the extracted embeddings alongside positional encodings for the visual features
- finally, **detection heads**: which generate bounding box predictions and instance segmentations

These components allow MDETR to combine image and textual information for end-to-end object detection and instance segmentation. The model attends to both visual and textual features simultaneously, enhancing its ability to understand and localize objects based on natural language queries. For this reason however it does not make use of CLIP, and it is reported here for the only purpose of showing the performances of a SOTA method in visual grounding.

In [None]:
#@title Class definition for MDETR

class MDETRvg(VisualGroundingPipeline):

    def __init__(self,
                 categories,
                 clip_ver="RN101",
                 device="cpu",
                 quiet=True):

        VisualGroundingPipeline.__init__(self, categories, clip_ver, device, quiet)

        cpt_url = "https://pytorch.s3.amazonaws.com/models/multimodal/mdetr/pretrained_resnet101_checkpoint.pth"

        self.MDETR = mdetr_for_phrase_grounding()
        self.MDETR.load_state_dict(torch.hub.load_state_dict_from_url(cpt_url)["model_ema"])
        self.RoBERTa = RobertaTokenizerFast.from_pretrained("roberta-base")
        self.img_preproc = T.Compose([
            T.Resize(800),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])

        print("[INFO] Initializing MDETR pipeline")
        print("")

    @staticmethod
    def rescale_boxes(boxes, size):
        w, h = size
        b = box_convert(boxes, "cxcywh", "xyxy")
        b = b * torch.tensor([w, h, w, h], dtype=torch.float32)
        return b

    def __call__(self, img_sample, prompt, show=True, timeit=False):

        if timeit:
            start = time.time()

        """ Pipeline core """

        # Get sample image
        img = Image.open(img_sample.path)

        # Make sure image has shape (h, w, 3)
        np_image = np.array(img)
        if len(np_image.shape) > 3 or (len(np_image.shape) == 3 and np_image.shape[-1] != 3):
            np_image = np_image[:, :, 0]
        if len(np_image.shape) == 2:
            np_image = np.stack((np_image,) * 3, axis=-1)
        img = Image.fromarray(np_image)

        # Encode the prompt with RoBERTa
        enc_text = self.RoBERTa.batch_encode_plus([prompt], padding="longest", return_tensors="pt")

        # Preprocess the image for MDETR
        img_transformed = self.img_preproc(img)

        # Run MDETR on image and prompt
        with torch.no_grad():
            out = self.MDETR([img_transformed], enc_text["input_ids"]).model_output

        # Parse MDETR results to get detections bboxes and probabilities
        probs = 1 - out.pred_logits.softmax(-1)[0, :, -1]
        boxes_scaled = self.rescale_boxes(out.pred_boxes[0, :], img.size)
        mdetr_results = pd.DataFrame(boxes_scaled.squeeze().numpy().reshape(-1, 4))
        mdetr_results.columns = ["xmin", "ymin", "xmax", "ymax"]
        mdetr_results["prob"] = probs.numpy()
        mdetr_results = mdetr_results.sort_values(by=['prob'], ascending=False)

        # Get best bbox
        pred_bbox = mdetr_results.iloc[0, :4].tolist()

        # Use CLIP to encode the prompt
        prompt_enc = self._encode_text(prompt)

        # Crop around the best bbox and encode
        pred_image = img.crop(pred_bbox)
        pred_image_enc = self._encode_img(pred_image)

        # Get ground truth bbox
        gt_bbox = img_sample.bbox

        """ Metrics computation """

        # Compute IoU
        iou = self._IoU(pred_bbox, gt_bbox)

        # Compute grounding accuracy
        grd_correct = self._grounding_accuracy(img_sample, pred_image_enc)

        # Compute distance metrics
        cosine_sim = cosine_similarity(prompt_enc, pred_image_enc)
        euclidean_dist = torch.cdist(prompt_enc, pred_image_enc, p=2).squeeze()
        dotproduct = prompt_enc @ pred_image_enc.T

        """ Display results """

        # Show the final prediction, if requested
        if show:
            display_preds(img, prompt, pred_bbox, gt_bbox, model_name="MDETR")

        # Print execution time, if requested
        if timeit:
            end = time.time()
            print(f"[INFO] Time elapsed: {end - start:.2f}s")

        return {
            "IoU": float(iou),
            "cosine": float(cosine_sim),
            "euclidean": float(euclidean_dist),
            "dotproduct": float(dotproduct),
            "grounding": float(grd_correct),
        }


With that said, we can instantiate and use an object for the MDETR pipeline.

In [None]:
#@title Instantiate the MDETR pipeline

# MDETR for visual grounding
mdetr = MDETRvg(dataset.categories, quiet=True, device=device)


In [None]:
#@title Test the MDETR pipeline

#@markdown Please flag `show` to show prediction, or `timeit` to time the process.
show = True #@param {type:"boolean"}
timeit = True #@param {type:"boolean"}

mdetr(sample, sample.sentences[0], show=show, timeit=timeit)


## 3. Testing and results

To test each visual grounding framework on the whole test split, a handy function is defined to run the pipeline on each sentence for each sample and compute the average metrics returned. The metrics are also printed in real time to monitor the progress of the testing process.

In [None]:
#@title Function to test a given visual grounding pipeline on a given dataset

def visual_grounding_test(vg_pipeline, dataset, logging=False):
    scores = list()

    pbar = tqdm(dataset)

    for sample in pbar:

        sample = RefCOCOgSample(**sample)

        for sentence in sample.sentences:

            sc = vg_pipeline(sample, sentence, show=False)

            scores.append(sc)

            avg_metrics = list()

            for metric in scores[0].keys():
                avg_metric = np.mean([score[metric] for score in scores if score[metric] is not np.nan])
                avg_metric = f"{metric}: {avg_metric:.3f}"
                avg_metrics.append(avg_metric)

            pbar_desc = " | ".join(avg_metrics)

            if logging:
                pipeline_name = vg_pipeline.__class__.__name__.lower()
                datetime_tag = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")

                with open(f"logs/{pipeline_name}_log_{datetime_tag}.txt", "a") as f:
                    f.write("[" + datetime_tag + "] " + pbar_desc + "\n")

            pbar.set_description(pbar_desc)


(You can use the dropdown menu to choose among one of the pipelines define above and test it right now. Still, notice the testing process can be pretty lenghty).

In [None]:
#@title Test one of the pipeline defined

pipeline = yolo5clip  #@param ["yolo5clip", "yolo8clip", "wshedclip", "slicnclip", "detrnclip", "ssdnclip", "mdetr"]

visual_grounding_test(pipeline, test_ds)


### 3.1 Results discussion

Here are reported the results achieved testing each approach on the test split of RefCOCOg.
Note that the pipelines supporting a number of hyperparameters has been tested using the combination of parameters which achieved the best compromise in terms of performance and execution time.

|------------------------|------------------------|----------------------|----------------------|-----------------------|-----------------------|
| Pipeline               | avg. IoU               | avg. cosine sim.     | avg. euclidean dist. | avg. dotproduct       | avg. grounding acc.   |
|------------------------|------------------------|----------------------|----------------------|-----------------------|-----------------------|
| YOLOv8x+CLIP(RN50)     | 0.554                  | 0.238                | 1.234                | 5.259                 | 0.353                 |
| YOLOv8x+CLIP(RN101)    | 0.550                  | 0.459 *              | 1.040 *              | 8.842                 | 0.486                 |
| YOLOv8x+CLIP(ViT-L/14) | 0.552                  | 0.254                | 1.221                | 0.254                 | 0.520 **              |
| YOLOv5s+CLIP(RN50)     | 0.552                  | 0.238                | 1.234                | 0.238                 | 0.495                 |
| YOLOv5s+CLIP(RN101)    | 0.550                  | 0.238                | 1.235                | 0.238                 | 0.496                 |
| YOLOv5s+CLIP(ViT-L/14) | 0.533                  | 0.253                | 1.222                | 0.253                 | 0.515                 |
| W.SHED+CLIP(ViT-L/14)  | 0.219                  | 0.243                | 1.230                | 0.243                 | 0.525 *               |
| SLIC+CLIP(ViT-L/14)    | 0.180                  | 0.228                | 1.242                | 3.185                 | 0.373                 |
| SSD+CLIP(RN50)         | 0.175                  | 0.217                | 1.251                | 2.319                 | 0.355                 |
| SSD+CLIP(RN101)        | 0.172                  | 0.442                | 1.056                | 3.478                 | 0.365                 |
| SSD+CLIP(ViT-L/14)     | 0.171                  | 0.225                | 1.245                | 3.138                 | 0.400                 |
| DETR+CLIP(RN50)        | 0.560 **               | 0.237                | 1.235                | 0.237                 | 0.496                 |
| DETR+CLIP(RN101)       | 0.547                  | 0.458 **             | 1.041 **             | 0.458                 | 0.482                 |
| DETR+CLIP(ViT-L/14)    | 0.537                  | 0.252                | 1.223                | 0.252                 | 0.514                 |
| MDETR (CLIPw/RN50)     | 0.617 *                | 0.225                | 1.244                | 0.225                 | 0.483                 |

`*` = best <br>
`**` = second best

Considering the results achieved, we can draw some conclusions:

Overall, succeeding into scoring a results better than the YOLO baseline has proven pretty hard. Over all the different approaches:

- **The YOLO+CLIP baseline** retained its primate, at least in terms of semantic similarity (in particular with YOLOv8x and CLIP with a RN101 backbone)

- **The DETR+CLIP approach** is a close runner up, with pretty similar metrics in terms of semantic similarity (again with a RN101 backbone for CLIP). The same approach (with a ResNet50 CLIP backbone) was also our second best framework in terms of localization accuracy.

- **MDETR** shows all its SOTA potential, purely talking in terms of IoU, achieving the best results of all the tested pipelines.

- **The Segmentation + CLIP approach** surely gets a honorable mention, achieving the best results in terms of grounding accuracy, even with unsatisfactory results in terms of IoU.

### 3.2 Future directions

...

## References

