<a href="https://colab.research.google.com/github/thesteve0/impatient-computer-vision/blob/main/7_visual_language_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visual Language Models

The models we have used up to now could only handle one mode of unstructured data - images. These next type of models we are going to cover are called multi-modal, they can handle multiple modes of input. In this case, the models are trained on both images and text. The training of these models occurs with image and text pairs. The model captures and combines the features, embeddings, allowing them to generate predictions for either modality.

What this practically means is you can query the images using text, such as "images with cats". And you can also ask the model to describe what is in the image. You can still do the single mode tasks, such as object detection, and you gain the benefit of the model understanding more of the human perceptions of the image.

One of the earliest models of this type was built by Open AI and called CLIP. You actually saw the embeddings generated from this model back in the embeddings notebook. Those embeddings were generated using an open weights version of CLIP called OpenClip. And because these embeddings were generated using this multi-modal model, they contained more of the semantic features, human understanding, of what was in the images.

We are going to use another YOLO model to help with one of the most labor-intensive part of training computer vision models, creating annotation data. This use is sometimes called generative labeling. We use the models to generate an initial set of labels. While we still have  edit the labels, we avoid trying to do everything from scratch.

And with that, time to get our hands dirty - let's do the housekeeping:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

!pip install fiftyone==1.4.1 torch torchvision ultralytics

import fiftyone as fo
import fiftyone.zoo as foz

name = "our-photos"
dir = "/content/drive/MyDrive/impatient-cv/flickr-labeled"

dataset = fo.Dataset.from_dir(
    dataset_dir=dir,
    dataset_type=fo.types.FiftyOneDataset,
    name=name
)


## Running models

There are all sorts of interesting tasks you can do with these multi-modal models, for example, [this code](https://github.com/thesteve0/photo-explorer/blob/main/4_word_cloud.py) generates a word cloud from ideas in the dataset's images. Today we are going to use these models to help us generate annotations in a dataset without annotations.

When we did our original object detections, the model could only produce predictions from the classes it was trained on. We can use the multi-modal models to generate custom object detection classes. Generating these labels would be one of the first steps in training or fine-tuning our own model to work for our use cases.

In [None]:
# Just default predictions
model = foz.load_zoo_model("yolov8l-world-torch")

dataset.apply(model, label_field="default_predictions")

#
# Make zero-shot predictions with custom classes
#

model = foz.load_zoo_model(
    "yolov8l-world-torch",
    classes=["car", "motocycle", "building", "flower", "animal"],
)

dataset.apply(model, label_field="custom_predictions")

session = fo.launch_app(dataset, auto=False)
session.url