# Understanding Image Space

Let's use state-of-the art [OpenCLIP](https://github.com/mlfoundations/open_clip?tab=readme-ov-file) models to embed text or images into a multi-modal vector space.

In [1]:
%pip install superlinked==37.3.0

In [2]:
from io import BytesIO

import requests
import PIL

from superlinked import framework as sl

DATA_URL: str = "https://storage.googleapis.com/superlinked-notebook-feature-image-embedding"

Let's create our dataset and load some images into it.

In [3]:
def open_image_from_public_gcs(public_url: str) -> PIL.ImageFile.ImageFile:
    # Fetch the image using the public URL
    response = requests.get(public_url, timeout=5)

    # Ensure the request was successful
    if response.status_code == 200:
        # Open the image with PIL using the downloaded bytes
        downloaded_image = PIL.Image.open(BytesIO(response.content))
        return downloaded_image
    raise requests.RequestException(
        f"Failed to fetch image. Status code: {response.status_code}",
    )


image_labels: list[str] = [
    "blue-circle",
    "blue-square",
    "red-circle",
    "red-rectangle",
    "red-circle-with-black-frame",
]

data = [
    {
        "id": image_label,
        "image": open_image_from_public_gcs(f"{DATA_URL}/{image_label}.png"),
    }
    for image_label in image_labels
]

# we will use it for querying
blue_square_image = open_image_from_public_gcs(f"{DATA_URL}/blue-square.png")

# Superlinked config for image search

Multimodal vision transformers create the opportunity to embed the image and its caption or description together in a shared space. As a result, the embedding of the same concept will be close regardless of it being an image or a textual representation. With ImageSpace you can embed the image together with its description (or simply the image - only embedding text with ViT is currently not possible due to efficiency reasons) into this multimodal space.

Use Blob SchemaFieldType for images. It accepts
- PIL.Image in memory
- local path or remote url to load the image from
- imagefile as a byte array in string format

In [4]:
class Image(sl.Schema):
    id: sl.IdField
    image: sl.Blob
    description: sl.String


image = Image()

# the space is set up to aggregate the image and the description embeddings
image_embedding_space = sl.ImageSpace(image=image.image + image.description)

image_index = sl.Index(image_embedding_space)
source: sl.InMemorySource = sl.InMemorySource(image)
executor = sl.InMemoryExecutor(sources=[source], indices=[image_index])
app = executor.run()

simple_query = (
    sl.Query(image_index, weights={image_embedding_space: 1.0})
    .find(image)
    .similar(image_embedding_space.image, sl.Param("image_search"), sl.Param("image_weight"))
    .similar(
        image_embedding_space.description,
        sl.Param("text_search"),
        sl.Param("text_weight"),
    )
    .select_all()
)

In [5]:
# ingest data
source.put(data)

# Run queries

First on the original data, subsequently extend the dataset with descriptions.

Let's search with an image of a blue square first (the same image as the ingested one, hence the 1.0 similarity).

In [6]:
result = app.query(simple_query, image_search=blue_square_image)
sl.PandasConverter.to_pandas(result)

Unnamed: 0,image,id,similarity_score,rank
0,iVBORw0KGgoAAAANSUhEUgAAA8AAAAPACAIAAAB1tIfMAA...,blue-square,1.0,0
1,iVBORw0KGgoAAAANSUhEUgAAAOEAAADhCAMAAAAJbSJIAA...,blue-circle,0.868173,1
2,iVBORw0KGgoAAAANSUhEUgAAAOEAAADhCAMAAAAJbSJIAA...,red-circle-with-black-frame,0.826916,2
3,iVBORw0KGgoAAAANSUhEUgAAAaQAAAEYCAYAAAATRII7AA...,red-rectangle,0.825166,3
4,iVBORw0KGgoAAAANSUhEUgAAAVwAAAFcBAMAAAB2OBsfAA...,red-circle,0.809378,4


Interestingly enough, a blue circle is closer to a blue square, then a red rectangle. Also, a red circle is the furthest (understandably), but a black frame on a circle is more similar to a square than a rectangle - by a negligable margin.

Now let's add descriptions to the images. Their embeddings will be aggregated with the image embeddings. 

In [7]:
data_desc = [
    {
        "id": image_label,
        "image": open_image_from_public_gcs(f"{DATA_URL}/{image_label}.png"),
        "description": f"This is a {' '.join(image_label.split('-'))}.",
    }
    for image_label in image_labels
]

# let's try to deceive everyone by stating the red rectangle is blue and see how that affects our results
data_desc[3]["description"] = "This is a blue rectangle."

source.put(data_desc)

Continue running queries on the dataset that now contains descriptions.

In [8]:
result = app.query(simple_query, image_search=blue_square_image)
sl.PandasConverter.to_pandas(result)

Unnamed: 0,image,description,id,similarity_score,rank
0,iVBORw0KGgoAAAANSUhEUgAAA8AAAAPACAIAAAB1tIfMAA...,This is a blue square.,blue-square,0.853752,0
1,iVBORw0KGgoAAAANSUhEUgAAAaQAAAEYCAYAAAATRII7AA...,This is a blue rectangle.,red-rectangle,0.753377,1
2,iVBORw0KGgoAAAANSUhEUgAAAOEAAADhCAMAAAAJbSJIAA...,This is a blue circle.,blue-circle,0.734472,2
3,iVBORw0KGgoAAAANSUhEUgAAAOEAAADhCAMAAAAJbSJIAA...,This is a red circle with black frame.,red-circle-with-black-frame,0.688376,3
4,iVBORw0KGgoAAAANSUhEUgAAAVwAAAFcBAMAAAB2OBsfAA...,This is a red circle.,red-circle,0.67771,4


Our trick worked, now the red rectangle (labeled blue) is the closest item to a blue square (taking over the blue circle, that is actually blue in description and image as well.)

Now let's try searching with some text in the vectorspace of the Vision Transformer - showing that textual queries can be used, too.

In [9]:
result = app.query(simple_query, text_search="black frame around a red circle")
sl.PandasConverter.to_pandas(result)

Unnamed: 0,image,description,id,similarity_score,rank
0,iVBORw0KGgoAAAANSUhEUgAAAOEAAADhCAMAAAAJbSJIAA...,This is a red circle with black frame.,red-circle-with-black-frame,0.746052,0
1,iVBORw0KGgoAAAANSUhEUgAAAVwAAAFcBAMAAAB2OBsfAA...,This is a red circle.,red-circle,0.697133,1
2,iVBORw0KGgoAAAANSUhEUgAAAOEAAADhCAMAAAAJbSJIAA...,This is a blue circle.,blue-circle,0.648808,2
3,iVBORw0KGgoAAAANSUhEUgAAAaQAAAEYCAYAAAATRII7AA...,This is a blue rectangle.,red-rectangle,0.596682,3
4,iVBORw0KGgoAAAANSUhEUgAAA8AAAAPACAIAAAB1tIfMAA...,This is a blue square.,blue-square,0.563707,4


Utilising the 2 similar clauses in the query, we can search with text and image at the same time...

In [10]:
result = app.query(simple_query, image_search=blue_square_image, text_search="red circle")
sl.PandasConverter.to_pandas(result)

Unnamed: 0,image,description,id,similarity_score,rank
0,iVBORw0KGgoAAAANSUhEUgAAA8AAAAPACAIAAAB1tIfMAA...,This is a blue square.,blue-square,0.898397,0
1,iVBORw0KGgoAAAANSUhEUgAAAVwAAAFcBAMAAAB2OBsfAA...,This is a red circle.,red-circle,0.896603,1
2,iVBORw0KGgoAAAANSUhEUgAAAOEAAADhCAMAAAAJbSJIAA...,This is a blue circle.,blue-circle,0.891934,2
3,iVBORw0KGgoAAAANSUhEUgAAAOEAAADhCAMAAAAJbSJIAA...,This is a red circle with black frame.,red-circle-with-black-frame,0.891678,3
4,iVBORw0KGgoAAAANSUhEUgAAAaQAAAEYCAYAAAATRII7AA...,This is a blue rectangle.,red-rectangle,0.858931,4


... and even set different weights according to the importance of the respective search terms.

In [14]:
result = app.query(
    simple_query,
    image_search=blue_square_image,
    image_weight=0.1,
    text_search="red circle",
    text_weight=1.0,
)
sl.PandasConverter.to_pandas(result)

Unnamed: 0,image,description,id,similarity_score,rank
0,iVBORw0KGgoAAAANSUhEUgAAAVwAAAFcBAMAAAB2OBsfAA...,This is a red circle.,red-circle,0.775167,0
1,iVBORw0KGgoAAAANSUhEUgAAAOEAAADhCAMAAAAJbSJIAA...,This is a red circle with black frame.,red-circle-with-black-frame,0.758307,1
2,iVBORw0KGgoAAAANSUhEUgAAAOEAAADhCAMAAAAJbSJIAA...,This is a blue circle.,blue-circle,0.718329,2
3,iVBORw0KGgoAAAANSUhEUgAAAaQAAAEYCAYAAAATRII7AA...,This is a blue rectangle.,red-rectangle,0.651391,3
4,iVBORw0KGgoAAAANSUhEUgAAA8AAAAPACAIAAAB1tIfMAA...,This is a blue square.,blue-square,0.623732,4


Notice how the weight change moves everything red and circular upwards, while blue and rectangulars move downwards in the result ranking.