### Tutorial

We will use [**LlamaIndex**](https://huggingface.co/llamaindex/vdr-2b-multi-v1/tree/main) for generating multimodal embeddings and [**Qdrant**](http://qdrant.tech) for storing and retrieving them.

In [None]:
%pip install llama-index-embeddings-huggingface qdrant-client 

In [2]:
from qdrant_client import QdrantClient, models

# docker run -p 6333:6333 qdrant/qdrant
client = QdrantClient(url="http://localhost:6333/")

Let's embed a very short selection of images and their captions in the **shared embedding space**.

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

model = HuggingFaceEmbedding(
    model_name="llamaindex/vdr-2b-multi-v1",
    device="cpu",  # "mps" for mac, "cuda" for nvidia GPUs
    trust_remote_code=True,
)

documents = [
    {"caption": "An image about plane emergency safety.", "image": "images/image-1.png"},
    {"caption": "An image about airplane components.", "image": "images/image-2.png"},
    {"caption": "An image about COVID safety restrictions.", "image": "images/image-3.png"},
    {"caption": "An confidential image about UFO sightings.", "image": "images/image-4.png"},
    {"caption": "An image about unusual footprints on Aralar 2011.", "image": "images/image-5.png"},
]

text_embeddings = model.get_text_embedding_batch([doc["caption"] for doc in documents])
image_embeddings = model.get_image_embedding_batch([doc["image"] for doc in documents])

Create a **Collection**

In [4]:
COLLECTION_NAME = "llama-multi"

if not client.collection_exists(COLLECTION_NAME):
    client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config={
            "image": models.VectorParams(size=len(image_embeddings[0]), distance=models.Distance.COSINE),
            "text": models.VectorParams(size=len(text_embeddings[0]), distance=models.Distance.COSINE),
        }
    )

Now let's upload our images with captions to the **Collection**. Each image with its caption will create a [Point](https://qdrant.tech/documentation/concepts/points/) in Qdrant.

In [5]:
client.upload_points(
    collection_name=COLLECTION_NAME,
    points=[
        models.PointStruct(
            id=idx,
            vector={
                "text": text_embeddings[idx],
                "image": image_embeddings[idx],
            },
            payload=doc
        )
        for idx, doc in enumerate(documents)
    ]
)

Let'see what image we will get to the query "*Adventures on snow hills*"

In [None]:
from PIL import Image

find_image = model.get_query_embedding("Adventures on snow hills")

Image.open(client.query_points(
    collection_name=COLLECTION_NAME,
    query=find_image,
    using="image",
    with_payload=["image"],
    limit=1
).points[0].payload['image'])

Let's also run the same query in Italian and compare the results.

In [None]:
Image.open(client.query_points(
    collection_name=COLLECTION_NAME,
    query=model.get_query_embedding("Avventure sulle colline innevate"),
    using="image",
    with_payload=["image"],
    limit=1
).points[0].payload['image'])

Now let's do a reverse search for the follwing image:

In [None]:
Image.open('images/image-2.png')

In [None]:
client.query_points(
    collection_name=COLLECTION_NAME,
    query=model.get_image_embedding("images/image-2.png"),  
    # Now we are searching only among text vectors with our image query
    using="text",
    with_payload=["caption"],
    limit=1
).points[0].payload['caption']