# Multimodal Search Using CLIP

## Introduction

This notebook demonstrates how SuperDuperDB can perform searches involving both text and images using the [CLIP multimodal architecture](https://openai.com/research/clip). 


## Prerequisites

Before we start, make sure you have the necessary tools by running these commands:

In [None]:
!pip install superduperdb
!pip install ipython openai-clip
!pip install -U datasets

## Connect to datastore 

Connect to a MongoDB datastore using SuperDuperDB. Adjust the connection URI based on your setup.
Here are some examples of `MongoDB URIs`:

* For testing (default connection): `mongomock://test`
* Local MongoDB instance: `mongodb://localhost:27017`
* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`
* MongoDB Atlas: `mongodb+srv://<username>:<password>@<atlas_cluster>/<database>`

In [None]:
import os
from superduperdb import superduper
from superduperdb.backends.mongodb import Collection

# SuperDuper your Database!
mongodb_uri = os.getenv("MONGODB_URI", "mongomock://test")
db = superduper(mongodb_uri, artifact_store='filesystem://.data')

collection = Collection('multimodal')

## Load Dataset 

For simplicity, we'll work with a subset of the [Tiny-Imagenet dataset](https://paperswithcode.com/dataset/tiny-imagenet). To insert images into the database, we utilize the `Encoder`-`Document` framework, which allows saving Python class instances as blobs in the `Datalayer` and retrieving them as Python objects. We will use the `PIL.Image` encoders, but it's also possible to create your own encoders for your custom data types.

Download images locally.

In [None]:
!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/coco_sample.zip
!unzip -f coco_sample.zip

Convert images to Python objects.

In [None]:
from superduperdb import Document
from superduperdb.ext.pillow import pil_image as i
import glob

images = glob.glob('images_small/*.jpg')
documents = [Document({'image': i(uri=f'file://{img}')}) for img in images][:500]

Ensure that images are represented correctly as Python objects.

In [None]:
documents[1]

The wrapped python objects may be inserted directly to the `Datalayer`:

In [None]:
db.execute(collection.insert_many(documents), encoders=(i,))

Verify that the images are stored:

In [None]:
x = db.execute(collection.find_one()).unpack()['image']
display(x.resize((300, 300 * (1+int(x.size[1] / x.size[0])))))

## Build Models

Now, let's prepare the CLIP model for multimodal search, which involves two components: `text encoding` and `visual encoding`. 

In [None]:
import clip
from superduperdb import vector
from superduperdb.ext.torch import TorchModel

# Load the CLIP model
model, preprocess = clip.load("RN50", device='cpu')

# Define a vector
e = vector(shape=(1024,))

# Create a TorchModel for text encoding
text_model = TorchModel(
    identifier='clip_text',
    object=model,
    preprocess=lambda x: clip.tokenize(x)[0],
    postprocess=lambda x: x.tolist(),
    encoder=e,
    forward_method='encode_text',    
)

# Create a TorchModel for visual encoding
visual_model = TorchModel(
    identifier='clip_image',
    object=model.visual,    
    preprocess=preprocess,
    postprocess=lambda x: x.tolist(),
    encoder=e,
)

## Create a Vector-Search Index

Now, let's make a system that can search for both text and images using vectors. 

We'll add both the `visual_model` and the `text_model` to the search system, but they have different roles. 

The `visual_model` will be designated as the primary transformer (`indexing_listener`), in charge of creating vectors in the database.
The `text_model` will serve as the secondary transformer (`compatible_listener`), defining how an alternative model can search for those vectors.
This way, we can use different models for searching, even if they expect different types of information.

In [None]:
from superduperdb import VectorIndex
from superduperdb import Listener

# Create a VectorIndex and add it to the database
db.add(
    VectorIndex(
        'my-index',
        indexing_listener=Listener(
            model=visual_model,
            key='image',
            select=collection.find(),
            predict_kwargs={'batch_size': 10},
        ),
        compatible_listener=Listener(
            model=text_model,
            key='text',
            active=False,
            select=None,
        )
    )
)

## Search Images Using Text

Now we can demonstrate searching for images using text queries:

In [None]:
from IPython.display import display
from superduperdb import Document

query_string = 'sports'

search_results = db.execute(
    collection.like(
        Document({'text': query_string}), # Search image by text
        vector_index='my-index', 
        n=3,
    ).find({})
)

# Display the images from the search results
for r in search_results:
    x = r['image'].x
    display(x.resize((300, int(300 * x.size[1] / x.size[0]))))

## Search For Similar Images

Besides searching for images using text, we can use the same vectors to search for similar images.  

To do so, let's pick a random image, and use it as a reference to finding the similar ones.   

In [None]:
# Pickup a random image as reference.
ref_img = db.execute(collection.find_one({}))['image']
x = ref_img.x
display(x.resize((300, int(300 * x.size[1] / x.size[0]))))

The process is now the same before:

In [None]:
cur = db.execute(
    collection.like(
        Document({'image': ref_img}), # Search similar images
        vector_index='my-index', 
        n=3,
    ).find({})
)

for r in cur:
    x = r['image'].x
    display(x.resize((300, int(300 * x.size[1] / x.size[0]))))