# Text vector search

<!-- TABS -->
## Connect to superduper

:::note
Note that this is only relevant if you are running superduper in development mode.
Otherwise refer to "Configuring your production system".
:::

In [2]:
APPLY = True
COLLECTION_NAME = '<var:table_name>' if not APPLY else 'sample_text_vector_search'

In [3]:
from superduper import superduper

db = superduper('mongomock:///test_db')

[32m2025-Jan-13 12:45:15.81[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.misc.plugins[0m:[36m13  [0m | [1mLoading plugin: mongodb[0m
[32m2025-Jan-13 12:45:15.87[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m64  [0m | [1mBuilding Data Layer[0m
[32m2025-Jan-13 12:45:15.87[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m79  [0m | [1mData Layer built[0m
[32m2025-Jan-13 12:45:15.87[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.backends.base.cluster[0m:[36m99  [0m | [1mCluster initialized in 0.00 seconds.[0m
[32m2025-Jan-13 12:45:15.87[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.build[0m:[36m184 [0m | [1mConfiguration: 
 +---------------+----------------------+
| Configuration |        Value         |
+---------------+----------------------+
|  Data Backend | mongomock:///test_db |
+---

<!-- TABS -->
## Get useful sample data

In [4]:
import json
import requests
import io

def getter():
    response = requests.get('https://superduperdb-public-demo.s3.amazonaws.com/text.json')
    return json.loads(response.content.decode('utf-8'))

In [5]:
if APPLY:
    data = getter()

<!-- TABS -->
## Insert simple data

After turning on auto_schema, we can directly insert data, and superduper will automatically analyze the data type, and match the construction of the table and datatype.

In [6]:
if APPLY:
    from superduper import Document
    ids = db.execute(db[COLLECTION_NAME].insert([Document({'x': x}) for x in data]))

[32m2025-Jan-13 12:45:21.08[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m593 [0m | [1mComponent (table, sample_text_vector_search) not found in cache, loading from db[0m
[32m2025-Jan-13 12:45:21.08[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m599 [0m | [1mLoad (('table', 'sample_text_vector_search')) from metadata...[0m
[32m2025-Jan-13 12:45:21.08[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m331 [0m | [1mTable sample_text_vector_search does not exist, auto creating...[0m
[32m2025-Jan-13 12:45:21.59[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m337 [0m | [1mCreating table sample_text_vector_search with schema {('_fold', 'str'), ('x', 'str')}[0m
[32m2025-Jan-13 12:45:21.59[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m593 

## Apply a chunker for search

:::note
Note that applying a chunker is ***not*** mandatory for search.
If your data is already chunked (e.g. short text snippets or audio) or if you
are searching through something like images, which can't be chunked, then this
won't be necessary.
:::

In [7]:
from superduper import Model

class Chunker(Model):
    chunk_size: int = 200
    signature: str = 'singleton'

    def predict(self, text):
        text = text.split()
        chunks = [' '.join(text[i:i + self.chunk_size]) for i in range(0, len(text), self.chunk_size)]
        return chunks

Now we apply this chunker to the data by wrapping the chunker in `Listener`:

In [8]:
from superduper import Listener

upstream_listener = Listener(
    model=Chunker('chunk_model', chunk_size=200, example='test ' * 50),
    select=db[COLLECTION_NAME].select(),
    key='x',
    identifier=f'chunker_{COLLECTION_NAME}',
    flatten=True,
)

In [9]:
if APPLY:
    db.apply(upstream_listener, force=True)

[32m2025-Jan-13 12:45:27.84[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m593 [0m | [1mComponent (model, chunk_model) not found in cache, loading from db[0m
[32m2025-Jan-13 12:45:27.84[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m599 [0m | [1mLoad (('model', 'chunk_model')) from metadata...[0m
[32m2025-Jan-13 12:45:27.84[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m359 [0m | [1mFound new model:chunk_model:76436da6a29d4031[0m
[32m2025-Jan-13 12:45:27.85[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m593 [0m | [1mComponent (listener, chunker_sample_text_vector_search) not found in cache, loading from db[0m
[32m2025-Jan-13 12:45:27.85[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m599 [0m | [1mLoad (('listener', 'chunker_sample_

187it [00:00, 29373.64it/s]

[32m2025-Jan-13 12:45:27.88[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.components.model[0m:[36m672 [0m | [1mAdding 187 model outputs to `db`[0m
[32m2025-Jan-13 12:45:27.97[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m310 [0m | [1mInserted 336 documents into _outputs__chunker_sample_text_vector_search__fc9ea0798a1e47e3[0m
[32m2025-Jan-13 12:45:27.97[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.backends.local.queue[0m:[36m120 [0m | [1mConsumed all events[0m





## Select outputs of upstream listener

:::note
This is useful if you have performed a first step, such as pre-computing 
features, or chunking your data. You can use this query to 
operate on those outputs.
:::

<!-- TABS -->
## Build text embedding model

OpenAI:

In [10]:
from superduper.components.datatype import Vector
from superduper_openai import OpenAIEmbedding

openai_embedding = OpenAIEmbedding(
    identifier='text-embedding-ada-002',
    datatype=Vector(shape=(1536,)),
)

Sentence-transformers

In [11]:
from superduper_sentence_transformers import SentenceTransformer

sentence_transformers_embedding = SentenceTransformer(
    identifier="sentence-transformers-embedding",
    model="BAAI/bge-small-en",
    datatype=Vector(shape=(1024,)),
    postprocess=lambda x: x.numpy(),
    predict_kwargs={"show_progress_bar": True},
)

In [12]:
from superduper.components.model import ModelRouter

embedding_model = ModelRouter(
    'embedding',
    models={'openai': openai_embedding, 'sentence_transformers': sentence_transformers_embedding},
    model='<var:embedding_model>' if not APPLY else 'openai',
    example='this is a test',
)

## Create vector-index

In [13]:
from superduper import VectorIndex, Listener

vector_index_name = f'vector-index-{COLLECTION_NAME}'

vector_index = VectorIndex(
    vector_index_name,
    indexing_listener=Listener(
        key=upstream_listener.outputs,
        select=db[upstream_listener.outputs].select(),
        model=embedding_model,
        identifier=f'embedding-listener-{COLLECTION_NAME}',
        upstream=[upstream_listener],
    )
)

In [14]:
if APPLY:
    db.apply(vector_index, force=True)

[32m2025-Jan-13 12:45:40.21[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m265 [0m | [1mFound identical model:chunk_model:76436da6a29d4031[0m
[32m2025-Jan-13 12:45:40.21[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m341 [0m | [1mFound update model:chunk_model:76436da6a29d4031[0m
[32m2025-Jan-13 12:45:40.21[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m265 [0m | [1mFound identical listener:chunker_sample_text_vector_search:fc9ea0798a1e47e3[0m
[32m2025-Jan-13 12:45:40.21[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m593 [0m | [1mComponent (datatype, vector[1536]) not found in cache, loading from db[0m
[32m2025-Jan-13 12:45:40.21[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m599 [0m | [1mLoad (('datatype', 'vector[1536]')) from metadata.

[32m2025-Jan-13 12:45:43.40[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m79  [0m | [1mFound these changes and/ or additions that need to be made:[0m
[32m2025-Jan-13 12:45:43.40[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m81  [0m | [1m----------------------------------------------------------------------------------------------------[0m
[32m2025-Jan-13 12:45:43.40[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m82  [0m | [1mMETADATA EVENTS:[0m
[32m2025-Jan-13 12:45:43.40[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m83  [0m | [1m----------------------------------------------------------------------------------------------------[0m
[32m2025-Jan-13 12:45:43.40[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m92  [0m | [1m[0]: model:chunk_model:76

336it [00:00, 17893.20it/s]


[32m2025-Jan-13 12:45:45.63[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.components.model[0m:[36m1345[0m | [1mPredicting with model openai[0m


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:06<00:00,  1.66s/it]


[32m2025-Jan-13 12:45:52.25[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.components.model[0m:[36m672 [0m | [1mAdding 336 model outputs to `db`[0m
[32m2025-Jan-13 12:45:53.85[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m310 [0m | [1mInserted 336 documents into _outputs__embedding-listener-sample_text_vector_search__13c7d67953a04170[0m
[32m2025-Jan-13 12:45:54.01[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.backends.local.queue[0m:[36m120 [0m | [1mConsumed all events[0m


By applying the RAG model to the database, it will subsequently be accessible for use in other services.

In [15]:
from superduper import Application

app = Application(
    f'text-vector-search-app-{COLLECTION_NAME}',
    components=[
        upstream_listener,
        vector_index,
    ]
)

[32m2025-Jan-13 12:45:56.04[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.components.application[0m:[36m39  [0m | [1mResorting components based on topological order.[0m
[32m2025-Jan-13 12:45:56.04[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.components.application[0m:[36m56  [0m | [1mNew order of components: ['listener:chunker_sample_text_vector_search:fc9ea0798a1e47e3', 'vector_index:vector-index-sample_text_vector_search:3a06ad54829f4776'][0m


In [16]:
if APPLY:
    db.apply(app, force=True)

[32m2025-Jan-13 12:46:01.50[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m265 [0m | [1mFound identical model:chunk_model:76436da6a29d4031[0m
[32m2025-Jan-13 12:46:01.50[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m341 [0m | [1mFound update model:chunk_model:76436da6a29d4031[0m
[32m2025-Jan-13 12:46:01.50[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m265 [0m | [1mFound identical listener:chunker_sample_text_vector_search:fc9ea0798a1e47e3[0m
[32m2025-Jan-13 12:46:01.51[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m265 [0m | [1mFound identical model:chunk_model:76436da6a29d4031[0m
[32m2025-Jan-13 12:46:01.51[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m341 [0m | [1mFound update model:chunk_model:76436da6a29d4031[0m
[32m2025-Jan-13 12:46:01.

[32m2025-Jan-13 12:46:06.35[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m79  [0m | [1mFound these changes and/ or additions that need to be made:[0m
[32m2025-Jan-13 12:46:06.35[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m81  [0m | [1m----------------------------------------------------------------------------------------------------[0m
[32m2025-Jan-13 12:46:06.35[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m82  [0m | [1mMETADATA EVENTS:[0m
[32m2025-Jan-13 12:46:06.35[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m83  [0m | [1m----------------------------------------------------------------------------------------------------[0m
[32m2025-Jan-13 12:46:06.35[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m92  [0m | [1m[0]: model:chunk_model:76

You can now load the model elsewhere and make predictions using the following command.

In [17]:
search_term = 'tell me about the use of pylance and vector-search'

vector_search_query = db[f'_outputs__chunker_{COLLECTION_NAME}'].like(
    {f'_outputs__chunker_{COLLECTION_NAME}': search_term},
    n=10,
    vector_index=vector_index_name,
).select()

In [19]:
if APPLY:
    print(vector_search_query.tolist())

[32m2025-Jan-13 12:46:23.21[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m802 [0m | [1mGetting vector-index[0m
[32m2025-Jan-13 12:46:23.21[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m810 [0m | [1m{}[0m
[32m2025-Jan-13 12:46:23.21[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.components.model[0m:[36m1335[0m | [1mPredicting with model openai[0m
[Document({'_outputs__chunker_sample_text_vector_search__fc9ea0798a1e47e3': "--- sidebar_position: 7 --- # Vector-search Superduper allows users to implement vector-search in their database by either using in-database functionality, or via a sidecar implementation with `lance` and `FastAPI`. ## Philosophy In Superduper, from a user point-of-view vector-search isn't a completely different beast than other ways of using the system: - The vector-preparation is exactly the same as preparing outputs with any mode

In [21]:
from superduper import QueryTemplate, CFG

qt = QueryTemplate(
    'vector_search',
    template=vector_search_query,
    substitutions={
        COLLECTION_NAME: 'table_name',
        search_term: 'search_term',
        'mongodb': 'data_backend',
    },
    types={
        'search_term': {
            'type': 'str',
            'default': 'enter your question here...',
        },
        'table_name': {
            'type': 'str',
            'default': 'sample_text_vector_search'
        },
        'data_backend': {
            'type': 'mongodb',
            'choices': ['mongodb', 'ibis'],
            'default': 'mongodb'
        }
    },
    db=db
)

## Create template

In [23]:
from superduper import Template, CFG, Table, Schema
from superduper.components.dataset import RemoteData

template = Template(
    'text_vector_search',
    template=app,
    default_tables=[Table(
        'sample_text_vector_search',
        schema=Schema('sample_text_vector_search/schema', fields={'x': 'str'}),
        data=RemoteData(
            'superduper-docs',
            getter=getter,
        )
    )],
    queries=[qt],
    substitutions={COLLECTION_NAME: 'table_name', 'mongodb': 'data_backend'},
    template_variables=['embedding_model', 'table_name', 'data_backend'],
    types={
        'embedding_model': {
            'type': 'str',
            'choices': ['openai', 'sentence_transformers'],
            'default': 'openai',
        },
        'table_name': {
            'type': 'str',
            'default': 'sample_text_vector_search'
        },
        'data_backend': {
            'type': 'mongodb',
            'choices': ['mongodb', 'ibis'],
            'default': 'mongodb'
        }
    },
    db=db
)



In [24]:
template.export('.')

