<!-- TABS -->
# Text Vector Search

You'll find this example as well as the saved template in the main repository of `superduper`.
See [here](https://github.com/superduper-io/superduper/tree/main/templates/text_vector_search).

If you'd like to modify the template, or practice building it yourself, then you can rerun the `build.ipynb` notebook
in the template directory

<!-- TABS -->
## Connect to superduper

In [1]:
from superduper import superduper

db = superduper('mongomock://test_db')

[32m2024-Aug-23 14:38:44.94[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.misc.plugins[0m:[36m13  [0m | [1mLoading plugin: mongodb[0m
[32m2024-Aug-23 14:38:45.01[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.base.datalayer[0m:[36m103 [0m | [1mBuilding Data Layer[0m
[32m2024-Aug-23 14:38:45.01[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.base.build[0m:[36m171 [0m | [1mConfiguration: 
 +---------------+---------------------+
| Configuration |        Value        |
+---------------+---------------------+
|  Data Backend | mongomock://test_db |
+---------------+---------------------+[0m


<!-- TABS -->
## Get useful sample data

In [2]:
# <tab: Text>
!curl -O https://superduperdb-public-demo.s3.amazonaws.com/text.json
import json

with open('text.json', 'r') as f:
    data = json.load(f)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  720k  100  720k    0     0   277k      0  0:00:02  0:00:02 --:--:--  278k


In [None]:
# <tab: PDF>
!curl -O https://superduperdb-public-demo.s3.amazonaws.com/pdfs.zip && unzip -o pdfs.zip
import os

data = [f'pdfs/{x}' for x in os.listdir('./pdfs') if x.endswith('.pdf')]

In [3]:
datas = [{'x': d} for d in data]

<!-- TABS -->
## Create datatype

SuperduperDB supports automatic data conversion, so users don’t need to worry about the compatibility of different data formats (`PIL.Image`, `numpy.array`, `pandas.DataFrame`, etc.) with the database.

It also supports custom data conversion methods for transforming data, such as defining the following Datatype.

In [4]:
# <tab: Text>
datatype = 'str'

In [None]:
# <tab: PDF>
from superduper import DataType

# By creating a datatype and setting its encodable attribute to “file” for saving PDF files, 
# all datatypes encoded as “file” will have their corresponding files uploaded to the artifact store. 
# References will be recorded in the database, and the files will be downloaded locally when needed. 

datatype = DataType('pdf', encodable='file')

<!-- TABS -->
## Setup tables or collections

In [5]:
from superduper.components.table import Table
from superduper import Schema

schema = Schema(identifier="schema", fields={"x": datatype})
table = Table("docs", schema=schema)
select = db['docs'].select()

<!-- TABS -->
## Apply a chunker for search

:::note
Note that applying a chunker is ***not*** mandatory for search.
If your data is already chunked (e.g. short text snippets or audio) or if you
are searching through something like images, which can't be chunked, then this
won't be necessary.
:::

In [6]:
# <tab: Text>
from superduper import model

CHUNK_SIZE = 200

@model(flatten=True, model_update_kwargs={'document_embedded': False})
def chunker(text):
    text = text.split()
    chunks = [' '.join(text[i:i + CHUNK_SIZE]) for i in range(0, len(text), CHUNK_SIZE)]
    return chunks

In [None]:
# <tab: PDF>
!pip install -q "unstructured[pdf]"
from superduper import model
from unstructured.partition.pdf import partition_pdf

CHUNK_SIZE = 500

@model(flatten=True)
def chunker(pdf_file):
    elements = partition_pdf(pdf_file)
    text = '\n'.join([e.text for e in elements])
    chunks = [text[i:i + CHUNK_SIZE] for i in range(0, len(text), CHUNK_SIZE)]
    return chunks

Now we wrap this chunker as a `Listener`, so that it processes incoming data

In [7]:
from superduper import Listener

upstream_listener = Listener(
    model=chunker,
    select=db['docs'].select(),
    key='x',
    uuid="chunk",
    identifier='chunker',
)

## Select outputs of upstream listener

:::note
This is useful if you have performed a first step, such as pre-computing 
features, or chunking your data. You can use this query to 
operate on those outputs.
:::

In [8]:
indexing_key = upstream_listener.outputs
indexing_key

'_outputs__chunker'

<!-- TABS -->
## Build text embedding model

In [9]:
# <tab: OpenAI>
from superduper_openai import OpenAIEmbedding
import os

os.environ['OPENAI_API_KEY'] = 'sk-<secret>'

embedding_model = OpenAIEmbedding(identifier='text-embedding-ada-002')

In [None]:
# <tab: JinaAI>
import os
from superduper_jina import JinaEmbedding

os.environ["JINA_API_KEY"] = "jina_xxxx"
 
# define the model
embedding_model = JinaEmbedding(identifier='jina-embeddings-v2-base-en')

In [None]:
# <tab: Sentence-Transformers>
!pip install sentence-transformers
from superduper import vector
import sentence_transformers
from superduper_sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(
    identifier="embedding",
    object=sentence_transformers.SentenceTransformer("BAAI/bge-small-en"),
    datatype=vector(shape=(1024,)),
    postprocess=lambda x: x.tolist(),
    predict_kwargs={"show_progress_bar": True},
)

In [None]:
print(len(embedding_model.predict("What is superduper")))

## Create vector-index

In [10]:
vector_index_name = 'my-vector-index'

In [11]:
from superduper import VectorIndex, Listener

vector_index = VectorIndex(
    vector_index_name,
    indexing_listener=Listener(
        key=indexing_key,              # the `Document` key `model` should ingest to create embedding
        select=db[indexing_key].select(),                 # a `Select` query telling which data to search over
        model=embedding_model,         # a `_Predictor` how to convert data to embeddings
        identifier=f'{embedding_model.identifier}-listener',
        upstream=[table, upstream_listener],              # this makes sure that the table is already set up when the other components are triggered
    )
)

In [13]:
from superduper import Application

application = Application(
    'text-vector-search', 
    components=[
        table,
        upstream_listener,
        vector_index,
    ]
)

In [14]:
db.apply(application)

[32m2024-Aug-23 14:40:30.88[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.components.listener[0m:[36m94  [0m | [1mRequesting listener setup on CDC service[0m
[32m2024-Aug-23 14:40:30.88[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.components.listener[0m:[36m104 [0m | [1mSkipping listener setup on CDC service since no URI is set[0m
[32m2024-Aug-23 14:40:30.90[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.components.listener[0m:[36m94  [0m | [1mRequesting listener setup on CDC service[0m
[32m2024-Aug-23 14:40:30.90[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.components.listener[0m:[36m104 [0m | [1mSkipping listener setup on CDC service since no URI is set[0m
[32m2024-Aug-23 14:40:30.92[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.components.vector_index[0m:[36m54  [0m | [1mLoading vectors of vector-index: 'm

Loading vectors into vector-table...: 0it [00:00, ?it/s]

[32m2024-Aug-23 14:40:30.92[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.components.vector_index[0m:[36m97  [0m | [1mLoaded 0 vectors into vector index succesfully[0m





([],
 Application(identifier='text-vector-search', uuid='ba67055ba77d40ffa73d5cd3a17118aa', upstream=None, plugins=None, cache=False, components=[Table(identifier='documents', uuid='d05bd908810a43988dbe41c2167644f0', upstream=None, plugins=None, cache=False, schema=Schema(identifier='schema', uuid='ea43dd0a76f749afb3b416925715c3c1', upstream=None, plugins=None, cache=False, fields={'x': FieldType(identifier='str', uuid='9426229dbd614d10b476e979a96f148f'), '_fold': FieldType(identifier='str', uuid='2de7977645d2435aacc4d8e6c1e1d573')}), primary_id='id'), Listener(identifier='chunker', uuid='chunk', upstream=None, plugins=None, cache=False, key='x', model=ObjectModel(identifier='chunker', uuid='3855aff339c84bf78af50c537c61b5ba', upstream=None, plugins=None, cache=False, signature='*args,**kwargs', datatype=None, output_schema=None, flatten=True, model_update_kwargs={'document_embedded': False}, predict_kwargs={}, compute_kwargs={}, validation=None, metric_values={}, num_workers=0, object=

In [16]:
application.info(verbosity=2)



In [17]:
db['docs'].insert(datas).execute()
select = db['docs'].select()

[32m2024-Aug-23 14:41:35.51[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.base.datalayer[0m:[36m344 [0m | [1mInserted 210 documents into documents[0m
[32m2024-Aug-23 14:41:35.52[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.base.datalayer[0m:[36m404 [0m | [1mCreated 210 events for insert on [documents][0m
[32m2024-Aug-23 14:41:35.52[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.base.datalayer[0m:[36m407 [0m | [1mBroadcasting 210 events[0m
[32m2024-Aug-23 14:41:35.52[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.jobs.queue[0m:[36m210 [0m | [1mRunning jobs for listener::chunker[0m
[32m2024-Aug-23 14:41:35.52[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.backends.local.compute[0m:[36m67  [0m | [1mSubmitting job. function:<function method_job at 0x112075300>[0m
[32m2024-Aug-23 14:41:35.52[0m| [1mINFO    [0m |

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:08<00:00,  1.67s/it]


[32m2024-Aug-23 14:41:44.02[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.components.model[0m:[36m853 [0m | [1mAdding 431 model outputs to `db`[0m
[32m2024-Aug-23 14:41:46.12[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.base.datalayer[0m:[36m344 [0m | [1mInserted 431 documents into _outputs__text-embedding-ada-002-listener[0m
[32m2024-Aug-23 14:41:46.15[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.base.datalayer[0m:[36m404 [0m | [1mCreated 431 events for insert on [_outputs__text-embedding-ada-002-listener][0m
[32m2024-Aug-23 14:41:46.15[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.base.datalayer[0m:[36m407 [0m | [1mBroadcasting 431 events[0m
[32m2024-Aug-23 14:41:46.15[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.jobs.queue[0m:[36m210 [0m | [1mRunning jobs for vector_index::my-vector-index[0m
[32m2024-A

In [19]:
db.databackend.db.list_collection_names()

['documents', '_outputs__chunker', '_outputs__text-embedding-ada-002-listener']

## Perform a vector search

In [21]:
from superduper import Document
# Perform the vector search based on the query
item = Document({indexing_key: "Tell me about vector-search"})

In [27]:
results = db[indexing_key].like(item, vector_index=vector_index_name, n=10).select().execute()

[32m2024-Aug-23 14:43:39.94[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36msuperduper.base.datalayer[0m:[36m889 [0m | [1m{}[0m


In [28]:
for result in results:
    print("\n", '-' * 20, '\n')
    print(Document(result.unpack())[indexing_key])


 -------------------- 

--- sidebar_position: 7 --- # Vector-search SuperDuperDB allows users to implement vector-search in their database by either using in-database functionality, or via a sidecar implementation with `lance` and `FastAPI`. ## Philosophy In `superduperdb`, from a user point-of-view vector-search isn't a completely different beast than other ways of using the system: - The vector-preparation is exactly the same as preparing outputs with any model, with the special difference that the outputs are vectors, arrays or tensors. - Vector-searches are just another type of database query which happen to use the stored vectors. ## Algorithm Here is a schematic of how vector-search works: ![](/img/vector-search.png) ## Explanation A vector-search query has the schematic form: ```python table_or_collection .like(Document(<dict-to-search-with>)) # the operand is vectorized using registered models .filter_results(*args, **kwargs) # the results of vector-search are filtered ``` ```

In [33]:
from superduper import Template

t = Template(
    'vector-search',
    template=application,
    substitutions={'docs': 'table_name'},
)



In [35]:
t.export('.')

In [37]:
!cat component.json | jq .

[1;39m{
  [0m[1;34m"_base"[0m[1;39m: [0m[0;32m"?vector-search"[0m[1;39m,
  [0m[1;34m"_builds"[0m[1;39m: [0m[1;39m{
    [0m[1;34m"vector-search"[0m[1;39m: [0m[1;39m{
      [0m[1;34m"_path"[0m[1;39m: [0m[0;32m"superduper.components.template.Template"[0m[1;39m,
      [0m[1;34m"template"[0m[1;39m: [0m[1;39m{
        [0m[1;34m"_base"[0m[1;39m: [0m[0;32m"?text-vector-search"[0m[1;39m,
        [0m[1;34m"_builds"[0m[1;39m: [0m[1;39m{
          [0m[1;34m"str"[0m[1;39m: [0m[1;39m{
            [0m[1;34m"_path"[0m[1;39m: [0m[0;32m"superduper.components.schema.FieldType"[0m[1;39m
          [1;39m}[0m[1;39m,
          [0m[1;34m"schema"[0m[1;39m: [0m[1;39m{
            [0m[1;34m"_path"[0m[1;39m: [0m[0;32m"superduper.components.schema.Schema"[0m[1;39m,
            [0m[1;34m"fields"[0m[1;39m: [0m[1;39m{
              [0m[1;34m"x"[0m[1;39m: [0m[0;32m"?str"[0m[1;39m,
              [0m[1;34m"_fold"[0m[1;39m: 