In [None]:
!pip install openai pymilvus>=2.3.2

# Similarity Search with Zilliz Cloud and OpenAI

This page discusses integrating vector databases with OpenAI's embedding API.

We will demonstrate how to use OpenAI's [Embedding API](https://beta.openai.com/docs/guides/embeddings) with our vector database to search for book titles. Many existing book search solutions, such as those used by public libraries, rely on keyword matching rather than a semantic understanding of the title's meaning. Using a trained model to represent the input data is known as semantic search and can be applied to a variety of different text-based use cases, including anomaly detection and document search.

## Get started

To follow along, you'll need an API key from the [OpenAI website](https://openai.com/api/). Also, be sure to visit our [cloud landing page](https://zilliz.com/cloud) for free credits that you can use to spin up a new cluster if you don’t have one already.

We'll also need to prepare the data for this example. You can obtain the book titles from [here](https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks). Let's create a function that loads book titles from our CSV.

In [4]:
# Use PyMilvus in development
# Should be replaced with `from pymilvus import *` in production
# from pathlib import Path
# import sys
# sys.path.append(str(Path("/Users/anthony/Documents/play/refine_milvus/pymilvus")))

import csv, random, time
import openai
from pymilvus import connections, DataType, CollectionSchema, FieldSchema, Collection, utility
import time

## Prepare arguments

In this section, we need to set up an environment for us to embed the extracted titles, insert the embeddings into a Zilliz Cloud collection, and conduct an ANN search against them.

In [7]:
# Set up arguments

# 1. Go to https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks, download the dataset, and save it locally.
FILE = '../books.csv' 

# 2. Set up the name of the collection to be created.
COLLECTION_NAME = 'title_db'

# 3. Set up the dimension of the embeddings.
DIMENSION = 1536

# 4. Set up the number of records to process.
COUNT = 100

# 5. Set up the connection parameters for your Zilliz Cloud cluster.
URI = 'YOUR_CLUSTER_ENDPOINT'

TOKEN = 'YOUR_CLUSTER_TOKEN'

# 6. Set up the OpenAI engine and API key to use.
OPENAI_ENGINE = 'text-embedding-ada-002'  # Which engine to use
openai.api_key = 'YOUR_OPENAI_API_KEY'  # Use your own Open AI API Key here

## Work with Zilliz Cloud

The following snippet deals with Zilliz Cloud and setting up the cluster for this use case. Within Zilliz Cloud, we need to set up a collection and index it. For more information on how to set up and use Zilliz Cloud, refer to [this link](https://docs.zilliz.com/docs/quick-start-1).

In [8]:
# Connect to Zilliz Cloud and create a collection

connections.connect(
    alias='default',
    # Public endpoint obtained from Zilliz Cloud
    uri=URI,
    token=TOKEN
)

if COLLECTION_NAME in utility.list_collections():
    utility.drop_collection(COLLECTION_NAME)

fields = [
    FieldSchema(name='id', dtype=DataType.INT64, descrition='Ids', is_primary=True, auto_id=False),
    FieldSchema(name='title', dtype=DataType.VARCHAR, description='Title texts', max_length=200),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='Embedding vectors', dim=DIMENSION)
]

schema = CollectionSchema(fields=fields, description='Title collection')

collection = Collection(
    name=COLLECTION_NAME,
    schema=schema,
)

index_params = {
    'metric_type': 'L2',
    'index_type': 'AUTOINDEX',
    'params': {'nlist': 1024}
}

collection.create_index(
    field_name='embedding', 
    index_params=index_params
)

collection.load()

Status(code=0, message=)

After setting up the collection, we can begin inserting our data, which involves three steps: reading the data, embedding the titles, and inserting into Zilliz Cloud.

In [9]:
# Load the csv file and extract embeddings from the text

def csv_load(file):
    with open(file, newline='') as f:
        reader=csv.reader(f, delimiter=',')
        for row in reader:
            yield row[1]

def embed(text):
    return openai.Embedding.create(
        input=text, 
        engine=OPENAI_ENGINE)["data"][0]["embedding"]

# Insert each title and its embeddings

for idx, text in enumerate(random.sample(sorted(csv_load(FILE)), k=COUNT)):
    ins = {
        'id': idx,
        'title': (text[:198] + '..') if len(text) > 200 else text,
        'embedding': embed(text)
    }
    collection.insert(data=ins)
    time.sleep(3)
    print('Inserted: ', ins['title'])

Inserted:  Forever (Firstborn  #5)
Inserted:  A Kiss Remembered
Inserted:  Blubberina (Scrambled Legs  #5)
Inserted:  Nanny Ogg's Cookbook
Inserted:  One Hundred Years of Solitude
Inserted:  Key Lime Pie Murder (Hannah Swensen  #9)
Inserted:  The Collected Stories of Philip K. Dick 3: Second Variety
Inserted:  The Call of the Mall: How we shop
Inserted:  The Iliad
Inserted:  The Origin of Consciousness in the Breakdown of the Bicameral Mind
Inserted:  The Philosophy of History
Inserted:  City of Glass: The Graphic Novel
Inserted:  Shadow of the Moon (Moon #5)
Inserted:  A Circle of Quiet (Crosswicks Journals #1)
Inserted:  Demonic Males: Apes and the Origins of Human Violence
Inserted:  On Crimes and Punishments
Inserted:  Anleitung zum Zickigsein
Inserted:  Tell No One
Inserted:  For Her Own Good: Two Centuries of the Experts' Advice to Women
Inserted:  Up Country
Inserted:  Europe and the People Without History
Inserted:  Presidential Assassins (History Makers)
Inserted:  Homebody
In

In [10]:
# Search for similar titles
def search(text):
    res = collection.search(
        data=[embed(text)],
        anns_field='embedding',
        param={"metric_type": "L2", "params": {"nprobe": 10}},
        output_fields=['title'],
        limit=5,
    )

    ret = []

    for hits in res:
        for hit in hits:
            row = []
            row.extend([hit['id'], hit['distance'], hit['entity']['title']])
            ret.append(row)

    return ret

search_terms = [
    'self-improvement',
    'landscape',
]

for x in search_terms:
    print('Search term: ', x)
    for x in search(x):
        print(x)
    print()

Search term:  self-improvement

Search term:  landscape

