In [None]:
!pip install openai pymilvus>=2.3.2

# Similarity Search with Zilliz Cloud and OpenAI

This page discusses integrating vector databases with OpenAI's embedding API.

We will demonstrate how to use OpenAI's [Embedding API](https://beta.openai.com/docs/guides/embeddings) with our vector database to search for book titles. Many existing book search solutions, such as those used by public libraries, rely on keyword matching rather than a semantic understanding of the title's meaning. Using a trained model to represent the input data is known as semantic search and can be applied to a variety of different text-based use cases, including anomaly detection and document search.

## Get started

To follow along, you'll need an API key from the [OpenAI website](https://openai.com/api/). Also, be sure to visit our [cloud landing page](https://zilliz.com/cloud) for free credits that you can use to spin up a new cluster if you don’t have one already.

We'll also need to prepare the data for this example. You can obtain the book titles from [here](https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks). Let's create a function that loads book titles from our CSV.

In [3]:
# Use PyMilvus in development
# Should be replaced with `from pymilvus import *` in production
from pathlib import Path
import sys
sys.path.append(str(Path("/Users/anthony/Documents/play/refine_milvus/pymilvus")))

import csv, random, time
import openai
from pymilvus import MilvusClient, DataType, CollectionSchema, FieldSchema

## Prepare arguments

In this section, we need to set up an environment for us to embed the extracted titles, insert the embeddings into a Zilliz Cloud collection, and conduct an ANN search against them.

In [6]:
# Set up arguments

# 1. Go to https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks, download the dataset, and save it locally.
FILE = '../books.csv' 

# 2. Set up the name of the collection to be created.
COLLECTION_NAME = 'title_db'

# 3. Set up the dimension of the embeddings.
DIMENSION = 1536

# 4. Set up the number of records to process.
COUNT = 100

# 5. Set up the connection parameters for your Zilliz Cloud cluster.
URI = 'https://in03-24426a264d9129a.api.gcp-us-west1.zillizcloud.com'

# For serverless clusters, use your API key as the token.
# For dedicated clusters, use a colon (:) concatenating your username and password as the token.
TOKEN = 'e90b8dfdf09dce948b254f0d238324143e65ced62e469aed3bfe56c474d39d3f0a0e270c599ab6acb23ffe1aa2972d1adf6d4827'

# 6. Set up the OpenAI engine and API key to use.
OPENAI_ENGINE = 'text-embedding-ada-002'  # Which engine to use
openai.api_key = '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'  # Use your own Open AI API Key here

## Work with Zilliz Cloud

The following snippet deals with Zilliz Cloud and setting up the cluster for this use case. Within Zilliz Cloud, we need to set up a collection and index it. For more information on how to set up and use Zilliz Cloud, refer to [this link](https://docs.zilliz.com/docs/quick-start-1).

In [7]:
# Connect to Zilliz Cloud and create a collection

client = MilvusClient(uri=URI, token=TOKEN)

if COLLECTION_NAME in client.list_collections():
    client.drop_collection(COLLECTION_NAME)

fields = [
    FieldSchema(name='id', dtype=DataType.INT64, descrition='Ids', is_primary=True, auto_id=False),
    FieldSchema(name='title', dtype=DataType.VARCHAR, description='Title texts', max_length=200),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='Embedding vectors', dim=DIMENSION)
]

schema = CollectionSchema(fields=fields, description='Title collection')

index_params = {
    'metric_type': 'L2',
    'index_type': 'AUTOINDEX',
    'params': {'nlist': 1024}
}

client.create_collection_with_schema(
    collection_name=COLLECTION_NAME, 
    schema=schema, 
    index_params=index_params
)

alloc_timestamp unimplemented, ignore it


After setting up the collection, we can begin inserting our data, which involves three steps: reading the data, embedding the titles, and inserting into Zilliz Cloud.

In [8]:
# Load the csv file and extract embeddings from the text

def csv_load(file):
    with open(file, newline='') as f:
        reader=csv.reader(f, delimiter=',')
        for row in reader:
            yield row[1]

def embed(text):
    return openai.Embedding.create(
        input=text, 
        engine=OPENAI_ENGINE)["data"][0]["embedding"]

# Insert each title and its embeddings

for idx, text in enumerate(random.sample(sorted(csv_load(FILE)), k=COUNT)):
    ins = {
        'id': idx,
        'title': (text[:198] + '..') if len(text) > 200 else text,
        'embedding': embed(text)
    }
    client.insert(collection_name=COLLECTION_NAME, data=ins)
    time.sleep(3)
    print('Inserted: ', ins['title'])

# Flush the data to disk 
# Zilliz Cloud automatically flushes the data to disk once a segment is full. 
# You do not always need to call this method.
client.flush(COLLECTION_NAME)

Inserted:  The Man From St. Petersburg
Inserted:  The Deed (Deed  #1)
Inserted:  The Dragons at War (Dragonlance Dragons  #2)
Inserted:  Pioneer Girl: The Story of Laura Ingalls Wilder
Inserted:  Citizen of the Galaxy
Inserted:  The Church in Emerging Culture: Five Perspectives
Inserted:  I Was a Teenage Fairy
Inserted:  Poetry and Prose of Alexander Pope (Riverside Editions)
Inserted:  The Wildlife of Star Wars
Inserted:  Girls Think of Everything: Stories of Ingenious Inventions by Women
Inserted:  The Hobbit  or There and Back Again
Inserted:  The Closers (Harry Bosch  #11; Harry Bosch Universe  #14)
Inserted:  Ultimate Punishment
Inserted:  The Theory and Practice of Group Psychotherapy
Inserted:  The Hunt for Zero Point: Inside the Classified World of Antigravity Technology
Inserted:  Monsieur Ibrahim et les fleurs du Coran
Inserted:  La venganza de Opal (Artemis Fowl  #4)
Inserted:  Time for Ballet
Inserted:  Darwin's Dangerous Idea: Evolution and the Meanings of Life
Inserted:  

In [9]:
# Search for similar titles
def search(text):
    res = client.search(
        collection_name=COLLECTION_NAME,
        data=[embed(text)],
        output_fields=['title'],
        limit=5,
    )

    ret = []

    for hits in res:
        for hit in hits:
            row = []
            row.extend([hit['id'], hit['distance'], hit['entity']['title']])
            ret.append(row)

    return ret

search_terms = [
    'self-improvement',
    'landscape',
]

for x in search_terms:
    print('Search term: ', x)
    for x in search(x):
        print(x)
    print()

Search term:  self-improvement
[34, 0.4293707609176636, "Bridget Jones's Guide to Life"]
[13, 0.4357629418373108, 'The Theory and Practice of Group Psychotherapy']
[21, 0.43578940629959106, 'A Portrait of the Artist as a Young Man']
[51, 0.44489872455596924, 'Siddhartha']
[85, 0.4468797445297241, 'The Elements of Style']

Search term:  landscape
[88, 0.3831649422645569, 'Beauty and the Contemporary Sublime']
[21, 0.4067862033843994, 'A Portrait of the Artist as a Young Man']
[73, 0.4077138304710388, 'A Hundred Camels in the Courtyard']
[8, 0.41474688053131104, 'The Wildlife of Star Wars']
[82, 0.41780704259872437, 'Norden']

