# hello_milvus_openai Demo

`hello_milvus_openai.ipynb` demonstrates the basic operations of PyMilvus, a Python SDK of Milvus and openAI.
Before running, make sure that you have a running Milvus instance.

1. connect to Milvus
2. create collection
3. create embeddings via OpenAI API
4. data preparation
5. insert data
6. create index
7. search, query, and hybrid search on entities
8. delete entities by PK
9. drop collection

This notebook is based on original `hello_milvus.ipynb`, and inspired by Pinecone's `gen_qa_openai.ipynb`.

In [45]:
import numpy as np
import time

from pymilvus import (
    connections,
    utility,
    FieldSchema, CollectionSchema, DataType,
    Collection,
)

search_latency_fmt = "search latency = {:.4f}s"
dim = 1536
auto_id = False

## 1. connect to Milvus

Add a new connection alias `default` for Milvus server in `localhost:19530`. 

Actually the `default` alias is a buildin in PyMilvus. If the address of Milvus is the same as `localhost:19530`, you can omit all parameters and call the method as: `connections.connect()`.

Note: the `using` parameter of the following methods is default to "default".

In [46]:
connections.connect("default", host="localhost", port="19530")

has = utility.has_collection("hello_milvus_openai")
print(f"Does collection hello_milvus exist in Milvus: {has}")

Does collection hello_milvus exist in Milvus: True


## 2. create collection
We're going to create a collection with 3 fields.

|   |field name  |field type |other attributes              |  field description      |
|---|:----------:|:---------:|:----------------------------:|:-----------------------:|
|1  |    "pk"    |   INT64/VARCHAR |is_primary=True, auto_id=True/False|      "primary field"    |
|2  |"embeddings"|FloatVector|     dim=8                    |"float vector with dim 8"|
|3  |   "text"   |   VARCHAR |                              |"varchar"|

In [69]:
if auto_id:
    pk_schema = FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True)
else:
    pk_schema = FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100)

fields = [
    pk_schema,
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=4096)
]

schema = CollectionSchema(fields, "hello_milvus_openai is the simplest demo to introduce the APIs")

hello_milvus_openai = Collection("hello_milvus_openai", schema, consistency_level="Strong")

## 3. create embeddings via OpenAI API
We are going to connect to openAI and test connection.

In [61]:
import os
import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = os.getenv("OPENAI_API_KEY") or "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"

In [62]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)
len(res['data'][0]['embedding'])

1536

We will apply this same embedding logic to a dataset containing information relevant to our query (and many other queries on the topics of ML and AI).

## 4. data Preparation

The dataset we will be using is the `jamescalam/youtube-transcriptions` from Hugging Face _Datasets_. It contains transcribed audio from several ML and tech YouTube channels. We download it with:

In [4]:
from datasets import load_dataset

data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset json (/home/lazar/.cache/huggingface/datasets/jamescalam___json/jamescalam--youtube-transcriptions-08d889f6a5386b9b/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)


Dataset({
    features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
    num_rows: 208619
})

In [53]:
data[0]

{'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
 'published': '2021-07-06 13:00:03 UTC',
 'url': 'https://youtu.be/35Pdoyi6ZoQ',
 'video_id': '35Pdoyi6ZoQ',
 'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
 'id': '35Pdoyi6ZoQ-t0.0',
 'text': 'Hi, welcome to the video.',
 'start': 0.0,
 'end': 9.36}

The dataset contains many small snippets of text data. We will need to merge many snippets from each video to create more substantial chunks of text that contain more information.

In [55]:
from tqdm.auto import tqdm

new_data = []

window = 10  # number of sentences to combine
stride = 5  # number of sentences to 'stride' over, used to create overlap

for i in tqdm(range(0, len(data), stride)):
    i_end = min(len(data)-1, i+window)
    if data[i]['title'] != data[i_end]['title']:
        # in this case we skip this entry as we have start/end of two videos
        continue
    text = ' '.join(data[i:i_end]['text'])
    # create the new merged dataset
    new_data.append({
        'start': data[i]['start'],
        'end': data[i_end]['end'],
        'title': data[i]['title'],
        'text': text,
        'id': data[i]['id'],
        'url': data[i]['url'],
        'published': data[i]['published'],
        'channel_id': data[i]['channel_id']
    })

100%|███████████████████████████████████████████████████████████████████████████| 41724/41724 [00:24<00:00, 1687.12it/s]


In [59]:
max((len(d['text']) for d in new_data))

3893

In [60]:
new_data[0]

{'start': 0.0,
 'end': 39.56,
 'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
 'text': "Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video.",
 'id': '35Pdoyi6ZoQ-t0.0',
 'url': 'https://youtu.be/35Pdoyi6ZoQ',
 'published': '2021-07-06 13:00:03 UTC',
 'channel_id': 'UCv83tO5cePwHMt1952IVVHw'}

Now we need a place to store these embeddings and enable a efficient _vector search_ through them all. To do that we use milvus collection we created eariler.

## 5. insert data

We are going to insert first 10% rows of new_data into `hello_milvus_openai` (for faster execution). Data to be inserted must be organized in fields.

The insert() method returns:
- either automatically generated primary keys by Milvus if auto_id=True in the schema;
- or the existing primary key field from the entities if auto_id=False in the schema.

In [72]:
from tqdm.auto import tqdm
from time import sleep

batch_size = 100  # how many embeddings we create and insert at once

# for i in tqdm(range(0, len(new_data), batch_size)):
for i in tqdm(range(0, len(new_data)//10, batch_size)):
    # find end of batch
    i_end = min(len(new_data), i+batch_size)
    meta_batch = new_data[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoid RateLimitError)
    try:
        res = openai.Embedding.create(input=texts, engine=embed_model)
    except:
        done = False
        while not done:
            sleep(5)
            try:
                res = openai.Embedding.create(input=texts, engine=embed_model)
                done = True
            except:
                pass
    embeds = [record['embedding'] for record in res['data']]
    entities = [ids_batch] if not auto_id else [] 
    entities += [
        embeds,
        texts
    ]
    insert_result = hello_milvus_openai.insert(entities)

print(f"Number of entities in Milvus: {hello_milvus_openai.num_entities}")  # check the num_entites
# print(f"Primary keys of the inserted entities: {insert_result.primary_keys[:3]}") # check the autogenerated primary_keys
#     break

100%|█████████████████████████████████████████████████████████████████████████████████| 404/404 [09:06<00:00,  1.35s/it]

Number of entities in Milvus: 37945





## 6. create index
We are going to create an IVF_FLAT index for hello_milvus_openai collection.

create_index() can only be applied to `FloatVector` and `BinaryVector` fields.

In [73]:
index = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128},
}

hello_milvus_openai.create_index("embeddings", index)

Status(code=0, message=)

## 7. search, query, and hybrid search
After data were inserted into Milvus and indexed, you can perform:
- search based on vector similarity
- query based on scalar filtering(boolean, int, etc.)
- hybrid search based on vector similarity and scalar filtering.

Before conducting a search or a query, you need to load the data in `hello_milvus_openai` into memory.

In [17]:
hello_milvus_openai.load()

**Text embedding**

In [38]:
new_text = "There is an ugly data loader. It's name is unknown :)"
original_text = new_data[100]['text']

search_texts = [original_text, new_text]
search_texts

["We have our data loader. What is the name of that data loader? I'm not sure. Data loader. Cool.",
 "There is an ugly data loader. It's name is unknown :)"]

In [39]:
res = openai.Embedding.create(input=search_texts, engine=embed_model)
search_embeds = [record['embedding'] for record in res['data']]
len(search_embeds[0])

1536

**Search based on vector similarity**

In [40]:
# search on the last two entity embeddings
search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 10},
}

start_time = time.time()
result = hello_milvus_openai.search(search_embeds, "embeddings", search_params, limit=3, output_fields=["text"])
end_time = time.time()

for hits in result:
    print('\n')
    for hit in hits:
        print(f"hit: {hit}, text field: {hit.entity.get('text')}")
print(search_latency_fmt.format(end_time - start_time))



hit: id: 440952106819840763, distance: 0.0, entity: {'text': "We have our data loader. What is the name of that data loader? I'm not sure. Data loader. Cool."}, text field: We have our data loader. What is the name of that data loader? I'm not sure. Data loader. Cool.
hit: id: 440952106819840746, distance: 0.19315238296985626, entity: {'text': "And we're going to initialize our loop object using TQDM. So TQDM. We have our data loader. What is the name of that data loader? I'm not sure."}, text field: And we're going to initialize our loop object using TQDM. So TQDM. We have our data loader. What is the name of that data loader? I'm not sure.
hit: id: 440952106819840764, distance: 0.23596343398094177, entity: {'text': "I'm not sure. Data loader. Cool. Data loader. And we set leave equals true."}, text field: I'm not sure. Data loader. Cool. Data loader. And we set leave equals true.


hit: id: 440952106819840763, distance: 0.2407623827457428, entity: {'text': "We have our data loader.

## 9. drop collection
Finally, drop the hello_milvus collection

In [66]:
utility.drop_collection("hello_milvus_openai")